Monitoring Transactional Replication in SQL Server

11 April 2013
by Francis Hanlon

If you are using replication in SQL Server, you can monitor it in SSMS, but it makes sense to monitor distribution jobs automatically, especially if you can set up alerts or even set up first-line remedial action when a problem is detected. Francis shows how to do it in TSQL as an agent job.

There are several variables that we can examine in order to determine the health of our transactional replication system. Whilst the values  of these various counters may not, by themselves, indicate a problem, they can serve to give us clues as to what we can do to get replication back into a healthy status.

Normally we can monitor replication with the Replication Monitor Tool that is part of SQL Server Management Studio.  Since we cannot watch this 24 hours a day, it is useful to supplement this with a programmatic way to monitor replication.  Even if you opt to use a 3rd party tool like Red Gate’s SQL Monitor you may need to create a custom alert. Even if your tool of choice allows ‘out-of-the-box’ monitoring of replication, you might still need to set up a specific response to the counters if they exceed a threshold value.  More on that later.

One of the more useful tools supplied by Microsoft is the system stored procedure called dbo.sp_replmonitorsubscriptionpendingcmds which is stored in the distribution database.

This procedure returns both the number of commands that are pending for the subscription and an estimate of the number of seconds required to deliver all of the pending commands to the Subscriber.  I have a job that runs every 10 minutes and saves the data in a table as an historical record of the status.  I keep this data for 14 days.

This table is created in a DBA database installed on the subscriber server.  The code to create the table is:

CREATE TABLE dbo.Replication_Qu_History(

       Subscriber_db varchar(50) NOT NULL,

       Records_In_Que numeric(18, 0) NULL,

       CatchUpTime numeric(18, 0) NULL,

       LogDate datetime NOT NULL,



       Subscriber_db ASC, LogDate DESC



Script 1

The data in this table is populated by the monitoring procedures and provides an historical context for examining issues.  But to monitor what is happening right now more is required.

There are three things that help to determine the health of replication.

  1. The status of the replication related jobs
  2. The latency (especially the distribution latency) as measured by the counter Dist:Delivery Latency  
  3. The number of outstanding commands that are pending for the subscription

I have focused on the distribution latency because past experience has shown this to be more of an issue than log reader latency. Much of the time, the distribution latency is due to an increase in  transaction volume. For example doing an index rebuild on a large table in the publisher database can cause a large increase in the volume of the transaction log which results in a higher than normal volume of data to be replicated. For further discussion of this idea see Linchi Shea’s blog from 2009 Distribution latency in transactional replication: Is a volume surge the culprit?

If there are a large  number of outstanding commands waiting to be distributed, then sometimes this is due to a distribution agent job that is not running.  On the other hand, sometimes this job is running but not keeping up. By stopping and starting the agent the job starts to process the outstanding commands.

   To start off, we need to get information about replication, such as the name of the publisher and the subscriber, the names of the distribution agent jobs etc.  Microsoft has supplied several procedures in the Distributor database to help to gather information.  My Distributor database is on the same server as my Subscriber database so my script is simpler than if they were on separate servers. Firstly executing sp_replmonitorhelppublisher returns monitoring information for all Publishers using this Distributor. Executing sp_replmonitorhelppublication returns monitoring information for all publications using this Distributor. Lastly, executing sp_replmonitorhelpsubscription returns monitoring information for all subscriptions.  This information includes some latency numbers so I after executing this procedure I already have some key information.

This is the code I use to gather this information:


DECLARE @publisher SYSNAME, @publisher_db SYSNAME, @publication SYSNAME, @pubtype INT

DECLARE @subscriber SYSNAME, @subscriber_db SYSNAME, @subtype INT

DECLARE @cmdcount INT, @processtime INT

DECLARE @ParmDefinition NVARCHAR(500)


DECLARE @minutes INT, @threshold INT, @maxCommands INT, @mail CHAR(1) = 'N'

SET @minutes = 60 --> Define how many minutes latency before you would like to be notified

SET @maxCommands = 80000  --->  change this to represent the max number of outstanding commands to be proceduresed before notification

SET @threshold = @minutes * 60


SELECT * INTO #PublisherInfo


, 'SET FMTONLY OFF EXEC distribution.dbo.sp_replmonitorhelppublisher')


SELECT @publisher = publisher FROM #PublisherInfo     



,''SET FMTONLY OFF EXEC distribution.dbo.sp_replmonitorhelppublication @publisher='

+ @publisher + ''')'

--select @cmd

EXEC sp_executesql @cmd


SELECT @publisher_db=publisher_db, @publication=publication, @pubtype=publication_type  FROM ##PublicationInfo



,''SET FMTONLY OFF EXEC distribution.dbo.sp_replmonitorhelpsubscription @publisher='

+ @publisher + ',@publication_type=' + CONVERT(CHAR(1),@pubtype) + ''')'

--select @cmd

EXEC sp_executesql @cmd



ALTER TABLE ##SubscriptionInfo

ADD  PendingCmdCount INT NULL,

EstimatedProcessTime INT NULL


Script 2

Once I know some of this basic information about my publishers and my subscribers I can then check the status of the distribution jobs.  They should all be running.  If any are not running then I need to start them.  If I need to restart a job I set a flag to force an email alert to be sent.  I don’t want to actually send the email yet since I need to check the status of all my subscriptions. Again, if this number exceeds my threshold then I also set the flag to trigger my email alert. I used a cursor to go through my subscriptions because this was the easiest way I could gather the information and then use this information as parameters to other stored procedures to determine if the distribution agent was running and potentially stop and restart the agent:



SELECT @publisher, s.publisher_db, s.publication, s.subscriber, s.subscriber_db, s.subtype, s.distribution_agentname

FROM ##SubscriptionInfo s


OPEN cur_sub  

FETCH NEXT FROM cur_sub INTO @publisher, @publisher_db, @publication, @subscriber, @subscriber_db, @subtype, @JobName




       SET @cmd = 'SELECT @cmdcount=pendingcmdcount, @processtime=estimatedprocesstime FROM OPENROWSET(''SQLOLEDB'',''SERVER=(LOCAL);TRUSTED_CONNECTION=YES''

        ,''SET FMTONLY OFF EXEC distribution.dbo.sp_replmonitorsubscriptionpendingcmds @publisher=' + @publisher

       + ',@publisher_db=' + @publisher_db + ',@publication=' + @publication

       + ',@subscriber=' + @subscriber + ',@subscriber_db=' + @subscriber_db

       + ',@subscription_type=' + CONVERT(CHAR(1),@subtype) + ';' + ''')'

       SET @ParmDefinition = N'@cmdcount INT OUTPUT,

                        @processtime INT OUTPUT'

       --select @cmd

       EXEC sp_executesql @cmd,@ParmDefinition,@cmdcount OUTPUT, @processtime OUTPUT


       UPDATE ##SubscriptionInfo

       SET PendingCmdCount = @cmdcount

        , EstimatedProcessTime = @processtime

       WHERE subscriber_db = @subscriber_db


    INSERT INTO DBA.dbo.Replication_Que_History

    VALUES(@subscriber_db, @cmdcount, @processtime, GETDATE())

       --  find out if the distribution job with the high number of outstanding commands running or not

       --  if it is running then sometimes stopping and starting the agent fixes the issue


                           DROP TABLE ##JobInfo



              ,''SET FMTONLY OFF EXEC msdb.dbo.sp_help_job @job_name='''''

              + @JobName + ''''',@job_aspect=''''JOB'''''')'

       EXEC sp_executesql @cmd


       IF @cmdcount > @maxCommands OR (@processtime > @threshold AND @cmdcount > 0)


              IF (SELECT current_execution_status FROM ##JobInfo) = 1 --  This means job is currently executing so stop/start it


                     EXEC distribution.dbo.sp_MSstopdistribution_agent

                           @publisher = @publisher

                           , @publisher_db = @publisher_db

                           , @publication = @publication

                           , @subscriber = @subscriber

                           , @subscriber_db = @subscriber_db

                     WAITFOR DELAY '00:00:05' ---- 5 Second Delay

                     SET @mail = 'Y' 



       --SELECT name, current_execution_status FROM ##JobInfo

       IF (SELECT current_execution_status FROM ##JobInfo) <> 1 -- if the job is not running start it


              EXEC distribution.dbo.sp_MSstartdistribution_agent

                     @publisher = @publisher

                     , @publisher_db = @publisher_db

                     , @publication = @publication

                     , @subscriber = @subscriber

                     , @subscriber_db = @subscriber_db

              SET @mail = 'Y'      -- Send email if job has stopped and needed to be restarted


       DROP TABLE ##JobInfo

       FETCH NEXT FROM cur_sub INTO @publisher, @publisher_db, @publication, @subscriber, @subscriber_db, @subtype, @JobName



CLOSE cur_sub  


Script 3

I run the Microsoft-supplied procedure sp_replmonitorsubscriptionpendingcmds to gather information on outstanding commands and the expected number of seconds to catch up.

This is the information I want to store in my history table so I can get an idea of how replication is performing.

We need to decide on an acceptable threshold for latency.  I use sixty minutes, meaning that I want to be notified if the replicated database is more than sixty minutes behind the publisher database.  The other decision is the maximum number of undistributed commands.  If this number trends upwards as opposed to fluctuating, then there may be a problem. It is your choice how high you want to let this number get before taking action.  I choose to let the system get to 80000 undistributed commands. 

I picked these numbers by letting my Replication Queue checker job run for a couple of weeks, ensuring all maintenance jobs like index rebuild jobs had run.  Then I looked at the periods with the greatest number of undistributed commands and the greatest latency and made sure my settings were higher.  I don’t want to get called at night if a Rebuild Index job causes the system to backup temporarily but is capable of easily recovering by itself.

The code below requires the Ad Hoc Distributed Queries server configuration option be enabled. Here I create the email to be sent assuming the previous Script 3 found an issue.

IF @mail = 'Y'


   DECLARE @msg VARCHAR(MAX) = 'Replication on ' + @@SERVERNAME

       + ' may be experiencing some problems.  Attempts to restart the distribution agent have been made. '

       + 'If this is not the first message like this that you have received within the last hour, please investigate.'







       SET @xml1 = CAST(( SELECT subscriber AS 'td','',subscriber_db AS 'td','',

       latency AS 'td','', PendingCmdCount AS 'td','', EstimatedProcessTime AS 'td'

       FROM  ##SubscriptionInfo s



       SET @tab1 ='<html><body><H4>Subscription Information </H4>

         <table border = 1> <tr>

         <th> Subscriber </th> <th> Subscriber Database </th> <th> Latency(seconds)</th>

         <th> Undistributed Commands </th> <th> Estimated Catch Up Time</th></tr>'   

--  this command gives us the last 10 measurements of latency for each subscriber

       SET @xml2 = CAST(( SELECT s.Subscriber_db AS 'td','', s.Records_In_Que AS 'td','', s.CatchUpTime AS 'td','', CONVERT(CHAR(22),LogDate, 100) AS 'td'

              FROM (SELECT ROW_NUMBER() OVER ( PARTITION BY subscriber_db ORDER BY Logdate DESC ) AS 'RowNumber',


              , Records_In_Que

              , CatchUpTime

              , Logdate

              FROM DBA.dbo.Replication_Que_History

              ) s

              WHERE RowNumber <= 8

              FOR XML PATH('tr'), ELEMENTS ) AS NVARCHAR(MAX))


       SET @tab2 ='<br><br><H4>Historical Latency Information </H4>

         <table border = 1>


         <th>Subscriber</th> <th>Undistributed Commands</th> <th> Catch Up Time </th> <th> Date\Time </th></tr>'


       SET @body = @msg + @tab1 + @xml1 + '</table>'

       + @tab2 + @xml2 + '</body></html>'


       DECLARE @to NVARCHAR(200)


       EXEC msdb.dbo.sp_send_dbmail

         @body = @body,

         @body_format ='HTML',

         @recipients = @to,

         @subject = 'Possible Replication Problem' ;


DROP TABLE #PublisherInfo

DROP TABLE ##PublicationInfo

DROP TABLE ##SubscriptionInfo


Script 4

 The last process is to periodically delete rows from the replication status table so the data does not get stale.

DECLARE @delDate datetime = getdate()-10

  DELETE FROM DBA.dbo.Replication_Que_History

  WHERE LogDate < @deldate

Script 5

If any of the thresholds you set up in this script are met, then the distribution agent corresponding to the subscription with the offending counter will be stopped and restarted unless it is already stopped in which case the job will just be started. An email will be sent to let you know that some action was taken.  In many cases, restarting the distribution agent fixed the issue and replication started to work again.  If this did not fix the issue then when this job next runs the same action will be taken and a second email sent.  At this point, if you aren’t already examining the situation then you need to start.

If you have a 3rd party alerting system you could just use a modified form of the 3rd script supplied to stop and restart the distribution agent jobs if any of the thresholds are met.  Otherwise, run script 1 to create the required table.  Create a new job and put scripts 2-4 in the first step, then put script 5 as the second step of the job and schedule this job for whatever frequency makes sense to you.  I am currently running this every 10 minutes.

This process should help with the period stoppages of transactional replication but automating the recovery process.  It is still important to periodically monitor the process using the Replication Monitor tool.  This process is only meant to help prevent the off-hours calls that can be easily fixed by kick starting a distribution agent job.