Click here to monitor SSC
  • Av rating:
  • Total votes: 21
  • Total comments: 8
Francis Hanlon

Monitoring Transactional Replication in SQL Server

11 April 2013

If you are using replication in SQL Server, you can monitor it in SSMS, but it makes sense to monitor distribution jobs automatically, especially if you can set up alerts or even set up first-line remedial action when a problem is detected. Francis shows how to do it in TSQL as an agent job.

There are several variables that we can examine in order to determine the health of our transactional replication system. Whilst the values  of these various counters may not, by themselves, indicate a problem, they can serve to give us clues as to what we can do to get replication back into a healthy status.

Normally we can monitor replication with the Replication Monitor Tool that is part of SQL Server Management Studio.  Since we cannot watch this 24 hours a day, it is useful to supplement this with a programmatic way to monitor replication.  Even if you opt to use a 3rd party tool like Red Gate’s SQL Monitor you may need to create a custom alert. Even if your tool of choice allows ‘out-of-the-box’ monitoring of replication, you might still need to set up a specific response to the counters if they exceed a threshold value.  More on that later.

One of the more useful tools supplied by Microsoft is the system stored procedure called dbo.sp_replmonitorsubscriptionpendingcmds which is stored in the distribution database.

This procedure returns both the number of commands that are pending for the subscription and an estimate of the number of seconds required to deliver all of the pending commands to the Subscriber.  I have a job that runs every 10 minutes and saves the data in a table as an historical record of the status.  I keep this data for 14 days.

This table is created in a DBA database installed on the subscriber server.  The code to create the table is:

CREATE TABLE dbo.Replication_Qu_History(

       Subscriber_db varchar(50) NOT NULL,

       Records_In_Que numeric(18, 0) NULL,

       CatchUpTime numeric(18, 0) NULL,

       LogDate datetime NOT NULL,

 CONSTRAINT PK_EPR_Replication_Que_History PRIMARY KEY CLUSTERED

(

       Subscriber_db ASC, LogDate DESC

) ON PRIMARY

GO

Script 1

The data in this table is populated by the monitoring procedures and provides an historical context for examining issues.  But to monitor what is happening right now more is required.

There are three things that help to determine the health of replication.

  1. The status of the replication related jobs
  2. The latency (especially the distribution latency) as measured by the counter Dist:Delivery Latency  
  3. The number of outstanding commands that are pending for the subscription

I have focused on the distribution latency because past experience has shown this to be more of an issue than log reader latency. Much of the time, the distribution latency is due to an increase in  transaction volume. For example doing an index rebuild on a large table in the publisher database can cause a large increase in the volume of the transaction log which results in a higher than normal volume of data to be replicated. For further discussion of this idea see Linchi Shea’s blog from 2009 Distribution latency in transactional replication: Is a volume surge the culprit?

If there are a large  number of outstanding commands waiting to be distributed, then sometimes this is due to a distribution agent job that is not running.  On the other hand, sometimes this job is running but not keeping up. By stopping and starting the agent the job starts to process the outstanding commands.

   To start off, we need to get information about replication, such as the name of the publisher and the subscriber, the names of the distribution agent jobs etc.  Microsoft has supplied several procedures in the Distributor database to help to gather information.  My Distributor database is on the same server as my Subscriber database so my script is simpler than if they were on separate servers. Firstly executing sp_replmonitorhelppublisher returns monitoring information for all Publishers using this Distributor. Executing sp_replmonitorhelppublication returns monitoring information for all publications using this Distributor. Lastly, executing sp_replmonitorhelpsubscription returns monitoring information for all subscriptions.  This information includes some latency numbers so I after executing this procedure I already have some key information.

This is the code I use to gather this information:

DECLARE @cmd NVARCHAR(max)

DECLARE @publisher SYSNAME, @publisher_db SYSNAME, @publication SYSNAME, @pubtype INT

DECLARE @subscriber SYSNAME, @subscriber_db SYSNAME, @subtype INT

DECLARE @cmdcount INT, @processtime INT

DECLARE @ParmDefinition NVARCHAR(500)

DECLARE @JobName SYSNAME

DECLARE @minutes INT, @threshold INT, @maxCommands INT, @mail CHAR(1) = 'N'

SET @minutes = 60 --> Define how many minutes latency before you would like to be notified

SET @maxCommands = 80000  --->  change this to represent the max number of outstanding commands to be proceduresed before notification

SET @threshold = @minutes * 60

 

SELECT * INTO #PublisherInfo

FROM OPENROWSET('SQLOLEDB', 'SERVER=(LOCAL);TRUSTED_CONNECTION=YES;'

, 'SET FMTONLY OFF EXEC distribution.dbo.sp_replmonitorhelppublisher')

 

SELECT @publisher = publisher FROM #PublisherInfo     

 

SET @cmd = 'SELECT * INTO ##PublicationInfo FROM OPENROWSET(''SQLOLEDB'',''SERVER=(LOCAL);TRUSTED_CONNECTION=YES''

,''SET FMTONLY OFF EXEC distribution.dbo.sp_replmonitorhelppublication @publisher='

+ @publisher + ''')'

--select @cmd

EXEC sp_executesql @cmd

 

SELECT @publisher_db=publisher_db, @publication=publication, @pubtype=publication_type  FROM ##PublicationInfo

 

SET @cmd = 'SELECT * INTO ##SubscriptionInfo FROM OPENROWSET(''SQLOLEDB'',''SERVER=(LOCAL);TRUSTED_CONNECTION=YES''

,''SET FMTONLY OFF EXEC distribution.dbo.sp_replmonitorhelpsubscription @publisher='

+ @publisher + ',@publication_type=' + CONVERT(CHAR(1),@pubtype) + ''')'

--select @cmd

EXEC sp_executesql @cmd

 

 

ALTER TABLE ##SubscriptionInfo

ADD  PendingCmdCount INT NULL,

EstimatedProcessTime INT NULL

 

Script 2

Once I know some of this basic information about my publishers and my subscribers I can then check the status of the distribution jobs.  They should all be running.  If any are not running then I need to start them.  If I need to restart a job I set a flag to force an email alert to be sent.  I don’t want to actually send the email yet since I need to check the status of all my subscriptions. Again, if this number exceeds my threshold then I also set the flag to trigger my email alert. I used a cursor to go through my subscriptions because this was the easiest way I could gather the information and then use this information as parameters to other stored procedures to determine if the distribution agent was running and potentially stop and restart the agent:

 

DECLARE cur_sub CURSOR READ_ONLY FOR

SELECT @publisher, s.publisher_db, s.publication, s.subscriber, s.subscriber_db, s.subtype, s.distribution_agentname

FROM ##SubscriptionInfo s

 

OPEN cur_sub  

FETCH NEXT FROM cur_sub INTO @publisher, @publisher_db, @publication, @subscriber, @subscriber_db, @subtype, @JobName

 

WHILE @@FETCH_STATUS = 0  

BEGIN  

       SET @cmd = 'SELECT @cmdcount=pendingcmdcount, @processtime=estimatedprocesstime FROM OPENROWSET(''SQLOLEDB'',''SERVER=(LOCAL);TRUSTED_CONNECTION=YES''

        ,''SET FMTONLY OFF EXEC distribution.dbo.sp_replmonitorsubscriptionpendingcmds @publisher=' + @publisher

       + ',@publisher_db=' + @publisher_db + ',@publication=' + @publication

       + ',@subscriber=' + @subscriber + ',@subscriber_db=' + @subscriber_db

       + ',@subscription_type=' + CONVERT(CHAR(1),@subtype) + ';' + ''')'

       SET @ParmDefinition = N'@cmdcount INT OUTPUT,

                        @processtime INT OUTPUT'

       --select @cmd

       EXEC sp_executesql @cmd,@ParmDefinition,@cmdcount OUTPUT, @processtime OUTPUT

 

       UPDATE ##SubscriptionInfo

       SET PendingCmdCount = @cmdcount

        , EstimatedProcessTime = @processtime

       WHERE subscriber_db = @subscriber_db

 

    INSERT INTO DBA.dbo.Replication_Que_History

    VALUES(@subscriber_db, @cmdcount, @processtime, GETDATE())

       --  find out if the distribution job with the high number of outstanding commands running or not

       --  if it is running then sometimes stopping and starting the agent fixes the issue

       IF EXISTS(SELECT * FROM tempdb.INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME LIKE '##JobInfo%')

                           DROP TABLE ##JobInfo

        

       SET @cmd = 'SELECT * INTO ##JobInfo FROM OPENROWSET(''SQLOLEDB'',''SERVER=(LOCAL);TRUSTED_CONNECTION=YES''

              ,''SET FMTONLY OFF EXEC msdb.dbo.sp_help_job @job_name='''''

              + @JobName + ''''',@job_aspect=''''JOB'''''')'

       EXEC sp_executesql @cmd

    

       IF @cmdcount > @maxCommands OR (@processtime > @threshold AND @cmdcount > 0)

              BEGIN

              IF (SELECT current_execution_status FROM ##JobInfo) = 1 --  This means job is currently executing so stop/start it

                     BEGIN

                     EXEC distribution.dbo.sp_MSstopdistribution_agent

                           @publisher = @publisher

                           , @publisher_db = @publisher_db

                           , @publication = @publication

                           , @subscriber = @subscriber

                           , @subscriber_db = @subscriber_db

                     WAITFOR DELAY '00:00:05' ---- 5 Second Delay

                     SET @mail = 'Y' 

                     END

              END   

       --SELECT name, current_execution_status FROM ##JobInfo

       IF (SELECT current_execution_status FROM ##JobInfo) <> 1 -- if the job is not running start it

              BEGIN

              EXEC distribution.dbo.sp_MSstartdistribution_agent

                     @publisher = @publisher

                     , @publisher_db = @publisher_db

                     , @publication = @publication

                     , @subscriber = @subscriber

                     , @subscriber_db = @subscriber_db

              SET @mail = 'Y'      -- Send email if job has stopped and needed to be restarted

              END   

       DROP TABLE ##JobInfo

       FETCH NEXT FROM cur_sub INTO @publisher, @publisher_db, @publication, @subscriber, @subscriber_db, @subtype, @JobName

END  

 

CLOSE cur_sub  

DEALLOCATE cur_sub

Script 3

I run the Microsoft-supplied procedure sp_replmonitorsubscriptionpendingcmds to gather information on outstanding commands and the expected number of seconds to catch up.

This is the information I want to store in my history table so I can get an idea of how replication is performing.

We need to decide on an acceptable threshold for latency.  I use sixty minutes, meaning that I want to be notified if the replicated database is more than sixty minutes behind the publisher database.  The other decision is the maximum number of undistributed commands.  If this number trends upwards as opposed to fluctuating, then there may be a problem. It is your choice how high you want to let this number get before taking action.  I choose to let the system get to 80000 undistributed commands. 

I picked these numbers by letting my Replication Queue checker job run for a couple of weeks, ensuring all maintenance jobs like index rebuild jobs had run.  Then I looked at the periods with the greatest number of undistributed commands and the greatest latency and made sure my settings were higher.  I don’t want to get called at night if a Rebuild Index job causes the system to backup temporarily but is capable of easily recovering by itself.

The code below requires the Ad Hoc Distributed Queries server configuration option be enabled. Here I create the email to be sent assuming the previous Script 3 found an issue.

IF @mail = 'Y'

   BEGIN

   DECLARE @msg VARCHAR(MAX) = 'Replication on ' + @@SERVERNAME

       + ' may be experiencing some problems.  Attempts to restart the distribution agent have been made. '

       + 'If this is not the first message like this that you have received within the last hour, please investigate.'

       DECLARE @body NVARCHAR(MAX)

       DECLARE @xml1 NVARCHAR(MAX)

       DECLARE @tab1 NVARCHAR(MAX)

       DECLARE @xml2 NVARCHAR(MAX)

       DECLARE @tab2 NVARCHAR(MAX)

 

       SET @xml1 = CAST(( SELECT subscriber AS 'td','',subscriber_db AS 'td','',

       latency AS 'td','', PendingCmdCount AS 'td','', EstimatedProcessTime AS 'td'

       FROM  ##SubscriptionInfo s

       FOR XML PATH('tr'), ELEMENTS ) AS NVARCHAR(MAX))

 

       SET @tab1 ='<html><body><H4>Subscription Information </H4>

         <table border = 1> <tr>

         <th> Subscriber </th> <th> Subscriber Database </th> <th> Latency(seconds)</th>

         <th> Undistributed Commands </th> <th> Estimated Catch Up Time</th></tr>'   

--  this command gives us the last 10 measurements of latency for each subscriber

       SET @xml2 = CAST(( SELECT s.Subscriber_db AS 'td','', s.Records_In_Que AS 'td','', s.CatchUpTime AS 'td','', CONVERT(CHAR(22),LogDate, 100) AS 'td'

              FROM (SELECT ROW_NUMBER() OVER ( PARTITION BY subscriber_db ORDER BY Logdate DESC ) AS 'RowNumber',

         subscriber_db

              , Records_In_Que

              , CatchUpTime

              , Logdate

              FROM DBA.dbo.Replication_Que_History

              ) s

              WHERE RowNumber <= 8

              FOR XML PATH('tr'), ELEMENTS ) AS NVARCHAR(MAX))

 

       SET @tab2 ='<br><br><H4>Historical Latency Information </H4>

         <table border = 1>

         <tr>

         <th>Subscriber</th> <th>Undistributed Commands</th> <th> Catch Up Time </th> <th> Date\Time </th></tr>'

 

       SET @body = @msg + @tab1 + @xml1 + '</table>'

       + @tab2 + @xml2 + '</body></html>'

      

       DECLARE @to NVARCHAR(200)

       SELECT @to = '' -- INSERT YOUR EMAIL ADDRESS HERE

       EXEC msdb.dbo.sp_send_dbmail

         @body = @body,

         @body_format ='HTML',

         @recipients = @to,

         @subject = 'Possible Replication Problem' ;

   END

DROP TABLE #PublisherInfo

DROP TABLE ##PublicationInfo

DROP TABLE ##SubscriptionInfo

 

Script 4

 The last process is to periodically delete rows from the replication status table so the data does not get stale.

DECLARE @delDate datetime = getdate()-10

  DELETE FROM DBA.dbo.Replication_Que_History

  WHERE LogDate < @deldate

Script 5

If any of the thresholds you set up in this script are met, then the distribution agent corresponding to the subscription with the offending counter will be stopped and restarted unless it is already stopped in which case the job will just be started. An email will be sent to let you know that some action was taken.  In many cases, restarting the distribution agent fixed the issue and replication started to work again.  If this did not fix the issue then when this job next runs the same action will be taken and a second email sent.  At this point, if you aren’t already examining the situation then you need to start.

If you have a 3rd party alerting system you could just use a modified form of the 3rd script supplied to stop and restart the distribution agent jobs if any of the thresholds are met.  Otherwise, run script 1 to create the required table.  Create a new job and put scripts 2-4 in the first step, then put script 5 as the second step of the job and schedule this job for whatever frequency makes sense to you.  I am currently running this every 10 minutes.

This process should help with the period stoppages of transactional replication but automating the recovery process.  It is still important to periodically monitor the process using the Replication Monitor tool.  This process is only meant to help prevent the off-hours calls that can be easily fixed by kick starting a distribution agent job.

    Francis Hanlon

    Author profile:

    Francis works as the Senior SQL Server DBA for Manitoba eHealth; a Canadian organization providing IT solutions for health care facilities in the province of Manitoba. He has been work involved in IT since using punch cards was a common practice in programming. He has been involved with SQL Server databases since the 20th century.

    Search for other articles by Francis Hanlon

    Rate this article:   Avg rating: from a total of 21 votes.


    Poor

    OK

    Good

    Great

    Must read
    Have Your Say
    Do you have an opinion on this article? Then add your comment below:
    You must be logged in to post to this forum

    Click here to log in.


    Subject: Server Timeout
    Posted by: scruppy43 (view profile)
    Posted on: Thursday, April 18, 2013 at 1:34 AM
    Message: what possible cause why some server are caused of server timeout..when i run my queries...

    Subject: Replication stats question?
    Posted by: Patrick Index (view profile)
    Posted on: Wednesday, April 24, 2013 at 7:24 AM
    Message: Francis
    Apart from putting triggers on the subscriber tables or doctoring the replication sp's, is there a simple way (for transactional replication) to establish how many records are replicated by article, by transaction type (i.e. UPDATE, DELETE, INSERT) for a given time period say 1 day? Thanks.

    Subject: Issue with sp_replmonitorsubscriptionpendingcmds for 2005 and 2008
    Posted by: pkpetroff (view profile)
    Posted on: Monday, May 06, 2013 at 10:36 AM
    Message: There is an issue with sp_replmonitorsubscriptionpendingcmds if you are still with SQL Server 2005 and 2008:

    https://connect.microsoft.com/SQLServer/feedback/details/507595/sp-replmonitorsubscriptionpendingcmds

    Subject: Issue with sp_replmonitorsubscriptionpendingcmds for 2005 and 2008
    Posted by: fhanlon (view profile)
    Posted on: Thursday, July 11, 2013 at 2:49 PM
    Message: That link to the connect article is a very interesting read. I have noticed this behaviour at times. The suggestion to use a "fixed" version of the sp_replmonitorsubscriptionpendingcmds proc may be of benefit.

    Thanks for pointing that out

    Subject: Replication stats question?
    Posted by: fhanlon (view profile)
    Posted on: Thursday, July 11, 2013 at 2:51 PM
    Message: Patrick I don't have an easy answer for this. I would have to do some research on this.

    Subject: Im getting errors! Pls help
    Posted by: Tiana0000 (view profile)
    Posted on: Monday, November 11, 2013 at 9:52 AM
    Message: When i run script 3 i get the error msg:

    Msg 207, Level 16, State 1, Procedure Repl_Monitor, Line 0
    Invalid column name 'PendingCmdCount'.
    Msg 207, Level 16, State 1, Procedure Repl_Monitor, Line 0
    Invalid column name 'EstimatedProceduresTime'

    *Note Repl_Monitor is the name of my procedure

    From what i see, its updating temp table '##SubscriptionInfo' with column that don't exist in the table ie column 'PendingCmdCount' and 'EstimatedProceduresTime'.
    I tried selecting into different temp table name like so...

    SET @cmd = 'SELECT @cmdcount=pendingcmdcount, @procedurestime=estimatedprocedurestime INTO ##Sub FROM OPENROWSET(''SQLOLEDB'',''SERVER=(LOCAL);TRUSTED_CONNECTION=YES''..etc

    And then i updated my ##Sub temp table but it showed an error message at WHERE clause of the UPDATE statement ie ' WHERE subscriber_db = @subscriber_db' because subscriber_db column doesnt exist in the execution of stored procedure sp_replmonitorsubscriptionpendingcmds. The second thing is that your cursor is not close nor deallocated. Even after doing so an error msg still shows saying 'The cursor 'cur_sub' already exit. Pls need help. Thanks in advance.


    Subject: Im getting errors! Pls help (follow up)
    Posted by: fhanlon (view profile)
    Posted on: Thursday, November 14, 2013 at 11:35 AM
    Message: Tiana:
    Thanks for your comments. I made some corrections to the script. The corrected scripts can be downloaded from the top of the article. I will try to get these corrections into the body of the article as soon as possible

    Subject: Im getting errors! Pls help (follow up)
    Posted by: Tiana0000 (view profile)
    Posted on: Monday, November 18, 2013 at 7:23 AM
    Message: Thanks for much fhanlon, it works now :)

     

    Phil Factor
    Searching for Strings in SQL Server Databases

    Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine like Google.... Read more...

     View the blog

    Top Rated

    Searching for Strings in SQL Server Databases
     Sometimes, you just want to do a search in a SQL Server database as if you were using a search engine... Read more...

    The SQL Server Sqlio Utility
     If, before deployment, you need to push the limits of your disk subsystem in order to determine whether... Read more...

    The PoSh DBA - Reading and Filtering Errors
     DBAs regularly need to keep an eye on the error logs of all their SQL Servers, and the event logs of... Read more...

    MySQL Compare: The Manual That Time Forgot, Part 1
     Although SQL Compare, for SQL Server, is one of Red Gate's best-known products, there are also 'sister'... Read more...

    Highway to Database Recovery
     Discover the best backup and recovery articles on Simple-Talk, all in one place. Read more...

    Most Viewed

    Beginning SQL Server 2005 Reporting Services Part 1
     Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

    Ten Common Database Design Mistakes
     If database design is done right, then the development, deployment and subsequent performance in... Read more...

    SQL Server Index Basics
     Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

    Reading and Writing Files in SQL Server using T-SQL
     SQL Server provides several "standard" techniques by which to read and write to files but, just... Read more...

    Concatenating Row Values in Transact-SQL
     It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

    Why Join

    Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.