Monitoring Transactional Replication in SQL Server

If you are using replication in SQL Server, you can monitor it in SSMS, but it makes sense to monitor distribution jobs automatically, especially if you can set up alerts or even set up first-line remedial action when a problem is detected. Francis shows how to do it in TSQL as an agent job.

There are several variables that we can examine in order to determine the health of our transactional replication system. Whilst the values  of these various counters may not, by themselves, indicate a problem, they can serve to give us clues as to what we can do to get replication back into a healthy status.

Normally we can monitor replication with the Replication Monitor Tool that is part of SQL Server Management Studio.  Since we cannot watch this 24 hours a day, it is useful to supplement this with a programmatic way to monitor replication.  Even if you opt to use a 3rd party tool like Red Gate’s SQL Monitor you may need to create a custom alert. Even if your tool of choice allows ‘out-of-the-box’ monitoring of replication, you might still need to set up a specific response to the counters if they exceed a threshold value.  More on that later.

One of the more useful tools supplied by Microsoft is the system stored procedure called dbo.sp_replmonitorsubscriptionpendingcmds which is stored in the distribution database.

This procedure returns both the number of commands that are pending for the subscription and an estimate of the number of seconds required to deliver all of the pending commands to the Subscriber.  I have a job that runs every 10 minutes and saves the data in a table as an historical record of the status.  I keep this data for 14 days.

This table is created in a DBA database installed on the subscriber server.  The code to create the table is:

Script 1

The data in this table is populated by the monitoring procedures and provides an historical context for examining issues.  But to monitor what is happening right now more is required.

There are three things that help to determine the health of replication.

  1. The status of the replication related jobs
  2. The latency (especially the distribution latency) as measured by the counter Dist:Delivery Latency  
  3. The number of outstanding commands that are pending for the subscription

I have focused on the distribution latency because past experience has shown this to be more of an issue than log reader latency. Much of the time, the distribution latency is due to an increase in  transaction volume. For example doing an index rebuild on a large table in the publisher database can cause a large increase in the volume of the transaction log which results in a higher than normal volume of data to be replicated. For further discussion of this idea see Linchi Shea’s blog from 2009 Distribution latency in transactional replication: Is a volume surge the culprit?

If there are a large  number of outstanding commands waiting to be distributed, then sometimes this is due to a distribution agent job that is not running.  On the other hand, sometimes this job is running but not keeping up. By stopping and starting the agent the job starts to process the outstanding commands.

   To start off, we need to get information about replication, such as the name of the publisher and the subscriber, the names of the distribution agent jobs etc.  Microsoft has supplied several procedures in the Distributor database to help to gather information.  My Distributor database is on the same server as my Subscriber database so my script is simpler than if they were on separate servers. Firstly executing sp_replmonitorhelppublisher returns monitoring information for all Publishers using this Distributor. Executing sp_replmonitorhelppublication returns monitoring information for all publications using this Distributor. Lastly, executing sp_replmonitorhelpsubscription returns monitoring information for all subscriptions.  This information includes some latency numbers so I after executing this procedure I already have some key information.

This is the code I use to gather this information:

Script 2

Once I know some of this basic information about my publishers and my subscribers I can then check the status of the distribution jobs.  They should all be running.  If any are not running then I need to start them.  If I need to restart a job I set a flag to force an email alert to be sent.  I don’t want to actually send the email yet since I need to check the status of all my subscriptions. Again, if this number exceeds my threshold then I also set the flag to trigger my email alert. I used a cursor to go through my subscriptions because this was the easiest way I could gather the information and then use this information as parameters to other stored procedures to determine if the distribution agent was running and potentially stop and restart the agent:

Script 3

I run the Microsoft-supplied procedure sp_replmonitorsubscriptionpendingcmds to gather information on outstanding commands and the expected number of seconds to catch up.

This is the information I want to store in my history table so I can get an idea of how replication is performing.

We need to decide on an acceptable threshold for latency.  I use sixty minutes, meaning that I want to be notified if the replicated database is more than sixty minutes behind the publisher database.  The other decision is the maximum number of undistributed commands.  If this number trends upwards as opposed to fluctuating, then there may be a problem. It is your choice how high you want to let this number get before taking action.  I choose to let the system get to 80000 undistributed commands. 

I picked these numbers by letting my Replication Queue checker job run for a couple of weeks, ensuring all maintenance jobs like index rebuild jobs had run.  Then I looked at the periods with the greatest number of undistributed commands and the greatest latency and made sure my settings were higher.  I don’t want to get called at night if a Rebuild Index job causes the system to backup temporarily but is capable of easily recovering by itself.

The code below requires the Ad Hoc Distributed Queries server configuration option be enabled. Here I create the email to be sent assuming the previous Script 3 found an issue.

Script 4

 The last process is to periodically delete rows from the replication status table so the data does not get stale.

Script 5

If any of the thresholds you set up in this script are met, then the distribution agent corresponding to the subscription with the offending counter will be stopped and restarted unless it is already stopped in which case the job will just be started. An email will be sent to let you know that some action was taken.  In many cases, restarting the distribution agent fixed the issue and replication started to work again.  If this did not fix the issue then when this job next runs the same action will be taken and a second email sent.  At this point, if you aren’t already examining the situation then you need to start.

If you have a 3rd party alerting system you could just use a modified form of the 3rd script supplied to stop and restart the distribution agent jobs if any of the thresholds are met.  Otherwise, run script 1 to create the required table.  Create a new job and put scripts 2-4 in the first step, then put script 5 as the second step of the job and schedule this job for whatever frequency makes sense to you.  I am currently running this every 10 minutes.

This process should help with the period stoppages of transactional replication but automating the recovery process.  It is still important to periodically monitor the process using the Replication Monitor tool.  This process is only meant to help prevent the off-hours calls that can be easily fixed by kick starting a distribution agent job.

For more articles like this, sign up to the fortnightly Simple-Talk newsletter.

Downloads

Tags: , , , , ,

  • 36518 views

  • Rate
    [Total: 33    Average: 4.5/5]
  • scruppy43

    Server Timeout
    what possible cause why some server are caused of server timeout..when i run my queries…

  • Patrick Index

    Replication stats question?
    Francis
    Apart from putting triggers on the subscriber tables or doctoring the replication sp’s, is there a simple way (for transactional replication) to establish how many records are replicated by article, by transaction type (i.e. UPDATE, DELETE, INSERT) for a given time period say 1 day? Thanks.

  • pkpetroff

    Issue with sp_replmonitorsubscriptionpendingcmds for 2005 and 2008
    There is an issue with sp_replmonitorsubscriptionpendingcmds if you are still with SQL Server 2005 and 2008:

    https://connect.microsoft.com/SQLServer/feedback/details/507595/sp-replmonitorsubscriptionpendingcmds

  • fhanlon

    Issue with sp_replmonitorsubscriptionpendingcmds for 2005 and 2008
    That link to the connect article is a very interesting read. I have noticed this behaviour at times. The suggestion to use a "fixed" version of the sp_replmonitorsubscriptionpendingcmds proc may be of benefit.

    Thanks for pointing that out

  • fhanlon

    Replication stats question?
    Patrick I don’t have an easy answer for this. I would have to do some research on this.

  • Tiana0000

    Im getting errors! Pls help
    When i run script 3 i get the error msg:

    Msg 207, Level 16, State 1, Procedure Repl_Monitor, Line 0
    Invalid column name ‘PendingCmdCount’.
    Msg 207, Level 16, State 1, Procedure Repl_Monitor, Line 0
    Invalid column name ‘EstimatedProceduresTime’

    *Note Repl_Monitor is the name of my procedure

    From what i see, its updating temp table ‘##SubscriptionInfo’ with column that don’t exist in the table ie column ‘PendingCmdCount’ and ‘EstimatedProceduresTime’.
    I tried selecting into different temp table name like so…

    SET @cmd = ‘SELECT @cmdcount=pendingcmdcount, @procedurestime=estimatedprocedurestime INTO ##Sub FROM OPENROWSET(”SQLOLEDB”,”SERVER=(LOCAL);TRUSTED_CONNECTION=YES”..etc

    And then i updated my ##Sub temp table but it showed an error message at WHERE clause of the UPDATE statement ie ‘ WHERE subscriber_db = @subscriber_db’ because subscriber_db column doesnt exist in the execution of stored procedure sp_replmonitorsubscriptionpendingcmds. The second thing is that your cursor is not close nor deallocated. Even after doing so an error msg still shows saying ‘The cursor ‘cur_sub’ already exit. Pls need help. Thanks in advance.

  • fhanlon

    Im getting errors! Pls help (follow up)
    Tiana:
    Thanks for your comments. I made some corrections to the script. The corrected scripts can be downloaded from the top of the article. I will try to get these corrections into the body of the article as soon as possible

  • Tiana0000

    Im getting errors! Pls help (follow up)
    Thanks for much fhanlon, it works now 🙂

  • sqlMilieu

    SQL Server 2012 Errors!
    Thanks for a great post!
    Getting errors in 2012:
    Msg 11514, Level 16, State 1, Procedure sp_describe_first_result_set, Line 1
    The metadata could not be determined because statement ‘exec @cmd N’if is_member(N”db_owner”) = 1 or isnull(is_member(N”replmonitor”),0) = 1 set @has_ac’ in procedure ‘sp_MSrepl_DistributorReplMonitorAccess’ contains dynamic SQL. Consider using the WITH RESULT SETS clause to explicitly describe the result set.

  • fhanlon

    SQL Server 2012 Errors!
    The code associated with this article was never tested with SQL 2012. I didn’t make use of sp_describe_first_result_set so I can’t comment on what might need to change. There is more than one way to approach this issue. I was trying to convey how I approached replication monitoring when this article was written and there is certainly room to adapt the code and enhance the code to take advantage of constructs in later versions. I will look into what might need to change with SQL 2012 when my system gets upgraded from the current 2008 R2.

  • sqlMilieu

    SQL Server 2012 Errors!
    Thanks for the prompt reply Hanlon! Not sure why my previous comments shows twice!

  • fhanlon

    The distributor_AgentName is not found
    When I look at the Microsoft documentation for sp_replmonitorsubscriptionpendingcmds it indicates the @subscription_type parameter is either set to 0 for a push subscription or 1 for a pull subscription. I would check to ensure the command is created with all the correct parameters for your situation. You may need to make modifications. There is not a different system proc for pull and push subscriptions. Print out the command that is created for you and check that all the parameters are filled in correctly. There may be more than 1 issue.

  • benhur216

    The distributor_AgentName is not found
    When I run this script on my distributor database, I get an error that the Distributor_AgentName could not be found. I have looked at the job name that is being returned, and it is the job name as if I were doing a push subscription. However, my environment requires a Pull subscription. Which I have successfully configured. Is there another system stored procedure that I could use instead of "sp_replmonitorsubscriptionpendingcmds" to get the correct job name?

  • benhur216

    The Distributor_AgentName is not found Con’t
    Sorry, I specified the wrong system stored procedure. The name of the stored procedure that I am having trouble with, is "sp_replmonitorhelpsubscription". Thanks in advance for your help.

  • fhanlon

    The Distributor_AgentName is not found Con’t
    There are many parameters that can be passed to the sp_replmonitorhelpsubscription proc. I did not use all of them. One of the parameters I notice in the Microsoft documentation is the @subtype parameter which defaults to 0 meaning push subscription. Changing this to 1 represents a Pull subscription. The code already includes a subscription type variable – @subtype. I don’t remember noticing this when I originally ran the code and it worked since I had a push subscription. Adding this parameter and populating it using the existing variable @subtype would not be difficult. This way it should work for both pull and push subscriptions. Good catch

  • benhur216

    The Distributor_AgentName is not found Con’t
    I have tried using the parameter of @SubType to the sp_replmonitorhelpsubscription and it gave me an error that the @SubType is not a valid parameter. As I look at the specification for the stored procedure, I notice that the subtype is a result variable not a parameter variable.

  • fhanlon

    The Distributor_AgentName is not found Con’t
    So you are saying when you execute the command EXEC distribution.dbo.sp_replmonitorhelpsubscription @publisher=servername,@publication_type=0
    then the publication you are interested in does not show up? If this is transactional replication then all subscriptions should show up. If its not transactional publication then the @publication_type must be changed to the appropriate type. If your subscription does not show up and replication is working then I’m sorry but I can’t answer why based on the information given. As far as I know there is no other command to list subscriptions.

  • benhur216

    The Distributor_AgentName is not found Con’t
    In further testing I hvave noticed that in the result table for the sp_replmonitorhelpsubscription stored procedure is returning a SubType of 1 which translates to a Pull subscription. However, the Distributor_AgentName is the same name as if the replication is a push subscription. So, thank you for your responses, but I believe that Microsoft has not implemented this stored procedure properly.

  • benhur216

    The Distributor_AgentName is not found Con’t
    Sir,
    I am seeing the publication and the publication type that I expect. I receive the SubType that I expect, however, the Distributor_AgenName is not the Job name that I expect to see.

  • UltraSQL

    In Script 3,@cmd value should be:
    SET @cmd = ‘SELECT @cmdcount=pendingcmdcount, @processtime=estimatedprocesstime FROM OPENROWSET(”SQLOLEDB”,
    ”SERVER=(LOCAL);TRUSTED_CONNECTION=YES”,
    ”SET FMTONLY OFF EXEC distribution.dbo.sp_replmonitorsubscriptionpendingcmds @publisher=””’ + @publisher
    + ””’,@publisher_db=””’ + @publisher_db + ””’,@publication=””’ + @publication
    + ””’,@subscriber=””’ + @subscriber + ””’,@subscriber_db=””’ + @subscriber_db
    + ””’,@subscription_type=””’ + CONVERT(CHAR(1),@subtype) + ””’;’ + ”’)’