Click here to monitor SSC

Jonathan has been working with SQL Server since 1999. He enjoys performance tuning, development and using SQL Server to provide appropriate business solutions. He is the founder and leader of the PASS SQL South West user group http://www.sqlsouthwest.co.uk , is a moderator at SQL Q + A forum ask.sqlservercentral.com and is on twitter at @fatherjack. He has spoken at SQLBits and SQL in the City, SQL Saturdays and local user groups across the UK and Europe.

Alerts are good, aren’t they?

Published 27 June 2011 5:00 am

It is accepted best practise to set some alerts on every SQL instance you install. They aren’t particularly well publicised but I have never seen any one not recommend setting up alerts for Error 823, 824 and 825. These alerts are focussed on successful access(IO) to the hard drives that SQL Server is using. If there are  any errors when reading or writing to the drives then one of these errors will be returned. Having the alerts on these errors means that any IO issues will be brought to the DBAs attention well before a disk actually fails.

The way these alerts work is that SQL Server will try to read/write to a disk up to 4 times, if after that it fails the Error 823 or 824 get fired and there is an entry on the error log reflecting this. Now because you are an outstanding DBA you check your error log every day and if you see this sort of thing you take action. (For the exact (boring) details of what, when and how this happens then you can read all about it at http://support.microsoft.com/kb/828339 (823) and   http://support.microsoft.com/kb/2015756 (824)). The important thing to note however is that the failure has to happen 4 times for the errors to be raised, what if there is an IO that fails once, twice or three times? This could be an early(ier) sign of an impending hard drive crash that you want to be aware of. That’s where error 825 comes in (http://support.microsoft.com/kb/2015757) , it is the error that signifies a read or write retry happened, but there is no error added to the error log. This is because the severity of Error 825 is different from the others:

SELECT  m.[message_id] ,
       
m.[severity] ,
       
m.[is_event_logged]
FROM    sys.messages AS m
WHERE   language_id = 1033
       
AND m.[message_id] IN ( 823, 824, 825 )
AND
is_event_logged = 1
ORDER BY severity ,
       
m.[message_id]

imageChecking the Raiserror help we know that "Severity levels from 20 through 25 are considered fatal. If a fatal severity level is encountered, the client connection is terminated after receiving the message, and the error is logged in the error and application logs."
Setting an alert on these errors however means that regardless of your inspection of the Error Log you will know if they happen, even Alert 825. Alerts can be set to automatically email Operators when they fire so you can sit back watching Wimbledon SQL Monitor‘s traces of your server performance, and not worry you are missing advance notice of a possible hardware crash.

However, make sure that the Alerts are set up correctly.

I have recently had to pick up on a case where these 3 alerts were in place on these errors to alert the DBA team by email. When a server restarted a drive was offline. The 823 alert did it’s job and fired an email via the Exchange server to the DBA email group. However the drive didn’t come online as there was a power supply issue. The Exchange server did it’s best to keep up but croaked at approx. 160k emails. The alert fired over 200k times, right up until the server was rebooted with the power supply to all drives as expected. The issue was that the Alert was set up with no delay between responses value. The default was to alert every time the drive was referenced, it should have had a delay – say every ten minutes or so and that would have massively reduced the number of alerts and the unexpected exercise for the Exchange server

Just to make sure you have these alerts, and that they are set up with a delay you can use this script on your servers:

USE [msdb]
GO
SELECT
   
[s].[name] ,
   
[s].[event_source] ,
   
[s].[event_id] ,
   
[s].[message_id] ,
   
[s].[severity] ,
   
[s].[enabled] ,
   
[s].[delay_between_responses] ,
   
[s].[include_event_description] ,
   
[s].[occurrence_count] ,
   
[s].[has_notification]
FROM
   
[dbo].[sysalerts] AS s
WHERE
   
--( message_id IN ( 823, 824, 825 )
    --  AND [s].[delay_between_responses] <> 600
    --)
    --OR
   
( ENABLED = 1
        
AND [s].[delay_between_responses] = 0
      
)
ORDER BY
   
severity ,
   
message_id
comment / uncomment the WHERE clause sections as you need to to locate all alerts and/or those that are set up with no delay.

PS. If you have never set up any Alerts, this is where to start:
SNAGHTML1b945ce
I’ll do a follow up blog on creating Operators and Alerts in a while, get in touch if you want it done soon.

4 Responses to “Alerts are good, aren’t they?”

  1. Anonymous says:

    Good post Jonathan, you are right that this sort of thing is not often talked about I seem to recall these errors and alerting was talked about in last years Paul Randal and Kimberely Tripp’s pre-con at sqlpass for “Accidental Dba session” and remember wondering then exactly how many DBAs really actively monitor for these types of thing.

    Too many DBAs bury their head in the sand when storage is involved imho and defer far too often to the “specialist” SAN/ Storage guru. Not me, I question every little detail and try to stay one step ahead of them!

    Thanks for raising this issue to the wider public, I’m sure its going to prevent corruption for someones SQL instance/s.

Leave a Reply