10 September 2014

SQL Server Performance Troubleshooting with SQL Monitor 5

Any SQL Server monitoring tool must gather the metrics that will allow a DBA to diagnose CPU, memory or I/O issues on their SQL Servers. It should also provide a set of accurate, reliable, configurable alerts that will inform the DBA of any abnormal or undesirable conditions and properties, as well as specific errors, on any of the monitored servers. This article provides an in-depth guide to the monitoring and alerting functionality available in one such tool, Redgate SQL Monitor. It focuses on the latest edition (5.0), which includes several key new features, such as performance diagnosis using wait statistics, the ability to compare to baselines, and more.

Any SQL Server monitoring tool must gather the metrics that will allow a DBA to diagnose CPU, memory or I/O issues on their SQL Servers. It should also provide a set of accurate, reliable, configurable alerts that will inform the DBA of any abnormal or undesirable conditions and properties, as well as specific errors, on any of the monitored servers. This article provides an in-depth guide to the monitoring and alerting functionality available in one such tool, Redgate SQL Monitor. It focuses on the latest edition (5.0), which includes several key new features, such as performance diagnosis using wait statistics, the ability to compare to baselines, and more.

Any SQL Server monitoring tool must gather the metrics that will allow a DBA to diagnose CPU, memory or I/O issues on their SQL Servers. It should also provide a set of accurate, reliable, configurable alerts that will inform the DBA of any abnormal or undesirable conditions and properties, as well as specific errors, on any of the monitored servers.

This article provides an in-depth guide to the monitoring and alerting functionality available in one such tool, Redgate SQL Monitor. It explains how the tool works, reviews its features, its architecture, shows how to configure it, and offers some practical examples of how to use it to monitor your SQL Server database servers and instances. It covers the latest version of the tool (5.0), which supports versions of SQL Server from 2005 up to and including SQL Server 2016.

The tool provides core monitoring functionality, such as built-in and customizable alerting, performance diagnosis using wait statistics, and the ability to capture and compare performance baselines, as well as new features such as Availability Group monitoring, and integration with Active Directory.


Why you need a SQL Server monitoring solution

“When I first started out as a SQL Server DBA, I was in charge of about 20 instances. My morning ritual was to log remotely into each instance, one at a time, and work through a series of checklists, verifying each instance was running as expected, checking disk space, confirming that all overnight jobs ran successfully, examining error logs in case of issues, checking key performance counters, and much more. Even with only 20 instances, this manual checking was very time consuming. In fact, on some very busy days, I didn’t even have time to check every instance. Also, the process got boring and tedious very quickly.” – Brad McGehee , SQL Server DBA

If Brad’s description of his early days as a DBA bears any similarity to a typical day for you, or if you spend a lot of time gathering performance and other metrics from your SQL Server instances and databases, manually, in response to various reported issues, then it’s likely that a SQL Server-specific monitoring solution will save you a lot of time.

The monitoring solution will gather the metrics you need to diagnose any CPU, memory or I/O issues on your SQL Servers. It will provide a set of configurable alerts that will inform you when an error occurs, when jobs fail or fail to run, database or server properties deviate from the norm, there are sustained spikes in resource usage, or abnormal trends, and so on.


How SQL Monitor works

SQL Monitor is both a performance monitoring tool and an alerting tool. It continuously monitors a set of target SQL Servers, collecting a wide range of server-, instance-, and database-level diagnostic data that will help a DBA to evaluate current activity levels on the server, and to identify resource bottlenecks, query performance problems, and other common errors and issues on their database systems.

SQL Monitor raises an alert whenever a particular event occurs (such as a SQL Server error or a job failure) or whenever some pre-defined, but configurable, threshold value for some metric (such as disk space usage, or query duration) is exceeded. Via the SQL Monitor web interface, users can drill into the list of all the alerts raised on the monitored servers, along with various overviews and summaries of the collected diagnostic data.

Therefore, for every alert raised, we can see various snapshots, aggregations and summaries of resource usage and query activity on the server around the time of the alert, helping us to determine quickly what might have caused it. We can also compare resource metrics at the time a certain issue occurred to their baseline values at a previous time, such as the same period on a previous day, in order to see exactly “what changed”.

Alongside the pre-set summaries of the data that SQL Monitor automatically displays in the web interface, we also create analysis graphs displaying the values of multiple performance counters over a desired time range. Outside the web interface, we can even run custom SSRS reports directly against the SQL Monitor database.


Lightweight implementation

Users access the monitoring data and alerts via the SQL Monitor web interface, meaning the data is accessible to authorized users on any device that has a web browser.

SQL Monitor collects all its data using a lightweight, agentless, 3-tier architecture that minimizes the danger of the observer effect (where the act of collecting performance data from a target server affects the performance of that server), and stores all the historical data in a SQL Server SQL Monitor database.


Pre-configured and customizable

Perhaps the biggest strength of SQL Monitor is that it is both pre-configured and highly customizable. It comes with a pre-defined set of base metrics that will allow it to start collecting server-, instance- and database-level metrics as soon as we add the first SQL Server we wish to monitor. It also comes with a pre-configured set of alerts for common resource usage, query performance and common errors and issues.

In other words, out-of-the-box, with minimal set up and zero configuration, SQL Monitor satisfies the Pareto Principle (also known as the 80-20 rule); it will not identify and raise an alert for every possible SQL Server-related problem, but it will catch the most common ones.

From this pre-configured base, we can start monitoring and alerting immediately and, as we learn more about our servers, and the level of alerting, start manually adjusting alerting thresholds so that they are best suited for the level of activity on our servers. SQL Monitor tries to configure sensible threshold values for alerts, but there are very few threshold values that aren’t fully dependent on the specific environment, and on the applications and the patterns of activity on your servers.

Alert tuning, and even disabling alerts, if necessary, is a vital part of maintaining a strong signal-to-noise ratio and therefore a healthy monitoring system that the team actually use rather than just attempt to ignore. Over time you need to tune and adjust your monitoring system so that all alerts are meaningful, and allow the team to take fast, corrective action.

The rules for effective SQL Server monitoring

I recommend reading The Ten Commandments of SQL Server Monitoring by Adam Machanic article, as well as Eight Steps to Effective SQL Server Monitoring by Tony Davis. Make sure your monitoring system, SQL Monitor or otherwise, obeys the rules and guidelines these articles propose. If you find there is some aspect of how it works that bends or flouts the rules, report it (in the case of SQL Monitor, via the dedicated forum).

Once comfortable with the level of alerting, and confident that the most important alerts don’t get lost in a “sea of noise”, we can start to build out our monitoring and alerting capabilities. SQL Monitor lets us write our own Transact-SQL code to create custom metrics that will capture the specific monitoring data we need from our servers, instances and databases, and create alerts on these custom metrics, as required. We’ll cover the creation of custom metrics and alerts later in the article.


Pre-defined alerts

SQL Monitor’s key mode of operation is to raise alerts based on abnormal or undesirable conditions and properties, as well as specific errors, on any of the monitored servers, instances and databases. For example, it might raise an alert if the monitoring data reveals:

  • High resource usage -memory or CPU usage for a server exceeds a certain threshold value
  • Disk space running low – find out that the drive housing the transaction log is nearly full while there is still time to act
  • Query performance issues -for example, a query against one of the databases runs longer than a specified duration
  • Specific SQL Server errors and problems -such as a deadlock, or some other error that SQL Server writes to the error log
  • Failed or long running jobs – such as database backup jobs
  • Abnormal values for instance , database or object-level properties – such as a database’s page verification option set to NONE, or an index’s fragmentation level exceeding a certain value
  • Availability Groups Problems – relating to the availability or performance of availability databases in the primary and secondary replicas

We can find a full list of the pre-defined alerts at monitor.red-gate.com/Configuration/Alerts. Out of the box, SQL Monitor will raise 22 alerts directly related to our SQL Server instances, ranging alerts on long running queries, to overdue backups, to blocking and deadlocks. It will also raise another 11 related to the host machine (disk space, memory and CPU usage, and so on). There are also a further 7 alerts specific to Availability Groups. Each alert is pre-configured with default thresholds, but it is very easy to adjust the threshold settings.

The monitor.red-gate.com website is a live copy of SQL Monitor that allows anyone on the Internet to view the alerts and performance of the two SQL Server instances that run the backend databases for both the SQLServerCentral.com and Simple-Talk.com websites.

Later in this article, we’ll review some specific examples of these alerts, and the supporting diagnostic data that SQL Monitor presents for each one.


Pre-defined metrics

SQL Monitor comes with a pre-defined set of base metrics that means it can start collecting server-, instance- and database-level metrics as soon as we add a SQL Server. It also collects various server, instance and database properties, all of which are useful in identifying the root cause of any alerts raised on the monitored servers.

It interrogates the Windows server for metrics such as CPU and memory use, as well as disk space usage. At the SQL Server instance and database-levels, it uses various built-in functions, system tables, views and performance counters to collect a wide range of SQL Server and diagnostic data and properties that can expose resource usage issues (for example, high memory or CPU use), performance issues (such as long running queries, blocking), as well as specific error and problems (such as deadlocks, index fragmentation, SQL Server errors). It also collects other important information, such database configuration properties.

The following sections summarize, briefly, some of the key predefined metrics that SQL Monitor collects. As noted earlier, and as I’ll demonstrate later, we can expand this list by defining our own custom metrics

Performance Monitor counters

SQL Monitor tracks a wide variety of Performance Monitor counters to identify bottlenecks in CPU, memory, disk I/O, or network resources. To make this easier, Performance Monitor counters are divided into four categories:

  • Ten counters related specifically to the host machine on which the instance is running
  • Eighteen counters related specifically to each instance
  • Nine counters related to each individual database on each instance
  • Six counters related to any Availability Group of which the instance is a member

For a full list of all the Performance Counters collected by SQL Monitor, visit http://monitor.red-gate.com/Analysis/Graphs.

As well as viewing strategic snapshots of this data, at various places in the web interface, we can also perform custom analysis of these metrics, on the Analysis tab in the SQL Monitor web interface (see later).

Diagnostic data from Dynamic Management Views

Alongside PerfMon metrics, SQL Monitor tracks data held in various Dynamic Management Views (DMVs), offering deeper insight into the workload distribution across an instance, and into databases or specific ‘hot’ tables that are the source of long running queries and resource waits.

Top queries

SQL Monitor sets alerts on current, long running queries, based on data from the sys.dm_exec_ requests DMV and also queries information from sys.dm_exec_sql_text and sys.dm_exec_query_plan DMVs.

SQL Monitor uses this information to present a list of the most expensive queries, by instance and by database, plus the SQL text of the executing batch and associated query plan handle, at the current time or at the time of an alert. You can sort these by execution time, count and resources used as necessary.

SQL Monitor also captures which resource waits have affected that query, if any.

Top resource waits

Every time a session has to wait for some reason before the requested work can continue, SQL Server records the length of time waited, and the resource on which the session is waiting. For currently-waiting tasks, it exposes the waits through the sys.dm_os_waiting_tasks DMV. SQL Monitor mines this data to present a list of the top ten resources waits on the server over a given time (such as for the previous 15 minutes), so we can see if the most significant waits are, for example, related to log writes, or waits to acquire locks and latches, or waits on some other resource. Clicking on a significant wait type reveals any queries that were delayed by this wait, query plan details, target database and the duration of the wait.

For top-level, system-wide waits, SQL Monitor also mines data from the sys.dm_os_wait_stats DMV, which exposes wait statistics aggregated across all session IDs, to provide a summary review of where the major waits are on a given instance.

Index fragmentation

SQL Monitor collects diagnostic data that allows it to report fragmented indexes in any of the monitored databases, and will raise an alert at a pre-defined fragmentation threshold.

Currently, for SQL Server 2000 compatibility, it collects this data by running periodically the deprecated DBCC SHOWCONTIG command, although it will likely switch at some point soon to using the sys.dm_db_index_physical_stats function.

By either route, collecting this data is potentially resource intensive, because SQL Server will need to scan every index page, and collect statistics, each time. For this reason, SQL Monitor only collects this data once a week, at 2AM on Sunday morning, and uses the FAST collection mode (roughly equivalent to using LIMITED mode with the sys.dm_db _index_physical_stats function).

Profiler data

While PerfMon and DMV data is very useful when troubleshooting problems, sometimes we need even more detailed information. SQL Monitor has the ability to collect selected Profiler data, as needed. Normally, this feature is left off because it can be resource intensive to collect lots of Profiler data, even though a server-side trace is used. Generally, it is only turned on to help provide more extensive data about specific alerts. For example, if tracing has been turned on, and a specific alert fires, then you will be able to see the Profiler events that occurred before and after the alert.

The Configuration | Trace page on SQL Monitor displays a warning icon to make it easier to see which instances have trace enabled.

Expensive system processes

As well as expensive queries running against SQL Server instances, it is often important to know what processes are running on your SQL Server box, and this feature keeps track of the most resource intensive processes running on your server, in terms of CPU time. This will help to determine if it is a SQL Server instance, or perhaps another application, that is using up valuable resources.

Basic server, instance, and database properties

For each server, instance, and database, essential properties are collected, such as the OS version, SQL Server version, Service Pack levels, and much more. This information helps the DBA stay informed about the basic configuration of the environment.


Metric Overviews and Analysis

SQL Monitor presents the data it collects in various ways across the Overviews, Alerts and Analysis screens that comprise the three main sections of the SQL Monitor web interface.

Overviews

The Overviews screen presents various predefined views into the monitoring data, offering summaries at the machine right down to database level of what’s happening on the system right now, or at a specified time:

  • Server / operating system level, including:
    • CPU, Memory and Network usage
    • Disk metrics – drive space used, average read/write time
    • OS properties
    • Top system processes
  • SQL Server instance level, including:
    • Instance-wide PerfMon metrics, such as batch requests/sec, page splits etc.
    • Host machine resource usage summaries
    • Instance properties – SQL Server edition, version, collation
    • Performance Diagnostics:
      • Top waits at current or specific time
      • Top expensive queries at current or specified time
      • Top SQL user processes, by CPU, in 30 seconds prior to current or specified time
    • Recent errors – last 10 error log entries
  • Database level, including:
    • Database file properties – data and log file size and location, plus log file usage
    • Database-level configuration settings – recovery model, collation, Auto create statistic status etc.
    • Performance metrics such as transactions/sec
    • Performance diagnostics – most expensive queries
  • The Availability Group level, including
    • Status and health of the AG service, the instances and the databases
    • Rate at which logs are being transferred to the secondary replicas
    • Various statuses including the settings for Failover

On the Overviews screen, at any level, we can use the back in time feature to see a summary of the metrics at a previous point in time.

Alerts

The Alerts screen displays details of all current, un-cleared alerts for all monitored SQL Servers. Accompanying the details of each alert, SQL Monitor provides a summary of diagnostic data collected at the time the alert was raised.

We’ll look at alerts in more detail later in the article, but the supporting data includes summaries of:

  • Host machine resource usage levels
  • SQL Server activity levels
  • System processes that were running concurrently
  • SQL Processes that were running concurrently (for SQL Server instance or database level alerts)
  • Queries that used the most resources during a 15 minute period, starting 10 minutes before the alert was triggered
  • The SQL Profiler trace stack (if enabled)

Analysis

On the Analysis screen, we can plot specific metrics, namely performance counter metrics plus custom metrics, over time.

We can also display and overlay multiple metrics for a given time period, and we can create metric baselines, and compare them to usage levels for the same metric, at the time of the alert. Again, we’ll discuss how this works a little later.


SQL Monitor architecture

Before we dive into the details of how to install, configure and use SQL Monitor, we need to take a high-level look at its design. SQL Monitor is divided into three key components:

  • Monitoring Service ( Base Monitor ) – the main workhorse of SQL Monitor. It runs as a normal Windows service, continuously monitoring our SQL Servers, collecting the alerts and performance data and storing it in the SQL Monitor database.
  • Web Server – serves up the SQL Monitor user interface via a web browser. We can choose to install a dedicated, standalone SQL Monitor Webserver, or use an existing IIS web server (IIS 7 and up).
  • SQL Monitor Database – a SQL Server database that stores all the configuration and historical data collected by the base monitor.

SQL Monitor is an agentless architecture; it collects diagnostic data without the need to install an agent on each SQL Server instance. Instead, the base monitor connects remotely to each server, collecting the data using a combination of WMI, T-SQL, remote registry calls, remote file access, and ping.

While an agentless environment minimizes the monitoring overhead, it is still unwise to install the SQL Monitor components on the same machine as the SQL Server instances we wish to monitor. The overhead of running the base monitor and web server, as well as writing all the data to the repository, which can grow large on a busy server, could affect the performance of the monitored server (the Observer effect). Equally, it’s of little help if the monitoring service, responsible for telling us that something’s failed on the server, goes down at the same time as the server.

Therefore, at the very least, we need one additional machine to host the web server, base monitor and SQL Monitor database, although the recommended configuration, generally, is two separate servers, one to host the base monitor and web server and one to host the SQL Monitor database. Ultimately, if we can install each of the three components on a dedicated server, it will allow us to scale SQL Monitor to monitor the largest number of SQL Server instances.

The SQL Monitor 5 documentation provides an overview of the SQL Monitor components, reproduced in Figure 1.

2353-1-317100b2-f767-40c1-b6e5-e14ba1daa

Figure 1: SQL Monitor architecture.

The Supported platforms and hardware and performance guidelines section of the documentation provides full details of supported platforms for the three SQL Monitor components, as well as supported SQL Server versions for monitoring, and supported browser versions for the web clients. It also offers hardware guidelines for the machines hosting the base monitor and SQL Monitor database.


SQL Monitor installation and start up

The SQL Monitor installation process is straightforward; simply run SQLMonitor.exe on the server on which you wish to install the web server component (and possibly also monitoring service) and complete the installation steps in the wizard. The installation documentation covers these steps in detail, so here we’ll just review briefly the main considerations.

You can either install the web server and monitoring service on the same machine, or on separate machines. If you choose the latter option, you’ll install the web server first, and then a separate installer allows you to install only the monitoring service, on a separate server.

For the examples in this article, we used a 2-server architecture for SQL Monitor, and so installed the web server and base monitor on the same virtual machine.


The web server

The default option in the installation wizard is to install the dedicated SQL Monitor web server, which uses the .NET 4 runtime. If you chose this option, SQL Monitor will install and configure the SQL Monitor 5 Web Service. See the documentation for the required permissions for the Web Service account.

However, the installation process will also detect if the server hosts a supported version of IIS (v7 or later) and, if so, the alternative is to add SQL Monitor to the existing IIS server. If you choose to use IIS, the SQL Monitor Web Service is not installed.

If IIS isn’t installed, it’s worth installing the dedicated SM web server, and you may still opt to do so even if the machine already hosts IIS v7 or later. The dedicated web server is much easier to set up, as you don’t have to juggle any IIS permissions, and is very lightweight.

Using SQL Monitor with IIS

If you do need to use IIS, to support a ‘heavyweight’ SQL Monitor deployment, then the documentation provides details on installing SQL Monitor with IIS 7 and later, and on how to move hosting from the SQL Monitor Webserver to IIS, if required.

The examples in this article use the dedicated SQL Monitor web server. By default, it uses TCP port number 8080 for incoming connections, and you can test the port assignment to ensure no other program is already using it. Here, for this reason, we assign the SQL Monitor Webserver to port 8081.

If you want to access the SQL Monitor web server outside your organization’s firewall, you will need to configure the web server so that it can be accessed through the firewall.


Install the monitoring service

In this example, we simply install the monitoring service ( a.k.a. base monitor) on the same machine as the web server. The Base Monitor service uses default TCP port 7399 to communicate with the Web Server. The base monitor will need an account that has access to the SQL Monitor database, as well as an account, which may be the same account or a different one, to access to each of the SQL Server instances and databases that we wish to monitor. The former we specify during installation and the latter post-installation, during initial configuration. The Requirements section of the documentation specifies the minimum permissions required for each of these accounts.

The base monitor can connect to the SQL Monitor database using Windows (the default, and used here) or SQL Server authentication.

If you run Services.msc on the machine where you installed the base monitor, the name of the base monitor service is SQL Monitor 5 Base Monitor. If you ever see the message “Unable to detect base monitor” when starting up SQL Monitor, check to see that the service is still running (it will stop working if, for example, we change the account password and forget to update the service to use the new one).


Set up the SQL Monitor database

The final major step in the installation is to set up the SQL Monitor database. This will store all the data collected from each of the monitored SQL servers. Depending on the size of your SQL Server databases, and activity on your servers, the SQL Monitor database can grow large, quickly, and it’s essential to pre-plan for this during set up.

The documentation estimates that monitoring one server, hosting a single SQL Server instance, will require between 150 and 450 MB of storage per day. Therefore, for 10 servers, we can expect the SQL Monitor database to grow by 10 to 30 GB per week.

Decide how many weeks’ worth of monitoring data you wish to retain (covered in more detail in the Configuring SQL Monitor section, later) and then create a custom SQL Monitor database, with initial size and filegrowth for the data and log files that will avoid excessive autogrowth events, which can cause file fragmentation.

If, instead, you decide to let SQL Monitor create the SQL Monitor database for you, it will create a repository database called RedGateMonitor using the defaults supplied by the model database for that instance. It then alters the database, first to set the initial size, and growth increment, of the data file to 512MB, and then to set its recovery model to SIMPLE.

For the examples in this article, we created a new SQL Server database, sized appropriately, and supplied a Windows account for the base monitor to connect to the database, as discussed earlier.

SQL Monitor database architecture

If you would like to know a little more about the schema of the SQL Monitor database, and how it organizes and stores the monitoring data, and alerts, try Chris Lambrou’s short series of posts: https://www.simple-talk.com/blogs/author/198217-chris-lambrou/ (Chris was one of the SQL Monitor developers).


Initial start up

With the installation process complete, all we need to do is launch SQL Monitor. The first time we do so, create the administrative role for the SQL Monitor web interface and then log in, and start adding the SQL Server machines and instances that we wish to monitor.

The Administrator user has full access to all SQL Monitor features. Subsequently, the administrator can create less-privileged user roles (covered in the Configuration section, a little later in the article) for users who only need to view and analyze the monitoring data.

Licensing and activation

After installing SQL Monitor, you will also need to enter licensing information and activate it. See the Help file for instructions on how to do this. You don’t have to enter the licensing information immediately, as SQL Monitor will work as a free trial for 14 days.

Accessing SQL Monitor

We can access SQL Monitor from any web client, simply by supplying the URL with the address and port number for the web server component:

http://<webservername>.<domain>:8080.

On the local machine hosting the web server, we can also launch SQL Monitor from the Start menu.

Adding an instance to SQL Monitor

In order to start using SQL Monitor, we just need to add some SQL Servers for it to monitor. Having done so, it starts monitoring immediately, based on a pre-defined set of base performance metrics and alerts.

We add new server and instances in the Configuration | Monitored Servers section of SQL Monitor (we can also reach this page from the left hand menu of the Global Overviews page)

To add a new server, simply click on Add SQL Server.

2353-1-0bdd6c17-d084-4e6e-963d-5689ba0eb

Figure 2: Adding a SQL Server instance

We need to supply the following information:

  • The path to SQL Server host machine or Cluster we wish to monitor – the fully-qualified name
  • A Host machine account – used by the base monitor service to connect to, and gather server-level data from, the machine hosting the SQL Server we wish to monitor
  • A SQL Server instance account – used by the base monitor service to connect to the and gather data from the SQL Server instance

By default, SQL Monitor will use the Base Monitor service account, which we set up during installation to allow it to connect to the SQL Monitor database, for both the host machine account and the SQL Server instance account.

However, it is a best practice to create separate, dedicated domain accounts for each credential. The host machine account will generally have local admin rights to the server where the SQL Server instance is located, and the SQL Server instance account will generally have SQL Server sysadmin rights, although in neither case is the use of such privileged access mandatory (see Requirements).

As we add new servers over time, the Monitored servers page will provide a full list of all monitored servers and instances, reporting their current status and allowing us to perform actions such as edit credentials, suspend monitoring, retry a connection, and set a maintenance window.

Active Directory integration

New to SQL Monitor 5 is integration into Active Directory to help manage and control access to the monitored SQL Servers. If certain users/groups are not allowed to view a certain server then that server won’t appear in SQL Monitor when viewed by those users.

On the Configuration page, click Authentication Settings and you’ll see options to log in to SQL Monitor using either SQL Monitor credentials, or Active Directory credentials. If you choose the latter, you’ll need to supply at minimum a domain name and credentials for a service account in Active Directory.

Having done so, tested the connection and saved the settings, you’ll see the message in Figure 3, where we need to confirm the AD user or group that will have administrative access to SQL Monitor.

2353-1-398dd6c1-e92d-426f-abd6-185acf405

Figure 3: Adding an AD user or group as an Administrator

We can then use the Configuration | Manage Users page to create lower-privileged AD users and groups that have access only to specific servers.

2353-1-cc5467ea-b2ee-42b4-b8a3-8761e303f

Figure 4: Adding a standard user or group with access only to specific servers

If permissions conflict, the least restrictive is taken. For example, if a user is a member of an AD group that doesn’t have access to a certain server but also a member of an AD group that does have access, then SQL Monitor will give that user access.


Working with SQL Monitor

SQL Monitor allows us to track alerts and analyze associated monitoring data across three different screens, as described previously:

  • Overviews – snapshots and summaries of monitoring data, describing what is currently occurring on the system, or what the system looked like at a specified moment in time.
  • Alerts – detailed information about alerts that SQL Monitor has generated. We can also manage those alerts, marking them as Read (or Cleared) in order to keep the alert list clean, and so maintain a strong signal to noise ratio
  • Analysis – tools to analyze the gathered metrics over time, establish baselines, and compare different metrics.

Combined, these tools offer a degree of proactive control over the performance of our database systems, as well as the ability to respond to emergencies as they arise.


Tracking alerts

Having installed SQL Monitor, and added some SQL Server instances, it automatically begins to collect data and raise alerts.

SQL Monitor raises an alert in the following situations:

  • Every time the event related to the alert occurs – such as a deadlock occurring, a job failing, or an instance becoming unavailable
  • When the defined threshold level for the alert firing is breached – for example, a query runs longer than the threshold value, or available disk space drops below a threshold value, or CPU usage exceeds a threshold value for a longer than a set duration

We can configure each alert, setting the desired threshold levels appropriate for our environment, or even disabling it, if required. We’ll cover alert configuration in detail later in the article.

When SQL Monitor raises an alert, it stays raised until someone with proper permission clears it. It is very important that the team are diligent in clearing alerts that they have investigated and resolved, or that simply don’t require immediate attention. In this way, the list of alerts SQL Monitor displays will always provide an excellent place to begin troubleshooting.

Global Overview

The first screen that comes up upon starting SQL Monitor is the “Global Overview” screen, as shown in Figure 5.

2353-1-d78b7639-5f63-4b8a-ad57-a2972980d

Figure 5: The “Global Overview” screen in SQL Monitor.

The Monitored Servers explorer, on the left, lists host machines and SQL Server instances in a hierarchy. The ” (2)” next to Global Overview indicates one host machine (called win-3xxx) and two instances (local and dojo), although of course a host machine might run two or more SQL Server instances.

Monitoring SQL Server clusters

If you run SQL Server in a cluster environment, which SQL Monitor supports fully, you will see a hierarchy of cluster name, host machine names, and instance names. SQL Monitor automatically groups these together, when monitoring a cluster.

The main portion of the screen summarizes all un-cleared and unread alerts across all monitored machines and instances. SQL Monitor raises specific alerts both for the host machine level, and at the SQL Server instance-level. The highest level of the hierarchy (win- 3 xx, in this example), rolls up the alerts for all the lower levels. In Figure 5, we see some un-cleared alerts, two high severity (red) alerts, two medium severity (orange) alerts and 2 low severity (blue) alerts.

If we click down to the instance level, we see the alerts per database on the local instance.

2353-1-5fdeb56f-ac52-47a9-8daf-632855cd5

Figure 6: Alerts at the instance level

Alerts screen

We can drill down into the details of the alerts either by clicking on the number of un-cleared alerts listed for a server or instance, or by switching to the Alerts screen. In this example, I clicked on the five un-cleared alerts listed for the local instance, as shown in Figure 7.

2353-1-e1111311-8023-4370-aac4-5673fb9c6

Figure 7: The Alerts screen showing six alerts

The severity of the alert runs from Low to Medium to High, with the latter being alerts that should be of immediate concern, as they are the ones most likely to lead to outages or major performance problems. In the Actions column, we can, for example, mark an alert as read, or clear an alert, although we’re more likely to do that on the screen for each individual alert, as we’ll discuss shortly.

Troubleshooting alerts (Deadlock example)

Simply by clicking on each alert, we can see the full details for each alert, when it occurred, and on which database or instance, along with snapshots and summaries of resource usage and activity levels on the host machine, and SQL Server instance, at the time of the alert, as well as details of SQL Server queries and processes, or other server-level processes, which were running concurrently.

To give you a feel for the troubleshooting process, let’s drill deeper into one of the three alerts shown in Figure 7, namely the Deadlock alert.

Clicking on any alert takes us to the Details tab for the alert, as shown in Figure 8.

2353-1-184b0f54-2976-465e-8ebd-8b32b252f

Figure 8: The Details tab for a Deadlock alert

At the top, for any alert, we see the time the alert was raised (and ended, if relevant). Below that we can see the current status of the alert. In this case, someone has read the alert but it has not yet been cleared. We can simply click those boxes to changes its status. We can also add useful comments, perhaps regarding resolution of the problem.

In the body of the Details tab, we see the details of the alert. Obviously, the information presented here is specific to the alert. In this case, we see details of the sessions and objects involved in the deadlock.

We can see straight away that processes 57 and 59 were engaged in a deadlock and that SQL Server chose process 59, an UPDATE on the Person.AddressType table, as the deadlock victim and rolled it back.

Troubleshooting Deadlocks

You may find it interesting to compare the information provided by SQL Monitor, for troubleshooting deadlocks, with that obtained from use of Trace Flag 1222 and SQL Trace’s Deadlock Graph event (or using the system_health default event session in Extended Events). For more in-depth coverage, including how to analyse deadlock graphs, I can recommend the Handling Deadlocks chapter of the free e-book Troubleshooting SQL Server, by Jonathan Kehayias.

The Lock details column of the output provides details of the database and specific objects involved in the deadlock. From this, we can see that Process 57 holds an Exclusive (X) lock on the Primary Key index of the Person.Address table and Process 59 holds an Exclusive (X) lock on the Primary Key index of the Person.AddressType table. Each process needs to obtain an X lock to update rows in the index on which the other process already holds a lock, leading to a deadlock.

Alongside the Details tab are five other tabs that provide further useful information about the alert. The Output tab is alert-specific, but the others are common to all alerts.

  • Output – displays the entire deadlock graph
  • Comments – we can add comments regarding alert resolution
  • Alert History – the lifecycle of the current occurrence of the alert
  • Occurrences – how many times SQL Monitor has raised this alert, for the current object
  • Description – useful background information on troubleshooting the alert type

Scrolling down further, we reach the Performance Data section. On the Host machine tab, SQL Monitor provides a snapshot of the values for various server-level Performance Counters, captured around the time that the alert fired.

2353-1-ce406940-7bcd-4ea3-9ad4-e55873d62

Figure 9: Performance metrics capture at the time the alert fired

The gray bar in each graph indicates the time that SQL Monitor fired the alert. A green bar, if there is one, indicates when the alert ended. Note that these bars represent the time that the alert fired and ended, not the exact time that the underlying queries started and ended, although they are a close approximation. An alert won’t fire until its threshold is reached, and this will be some time after the query began. Similarly, the SQL Server tab provides snapshots of performance counter metrics for the SQL Server instance.

For even more information, the System processes tab lists concurrent system processes, and SQL processes / Profiler trace identifies the processes running at the time of the alert, plus the trace stack, if you have trace enabled. In this case, this tab will identify who and what was running the session that caused the deadlock (OK, I confess, it was me, in SSMS, armed with a nasty, long-running UPDATE query).

Finally, we can see the Top queries from around the time of the alert. By default, SQL Monitor lists them in order of duration, but by clicking on the desired column (such as execution count), we can reorder the results by that column.

Clicking on the long-running query will reveal the full text of the query, its plan handle, the target database, and the query duration. Clicking on the arrow, below the query text will show the query in the context of its containing batch. This is especially useful when the query is run as part of a stored procedure, as it will give you the name of that procedure

2353-1-0119c33e-5b00-4dc8-8160-4aba0b240

Figure 10: Top queries running at the time an alert fired

Armed with knowledge of the queries involved in the deadlock, we can take steps to resolve the problem, to ensure it does not recur. This might involve query tuning, adding a new index, or modifying an existing one, or some other standard mitigation for deadlocks.

Clearing alerts

When we start to work with an alert, SQL Monitor marks is automatically as Read. This means that other DBAs or data professionals looking at the Alert won’t also start to troubleshoot the issue. Having resolved the problem, it is very important to mark the alert as Cleared.

Other alert examples

We haven’t got room in this article to walk through all of the 33 possible alerts, but we’ll take a brief look at the information available for the two other alerts raised in Figure 7, namely the Blocked Process alert and the Long Running Query alert.

A full list of pre-configured alerts

The documentation provides a complete list of alerts, along with the default threshold settings for each one and the frequency at which SQL Monitor collects data for the related metric. See, http://documentation.red-gate.com/display/SM5/List+of+alerts.

The basic troubleshooting process is similar for every alert, in that SQL Monitor will reveal both the relevant alert details, and performance data captured around the time the alert was raised, which the DBA can use to diagnose the cause. For example, a Fragmented Indexes alert will identify the indexes, their size and degree of fragmentation, a Disk Space alert will identify the space-challenged disk and summarize the size of the databases that store data on that disk, and so on.

Blocked Process

Figure 11 shows the top portion of the Details tab for the Blocked Process alert we saw in Figure 7.

2353-1-feb7af11-d69a-4e37-bbf1-c66308112

Figure 11: Alert details for a blocked process report

The Details tab displays the time and duration of the blocking event and the database in which it occurred. More useful though is the information presented in the Processes tab, which reveals information about both the blocked process and the blocking process.

2353-1-14786525-f147-4be0-97fe-838ba775f

Figure 12: Details of the blocked and blocking processes that caused a blocking alert

Process ID 59, the UPDATE transaction against the Person.Address table is blocked. You can see the complete T-SQL statement for the blocking process, shown as ID 57. Armed with the process ID and the start time, we can use the Rewind Time feature (covered shortly) to examine additional queries that were active at this time.

The problem may be a poorly-designed query that needs tuning, or simply an open transaction that needs to be committed or rolled back. In this example, on examining the running queries, the blocking process was identified as an open transaction, affecting a single row, which was dealt with by killing the process and letting the transaction roll back.

Long Running Query

Another alert we saw in Figure 7 was the Long Running Query alert. As revealed by the Details pane for this alert, SQL Monitor raised it due to the same blocked query that caused the Blocked Process alert.

2353-1-bcf90079-96b4-4127-84c4-db352e200

Figure 13: Details of the blocked and blocking processes that caused a blocking alert

It is quite common to see these two events together. By default, SQL Monitor raises a High severity Blocked Process alert after one minute of blocking, and also a Low severity Long Running Query alert when a query runs longer than one minute. However, we might prefer the long-running query event to be separated in time from the blocked process event, and we can do this easily by modifying the settings for the long-running query alert to make the threshold value shorter or longer than the blocked process alert, using the Configuration page.

In this example, if we resolve the blocking issue then we’ll also resolve the long running query problem. However, more generally, we would want to examine the query details, in the accompanying performance data, find out how often the query occurred in the workload, and so on. If this is the first occurrence of the query, it might be a one-off ad hoc query that may never happen again. If, instead, we see many occurrences of this same alert for the same query, then we will want to investigate what application or user is running it, detail we can get form the SQL Processes / Profiler trace tab, as described previously.

Once again, be sure to clear each alert after you address it.

Set up email notifications for alerts

Immediately after an instance is added to SQL Monitor, it begins to collect data and report any alerts, on the Alerts screen. For important alerts, it’s advisable to set up email notifications so that the relevant DBA receives immediate notification. Each email contains much of the information on the alert details page. We can configure alert emails to be sent when alerts are first raised, when they are escalated (from low to medium or high severity, for example), and when they are closed.

First, on the Email Settings page of the Configuration section, enter the default email address to which to send notifications along with the mail server settings. Alternatively, we can disable all email notifications, if they are not required.

At Configuration | Alert Settings, we can override the default notification address for individual alerts, as required, or simply disable notifications for that alert by selecting Specify an email recipient, and then unchecking the Send Email notification to: box.

SQL Monitor’s alerting feature works with PagerDuty, which allows email messages to be sent via phone and SMS. This is a third-party cloud-based application subscription service.


Using Overviews

On the Overviews page, SQL Monitor presents various aggregations and summaries of its collected diagnostic data, offering a snapshot of the system activity and resource usage at the current time, or at a specific point in time. SQL Monitor presents relevant data at each level, as we can drill down from the server level, to the SQL Server instance and database levels, as summarized earlier in the article.

At the SQL Server instance level, SQL Monitor displays in the Performance Diagnostics section various PerfMon counter metrics for the instance, and also snapshots of data mined from Dynamic Management Views, presented as Top queries that were executing at the specific time and the Top queries captured over the past, say, 15 minutes.

We’re going to focus on the wait statistics, because they offer a great way to understand what is causing the system to slow down, right now.

Wait statistics

A frequent occurrence for a DBA is to receive a phone call informing him or her that that the server is “running slow” now, or was running slow “just before I went to lunch”. With no more specifically-defined issue on which to base their investigation, a good starting point is to review the wait statistics for instance, over the reported period of slowdown.

In order to do this, generally speaking, the DBA would need to be capturing and saving regular snapshots of the sys.dm_os_wait_stats DMV, and then comparing them to find out ‘what changed’ over the slow period. A tool like SQL Monitor takes care of this “heavy lifting” for us, and presents a “top waits” view of the data, as shown in Figure 14. By default, SQL Monitor sorts the waits by duration ( i.e. wait time).

2353-1-d572f583-38a1-460c-83f3-9f57b07ff

Figure 14: Current Top waits

On my instance, the top four entries show a number of tasks waiting for a long cumulative period, and only using up a certain amount of CPU (as evidenced by the comparatively low Signal wait time). The first is a relatively obscure wait type associated primarily with extended events monitoring and we can be safely ignore it.

For all of the more common waits, SQL Monitor will display helpful descriptive text. In Figure 14, the more interesting are the waits after the first, and in particular the PAGELATCH_SH wait type. The data shows that 450,675 tasks waited for this resource (i.e. waited to acquire a shared lock on a data page that is in the buffer pool), over the previous 15 minute period.

This suggests a possible issue, so let’s drill deeper by clicking on this wait type. The first point to note is that on clicking on a specific wait type, below the list of waits appears detailed information about that wait type, what often causes it and possible diagnostic routes.

2353-1-dcfd2b9b-276d-40ba-9761-f2ff9ccc9

Figure 15: SQL Monitor provides detailed descriptions and troubleshooting advice for each wait type

In the main part of the screen, for the selected wait type, SQL Monitor displays any queries that experienced this type of wait over that period.

2353-1-56568e5c-a6bb-4525-9aa0-8d3dc7f82

Figure 16: Queries affected by a wait type

In this case, all the waits are coming from a single query (of course, since this is just my test instance, resource usage levels are not terribly high).

We can click on the query to reveal more information about the source of the problem, the query involved, the plan handle to pull the plan from cache, and other waits associated with the query.

2353-1-a5246654-61d9-4502-93d7-77f26169c

Figure 17: Further details on the query associated with a particular wait type

In this example, we have a fairly complex, recursive CTE query (in the uspGetEmployeeManagers stored procedure in AdventureWorks) that uses up a lot of tempdb space. This causes the latch contention in tempdb, and the subsequent waits cause the query to be slow running. In short, it needs to be tuned.

The Rewind Time feature

Sometimes the phone call you get is about a problem that occurred yesterday. While SQL Monitor maintains and displays real-time data, it also stores past data in a database, which means that we can roll back to see what was happening at a particular point in time. This is great for diagnosing problems that occurred when you were not around. By default, data is stored for 1-2 months, but this is configurable (see later).

To rewind to a previous point in time, we use the aptly-named Rewind Time feature, available at the top of any Overviews page.

2353-1-41555f8c-c83e-44db-9349-be79471d1

Figure 18: The Rewind Time feature

Use of the control is largely self-explanatory. We can add or subtract minutes or hours by clicking on the appropriate buttons, or click the calendar to choose a specific date and time, or simply type in a date and time (using a 24 hour clock). The orange “skip forward” arrow will return us to the current time.

SQL Monitor presents the same views into the performance and diagnostic data, for the specified time rather than current time, but the information for the databases and alerts is blank.

2353-1-3315ae5c-5b3f-4605-bba0-609a0bf2d

Figure 19: The Overviews screen, at the SQL Server instance level, for a previous point in time

Further down the screen, we’ll also find the top waits, expensive queries, and so on, for the chosen time. We can go as far back in time as we have data stored in the SQL Monitor database. Having completed analyzing the previous point in time data, re-click the ‘skip forward’ button to return to the current time.


Analyzing performance metrics

We use the Analysis screen to build and display graphs of various metrics over different time periods, and to perform baseline testing. This is the go-to location for long term analysis based on the performance counter metrics collected over time by SQL Monitor. We can use this information for trending, historical comparisons, and other types of troubleshooting and reporting.

By default, the Analysis screen will display a graph showing a single metric, Machine processor time (total CPU utilization) for the server first in the list of monitored servers, over the last hour.

2353-1-12892dce-a857-43e7-860a-a4549059e

Figure 20: The default Analysis Graph

The x-axis displays the time range and the y-axis the metric value; of course, the y-axis will display a variable range based on the types of metrics selected. As you can see, my system is initially under a reasonably consistent load of about 30% processor time (I have a load generation PowerShell script running), followed by some spikes and rises in load as I ran additional functions.

We can adjust the time range using one of the pre-set options, such as “Today” or “Last 24h”, or we can simply use the time range and calendar options. Using the forward and back buttons will display the metric values for the same internal on the previous or next day.

Hovering the mouse pointer over the graph will reveal the exact metric value for a specific point in time.

2353-1-0ce75452-9156-4cfc-86f2-1963e6b75

Figure 21: Viewing the metric value for a specific point in time

Below the graph is the selection box to allow us to add further metrics to the graph, and so compare different types of activity over the same time period. For server-level metrics, such as processor time, we can choose the server for which we wish to display the metric. For SQL Server instance- and database-level metrics, we can display them for all databases on a specific instance, or for a specific database.

Let’s see how that works in more detail.

Multiple metric graphs

Consider a situation where the wait statistics are showing high WRITELOG waits, at certain times, for the tempdb database, and we want to find out the root cause.

SQL Monitor collects several PerfMon counters related to log activity for a database, so we start by displaying the Log flush waits/sec metric just for the tempdb database, over the last two hours.

2353-1-6cc1cbb3-6a7f-4113-af81-cf354562f

Figure 22: Viewing a SQL Server metric for a specific database

Next, click the button Add another metric and add the Log flushes/sec metric. Then double click (As Above) for the server selection, SQL Monitor will automatically display the new metric for the same server, instance and database as the previous Log flush waits/sec metric.

2353-1-cf99e1f1-75cf-497f-b449-5cf0f7ab7

Figure 23: Plotting two metrics over the same period

It’s worth noting that, if we change the server, instance or database selection in the first metric (listed in the Show: line in Figure 23), then it will also change for all subsequent metrics for which we selected As Above. It makes it very easy to compare the same set of metrics for different databases.

On the current graph, we can see that the y-axis now display no scale, since the scale for each of the two metrics is different. SQL Monitor simply scales them proportionally within the time frame within the selected time range. Remember that we can still hover over each graph to see absolute values for each metric, at a specific point in time.

In any event, what we’re interested in here is not so much absolute values as corresponding patterns in behavior between different metrics. A log flush occurs, writing log records from the log cache to the physical log file, every time a transaction commits and during CHECKPOINT operations. By plotting the two metrics together, we should see if there is any correlation, i.e. whether ‘spikes’ in the number of log flushes per second correlates to increased numbers of log records waiting to be written to disk. In this case, we see no correlation between spikes in the number of log flushes and spikes in the log flush waits.

Let’s add a third metric to the graph, Log space used.

2353-1-89e37188-b339-4074-b740-909c4c4b9

Figure 24: A three-metric graph for the tempdb database

Now, we see a correlation between the behavior in the Log space used and Log flush waits/sec metrics. For the tempdb database, periodic drops in the log space used correspond closely to spikes in log flush waits.

For the tempdb database, SQL Server runs a CHECKPOINT operation when the tempdb log file reaches 70% full. Since tempdb always operates in SIMPLE recovery model, SQL Server truncates the log once the CHECKPOINT is complete, removing all log records that are no longer required. Clearly, we can see that the spikes in log flush waits correspond to times when SQL Server is clearing space in the tempdb log file. In this case, there is quite a lot of activity, but that’s in keeping with the current load on the system (my test script runs lots of inserts and deletes over and again, in a loop).

As you can see, combining metrics in this fashion allows us to spot correlations in metric patterns.

Using Baselines

While comparing various metrics over a given time period is useful, often what a DBA really wants to know, in response to hearing that a server was “running slow” at some random time, is “what changed?” What was happening on the system at that time that was different from “normal”? That’s where baselines come in.

To see baselines in action, I’m going to display some metrics for a different, busier SQL Server. Once again, we start with an analysis graph showing a single metric, Machine: processor time, for the selected server, this time for a 24 hour period.

Comparing to a baseline

Clicking the Compare baseline button automatically adds to the graph a second set of values for the same metric, over the previous 24 hour period, therefore providing a baseline, or comparison point. In Figure 25, the dark blue line represents the current time range of 24 hours of data and the lighter blue line represents the previous 24 hours’ worth of data.

2353-1-45a0a4de-a8f2-4a68-aa9e-10982bc76

Figure 25: Comparing today to yesterday

In this example, we do see day-to-day similarities in the data, with similar spikes in CPU usage at 3AM, for example. Sometimes, we will see tighter correlations by comparing to the same day the previous week and we can do this simply by modifying the Compare With value such that the graph compares the last 24 hours with the same time period seven days earlier.

2353-1-9f1b52b7-a338-4b6e-83b1-8cb5a87ef

Figure 26: Comparing today to the same day last week

Now we can see some variations, as well as interesting correlations. The spike at 3AM also occurred seven days ago, but CPU usage sustained a higher level for a period after the spike, around 20% usage compared to about 15% in the most recent 24 hours.

In this manner, we can begin to understand if the current pattern of activity on the server is normal, as compared to previous behavior.

Extending Baselines

The Extend Baseline button allows us to extend the baseline comparison a set number of previous time ranges. So, for example, if we are initially comparing a 1 hour time range to the same period the previous day, then extending the baseline by 7 time ranges will add to the graph the same 1-hour time range for the previous 7 days.

The example in Figure 27 compares the current 24 hours with the preceding 24 hours, and extends the baseline by 7 time ranges, so what’s on display is basically the last weeks’ worth of data collections, with each line being fainter the further back in history.

2353-1-364412a1-de6b-433f-b2aa-dc2b053ce

Figure 27: Extending the baseline to cover a whole week

We can change the time ranges, or we can modify the data selected, based on the information stored in the SQL Monitor database. Obviously, we can’t request data from counters that weren’t collected, or for time ranges that have been purged from the SQL Monitor database.

In Figure 28, I’ve opted to compare the current day to the same day the previous week, and then extended the baseline by seven time ranges, so the graph displays CPU usage for the current Tuesday and the previous seven Tuesdays.

2353-1-331488a3-8c28-4c07-9c13-f725dcb0d

Figure 28: Comparing CPU usage for the same 24 hour period across 7 days

In this way, we begin to get a true idea of what “normal” looks like and can spot unusual spikes and see correlations in behavior.

However, for an even clearer picture of what “normal” looks like, we can display averaged regions, for the baselines, rather than separate lines. To do this, simply click the Regions button at the top of the graph.

2353-1-34759562-1d55-4206-88c3-f795f2808

Figure 29: Comparing behavior to averaged baseline regions

SQL Monitor still displays the current data as a line, so we know what happened in the current 24 hour period. The rest of the data has been averaged out and smoothed down.

Essentially, the Regions view is based on the same data, but shows the 0th, 1st, 3rd and 4th quartiles, with quite strong Loess smoothing applied and the area between 1st and 3rd quartiles (the ‘mid-range’ values) shaded dark blue, and the area between 0th and 1st, and 3rd and 4th (the low- and high-range values) shaded a lighter blue.

With this knowledge, we can ascertain quickly that, in this case, the last 24 hours are not abnormal. In fact, they are exceedingly normal, falling almost exclusively within the medium blue, or mid-range average values throughout the 24 hour period.

Baselines provide an excellent mechanism for making comparisons over time so that we can understand how our system is currently behaving compared to how it behaved previously. The knowledge generated by the baselines allows us to spot trends and identify outliers in order to assist in both administration and troubleshooting of our servers.

It’s possible, with a little patience, to write T-SQL or PowerShell scripts to collect some or all of this data “by hand”, for example from the various system views, performance counters, Dynamic Management Views, and so on. We would need to collect baseline measurements for each key metric and then capture the data into a table at regular intervals. Erin Stellato in her Capturing Baselines series on SQLServerCentral.com explains how to start collecting this data manually.

The problem however is that writing and maintaining the scripts to capture the required data, the scheduled jobs to capture it, and the tables to store the data becomes, itself, a big task and one that can detract from the more important task of analysing the data to spot problems on our SQL Server instances. A tool like SQL Monitor will remove a lot of this heavy lifting.


Monitoring availability groups

Introduced with SQL Server 2012, Availability Groups (AG) are now a fundamental aspect of the High Availability and Disaster Recovery (HA/DR) functionality within SQL Server.

Availability Groups are maintained at the database level, not at the server level, so they bring with them different types of monitoring functionality. Further, the HA/DR nature of AG means that you have to monitor differently on the Primary node, the currently active instance maintaining your database, and any of the Secondary nodes, the available backup instances.

SQL Monitor provides all the additional functionality necessary to monitor your AG for up time, availability, current status and performance with appropriate monitoring, alerts and metrics, on both the Primary and Secondary nodes.

When you add in an Availability Group, the Global Overview screen lists the AG along with the servers and instances contained within that AG, as shown in Figure 30.

2353-1-b05e6df3-1193-4332-b2a8-d5eac8fd4

Figure 30: AGs on the Overview screen

If you click on any of the servers or instances, you’ll see the standard overview screen for that server or instance. If you click on the AG itself, you’ll see a new overview, specific to the AG, as shown in Figure 31.

2353-1-df9d82cf-ebad-4e0f-88cb-b6d1c4a88

Figure 31: Overview of metrics and alerts for a specific AG

There’s quite a lot of information displayed on this screen, so we’ll break it down in more detail. On the left, you have the basic status and health of the AG service itself, the instances and the databases. You can also see any AG-related alerts that have been fired, as well as the current status of synchronization.

2353-1-0dad26e1-8ab5-4d5c-8427-66e315fe7

Figure 32: AG status and health

You can also see the rate at which logs are being transferred to the secondary replicas.

2353-1-da2c657e-8787-4f2e-81bb-a04ca9f66

Figure 33: Log transfer rate to secondaries

You can see various statuses including the settings for Failover including Synchronized and Automatic, Synchronized with No Data Loss, and Asynchronous allowing Data Loss. The Asynchronous connection which allows data loss is what you can see in the final line of Figure 33.

We can drill down into each of the items on this screen. Clicking on the servers and databases will open the standard overview pages for those object types. Clicking on an individual alert will open the Alerts page. The spark lines for the Redo Queue, and others, will open the Analysis page.

For an AG service, you’re going to want to ensure that you monitor both the primary and the secondary instances. With that you can then be sure to look at Analysis of the metrics that shows both the load on the primary, but also how well the primary is interacting with the secondary instances as it sends across the transactions and validates their safe arrival.

When you add a cluster that’s part of an Availability Group, SQL Monitor automatically detects which instances are primary, and which are secondary. All of the instances are then added.

2353-1-b8442e22-d65d-478d-8768-c6f2c232c

Figure 34: Detailed analysis of metrics for primary and secondary replicas

All the information within the Analysis screen for the AG works exactly the same as for the Analysis screen for of any other set of counters. The Analysis screen is useful here because we can add counters from the Primary and from Secondary servers, as you can see in Figure 34. This allows you to see how the different servers are behaving.

SQL Monitor 5 offers a number of new Alerts for Availability Group management, as shown in Figure 35.

2353-1-999a7a09-020f-41ae-be78-b9df0667b

Figure 35: AG-specific alerts

All this functionality will, collectively, allow you to manage your AG instances much more effectively and proactively.


Customizing alerts and metrics

After some time working with a monitoring and alerting tool such as SQL Monitor, we begin to get a good feel for the level of activity, and subsequent alerting, on our monitored servers, as well as for how much monitoring data we’re collecting, and the growth rate of the central repository for storing it.

We’ll start to understand that certain alerts are either unnecessary in our environment, or require different threshold settings. At the same time, we may discover that our database suffer from particular problems for which there are no pre-defined alerts.

As discussed, SQL Monitor is highly configurable, and it is very easy to change alert settings, and to add new custom metrics and associated alerts.


Tuning alert settings

To customize any alert, either to turn it off or change its threshold settings, navigate to the Configuration page in SQL Monitor and select Alert settings to reveal the full list of host machine and SQL Server alerts.

To change the settings for any alert, simply click on it. Figure 36 shows the configuration page for the Long-running query alert.

2353-1-7cf92ee3-93d9-486f-9478-2ef50599e

Figure 36: Customizing alerts

We can turn an alert on or off, set the thresholds for when different types of alerts will fire and, for this particular alert, specify particular SQL process names or queries that we want to exclude from the alert. For example, let’s say that a once a month, a report runs that typically takes 30 minutes. Since we know this is always the case, and if there is nothing we can do about it, we can exclude it so that we won’t be alerted every time it runs.

At the bottom of the screen, not shown, we can specify if and where to send email notifications for this alert, as described previously.


Adding custom metrics

Creating custom metrics is a powerful feature and it is very easy to add one to the default metrics already built into SQL Monitor.

Simply navigate to Configuration | Custom metrics, and use a three page wizard to:

  1. Define the metric.
  2. Optionally create an alert on the metric.
  3. See a summary of the options you have selected as last review before it is created.

Define the custom metric

To start the Custom metric wizard, click Create custom metric. Figure 37 shows the top half of the screen. In this example, we’re creating a custom metric alert that fires when a database file increases in size. It’s a good practice not to rely on auto-growth to grow data files. Instead, we should size the data files at time of database creation to cope with the initial data load plus predicted growth over a reasonable period, and then grow the files manually, as required. With this practice in place, it’s wise to set an alert to notify us when a data file grows, because it indicates that the database has grown much faster than predicted, or that someone else has grown the file manually.

2353-1-a7e58ff9-1b82-40f0-ac9d-0eaf46ff4

Figure 37: Defining a custom metric to monitor data files growth events

We give the metric a name, a description, and then enter the Transact-SQL code that we want to run to collect the metric data. It is a good idea to create and test this Transact-SQL code in SSMS, first, to ensure that it not only works, but that it is as lightweight as possible. We want this code to run as fast as possible, using the least possible amount of server resources.

In the bottom half of the same screen, we specify the servers, instances, and databases for which the metric should collect data, and how often.

2353-1-45f63268-5fc3-466d-a4f5-762afa7e4

Figure 38: Defining instances to which the alert applies, and data collection frequency

In this example, the alert applies to all user databases on selected SQL Server instances, and will collect data once every 30 minutes.

Data collection frequency

The more often a custom metric collects data, the more resources it will need. Collect data only as often as is truly necessary.

On this screen, we also specify whether SQL Monitor should display the collected value, or calculate a rate of change between the current and previous value. This option is often needed for counters that increment continuously whenever SQL Server is restarted.

Notice near the middle of Figure 38, a button called Test metric collection , which we can use to test that the metric works as intended. I highly recommend you perform this test to ensure that the metric returns exactly the data you expect.

Add a custom alert

On the second page of the wizard, we can create an alert associated with our metric and set the threshold values on which it will fire, and the severity of alert to raise at each threshold.

2353-1-09d0845d-3217-4755-b69e-95c7aa7c7

Figure 39: Defining an alert for a custom metric

Since the purpose of the alert here is simply to notify us of any growth event on any database, Figure 39 simply specifies a High severity alert if a data file grows by 1 MB/hr. Alternatively, if we were actually interested in the growth rate, we could set different severity alerts depending on the rate.

Create the custom metric and alert

The final step is simply to review the custom metric before creating it. If everything looks OK, click on Create metric and alert, and they will act just like any of the built-in metrics and alerts, and will begin working immediately.

The SQLMonitorMetrics website

Redgate maintains the http://sqlmonitormetrics.red-gate.com/ website, where users can submit, comment on and help improve a community-maintained set of custom metrics, all of which are tested thoroughly by the development team at Redgate, before posting to the site.

On the site, you’ll find ready-made and tested custom metrics in categories ranging from auditing to detecting I/O issues to replication and security. If you find a metric that would be useful in your environment, simply click on Install metric to install it automatically in SQL Monitor. Likewise, if you’ve developed a metric you think others would find useful, submit it for consideration for inclusion on the website.


Other Maintenance and Configuration options

Although we can start using SQL Monitor “out of the box”, most DBAs will want to perform some minor additional configuration, including:

  • Determining how often to purge historical data
  • Setting Maintenance windows to avoid alerting during these periods
  • Organizing monitored servers into logical groups
  • Assigning users different access permissions depending on their requirements

We do all of this work via the Configuration section SQL Monitor’s web interface.


Purging historical data

A very important consideration is how much historical data to store because, over time, SQL Monitor will store a high volume of monitoring data, both host machine data and SQL Server data, as well as basic alert data and SQL trace data, if enabled.

SQL Monitor offers granular control over the length of time to retain different categories of data, defaulting to between 1-2 months for various types of machine data, 1-2 months for various types of SQL Server data, 1 month for basic alert data and 3 days for SQL trace data.

In most cases, you’ll want to keep a month’s worth for most types of data, which will make it easier to view trends over the typical business cycle of a month.

To customize the default data purging settings, go to Configuration | Data purging.


Configuring groups

We can group monitored SQL Server instances into logical groups, such as “Production” and “Development”, or into whatever groups make sense in your environment.

Having done this, SQL Monitor will:

  • Display the related instances together
  • Filter and configure Alerts for all of the instances in the group at one time, which can save work

To create a group, navigate to Configuration | Groups, or Overviews | Manage groups.


Defining user roles

We can assign users to different user roles depending on their required level of access, as follows:

  • Administrator – unrestricted access to every feature in SQL Monitor
  • Standard user role – can manage and configure alerts, view most administrative screens, and view reports
  • Read-Only user role – cannot do any configuration, but can view most areas

We create the administrator role on initial startup, and if you’re the only DBA who needs to access SQL Monitor then you won’t need to create additional roles. To allow others access, but with restricted privileges, simply navigate to Configuration | Manage user roles, define passwords for the other roles, and supply them to the users who require access. If you opt to use Active Directory authentication, you’ll probably only create the administrator role, which will be needed if you ever need to revert to SQL Monitor authentication. Other roles will be managed through Active Directory group permissions.


Setting maintenance windows

Most SQL Server instances will have maintenance windows where a lot of additional server activity will occur, such as backups, rebuilding indexes, and running DBCC CHECKDB, all of which can place an extra burden on the server’s resources, and could trigger some performance-related alerts that are essentially “false positives”.

To avoid this situation, we can set a maintenance window for each monitored server, which tells SQL Monitor not to trigger any performance, but not operational, alerts during this time interval, although monitoring still continues. Unfortunately, it is not currently possible to disable only specific alerts, over specific periods, or set different thresholds for alerts during those periods (such as while database backups are running at 2AM).

To set a maintenance window for the host machine, navigate to Configuration | Monitored servers, and use the Actions dropdown.


Custom SQL Monitor reporting

In addition to analyzing metrics directly in the SQL Monitor web interface, we can also build SQL Server Reporting Services (SSRS) reports that access the SQL Monitor database directly. There is not room in this article to discuss the topic in detail, but Redgate provides a free SSRS Pack to help in getting started, which includes ready-made SSRS reports on CPU and memory usage, database growth trends, general server properties, and more.


Summary

The main point of any monitoring effort, whether homegrown or using a dedicated tool such as SQL Monitor, is to be able to spot problems as quickly as possible, and respond to them before they affect too many people, and to avoid certain problems by predicting when they are likely to occur and taking evasive action.

Our overall goal is to troubleshoot the issues we discover and so improve the performance of the servers and the applications that use them, and to start to see a downward trend in the number of alerts raised, and optimization of resource usage across the servers.

Keep up to date with Simple-Talk

For more articles like this delivered fortnightly, sign up to the Simple-Talk newsletter

This post has been viewed 23917 times – thanks for reading.

Tags: , , , , , , , , ,

  • Rate
    [Total: 12    Average: 4.7/5]
  • Share

Tony Davis is an Editor with Red Gate Software, based in Cambridge (UK), specializing in databases, and especially SQL Server. He edits articles and writes editorials for both the Simple-talk.com and SQLServerCentral.com websites and newsletters, with a combined audience of over 1.5 million subscribers. You can sample his short-form writing at either his Simple-Talk.com blog or his SQLServerCentral.com author page.

As the editor behind most of the SQL Server books published by Red Gate, he spends much of his time helping others express what they know about SQL Server. He is also the lead author of the book, SQL Server Transaction Log Management.

In his spare time, he enjoys running, football, contemporary fiction and real ale.

View all articles by Tony Davis

Grant Fritchey is a SQL Server MVP with over 20 years’ experience in IT including time spent in support and development. Grant has worked with SQL Server since version 6.0 back in 1995. He has developed in VB, VB.Net, C# and Java. Grant has authored books for Apress and Simple-Talk, and joined Red Gate as a Product Evangelist in January 2011. Find Grant on Twitter @GFritchey or on his blog.

View all articles by Grant Fritchey

  • Anonymous

    You lost me
    You lost me at Windows XP as a "real-world" server.

  • David Lewis

    I wonder who is the aikido practitioner
    Just a guess, based on the names of the server/db….
    Nice article, thanks for the overview.

  • SDCSQL

    Wish list
    SQL Monitor is a great tool. I just wish you could save analysis plots instead of having to rebuild them every time. You could make a dashboard with them. It’d be nice.