Click here to monitor SSC
  • Av rating:
  • Total votes: 26
  • Total comments: 1
Warwick Rudd

AlwaysOn Availability Groups - What Not to do when Adding Databases

10 March 2014

SQL Server's AlwaysOn Availability Groups provide a very resilient way of providing High-availability for SQL Server databases, but there are inevitable limits to their capacity. How many databases can you reasonably add? It depends on the resources available and the workload, but you can come up with a reasonable estimate  as Warwick Rudd explains

AlwaysOn Availability Groups (AGs) were first introduced as a new High Availability (HA) / Disaster Recovery (DR) feature with the release of SQL Server 2012. In this article I am going to assume you already have set up and configured AGs in your environment. If you need to know how to setup and configure your AGs have a look at my previous article “SQL Server 2012 AlwaysOn”.

Like any new feature or technology without understanding how it works can make it easy to make mistakes and cause problems in your environment. This feature is definitely one that should not fall into the PILOT (Production In Lieu Of Testing) category. Careful planning and testing should be undertaken to ensure you have covered all of your bases.

Recently I was asked what issues might emerge from the process of adding a large number of databases to an AG. In this case, a two-node Windows Server Failover Cluster (WSFC), configured on virtual machines, was being used with four vCPUs on each of the cluster nodes. A script was put together to add over 300 databases to a single AG on this virtual machine setup. During the running of the script, some issues were experienced which left those databases that were added in a variety of states of play. The dashboard was showing some of the databases to be offline, others online, while some on the replica were not connected to the AG. This was a case of trying to over-utilise the resources that were available, and the WSFC not coping. The upshot was that all the databases could be recovered, because this was the initial addition of the replica databases, and all had full backups and associated transaction log backups. However, the design needed to be re-evaluated to meet the HA/DR requirements. This work led me to put this article together to mention some of the things you should not do with AGs.

When you implement AGs, you must remember that there is a limit to the number of AGs and availability database per server, either physical or virtual. The limit to the number of databases isn’t a hard-and-fast number because it depends on the resources available and the workload that will be undertaken. With the release of SQL Server 2012, Microsoft extensively tested a hundred databases spread evenly across ten AGs (ten databases per AG on average) per physical machine. The first signs of overload in your server environment from having too many databases configured will be Worker Thread exhaustion and a slow response times for system views or DMVs. -see MSDN’s ‘Prerequisites, Restrictions, and Recommendations for AlwaysOn Availability Groups (SQL Server)’.

Worker Thread Exhaustion

It is not just with AGs that you’ll see error messages related to Worker Thread Exhaustion, but when you do, it is likely to be a symptom of adding too many databases to your AG environment. Your system will become unresponsive and, if you try to open a new connection to your instance of SQL Server through SSMS, you will more than likely be met with an error message such as “Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding”. Alternatively you could investigate the SQL Server error log and you may see a message like “New queries assigned to process on Node x have not been picked up by a worker thread in the last x seconds.”

If you do experience such an error, you may then have some difficulty in connecting to the affected instance. If you do, you could restart your instance of SQL Server: This will clear all connections and so should allow you to connect. If not, you can still gain access if you are administrator by using the Dedicated Admin Connection (DAC). Once you have a connection you can go looking for the ID of the troublesome session. More than likely, this will be your process that is trying to add the three hundred Databases.

Worker Threads

A worker thread is what is used by SQL Server to perform some form of processing. A number of factors will determine the number of worker threads available on your system. They are created automatically by SQL Server as required. SQL Server has an instance-level configuration that is used to limit the number of worker threads available to SQL Server processes.

Max Worker Threads

The ‘Max Worker Threads’ configuration option is at the Instance-level and is found in sys.configurations. It can be set or changed using the sp_configure system stored procedure. This configuration option enables SQL Server to create a pool of worker threads that are then available to service a large number of SQL Server processes: This may improve performance. By default this value is set to 0. A value of 0 allows SQL Server to automatically configure the number of worker threads at start up time.

The following table shows the number of worker threads that are automatically-configured based on the number of CPUs present on the server:

Number of CPUs

32-bit

64-Bit

<= 4 Processors

256

512

8 Processors

288

576

16 Processors

352

704

32 Processors

480

960

64 Processors

736

1472

128 Processors

4224

4480

256 Processors

8320

8576

As this article is looking at the use of AGs we will only be concerned with the 64-bit worker thread numbers.

AG Worker Thread Requirements

There are a number of factors to consider when undertaking the capacity-planning phase for your AG environment. One such factor is the amount of resources each replica will require to provide appropriate performance for the system. To work this out, you need to take into account not only the normal activity that your environment will undertake but also the number of databases that will participate in the data transfers from your primary replica to your identified secondary replicas. There are, of course, many other factors as well that need to be taken into consideration, and some of them are covered by Jeremiah Peschka’s article “AlwaysOn Availability Groups: The Average of its Parts” but for this article we are only concerned with working out the scale of the resources in terms of the Worker Threads.

AGs have the following worker thread requirements (As quoted from “Prerequisites, Restrictions, and Recommendations for AlwaysOn Availability Groups (SQL Server) – Thread Usage by Availability Groups”):

  • An idle instance of SQL Server, AlwaysOn Availability Groups uses 0 threads
  • The maximum number of threads used by Availability Groups is the configured setting for the maximum number of threads (‘max worker threads’) minus 40
  • The availability replicas hosted on a given server instance share a single thread pool (HADR Worker Pool)
  • Threads are shared on an on-demand basis :
    • Typically there are 3-10 shared threads, but this can increase depending on the primary replica workload
    • If a given thread is idle for a while it is released back into the general SQL Server thread pool. Normally, an inactive thread is released after ~ 15 seconds of inactivity. However, depending on the last activity, an idle thread might be retained longer
  • In addition, availability groups use unshared threads as follows:
    • Each primary replica uses 1 Log capture thread for each primary database. In addition, it uses 1 Log send thread for each secondary database. Log send threads are released after ~ 15 seconds of inactivity.
    • Each Secondary replica uses 1 redo thread for each secondary database. Redo threads are released after ~ 15 seconds of inactivity
    • A backup on a secondary replica holds a thread on the primary replica for the duration of the backup operation.

Even though AGs are an extension of the database mirroring technology, the worker thread usage for AGs is different than with database mirroring. Database mirroring used dedicated threads per database compared to the request queue and worker pool (HADR Worker Pool) to handle the requests. Bob Dorr gives a really in-depth account of how the worker pool is utilised by AGs. You can read more in his article – HADRON Learning Series: Worker Pool Usage for HADRON enabled Databases.

There are other worker thread requirements beyond these. The minimum number of work threads required just to facilitate having AGs configured can be calculated from

  • the number of AGs you have configured in your instance of SQL Server
  • the number of availability databases in each of the AGs
  • the number of availability replicas (2-5 replicas. A maximum of 4 secondary replicas with SQL Server 2012)

To calculate the minimum worker thread pool size required, for this article our environment will be configured as follows:

  • One Primary Replica
  • One Secondary Replica
  • One Availability Group
  • One Hundred Availability Databases

The information that we will use in the minimum pool size calculation is:

  • Database Count (DC)
  • Secondary Replica Count (SRC)
  • Log Capture Thread (LCT)
  • Log Send Thread (LST)
  • Message Handler Thread (MHT)

For each database participating in an AG, one LCT is used to capture the transactions occurring on the database. One LST is required for each secondary replica in the AG. To allow the process to work at least one MHT is required to handle the communication occurring between replicas.

The algorithm we are going to use as outlined by Bob Dorr’s Article - HADRON Learning Series: Worker Pool Usage for HADRON enabled Databases is:

  • Minimum Pool size = (DC x (LCT + (LST x SRC))) + MHT

In a worst case scenario for our environment mentioned above all one hundred replica databases are actively being used.

  • Minimum Pool Size = (100 x (1 + (1 x 1))) + 1
  • Minimum Pool Size = (100 x (1 + (1))) + 1
  • Minimum Pool Size = (100 x 2) + 1
  • Minimum Pool Size = 201

The Minimum Pool size required for our environment in this article would be 201. If we then looked at the worker thread table above, we’d see that the number of threads available to the system on a 64 bit 4 cpu server would be 512. From the AG requirements outlined earlier in this article “The maximum number of threads used by Availability Groups is the configured setting for the maximum number of threads (‘max worker threads’) minus 40” we can calculate the maximum available for your AGs would be 472 (512 – 40). Our calculation shows we will need 201 threads to service AG workload however this is not taking into account the number of threads that would be required for the transactional activity that will be occurring on the system or the number required for performing backups on the system. So we can see that the number of threads available are being used up very quickly not leaving any available resources for workload utilisation spikes.

If you try to add too many availability databases to an AG via a script, you will run out of worker threads very quickly. Your script will lose its connection and will not complete the process. Your AG environment will have databases sitting in a state where they are not part of the AG. In order to get back online with at least your primary replica databases, you will need to drop your AG(s) and restore, where necessary any irrecoverable databases on the primary replica with the backups taken as part of the initial AG setup process.

Because of the risk of things going wrong, any deployment into an AG needs to be carefully planned, with a prior calculation of the resources required, especially when these AGs are part of your production environment. Make sure you understand your existing capacity requirements and forecast the potential growth in activity or utilisation so you are able to identify appropriate server resources to meet your needs.

References

Other useful resources

Warwick Rudd

Author profile:

Warwick Rudd is a Microsoft Certified Master – SQL 2008, MVP, MCT and the Principal Consultant at SQL Masters Consulting (http://www.sqlmastersconsulting.com.au). Warwick is a frequent speaker at local SQL Server User Groups and SQL Saturday events in Australia and New Zealand. You can find Warwick online with his blog (http://www.sqlmastersconsulting.com.au/blog) or on Twitter (@Warwick_Rudd). When he is not playing with the SQL Server Stack, he likes to get away to the cold and spend time snowboarding.

Search for other articles by Warwick Rudd

Rate this article:   Avg rating: from a total of 26 votes.


Poor

OK

Good

Great

Must read
Have Your Say
Do you have an opinion on this article? Then add your comment below:
You must be logged in to post to this forum

Click here to log in.


Subject: CPU worker thread capabilities
Posted by: PaulGoldstraw (view profile)
Posted on: Friday, March 4, 2016 at 3:42 AM
Message: Hi Warwick,

I've read this article a number of times as we're currently experiencing issues with AlwaysOn that we strongly suspect could be related to the number of worker threads available. We have a significant number of databases on the system and we frequently see the primary saying it is disconnected from the secondaries, we see flow-control kick in, and taking one of the secondaries out significantly alleviates the issues. We also see frequent THREADPOOL waits (though no threadpool starvation), but having increased the maximum worker threads already, we've not really seen that make a difference. What's interesting is that we don't see the worker threads ever really hit the limit we've imposed, suggesting that perhaps we're hitting some sort of physical limitation of the CPU; that it simply cannot spin up enough worker threads fast enough.

This has led me down the road of considering how I would assess the capability of our hardware. The table listed above specifies defaults for machines with a given number of cores. I presume these are quite conservative estimates with some overhead to push harder if necessary, but how do I know how many worker threads are too many? Do you have any thoughts on

a) whether THREADPOOL waits and the AlwaysOn issues described above indicates a shortage of worker threads?

b) whether the fact we don't ever hit our defined limit set in SQL Server suggests that we're hitting the limitations of the CPU before we're hitting the SQL Server defined limit, and that the only way to resolve this is to throw more cores at the problem?

Any thoughts you have on the matter would be appreciated

 
Simple-Talk Database Delivery

DLM
Patterns & Practices Library

Visit our patterns and practices library to learn more about database lifecycle management.

Find out how to automate the process of building, testing and deploying your database changes to reduce risk and make rapid releases possible.

Get started

Phil Factor
Automatically Creating UML Database Diagrams for SQL Server

SQL Server database developers seem reluctant to use diagrams when documenting their databases. It is probably... Read more...

 View the blog

Top Rated

In-Memory OLTP - Row Structure and Indexes
 There are several decisions to be made when designing indexes for Memory-optimized tables in In-Memory... Read more...

In-Memory OLTP – Understanding Memory-Optimized Tables
 How do you get started with In-memory OLTP? Murilo Miranda first advises on the setup of the server and... Read more...

Automatically Creating UML Database Diagrams for SQL Server
 SQL Server database developers seem reluctant to use diagrams when documenting their databases. It is... Read more...

SQL Server Security Audit Basics
 SQL Server Server Audit has grown in functionality over the years but it can be tricky to maintain and... Read more...

The SQL Server 2016 Query Store: Analyzing Query Store Performance
 There are some obvious advantages to having the Query Store, but what is the performance impact that it... Read more...

Most Viewed

Beginning SQL Server 2005 Reporting Services Part 1
 Steve Joubert begins an in-depth tour of SQL Server 2005 Reporting Services with a step-by-step guide... Read more...

Ten Common Database Design Mistakes
 If database design is done right, then the development, deployment and subsequent performance in... Read more...

Temporary Tables in SQL Server
 Temporary tables are used by every DB developer, but they're not likely to be too adventurous with... Read more...

SQL Server Index Basics
 Given the fundamental importance of indexes in databases, it always comes as a surprise how often the... Read more...

Concatenating Row Values in Transact-SQL
 It is an interesting problem in Transact SQL, for which there are a number of solutions and... Read more...

Why Join

Over 400,000 Microsoft professionals subscribe to the Simple-Talk technical journal. Join today, it's fast, simple, free and secure.