Data, data everywhere…
Project Lucy users that run Spotlight on SQL Server have been uploading SQL Server Waits data since May 2011. We now have hundreds of millions of samples of wait data from thousands of servers. We recently pushed an update to the Project Lucy home page that (amongst other things) displays a breakdown of the SQL Server population by hardware manufacturer type. This visualisation also splits the instances into two groups that display their performance ‘health’. To do this we came up with a model that calculated the performance health of a SQL Server instance based on wait data.
The Nature of the data
We are using wait statistics from SQL Server within this model. A variant of this model is well known in Oracle circles and we wanted to see if it could be applied to SQL Server. To do this we needed data, which is lucky because Spotlight on SQL Server Users who have opted in to uploading data to Project Lucy contribute 4 samples of wait stats data for each server that Spotlight monitors every day. Since our users have been uploading this data since May 2011 we now have a large corpus to play with.
Currently the model evaluates three variables.
CPU wait and I/O Wait – the presumption being that if you are waiting on these at least you are doing ‘work’. CPU wait is simply signal wait, which is a wait for CPU time after having waited on something else. I/O Wait consists of a wait on any of the following wait types – ASYNC_DISKPOOL_LOCK, ASYNC_IO_COMPLETION, CHKPT, DISKIO_SUSPEND, FCB_REPLICA_READ, FCB_REPLICA_WRITE, IO_COMPLETION, LOGBUFFER, LOGMGR, LOGMGR_FLUSH, LOGMGR_RESERVE_APPEND, PAGEIOLATCH_DT, PAGEIOLATCH_EX, PAGEIOLATCH_KP, PAGEIOLATCH_NL, PAGEIOLATCH_SH, PAGEIOLATCH_UP, SOAP_READ, SOAP_WRITE, WRITELOG
‘Bad’ waits are the waits that tell you that you are waiting for something that is stopping you from actually doing any useful work. There are a number of categories of ‘bad’ wait:
- Lock (LCK_M_*)
- Latch (LATCH_*, PAGELATCH_*, TRAN_MARK_*, TRAN_MARKLATCH_*)
- CLR (CLR_*, CLRHOST_*, SQLCLR_*)
- Other wait types (too numerous to list)
We do exclude Network (ASYNC_NETWORK_IO, NETWORKIO) from our model, since these are more indicative of client side issues.
I/O Subsystem Response
Databases are large data pumps. They exist to get data into and out of permanent storage as fast as possible. Performance is definitely going to suffer if the database cannot get pages on or off the disk fast enough, so we need to include in our model a measure of the responsiveness of the I/O subsystem. We are using Single Page Read latency which is simply the average time per wait for PAGEIOLATCH_* waits.
In English please…?
Basically in simple terms we are positing that as long as your single page read latency is low and the majority of time spent waiting is spent waiting on CPU or IO (i.e. you’re actually doing work) then you have no performance bottlenecks.
Doing the Math
Now we are going to use our model to calculate a score for a SQL Server. Currently we are using a score of above 80 for everything is OK. Let’s use as an example, a SQL Server instance that has a single page read latency of 24 ms and spends 50% of total wait time waiting on CPU and 40% of total waiting on IO.
We start with a score of 100 for Single page read latency. This score is reduced by 20 for each increment of 5ms above 5ms.
In the case of our example, the score is 20 because our I/O Single Page Read latency is between 20 and 25.
We start with a score of 100 for CPU + IO. For every increment of 5 that the sum of CPU and IO wait expressed as a percentage of total wait falls we subtract 20.
Again in our case, we have figures of 50% and 40% respectively for a total of 90% for CPU + IO as a percentage of total wait. 90% falls into the 95-90 range so the second score becomes 80.
These two scores are then equally weighted to produce a single score. (20 + 80) / 2 = 50.
In this case we would argue that the performance of this SQL Server is not OK because even though it is doing work, it cannot get pages on or off the disk fast enough because the I/O subsystem is not fast enough.
Running our algorithm over the whole corpus gave us the following distribution of performance scores. As you can see the majority of SQL Servers in the population are OK in terms of performance.
It is interesting to note the ‘hump’ around the 50 mark. This leads us to think that our algorithm could use some refinement. Perhaps single page read latency is less important when there is minimal I/O occurring, so the weighting should be less for this part of score under these conditions. Feedback and discussion on this is welcome.
Real world examples
So what does a SQL Server with a ‘good’ performance score look like? Here is an example. This server got a score of 100. As you can see, the only wait is ‘good’ wait which indicates that the SQL Server is doing actual work and that there is no contention for other resources getting in the way. In addition, the single page read latency is 1ms which means that there is no I/O subsystem bottleneck.
What about a SQL Server with a ‘bad’ performance score? Here is an example with a score of 30. Note the large proportion of time spent waiting on latches. Whilst this wait is occurring, the time available to do useful work has decreased. Also note the large single page read latency – 22ms is an unacceptably large amount of time to wait for an I/O to occur. This server clearly has contention issues and also I/O subsystem issues.
Here is an example showing a performance score of 50. This server has a low single page read latency which is good, but is spending a lot of time waiting on Locks and Latches. It is possible that the I/O read latency is low because the contention on Locks and Latches means that not enough I/O is being done to saturate the I/O subsystem.
This model seems to be a reasonable way to quickly gauge whether a SQL Server instance has any performance bottlenecks and what those bottlenecks are. Note that this model does not make any judgement at this stage whether the work being done is ‘useful’. That is, even if there are no bottlenecks, the response time may not be acceptable because of application design issues.
We welcome any comment or feedback you might have.