Latch Waits. What they mean, and when are they bad?

Recently our shop migrated to the new Version 11 of Dell’s Spotlight on SQL Server Enterprise, our monitoring tool.  Immediately we began seeing alerts that read like this:

The latch wait time rate is 7,500 ms per second.

Sounds ominous!  You are waiting 7.5 seconds every second to get a needed latch?!  Why is this happening to my servers?  They seem alright, because queries are moving along nicely and the users are happy.  However, this bad statistic must mean that there is a massive latent problem that this new default alert in the monitoring tool caught.

Research time!

What is a latch wait?

Well, first, we should define a latch.  A latch is a type of lock that protects access to SQL Server’s structures in memory like the buffer pool or SQL Server’s internal data structures.  They’re vitally necessary to keep integrity on in memory data pages.

Latches are serialized.  One latch completes work before the next latch is granted.  If a process needs a latch and one cannot be obtained, it is suspended until the latch becomes available.

Now, with that description, the latch wait seems even more ominous!  Waiting 7.5 milliseconds for each latch of memory?  Yipes!  Anything more than a microsecond’s wait is an eternity – correct?

Well, let’s use a DMV – sys.dm_os_latch_stats to see what’s going on:

latch_stats

I looked at the total wait time quickly, and see that ACCESS_METHODS_DATASET_PARENT is the predominant latch wait on this server.  Paul Randal wrote a wonderful post that explains many of these latch waits.  This particular wait is created by parallelism.  As Paul warns in his post, let’s not go bananas and set MAXDOP to 1.

Instead, the first thing to check is whether hyperthreading was accidentally enabled on this server.  In this case, the answer appears to be no, but the sysadmins need to check this (you can’t do it from SQL but there is a VB Script out there that can detect it — if your sysadmins will let you run that script on a production server).

While the Sysadmins investigate, I note that the second most prevalent class is BUFFER.  This is the classic wait for an in memory page or structure.  This one seems to be one that I can live with, especially since ACCESS_METHODS_DATASET_PARENT is responsible for over 90% of the wait time and also the highest max wait times.

As Paul’s article tells us, the NESTING_TRANSACTION_READONLY and NESTING_TRANSACTION_FULL latch classes are related to (duh) nested transactions.  Parallel nested transactions.

What My problem was

In my case, it seems that parallelism is the cause.  Hyperthreading is not, in this case, the problem.  SSIS also runs packages on this server to do ETL work.  We have four sockets with two cores each.  What I did to resolve this issue was to configure max degree of parallelism to 4, allowing four cores to be free to handle the SSIS packages and other SQL traffic.

We are seeing an incremental improvement in query performance on this server.  What was good is now a little better.  The alert hasn’t showed up since we changed configuration.

Does the new alert help?

Short answer:  yes.  However, this alert needs to be tuned server-by-server in Spotlight, and may need to be ignored for certain test servers.

I will take this opportunity to strongly encourage every DBA who manages more than five SQL Servers to review all of the available monitoring products out there.  I like Spotlight.  You might like SQL Sentry or Idera or RedGate.  Find the one that works best with your style of DB management and implement it.  Be ready to tune the thing for 30-120 days.

Spotlight did me right with the latest alert in v11.

Leave a Reply

Your email address will not be published. Required fields are marked *