Page Life Expectancy — A Metric on the Wane?

For years, we’ve seen the metric of Page Life Expectancy (PLE) touted as a health barometer for SQL Server.  Moreover, we have seen PLE touted as a magic number:  “300 seconds (5 minutes) means that you have a generally healthy SQL Server.”  About 5 years ago, Johnathan Kehayias and Paul Randal both spoke strongly against this notion, and Jonathan argued that a good number is better determined by this formula:

 (DataCacheSizeInGB/4GB*300)

Looks nifty, but this was again spoken of in 2011, when the wide availability of SSD was more of a pipe dream than the reality it is now.  Even more so, the availability of all-memory SAN such as the Violin Memory line-up gives a whole new meaning, or perhaps a lack thereof, to the PLE metric itself.

Is PLE relevant anymore?  A case study.

Here is a server, we will call it A.  A has SQL 2008 fully patched (and an application requirement preventing me from upgrading it to 2014), 97GB free to SQL Server, and is attached to a SAN with three tiers of storage:

  • SSD (used for high performance tables and indexes in main OLTP app)
  • 15K (used mainly for master, msdb, model and lower-performance databases)
  • SATA (slow – used for storage of BLOB data)

This server presents with a consistent 25-27 minute PLE – about 1500-1620 seconds.  The buffer cache hovers at a very reliable 87-88gb.  the main OLTP database hovers near 3TB and uses storage in the SSD and SATA groups, keeping only the BLOBs in the SATA.  Also, tempdb is completely SSD.

Sounds healthy under the old “300 second” rule.  However, under the Kehayias formula above, we come up with a preferred PLE of:

(87/4)* 300= 6525

Wow.  According to this, the PLE should be 4 1/3 times longer!  This server is on the brink of setting fire to the SAN!  The new outfit for data center workers near this SAN:

fire suit

PLE is a measurement of how long a page sits unreferenced in memory, on average.  The idea is to keep enough pages in memory so that SQL Server finds the data page there and does not need to perform physical I/O to retrieve data.

But how important is that in an era where SSD is becoming commonplace, and some SANs actually manage the block’s location based on the amount of traffic it sees (think Compellent as a pioneering example)?  SSD speed is fast.  Therefore, does it choke up the SAN bus and fabric with waits on mechanicals, as one would have seen in many SAN five years ago and more?  I would venture to say no.

One common PLE interpretation says that my server is healthy. Even big queries run quickly as far as the users are concerned; the empirical data seem to agree.  Another more rigorous analysis suggests that PLE is outdated and proposes a newer metric. However, hardware advances seem to be giving that metric a challenge in the present day.

Here are some other metrics from that same server:

  • Roughly 327 logical reads per physical read.
  • 180 compiles per second
  • 50-60 log flushes per second
  • Checkpoints at 120 pages per second
  • Read ahead at 120 pages per second

My options with this server are to:

  1. Do nothing and be happy
  2. Spend a few grand to increase RAM to 256G and thereby increase the SQL Server buffer cache to ~200G
  3. Go nuts and reprovision this machine with a $200,000 monster server with terabytes of RAM.

I think that PLE as a metric tells me little in this case unless I apply Jonathan’s formula to it.  I also think that Jonathan’s metric tells me to add some RAM to the server, and we will see an incremental boost in throughput.  I don’t think it will be massive; I also don’t think it will be negligible.  The $6000 we will spend will be well worth it.

Conclusion

PLE needs to be evaluated with more than a simple “300 second” threshold.  Jonathan Kehayias’ formula is a good starting point, but even more rigor is required.  The metric needs to be re-examined, and perhaps expressed as a ratio or something else that accounts for the size of RAM versus databases.

Right now, a threshold in a monitoring tool isn’t good, unless you tweak the alert for each server you monitor.  Something to keep in mind as you juggle the servers, keep the plates spinning, and don the fire suit as you pass by the SAN.

More Benefits of a Good Monitoring Tool.

The importance of a good monitoring tool cannot be over-emphasized.  I’ve posted on this issue before.  Now, I’d like to get into a tad more detail about this issue.  If you have responsibility for more than 2-3 servers, you need to have the alerts and reports. However, you have other reasons to employ a good tool.

Note: This article features Dell Spotlight on SQL Server.  I like the tool a great deal, but there are other good tools out there.

Benefit: Good visual understanding of Emergent situations.

Have a peek at this screen shot.  All screen shots are enlarged if you click on them.  Since this is from an actual production environment, certain information must be obfuscated.

snap1

In this particular situation, the tool informs me that the server has a process that has been blocked for a long enough period to trip the red alert level.

Clicking on the Blocked process icon gives us the details of the error, and clicking further gives us an analysis window:

snap2

 

This shows that the lead blocker is blocking one process, and the blocker is running but “AWAITING COMMAND.”  In this case, investigation reveals a massive update in process by the application. Further checking was not necessary in this case because the blocking cleared up about a minute later.

The above exemplifies a situation which a good DBA could have diagnosed by using sp_who2 and DBCC INPUTBUFFER on the lead blocker and the blocked task(s).  However, that would require time that the 4-5 seconds to examine in the tool did not, and moreover, the DBA is able to examine the issue proactively and not reactively (meaning when the angry user calls up with a hung session).

Let’s take a look at a slightly more critical scenario:

snap3

 

The above, my friends, is a classic “code Brown” situation.  The server showed up on the heat map (see later on) as a giant red blob taking up over a quarter of the real estate on the screen.  Why?  If you haven’t zoomed in, you probably couldn’t see this:

snap4

That’s right, 27.6 seconds average I/O wait!  You can guess that the SAN is being murdered by something truly massive that is whipping tempdb. As an aside the SAN in question placed 14 files of TEMPDB on SSD drives – the bottleneck was the SAN fabric itself. Ever seen the classic drawing of Brent Ozar standing in front of a porta-potty labeled “Temp DB?” He illustrates an important point and with good reason.

In this case, there was a table-valued function querying a 100+ million row table — twice — using a UNION (not UNION ALL).  This function created a 100GB heap on the tempDB and then proceeded to thrash the heap for eliminating duplicates in the UNION. I was one very upset DBA.  More than once, this culprit blocked several production jobs, including the peer-to-peer replication.

We were able to diagnose this issue with minutes of work, and again, we were able to proactively raise it to the developers and ensure that fixing the issue was a high priority.

Benefit: Enterprise overview

This is vitally important.  When you have your monitoring tool set up properly, you should be able to look at the screen and see what’s up with the enterprise. Something like this:

snap5

The above screen shot reflects nirvana.  All pretty green blocks.  This tool monitors VMWare clusters, Windows servers, SQL instances and replication publishers.  Just over 40 hosts and just over 135 connections.  The above screen shot was zoomed down to reduce away the name and type of each block, again for reasons of confidentiality.

We have a full sized window with the tool open on a 42 inch monitor in the room where DBA sits, and this permits us to instantly check the state of the enterprise.  Even better, we can do it from our iPhones too:

snap6

 

Therefore, we have good ability to take a look whenever we desire and evaluate the state of the enterprise.

Benefit:  Alerting

Any good tool won’t just sit there and wait for you to see a problem; it will send you an email (or even page your phone) when a situation requiring attention emerges.

snap7We have our alerts configured to send us email for some high alerts. Others that absolutely require attention are sent to our phones as pages (texts).  This is easy because we have configured our email server with DBA Notify (email) and DBA Page (phones) email addresses.  A good alerting tool will let you send to one or more emails, and will even let you execute some program(s) upon an alert. For example, if replication to the reporting servers is backlogged, run a program to pause the report queues so that we don’t publish reports with stale data.

Alerting is the one thing in any decent monitoring tool that will save the DBA’s backside.  For this reason, you’ll find out that many tools come with hundreds of possible alerts, with many already set up to “turn red” on a hair trigger. What the DBA must do is to look at each alert and set it up conservatively. Better to receive 100 alerts and find out that you can dial 75 of them back, rather than to set up a rose-colored (or should I say green-colored) view of the world that misses the one thing that takes the shop down.

Caveat:  Snoozing alerts

In my shop, disk space on the SAN is at a premium in our most critical environments.  We’ve come close to filling the SAN, and I embarked upon size reduction projects. In one database, I’ve freed up nearly a terabyte of space inside of the databases, but the data files come close to filling up the drives. Therefore, I get alerts on disk space usage from the Windows Server level of monitoring, and I snooze them for a month at a time (knowing that the data files can’t grow and have plenty of free space inside them).

It’s better to snooze a pile of alerts, and get pinged monthly or even less often, than it is to turn the alert off for that machine or (yipes!) the enterprise.

Working the Tool to Work the Servers.

In our shop, while on duty we work the monitoring tool to know when to work the server itself.  If I get chronic yellow alerts for processor queue length on a VM hosting a SQL server, and I’m snoozing that alert more than once or twice a month (daily would be bad), I know to bring the issue up with the network admin to see if I can get that VM some more horsepower.

This is an optimal situation when you are part of a small DBA staff shepherding a large number of servers. You want to be in a position to know where to apply your force, and the monitoring tool is what military types call a “force multiplier.”

When you are off duty, the tool pages you if something is wrong, and if you are on duty, you can see things emerge and address them before they have business impact.

Conclusion:  It’s worth it!

I cannot imagine any shop where DBA have to manage more than 3-4 SQL servers being effective without a top-flight monitoring tool. I have identified the tool that I use, but that doesn’t mean that it is the only tool that can do the job.  Look for yourself.  Get comfortable with the tools using demos.

Then make the business case to your management.  Reduced downtime. Better utilization of employee time. Force multiplication. Ensure that your management knows that this is not a “toy” for your DBA staff to “look cool;” but rather an integral and necessary part of the toolset that you need to do your job. I’m privileged to work for a company that recognizes and supports my team, and the end result has been a DBA team that does not engender any worry from management.