For years, we’ve seen the metric of Page Life Expectancy (PLE) touted as a health barometer for SQL Server. Moreover, we have seen PLE touted as a magic number: “300 seconds (5 minutes) means that you have a generally healthy SQL Server.” About 5 years ago, Johnathan Kehayias and Paul Randal both spoke strongly against this notion, and Jonathan argued that a good number is better determined by this formula:
Looks nifty, but this was again spoken of in 2011, when the wide availability of SSD was more of a pipe dream than the reality it is now. Even more so, the availability of all-memory SAN such as the Violin Memory line-up gives a whole new meaning, or perhaps a lack thereof, to the PLE metric itself.
Is PLE relevant anymore? A case study.
Here is a server, we will call it A. A has SQL 2008 fully patched (and an application requirement preventing me from upgrading it to 2014), 97GB free to SQL Server, and is attached to a SAN with three tiers of storage:
- SSD (used for high performance tables and indexes in main OLTP app)
- 15K (used mainly for master, msdb, model and lower-performance databases)
- SATA (slow – used for storage of BLOB data)
This server presents with a consistent 25-27 minute PLE – about 1500-1620 seconds. The buffer cache hovers at a very reliable 87-88gb. the main OLTP database hovers near 3TB and uses storage in the SSD and SATA groups, keeping only the BLOBs in the SATA. Also, tempdb is completely SSD.
Sounds healthy under the old “300 second” rule. However, under the Kehayias formula above, we come up with a preferred PLE of:
(87/4)* 300= 6525
Wow. According to this, the PLE should be 4 1/3 times longer! This server is on the brink of setting fire to the SAN! The new outfit for data center workers near this SAN:
PLE is a measurement of how long a page sits unreferenced in memory, on average. The idea is to keep enough pages in memory so that SQL Server finds the data page there and does not need to perform physical I/O to retrieve data.
But how important is that in an era where SSD is becoming commonplace, and some SANs actually manage the block’s location based on the amount of traffic it sees (think Compellent as a pioneering example)? SSD speed is fast. Therefore, does it choke up the SAN bus and fabric with waits on mechanicals, as one would have seen in many SAN five years ago and more? I would venture to say no.
One common PLE interpretation says that my server is healthy. Even big queries run quickly as far as the users are concerned; the empirical data seem to agree. Another more rigorous analysis suggests that PLE is outdated and proposes a newer metric. However, hardware advances seem to be giving that metric a challenge in the present day.
Here are some other metrics from that same server:
- Roughly 327 logical reads per physical read.
- 180 compiles per second
- 50-60 log flushes per second
- Checkpoints at 120 pages per second
- Read ahead at 120 pages per second
My options with this server are to:
- Do nothing and be happy
- Spend a few grand to increase RAM to 256G and thereby increase the SQL Server buffer cache to ~200G
- Go nuts and reprovision this machine with a $200,000 monster server with terabytes of RAM.
I think that PLE as a metric tells me little in this case unless I apply Jonathan’s formula to it. I also think that Jonathan’s metric tells me to add some RAM to the server, and we will see an incremental boost in throughput. I don’t think it will be massive; I also don’t think it will be negligible. The $6000 we will spend will be well worth it.
PLE needs to be evaluated with more than a simple “300 second” threshold. Jonathan Kehayias’ formula is a good starting point, but even more rigor is required. The metric needs to be re-examined, and perhaps expressed as a ratio or something else that accounts for the size of RAM versus databases.
Right now, a threshold in a monitoring tool isn’t good, unless you tweak the alert for each server you monitor. Something to keep in mind as you juggle the servers, keep the plates spinning, and don the fire suit as you pass by the SAN.