The importance of a good monitoring tool cannot be over-emphasized. I’ve posted on this issue before. Now, I’d like to get into a tad more detail about this issue. If you have responsibility for more than 2-3 servers, you need to have the alerts and reports. However, you have other reasons to employ a good tool.
Note: This article features Dell Spotlight on SQL Server. I like the tool a great deal, but there are other good tools out there.
Benefit: Good visual understanding of Emergent situations.
Have a peek at this screen shot. All screen shots are enlarged if you click on them. Since this is from an actual production environment, certain information must be obfuscated.
In this particular situation, the tool informs me that the server has a process that has been blocked for a long enough period to trip the red alert level.
Clicking on the Blocked process icon gives us the details of the error, and clicking further gives us an analysis window:
This shows that the lead blocker is blocking one process, and the blocker is running but “AWAITING COMMAND.” In this case, investigation reveals a massive update in process by the application. Further checking was not necessary in this case because the blocking cleared up about a minute later.
The above exemplifies a situation which a good DBA could have diagnosed by using sp_who2 and DBCC INPUTBUFFER on the lead blocker and the blocked task(s). However, that would require time that the 4-5 seconds to examine in the tool did not, and moreover, the DBA is able to examine the issue proactively and not reactively (meaning when the angry user calls up with a hung session).
Let’s take a look at a slightly more critical scenario:
The above, my friends, is a classic “code Brown” situation. The server showed up on the heat map (see later on) as a giant red blob taking up over a quarter of the real estate on the screen. Why? If you haven’t zoomed in, you probably couldn’t see this:
That’s right, 27.6 seconds average I/O wait! You can guess that the SAN is being murdered by something truly massive that is whipping tempdb. As an aside the SAN in question placed 14 files of TEMPDB on SSD drives – the bottleneck was the SAN fabric itself. Ever seen the classic drawing of Brent Ozar standing in front of a porta-potty labeled “Temp DB?” He illustrates an important point and with good reason.
In this case, there was a table-valued function querying a 100+ million row table — twice — using a UNION (not UNION ALL). This function created a 100GB heap on the tempDB and then proceeded to thrash the heap for eliminating duplicates in the UNION. I was one very upset DBA. More than once, this culprit blocked several production jobs, including the peer-to-peer replication.
We were able to diagnose this issue with minutes of work, and again, we were able to proactively raise it to the developers and ensure that fixing the issue was a high priority.
Benefit: Enterprise overview
This is vitally important. When you have your monitoring tool set up properly, you should be able to look at the screen and see what’s up with the enterprise. Something like this:
The above screen shot reflects nirvana. All pretty green blocks. This tool monitors VMWare clusters, Windows servers, SQL instances and replication publishers. Just over 40 hosts and just over 135 connections. The above screen shot was zoomed down to reduce away the name and type of each block, again for reasons of confidentiality.
We have a full sized window with the tool open on a 42 inch monitor in the room where DBA sits, and this permits us to instantly check the state of the enterprise. Even better, we can do it from our iPhones too:
Therefore, we have good ability to take a look whenever we desire and evaluate the state of the enterprise.
Any good tool won’t just sit there and wait for you to see a problem; it will send you an email (or even page your phone) when a situation requiring attention emerges.
We have our alerts configured to send us email for some high alerts. Others that absolutely require attention are sent to our phones as pages (texts). This is easy because we have configured our email server with DBA Notify (email) and DBA Page (phones) email addresses. A good alerting tool will let you send to one or more emails, and will even let you execute some program(s) upon an alert. For example, if replication to the reporting servers is backlogged, run a program to pause the report queues so that we don’t publish reports with stale data.
Alerting is the one thing in any decent monitoring tool that will save the DBA’s backside. For this reason, you’ll find out that many tools come with hundreds of possible alerts, with many already set up to “turn red” on a hair trigger. What the DBA must do is to look at each alert and set it up conservatively. Better to receive 100 alerts and find out that you can dial 75 of them back, rather than to set up a rose-colored (or should I say green-colored) view of the world that misses the one thing that takes the shop down.
Caveat: Snoozing alerts
In my shop, disk space on the SAN is at a premium in our most critical environments. We’ve come close to filling the SAN, and I embarked upon size reduction projects. In one database, I’ve freed up nearly a terabyte of space inside of the databases, but the data files come close to filling up the drives. Therefore, I get alerts on disk space usage from the Windows Server level of monitoring, and I snooze them for a month at a time (knowing that the data files can’t grow and have plenty of free space inside them).
It’s better to snooze a pile of alerts, and get pinged monthly or even less often, than it is to turn the alert off for that machine or (yipes!) the enterprise.
Working the Tool to Work the Servers.
In our shop, while on duty we work the monitoring tool to know when to work the server itself. If I get chronic yellow alerts for processor queue length on a VM hosting a SQL server, and I’m snoozing that alert more than once or twice a month (daily would be bad), I know to bring the issue up with the network admin to see if I can get that VM some more horsepower.
This is an optimal situation when you are part of a small DBA staff shepherding a large number of servers. You want to be in a position to know where to apply your force, and the monitoring tool is what military types call a “force multiplier.”
When you are off duty, the tool pages you if something is wrong, and if you are on duty, you can see things emerge and address them before they have business impact.
Conclusion: It’s worth it!
I cannot imagine any shop where DBA have to manage more than 3-4 SQL servers being effective without a top-flight monitoring tool. I have identified the tool that I use, but that doesn’t mean that it is the only tool that can do the job. Look for yourself. Get comfortable with the tools using demos.
Then make the business case to your management. Reduced downtime. Better utilization of employee time. Force multiplication. Ensure that your management knows that this is not a “toy” for your DBA staff to “look cool;” but rather an integral and necessary part of the toolset that you need to do your job. I’m privileged to work for a company that recognizes and supports my team, and the end result has been a DBA team that does not engender any worry from management.