Category Archives: Administration

Tips on how to administer a company’s data resources at a higher level.

Is a Non-Clustered GUID Primary Key a problem

Recently I received pushback from a developer about the use of a GUID primary key.  The proposed database schema, intended to be used with Entity Framework, contained numerous tables primary-keyed on a column named ID and typed as nvarchar(50).  the tables were clustered on this column, and it was intended to hold a GUID that would be populated with the .NET call System.Guid.NewGuid().

Cue Blood-Curdling Screaming Now.

I can hear the blood-curdling screams from every DBA worth his or her salt.  I asked for the column to be redefined as bigint IDENTITY, citing the well-known adverse effects of clustering a GUID, and in this case worse because the GUID would be stored as a 37-character Unicode value, taking a minimum of 74 bytes per row.

The developer objected because of turns to the SQL Server; this application requires extreme throughput and low latency.  The developer would have to retrieve the generated identity value, and thereby dependent rows in child tables would have to be separately populated.  If the developer used a program-generated GUID, no database access is required until insert time, which can be done in bulk through EF.  Developer therefore offered to cluster on an IDENTITY column but keep the Primary Key as a GUID.

I still objected because the nonclustered index quickly reached 99.5% fragmentation in experiments after only about 1000 rows, and this was with a fill factor of 50% defined for that index.  The developer argued that the fragmentation of the nonclustered index was meaningless.  The index size was huge, whether I used the nvarchar(50) or the uniqueidentifier (16 bytes) data types to hold the GUID.

Should I Care?

Now the question is whether I should care about a highly fragmented nonclustered index.  Remember that this is a foreign key, and this GUID must (for performance reasons) be indexed for join purposes on both parent and child table.  But this is a SEEK operation, is it not? Numerous articles argue that SEEK operations on nonclustered indexes, even when highly fragmented (which is the norm when the key is a GUID) are not expensive.

However, focusing on individual seek performance misses an important point here.  The query below explains why I care:

where O.GUID = '28592B77-5684-4C8E-9B70-997C8A7DC7E0';

Order lines may be OK, because there will be an index on ORDER_GUID for the ORDER_LINE table.  However, every line item seek would not be so lucky.  The items will be wildly randomized in the ITEM table, and each ORDER_LINE will need to join ITEM.  A smaller but virtually identical issue arises as ORDER_MAST joins to CUSTOMER.

Now let’s make the point even more directly:

WHERE O.ORDER_DATE between '01/01/2015' and '04/01/2015'

Assume that the ORDER_MAST table is clustered on an identity and the ORDER_DATE column is a datetime and has a nonclustered index.

Assume also that there are 10,000 orders with 10 lines each.  That means that there is a likelihood of reading 99,500 separate pages of data into memory in order to perform the ORDER_LINE to ITEM join.  Furthermore, without a sequence number column in the order_line table, one would be highly likely to never be able to order the lines into the order in which they were inserted into the table.

Defragmenting the GUID index would reduce those 99,500 page reads to about 50,000 page reads for a non-clustered index with a 50% fill factor.  If you have an 8 byte bigint clustered index that is also the PK, you would likely get many rows per page (depending on row size) and in the end increase speed by more than one order of magnitude.

Hence a sequential surrogate PK has its uses.  It would dramatically speed up this query. We haven’t discussed the speed implications of insert performance as the index pages are split over and over again, and the concept that defragmenting a nonclustered index based on a GUID could actually degrade insert performance by increasing the number of page splits.  Therefore, if anyone uses a utility such as dba_indexdefrag by Michelle Ufford, one is likely to generate slowness in inserts.

Do I care?  Heck yes!  This is a hill to die on for any SQL Server DBA.


In this case, the developer agreed to use an internally generated ever increasing unique number based on time, machine ID, and an increasing integer.  The generated number should fit into a BIGINT or a Decimal(19,0) data type, resulting in nine and not 74 bytes per row of PK.  This agreement solves all of the problems in a manner substantially identical to the IDENTITY or a SEQUENCE, while meeting the developer’s performance concerns — rightly expressed because this application requires maximum throughput.

Lesson?  Don’t allow GUIDs as clustered or non-clustered PK. Nope. Never.

Page Life Expectancy — A Metric on the Wane?

For years, we’ve seen the metric of Page Life Expectancy (PLE) touted as a health barometer for SQL Server.  Moreover, we have seen PLE touted as a magic number:  “300 seconds (5 minutes) means that you have a generally healthy SQL Server.”  About 5 years ago, Johnathan Kehayias and Paul Randal both spoke strongly against this notion, and Jonathan argued that a good number is better determined by this formula:


Looks nifty, but this was again spoken of in 2011, when the wide availability of SSD was more of a pipe dream than the reality it is now.  Even more so, the availability of all-memory SAN such as the Violin Memory line-up gives a whole new meaning, or perhaps a lack thereof, to the PLE metric itself.

Is PLE relevant anymore?  A case study.

Here is a server, we will call it A.  A has SQL 2008 fully patched (and an application requirement preventing me from upgrading it to 2014), 97GB free to SQL Server, and is attached to a SAN with three tiers of storage:

  • SSD (used for high performance tables and indexes in main OLTP app)
  • 15K (used mainly for master, msdb, model and lower-performance databases)
  • SATA (slow – used for storage of BLOB data)

This server presents with a consistent 25-27 minute PLE – about 1500-1620 seconds.  The buffer cache hovers at a very reliable 87-88gb.  the main OLTP database hovers near 3TB and uses storage in the SSD and SATA groups, keeping only the BLOBs in the SATA.  Also, tempdb is completely SSD.

Sounds healthy under the old “300 second” rule.  However, under the Kehayias formula above, we come up with a preferred PLE of:

(87/4)* 300= 6525

Wow.  According to this, the PLE should be 4 1/3 times longer!  This server is on the brink of setting fire to the SAN!  The new outfit for data center workers near this SAN:

fire suit

PLE is a measurement of how long a page sits unreferenced in memory, on average.  The idea is to keep enough pages in memory so that SQL Server finds the data page there and does not need to perform physical I/O to retrieve data.

But how important is that in an era where SSD is becoming commonplace, and some SANs actually manage the block’s location based on the amount of traffic it sees (think Compellent as a pioneering example)?  SSD speed is fast.  Therefore, does it choke up the SAN bus and fabric with waits on mechanicals, as one would have seen in many SAN five years ago and more?  I would venture to say no.

One common PLE interpretation says that my server is healthy. Even big queries run quickly as far as the users are concerned; the empirical data seem to agree.  Another more rigorous analysis suggests that PLE is outdated and proposes a newer metric. However, hardware advances seem to be giving that metric a challenge in the present day.

Here are some other metrics from that same server:

  • Roughly 327 logical reads per physical read.
  • 180 compiles per second
  • 50-60 log flushes per second
  • Checkpoints at 120 pages per second
  • Read ahead at 120 pages per second

My options with this server are to:

  1. Do nothing and be happy
  2. Spend a few grand to increase RAM to 256G and thereby increase the SQL Server buffer cache to ~200G
  3. Go nuts and reprovision this machine with a $200,000 monster server with terabytes of RAM.

I think that PLE as a metric tells me little in this case unless I apply Jonathan’s formula to it.  I also think that Jonathan’s metric tells me to add some RAM to the server, and we will see an incremental boost in throughput.  I don’t think it will be massive; I also don’t think it will be negligible.  The $6000 we will spend will be well worth it.


PLE needs to be evaluated with more than a simple “300 second” threshold.  Jonathan Kehayias’ formula is a good starting point, but even more rigor is required.  The metric needs to be re-examined, and perhaps expressed as a ratio or something else that accounts for the size of RAM versus databases.

Right now, a threshold in a monitoring tool isn’t good, unless you tweak the alert for each server you monitor.  Something to keep in mind as you juggle the servers, keep the plates spinning, and don the fire suit as you pass by the SAN.

More Benefits of a Good Monitoring Tool.

The importance of a good monitoring tool cannot be over-emphasized.  I’ve posted on this issue before.  Now, I’d like to get into a tad more detail about this issue.  If you have responsibility for more than 2-3 servers, you need to have the alerts and reports. However, you have other reasons to employ a good tool.

Note: This article features Dell Spotlight on SQL Server.  I like the tool a great deal, but there are other good tools out there.

Benefit: Good visual understanding of Emergent situations.

Have a peek at this screen shot.  All screen shots are enlarged if you click on them.  Since this is from an actual production environment, certain information must be obfuscated.


In this particular situation, the tool informs me that the server has a process that has been blocked for a long enough period to trip the red alert level.

Clicking on the Blocked process icon gives us the details of the error, and clicking further gives us an analysis window:



This shows that the lead blocker is blocking one process, and the blocker is running but “AWAITING COMMAND.”  In this case, investigation reveals a massive update in process by the application. Further checking was not necessary in this case because the blocking cleared up about a minute later.

The above exemplifies a situation which a good DBA could have diagnosed by using sp_who2 and DBCC INPUTBUFFER on the lead blocker and the blocked task(s).  However, that would require time that the 4-5 seconds to examine in the tool did not, and moreover, the DBA is able to examine the issue proactively and not reactively (meaning when the angry user calls up with a hung session).

Let’s take a look at a slightly more critical scenario:



The above, my friends, is a classic “code Brown” situation.  The server showed up on the heat map (see later on) as a giant red blob taking up over a quarter of the real estate on the screen.  Why?  If you haven’t zoomed in, you probably couldn’t see this:


That’s right, 27.6 seconds average I/O wait!  You can guess that the SAN is being murdered by something truly massive that is whipping tempdb. As an aside the SAN in question placed 14 files of TEMPDB on SSD drives – the bottleneck was the SAN fabric itself. Ever seen the classic drawing of Brent Ozar standing in front of a porta-potty labeled “Temp DB?” He illustrates an important point and with good reason.

In this case, there was a table-valued function querying a 100+ million row table — twice — using a UNION (not UNION ALL).  This function created a 100GB heap on the tempDB and then proceeded to thrash the heap for eliminating duplicates in the UNION. I was one very upset DBA.  More than once, this culprit blocked several production jobs, including the peer-to-peer replication.

We were able to diagnose this issue with minutes of work, and again, we were able to proactively raise it to the developers and ensure that fixing the issue was a high priority.

Benefit: Enterprise overview

This is vitally important.  When you have your monitoring tool set up properly, you should be able to look at the screen and see what’s up with the enterprise. Something like this:


The above screen shot reflects nirvana.  All pretty green blocks.  This tool monitors VMWare clusters, Windows servers, SQL instances and replication publishers.  Just over 40 hosts and just over 135 connections.  The above screen shot was zoomed down to reduce away the name and type of each block, again for reasons of confidentiality.

We have a full sized window with the tool open on a 42 inch monitor in the room where DBA sits, and this permits us to instantly check the state of the enterprise.  Even better, we can do it from our iPhones too:



Therefore, we have good ability to take a look whenever we desire and evaluate the state of the enterprise.

Benefit:  Alerting

Any good tool won’t just sit there and wait for you to see a problem; it will send you an email (or even page your phone) when a situation requiring attention emerges.

snap7We have our alerts configured to send us email for some high alerts. Others that absolutely require attention are sent to our phones as pages (texts).  This is easy because we have configured our email server with DBA Notify (email) and DBA Page (phones) email addresses.  A good alerting tool will let you send to one or more emails, and will even let you execute some program(s) upon an alert. For example, if replication to the reporting servers is backlogged, run a program to pause the report queues so that we don’t publish reports with stale data.

Alerting is the one thing in any decent monitoring tool that will save the DBA’s backside.  For this reason, you’ll find out that many tools come with hundreds of possible alerts, with many already set up to “turn red” on a hair trigger. What the DBA must do is to look at each alert and set it up conservatively. Better to receive 100 alerts and find out that you can dial 75 of them back, rather than to set up a rose-colored (or should I say green-colored) view of the world that misses the one thing that takes the shop down.

Caveat:  Snoozing alerts

In my shop, disk space on the SAN is at a premium in our most critical environments.  We’ve come close to filling the SAN, and I embarked upon size reduction projects. In one database, I’ve freed up nearly a terabyte of space inside of the databases, but the data files come close to filling up the drives. Therefore, I get alerts on disk space usage from the Windows Server level of monitoring, and I snooze them for a month at a time (knowing that the data files can’t grow and have plenty of free space inside them).

It’s better to snooze a pile of alerts, and get pinged monthly or even less often, than it is to turn the alert off for that machine or (yipes!) the enterprise.

Working the Tool to Work the Servers.

In our shop, while on duty we work the monitoring tool to know when to work the server itself.  If I get chronic yellow alerts for processor queue length on a VM hosting a SQL server, and I’m snoozing that alert more than once or twice a month (daily would be bad), I know to bring the issue up with the network admin to see if I can get that VM some more horsepower.

This is an optimal situation when you are part of a small DBA staff shepherding a large number of servers. You want to be in a position to know where to apply your force, and the monitoring tool is what military types call a “force multiplier.”

When you are off duty, the tool pages you if something is wrong, and if you are on duty, you can see things emerge and address them before they have business impact.

Conclusion:  It’s worth it!

I cannot imagine any shop where DBA have to manage more than 3-4 SQL servers being effective without a top-flight monitoring tool. I have identified the tool that I use, but that doesn’t mean that it is the only tool that can do the job.  Look for yourself.  Get comfortable with the tools using demos.

Then make the business case to your management.  Reduced downtime. Better utilization of employee time. Force multiplication. Ensure that your management knows that this is not a “toy” for your DBA staff to “look cool;” but rather an integral and necessary part of the toolset that you need to do your job. I’m privileged to work for a company that recognizes and supports my team, and the end result has been a DBA team that does not engender any worry from management.



DBCC CHECKDB in tough and easy situations.

Every production database should be subject to an integrity check with DBCC CHECKDB.  Every week at least.  It’s not just a best practice; the failure to do so should be considered dereliction of DBA duty.  The longer you go with an integrity error in your database, the higher the likelihood of a catastrophic, call Paul Randal for help, situation.

DBCC CHECKDB in a HA tight-disk situation

However, what happens when you have a situation where you have a high-uptime VLDB with little to no disk to spare?  You might have the feeling that you are in a catch-22; If you use TABLOCK to avoid the snapshots, you need downtime, but if you don’t use TABLOCK to do the DBCC, you have no disk (on the drive with the file where your object’s pages are stored) and your DBCC crashes with error 17053, root cause OS Error 112.

There is a way around this:  Take a full backup of the database, and restore it to another server, where the database may be accessed without blocking other applications.  Once restore is complete, run DBCC CHECKDB with TABLOCK.  BACKUP DATABASE copies database pages with data into the backup, and RESTORE DATABASE copies them back.  Running DBCC CHECKDB will give you a reliable picture of the integrity of the production database.  Moreover, this strategy gives you a reliable test of your backup strategy.

DBCC CHECKDB as a Scheduled Job

What if you can run DBCC CHECKDB without TABLOCK and not have problems, or if you can get the database into single-user mode for a clean run with TABLOCK?  Then you need to make a scheduled job out of it, for each and every database on your server.  SQL Server Maintenance plans will do this for you.

You should schedule DBCC CHECKDB weekly.  I recommend that you schedule this job to notify DBA whether or not the job succeeds. It is good to put the DBCC output in your face every single week, to give you assurance that it has been completed.

What if DBCC CHECKDB finds a problem?

If you have a consistency error, then, regardless of any other issues, you get with management and you come down as soon as possible! Get that database into single-user mode and start to fix the issue. The exact things you should do to fix the errors depend on the errors, and those repair attempts are best left to a more detailed post.

The important thing to remember is that time is of the essence.  Every second you are up with an integrity error is another second where the opportunity exists for that error to become serious and perhaps irreparable.

Once the error(s) are corrected and a DBCC CHECKDB with TABLOCK passes in single-user mode, then take a full backup at once.  Assume that all previous backups are corrupt.

Your managers may be grumpy that a production system had to come down for hours to repair a data inconsistency, but that is better than failure to report a potential problem that escalates to disaster.  If your managers refuse to bring the system down after you warn them of the error and its potential risks, then the responsibility lies with them and not you.

Whatever you do, running and verifying regular integrity checks on all production databases should be considered a core responsibility of your job.

How SQL AD Group Security promotes Clean DBA Living.

We know that we can improve auditability and simply database security administration by doignt he following:

  1. Eliminate SQL Server authentication in favor of integrated security; and
  2. Create AD Groups, and set those groups up as logins and database users, with specific permissions granted and or denied as the business requires; and
  3. Assigning AD users to those groups.

But why the extra level of AD Groups?  Why not just add the AD users as database logins with specific permissions at the more granular user level?

We could certainly do that, but the administrative tasks that come from this approach grow geometrically.

AD Groups reduce SQL Login maintenance.

Sheer Overhead

Let’s assume that you have 100 AD users who can access 4 databases on 2 SQL servers each.  That is the size of a very large workgroup or a small enterprise.  Without a grouping approach maintained at the AD level, you have to create 200 logins, 400 database users, and assign roles or (worse) individual permissions to each of these users.  That is a boat load of maintenance work.

On the other hand, grouping means that the user’s DB access is provisioned with his or her AD profile, and DBA has no involvement aside from a one-time provisioning of the access to the DBA group.

DBA resources are freed to work on more pressing issues.

Orphan Prevention

Assume for argument that a person is terminated and his or her AD account is deleted.  That does nothing to delete the SQL Server logins or database users.  You now have an orphaned login, and if you drop the login you have orphaned database users.

While it is true that you can write or download T-SQL scripts that can check AD for missing logins, identify orphans, and sweep the databases to drop users and then logins, this is maintenance work that requires attendance and usually automation in the form of SQL jobs that run regularly.

On the other hand, if the users are organized into AD Groups and one of the logins is deleted, there is no SQL Server maintenance at all.

Auditability Redux

Assume that Jane Doe is promoted from Payroll assistant to sales manager.  Jane no longer needs access to the payroll tables, meaning that her roles must be individually maintained by DBA, separately and in addition to any maintenance of her profile in Active Directory.  Worse, if the roles don’t cover everything and there are individual GRANT commands to tables and other entities that need to be revoked, Jane’s simple change results in significant script work for the DBA.

On the other hand, if you have an AD Group DOMAINPayrollWorker with all permissions set there, then the network admin who provisions Jane’s profile need only drop her from the group and her permissions change instantly.

That kind of control over access is the kind of stuff that gives auditors cause for celebration.


I’ve previously discussed the security benefits of eliminating SQL Server authentication.  Combined with the approach of Active Directory group and role based authentication, the security benefit also comes with a savings in DBA time.

Security with Auditability.

IF you are a DBA in any of the following:

  • A publicly-traded company
  • A company that has to comply with HIPAA or PCI
  • A company with DBA that hate disorder

…then you need to have an audit-friendly method of securing your SQL assets.  In general, a few rules make this easier on DBA and on management.

1. Get rid of SQL Server authentication!

The sooner the better.  If you do not have a sa login, it is really hard for a third party to login as sa.  And even if you have disabled sa  in favor of another sysadmin-powered login with a name like “janitor_readonly,” if that login is hacked then your data is in the open.

On the other hand, in the modern era, there is little to no need for SQL Server authentication at all.  With SQL 2008 and newer, the DBA actually has to opt in to using SQL Server authentication.  Your database servers should exclusively use integrated security.

Why?  First of all, every login is mapped to the AD user that was running the program that logged in.  Server applications have their own logins, and the clients are logged in with the end-user’s login.

2. Groups and Roles

The first thing you might think is something like, “I have 10,000 people logging into application X!”  I’ll be spending every waking minute maintaining SQL logins and permissions to the databases they need.

Nope.  Create Active Directory groups.  Assign them logins and database users with the minimum permissions that are needed. Examples:

  • DOMAINAccounting_RO – Read only to Accounting
  • DOMAINAccounting_RW – Read/Write to Accounting
  • DOMAINAccounting_Payroll – Read/Write Payroll Data
  • DOMAINSQL Admins – DBA with sysadmin privileges

Usually each database has a limited set of privilege levels.  in my shop, we have a limited set of groups and we use T-SQL to generate DSADD commands to add the group accounts to Active Directory, and pass these on to the sysadmins to be added (no DBA should be a Domain Admin – would you trust any DBA with that much power?  😀 ).

Each domain user is added to one or the other domain groups (or multiple groups; the permissions accumulate unless an explicit DENY is issued).

Use database roles to dereference objects further.  You then GRANT permissions to the database role, and then assign the database user for the AD Group.

Then when the desktop/sysadmins provision a user, they are added to the groups using ADUC (Active Directory Users and Computers), and the database users and permissions are defined once at the outset, and only need minor review as DDL changes are made.

User DOMAINCFO would be added to DOMAINAccounting_RW and DOMAINAccountingPayroll.  Then he or she would have the needed permissions at the outset.

3. Auditing Made Easy

If you enable successful logins to be saved to the database log, you will see that you will see the actual AD user that was logged in, along with the machine from which the login was made.  This information is also shown in sp_who2 and elsewhere in good monitoring tools.

Auditing tools now have access as well.  Now, if someone went in and ran a query like this:

update EmployeeSalary set SalaryRate = SalaryRate * 2 where EmployeeType = 'Janitor';

…any good auditing tool would capture the user ID and workstation name / IP of the culprit.  If someone did the same query with the sa login, you would have a lot less information on the culprit than would be advisable.

In short, it’s high time that SQL Server Authentication join the relics of Information Technology in the museum, right next to the IBM System/370-158.

Missing Index Suggestions — Discretion Required!

If you are in Management Studio and have seen a suggested index, or if you have used a resource such as Pinal Dave’s Missing Indexes Script, you will be tempted to apply the suggested index without exercising discretion in doing so.

Please, be careful!  I have seen a table with 10GB of data pages and 50GB+ of index pages.  This is a terrifying thing for write performance, and often, a little bit of analysis is in order.

A common anomaly is to see multiple indexes with identical sort columns  and different groups of include columns.  Fixing these into a single covering index is a low-hanging fruit to pick.  Simply delete the duplicate indexes and replace with one that has all necessary included columns — most of the time.

What if all of these grouping columns add up to most or all of the table columns?  You are then keeping a second copy of the table in the nonclustered index and are causing numerous extra write I/O operations.

In that case, the best bet is to eliminate all included columns and just keep the sort columns as one index — most of the time.  You may have an outlier query that requires included columns to evade a performance bottleneck.  How do you determine this? The best way is with the DMV included in SQL Server. Analyze the missing indexes based on estimated impact of index change, multiplied by the number of executions that asked for this missing index.  Look at the highest values of the product.

Pinal Dave’s script calculates this number.  Again, however, you have to analyze each recommendation. Just throwing the indexes at the database may solve a read issue, and create a write or disk space issue.

We are DBA. We thrive on using the simplest solution and moving on.  However, in many cases, we must do a little digging to determine what solution is truly the simplest.

Auto Close – Who Needs it?

When someone has a database set to Auto Close ON, after the last database user exits, the database is shut down cleanly and SQL Server frees the resources used.  While this may have had some use in the past, that is certainly no longer the case.

Auto Close set to ON will cause database down warnings in modern monitoring tools, which cannot (yet) tell the difference between an auto-closed database and one that is offline or suspect.  In a production environment that leads to numerous “cry-wolf” error messages.

The best thing to do is to ensure that all of your databases are set to auto close OFF.  Some people say that auto close can be advantageous in Development or testing environments.  However, I find that the risks outweigh the benefits.

Auto-close is usually defaulted to OFF.  However, moving databases up from free versions of SQL Server may cause AUTO_CLOSE to become set to ON.

Setting this pernicious feature off is easy:


Save yourself time and headache and do this to every database on every instance.

I’m open to hear of a scenario where this option helps; I just cannot see one at this time.

The Importance of Monitoring Tools.

How does the modern DBA deal with 20-1,000 SQL Servers?  There are a few choices.

The first thing a DBA could do is configure a set of alerts to catch error situations. While useful, alerts are reactive in nature and the DBA finds out there is a problem after it happens.

The second thing a DBA could do is to configure a set of scheduled jobs that gather and report on metrics that might predict a failure. One example is a script set that measures database space available and or free disk space on the drives/mount points, to help catch things like dwindling free space before it hits you at 3:00am on a Sunday. Again this is an excellent thing to do but implementation may vary by machine and it complicates deployment.

The third thing a DBA could do is to acquire a good monitoring tool. There are several: SQL Sentry, Idera SQL Diagnostic Manager, Redgate SQL Monitor, and Dell (formerly Quest) Spotlight on SQL Server Enterprise.

A DBA who has to monitor the health of more than five SQL servers needs the “force multiplier” effect of a good monitoring tool. Scripts and alerts won’t cut it.  My environment uses Spotlight, and has previously dallied with Idera and SQL Sentry. The tool you choose should be determined by your comfort level with said tool, and how effectively it helps you to manage your environment.

What do packaged monitoring tools offer you?

1 . Solid monitoring

Your monitoring product should monitor each SQL instance for performance, blocking, PLE, cache hit rates and more.  It should also monitor the health of services such as replication and or Availability Groups. It should be able to gather windows metrics, and if the server is a VM, it should gather at least some basic metrics from the underlying hypervisor.

2.  Solid Alerting

A good monitoring tool should enable the DBA to configure alerts that page and or email. Alert thresholds need to be set to a default but customizable for a given machine or SQL Server instance.   Alerts should be able to drive actions. For example a replication alert might be set up to stop a user process in the event of excess latency.

3. Solid high-level reporting.

Without making myself into a Spotlight evangelist, this is one area where the tool truly excels. It has a heat map with every instance in several colors, from green for all-OK to red for an immediate-attention item. Moreover, multiple alerts increase the size of the non-green instance icon to draw attention. Even better, there is an iPhone app that we can reference anywhere to temperature-check the environment.

4. Easy Provisioning

One should be able to commence monitoring with little to no effort, and preferably without installing client software or scripts.

5. A Solid Repository

This is important. You should be able to store metrics over time for trend reporting, and to access those metrics for custom reporting as needed. Such power, used wisely, can stabilize environments and make production DBA proactive and not reactive.

Some companies may view monitoring tools as a waste of money. I’m blessed to work someplace where management understands the need for the tools, and the end result is a stable environment that is easier for me to manage and is much more effective at meeting corporate needs.

I suggest that you obtain evaluation copies of the popular tools and see for yourself what they buy you in terms of proactive management of your SQL Server environment.