Recently I received pushback from a developer about the use of a GUID primary key. The proposed database schema, intended to be used with Entity Framework, contained numerous tables primary-keyed on a column named ID and typed as nvarchar(50). the tables were clustered on this column, and it was intended to hold a GUID that would be populated with the .NET call System.Guid.NewGuid().
Cue Blood-Curdling Screaming Now.
I can hear the blood-curdling screams from every DBA worth his or her salt. I asked for the column to be redefined as bigint IDENTITY, citing the well-known adverse effects of clustering a GUID, and in this case worse because the GUID would be stored as a 37-character Unicode value, taking a minimum of 74 bytes per row.
The developer objected because of turns to the SQL Server; this application requires extreme throughput and low latency. The developer would have to retrieve the generated identity value, and thereby dependent rows in child tables would have to be separately populated. If the developer used a program-generated GUID, no database access is required until insert time, which can be done in bulk through EF. Developer therefore offered to cluster on an IDENTITY column but keep the Primary Key as a GUID.
I still objected because the nonclustered index quickly reached 99.5% fragmentation in experiments after only about 1000 rows, and this was with a fill factor of 50% defined for that index. The developer argued that the fragmentation of the nonclustered index was meaningless. The index size was huge, whether I used the nvarchar(50) or the uniqueidentifier (16 bytes) data types to hold the GUID.
Should I Care?
Now the question is whether I should care about a highly fragmented nonclustered index. Remember that this is a foreign key, and this GUID must (for performance reasons) be indexed for join purposes on both parent and child table. But this is a SEEK operation, is it not? Numerous articles argue that SEEK operations on nonclustered indexes, even when highly fragmented (which is the norm when the key is a GUID) are not expensive.
However, focusing on individual seek performance misses an important point here. The query below explains why I care:
Select O.ORDER_NUMBER , C.CUSTOMER_NAME , I.ITEM_DESCRIPTION , L.QUANTITY , L.UNIT_COST , (L.UNIT_COST * L.QUANTITY) as EXTENDED from ORDER_MAST O INNER JOIN CUSTOMER C ON C.GUID = O.CUST_GUID INNER JOIN ORDER_LINE L ON L.ORDER_GUID = O.GUID INNER JOIN ITEM I ON I.GUID = L.ITEM_GUID where O.GUID = '28592B77-5684-4C8E-9B70-997C8A7DC7E0';
Order lines may be OK, because there will be an index on ORDER_GUID for the ORDER_LINE table. However, every line item seek would not be so lucky. The items will be wildly randomized in the ITEM table, and each ORDER_LINE will need to join ITEM. A smaller but virtually identical issue arises as ORDER_MAST joins to CUSTOMER.
Now let’s make the point even more directly:
Select O.ORDER_NUMBER , C.CUSTOMER_NAME , I.ITEM_DESCRIPTION , L.QUANTITY , L.UNIT_COST , (L.UNIT_COST * L.QUANTITY) as EXTENDED from ORDER_MAST O INNER JOIN CUSTOMER C ON C.GUID = O.CUST_GUID INNER JOIN ORDER_LINE L ON L.ORDER_GUID = O.GUID INNER JOIN ITEM I ON I.GUID = L.ITEM_GUID WHERE O.ORDER_DATE between '01/01/2015' and '04/01/2015' order by O.ORDER_NUMBER, I.ITEM_DESCRIPTION;
Assume that the ORDER_MAST table is clustered on an identity and the ORDER_DATE column is a datetime and has a nonclustered index.
Assume also that there are 10,000 orders with 10 lines each. That means that there is a likelihood of reading 99,500 separate pages of data into memory in order to perform the ORDER_LINE to ITEM join. Furthermore, without a sequence number column in the order_line table, one would be highly likely to never be able to order the lines into the order in which they were inserted into the table.
Defragmenting the GUID index would reduce those 99,500 page reads to about 50,000 page reads for a non-clustered index with a 50% fill factor. If you have an 8 byte bigint clustered index that is also the PK, you would likely get many rows per page (depending on row size) and in the end increase speed by more than one order of magnitude.
Hence a sequential surrogate PK has its uses. It would dramatically speed up this query. We haven’t discussed the speed implications of insert performance as the index pages are split over and over again, and the concept that defragmenting a nonclustered index based on a GUID could actually degrade insert performance by increasing the number of page splits. Therefore, if anyone uses a utility such as dba_indexdefrag by Michelle Ufford, one is likely to generate slowness in inserts.
Do I care? Heck yes! This is a hill to die on for any SQL Server DBA.
In this case, the developer agreed to use an internally generated ever increasing unique number based on time, machine ID, and an increasing integer. The generated number should fit into a BIGINT or a Decimal(19,0) data type, resulting in nine and not 74 bytes per row of PK. This agreement solves all of the problems in a manner substantially identical to the IDENTITY or a SEQUENCE, while meeting the developer’s performance concerns — rightly expressed because this application requires maximum throughput.
Lesson? Don’t allow GUIDs as clustered or non-clustered PK. Nope. Never.