Saturday, December 24, 2011

Cost of GUIDs as PK...greatly exaggerated

The cost of using GUIDs as a primary key in your DB can be quite high if your DB grows to tens of millions of records, especially if you use random GUIDs. However, the dangers of these practices have been greatly exaggerated and fear mongering from some respected DBAs has caused concern among some developers. So much so that many would NEVER use a GUID as a primary key.

Below, you’ll see that, if use properly, the performance difference between integers and GUIDCOMBs is less than 10%.

For modern, quick and innovative software it’s my opinion that dismissing GUIDs as a primary key is a dangerous practice and has dangerous consequences to the integrity and maintainability of our applications. I’m not saying you can’t design, build, and maintain DBs that use integers as primary keys (because of these fears we do everyday), but it’s much harder if you span multiple environments and mock up and test your data properly.

For me this is a very interesting conversation (I know, I know! Most of you have already stopped reading…). I used to work for Kaiser over in Oakland. Kaiser has some truly big tables and I occasionally ran some queries that matched against nearly 60,000,000 rows. In this type of environment even small performance gains add up.

Here at AKQA, we have a few large tables – but most tables here are well under 100,000 rows. At these sizes, even the worst DB designs are not so bad on performance. So the discussion below really only applies when the tables get large, over 500,000 rows.

This is not just an academic exercise. There are serious and valid business reasons for using something like a GUID as a PK:
  • More object ID–like primary key unique across the complete DB.
  • @@IDENTITY has been a source of problems for INSERT triggers.
  • You can move the creation of entities from the data tier to the business or consumer tiers.
  • In .NET, it's a big advantage to know the ID value at the time entities are added to a use case.
  • GUIDs big enough for most scenarios. INTEGERs not big enough for some scenarios.
  • You can prepare for merge replication.
  • IDENTITY can't be used for tables in distributed partitioned views.
  • It's common to see INTEGER keys in the UI (can cause problems). GUIDs rarely displayed.
  • The GUIDs more closely mirrors the NOSQL movement – MongoDB, CouchDB, etc.
To be fair, there are also a few generic (non-db) downsides to GUIDs:
  • GUIDs are hard to remember.
  • They can be a pain to generate when others need them.
  • If accidentally reused they can be hard to spot.
  • They are longer than stringified integers.
The argument against GUIDs as PK has been going on for a few years. Kimberly Tripp has written some seemingly devastating articles on the topic of DB keys.

“…it has a HUGE impact on overall performance…”
“…This is already SCARY and should go down into the "Are you kidding me category?…”
“…you have GOT to be wondering why things are going so horribly wrong?…”
“…It's a COMPLETE NIGHTMARE!…”
“… the random GUID db is really NOT fairing very well...:
10K more rows in SalesDBInts takes 00:26 seconds
10K more rows in SalesDBGUIDs takes 10:10 minutes
10K more rows in SalesDBSeqGUIDs takes 01:12 minutes”

As bad as Tripp makes all this sound, she didn't do her homework. She only considered one type of sequential GUID. In fact, you’ll soon see that using a GUIDCOMB eliminates every one of Tripp’s complaints except size and human-readability. She’s got a legitimate complaint about the size of GUIDs. However, unless you are extremely cheap, I hardly see how space is going to be much of an issue. In even the most extreme cases, the space we are talking about is measured in the low gigabytes. With 2TB drives going for under $100, I don’t really care if my DB is 500mb or 1gb. If size has a serious impact on performance, then that’s a problem.

Even Tripp’s article shows a very small performance difference between integers and SeqGUIDs unless you are dealing with thousands of records. To put her numbers in perspective, the difference is 0.0026 secs/record vs 0.0072 secs/record. I don’t know about you but if I could do record inserts in 0.0072 seconds I would be very happy. Even her worst case scenario is moving along at 0.061 secs/record.

For most of the apps we do, insert speed is not a real factor and the differences above would be insignificant.

Jimmy Nilsson did an analysis in an article called, “The Cost of GUIDs as Primary Keys.” He takes his time and really nails the topic. Indeed, his first tests using standard GUIDs confirm Tripp’s findings showing that there is a 10 – 1 performance difference (or higher) for using GUIDs on tables with over 1,000,000 rows. In his worst test the difference between GUIDs and integers is 30 – 1. Yikes.

Nilsson then continues and does the test that Tripp does not. He creates a truly sequential GUID, combining a random sequence with one generated by datetime. With his new combination GUID, or GUIDCOMB as they are known, he repeats the 30 – 1 test. Again the ratio between GUIDS and integers is 30 – 1, however the difference between GUIDCOMBs and integers is just 1.1 – 1. This 10% performance difference is probably due solely to the size difference between GUIDs (16 bytes) and integers (4 bytes).

So, armed with this knowledge, how hard was it to switch from using GUIDs, to GUIDCOMBs? I wanted to do this to remove any concerns about performance. Well, it was very easy. Using the class at the end of this post, the change means that instead of using this:
var key = Guid.NewGuid();
I now use this:
var key = GuidComb.NewComb();
That’s it. Problem solved and it took about 10 minutes to add the new class to our shared library.

Here’s the class I used to gen the GUIDCOMBs. It should be very easy to port this to your environment if you can’t find a handy GUIDCOMB generator.
using System;

namespace Akqa.Shared.Infrastructure
{
    /// <summary>
    /// A class for generating globally sequential Guids
    /// Source: http://www.developmentalmadness.com/archive/2010/09/28/sequential-guid-algorithm-ndash-implementing-combs-in-.net.aspx
    /// </summary>
    public static class GuidComb
    {
        /// <summary>
        /// Creates a new sequential guid (aka comb) <see cref="http://www.informit.com/articles/article.aspx?p=25862&seqNum=7"/>.
        /// </summary>
        /// <remarks>A comb provides the benefits of a standard Guid w/o the database performance problems.</remarks>
        /// <returns>A new sequential guid (comb).</returns>
        public static Guid NewComb()
        {
            byte[] uid = Guid.NewGuid().ToByteArray();
            byte[] binDate = BitConverter.GetBytes(DateTime.UtcNow.Ticks); // use UTC now to prevent conflicts w/ date time savings

            // create comb in SQL Server sort order
            byte[] comb = new byte[uid.Length];

            // the first 7 bytes are random - if two combs
            // are generated at the same point in time
            // they are not guaranteed to be sequential.
            // But for every DateTime.Tick there are
            // 72,057,594,037,927,935 unique possibilities so
            // there shouldn't be any collisions
            comb[3] = uid[0];
            comb[2] = uid[1];
            comb[1] = uid[2];
            comb[0] = uid[3];
            comb[5] = uid[4];
            comb[4] = uid[5];
            comb[7] = uid[6];

            // set the first 'nibble of the 7th byte to '1100' so 
            // later we can validate it was generated by us
            comb[6] = (byte)(0xc0 | (0xf & uid[7]));

            // the last 8 bytes are sequential,
            // these will reduce index fragmentation
            // to a degree as long as there are not a large
            // number of Combs generated per millisecond
            comb[9] = binDate[0];
            comb[8] = binDate[1];
            comb[15] = binDate[2];
            comb[14] = binDate[3];
            comb[13] = binDate[4];
            comb[12] = binDate[5];
            comb[11] = binDate[6];
            comb[10] = binDate[7];

            return new Guid(comb);
        }

        /// <summary>
        /// Validates if comb was generated by this class
        /// </summary>
        /// <remarks>
        /// Guids generated by Guid.NewGuid() have a value of 
        /// 0100 for the first 4 bits of the 7th byte. Ours will
        /// have a value of 1100 for the 6th byte. We're checking that here.
        /// 
        /// We could do additional validation by verifying that
        /// the value of a new Guid is greater than the
        /// one being validated (or that the last 6 bytes
        /// resolve to a valid DateTime), but this should
        /// be enough for now.
        /// </remarks>
        public static bool IsComb(Guid value)
        {
            // get the 7th byte
            byte b = value.ToByteArray()[6];

            // make sure the first 'nibble' == 1100
            return (0xc0 & b) == 0xc0;
        }

        /// <summary>
        /// Validates Guid to determine the supplied
        /// value was generated by Comb.NewComb. If
        /// invalid an ArgumentException is thrown.
        /// </summary>
        /// <param name="value"></param>
        public static void ValidateComb(Guid value)
        {
            if (!GuidComb.IsComb(value))
                throw new ArgumentException("The supplied Id value was not generated by Comb.NewComb.");
        }
    }
}
Thanks

No comments:

Post a Comment