Monday, March 17, 2014

Creating a truly unique value from several others in T-SQL

Update 2015 August 28: I've replaced the function master.sys.fn_varbintohexstr with CONVERT, with the extra parameter 2, which translates a binary field into a hexadecimal string with no leading 0x. In addition to being ugly to use, fn_varbintohexstr is very slow.

Sometimes you need to create a unique identifier for a bunch of values so that you use it as an ID in the database. The immediately obvious choice is the CHECKSUM and BINARYCHECKSUM functions. But beware, the purpose of these functions is to detect changes in a string, not to uniquely identify it. It might seem strange, but the two concepts are very different. The change modification functionality is only meant to generate very different values on small changes. The uniqueness is trying to create a value as distinctive as possible for any string. That is why when you use a checksum you will get a lot of similar values for (very) different strings.

Enter HASHBYTES, another function that has the purpose of creating a cryptographic hash for a string. It is mainly used for password hashing, but it will fit nicely for our purpose. There are some caveats, though. First, CHECKSUM gets a variable number of parameters, HASHBYTES only accepts one, so we must take care of the cumbersome concatenation of multiple values. Unfortunately SQL functions do not have the option of variable parameters, which is truly a shame, so we can't hack it. Also, the value that HASHBYTES returns is a varbinary. We could cast it to NVARCHAR, but it turns into a weird Chinese characters string. In order to turn it into a proper string, we need to use the same function used by SQL Server to display varbinary when selecting it: master.sys.fn_varbintohexstr the CONVERT function with a parameter of 2 (hex string without the leading 0x).

So let's compare the two usages. Suppose we have this nice table that contains company data: company name, contact first name, contact last name, phone, email, yearly value. We need to create a unique ID based on these values.
First CHECKSUM:
SELECT CHECKSUM(companyName, firstName, lastName, phone, email, yearlyValue) FROM OurTable
So easy! Just add the columns, no matter how many or what type they have, and get a value as a result. You can even use * to select all columns in a row. You also have the advantage of getting the same checksum for differently capitalized strings. If you don't want this behaviour, use BINARYCHECSUM, which works even better.

Second HASHBYTES:
SELECT CONVERT(VARCHAR(Max),HASHBYTES('SHA1',companyName+'|'+firstName+'|'+lastName+'|'+phone+'|'+email+'|'+CAST(yearlyValue as NVARCHAR(100))),2) as id,*
FROM OurTable
Ugly! You need to create a string from different types, using ugly casts. Also, this works more like BINARYCHECKSUM. If you want to get the same functionality as CHECKSUM you need to use LOWER(LTRIM(RTRIM(value))). Horrid!
However, it works.

WARNING: using CAST to NVARCHAR from a FLOAT loses precision. You should use STR instead!

A middle solution is to use XCHECKSUM. What is that, you ask? A placeholder that can be replaced with some regular expression search and replace, of course :)

Update: I've created a query that creates the script to update the value of a column called 'ValuesHash', for tables that have it, with the hash of all columns that are not in a list of names, they are not primary keys and they are not foreign keys, plus they are not computed, rowguidcol or filestream.
Imagine the scenario where you have something like this:
  • Table A:
    1. Id: primary identity key
    2. Data1: some data
    3. Data2: some data
    4. CreateTime: the creation time
    5. ValuesHash: a VARBINARY(50) column - only 20 are required normally, but let's make sure :)
  • Table B:
    1. Id: primary identity key
    2. AId: foreign key to A
    3. Data1: some data
    4. Data2: some data
    5. ModifyTime: the modification time
    6. ValuesHash: a VARBINARY(50) column - only 20 are required normally, but let's make sure :)
  • Table C:
    1. Id: primary identity key
    2. AId: foreign key to A
    3. Data1: some data
    4. Data2: some data
The query below will update ValuesHash for A and B (because C doesn't have the ValuesHash column) with a hash constructed from the Data columns. The Id columns will be ignored for being primary keys (and for being in the list of columns to ignore), the AId columns will be ignored for being foreign keys, ValuesHash and CreateTime and ModifyTime will be ignored for being in a list of custom columns)

WARNING: each column data is always truncated to 4000 characters, then the corresponding string is also truncated to 4000 bytes before running HASHBYTES (which only accepts a maximum of 8000 bytes). This hash will help in determining unique records, but it is not 100%.

SELECT * 
FROM   (
SELECT t.name, 
  'UPDATE [' + t.name+ '] SET ValuesHash = HASHBYTES(''SHA1'',SUBSTRING(' 
  + Stuff( 
    (SELECT '+ ''|''+ ISNULL('+CASE 
      WHEN tp.name IN ('float', 'real') THEN 'STR('+c.name+',30,30)' 
      WHEN tp.name IN ('binary', 'varbinary') THEN 'CONVERT(NVARCHAR(4000),'+c.name+',2)'
      ELSE 'CONVERT(NVARCHAR(4000),'+c.name+')' END+','''')'
     FROM sys.all_columns c 
         INNER JOIN sys.types tp 
      ON c.system_type_id=tp.system_type_id
      AND c.user_type_id=tp.user_type_id 
     LEFT JOIN sys.index_columns ic
      ON ic.object_id=t.object_id
      AND ic.column_id=c.column_id
     LEFT JOIN sys.indexes i
      ON ic.object_id=i.object_id
      AND ic.index_id=i.index_id 
     LEFT JOIN sys.foreign_key_columns fc
      ON fc.parent_object_id=t.object_id 
            AND c.column_id=fc.parent_column_id 
     WHERE t.object_id=c.object_id
      AND ISNULL(c.is_identity, 0)=0
      AND ISNULL(c.is_computed, 0)=0
      AND ISNULL(c.is_filestream, 0)=0
      AND ISNULL(c.is_rowguidcol, 0)=0
      AND ISNULL(i.is_primary_key, 0)=0 
      AND fc.parent_column_id IS NULL
      AND c.name NOT IN ('Id', 'CreateTime' , 'AcquireTime' , 'IntermediateCreateTime', 'IntermediateModifyTime', 'IntermediateDeleteTime', 'ValuesHash')
     ORDER BY Sign(c.max_length) DESC, c.max_length, Lower(c.name)
     FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)')
    , 1, 7, '') 
  + ',0,4000)) WHERE ValuesHash IS NULL' AS computed 
FROM   sys.tables t 
INNER JOIN sys.all_columns c 
  ON t.object_id = c.object_id 
WHERE  c.name = 'ValuesHash') x 
WHERE  computed IS NOT NULL 
ORDER  BY name 

Change it to suit your needs. It is by no means perfect, but it's a start for whatever you need.

Update:

A new FORMAT function was introduced in SQL Server 2012, working somewhat similar to the .NET ToString method. Using that function is slightly more precise:

SELECT * 
FROM   (
SELECT t.name, 
  'UPDATE [' + t.name+ '] SET ValuesHash = HASHBYTES(''SHA1'',SUBSTRING(' 
  + Stuff( 
    (SELECT '+ ''|''+ ISNULL('+CASE 
      WHEN tp.name IN ('float', 'real') THEN 'FORMAT('+c.name+',''R'')' 
      WHEN tp.name IN ('decimal') THEN 'FORMAT('+c.name+',''G'')' 
      WHEN tp.name IN ('datetime','datetime2') THEN 'FORMAT('+c.name+',''O'')' 
      WHEN tp.name IN ('binary', 'varbinary') THEN 'CONVERT(NVARCHAR(4000),'+c.name+',2)'
      ELSE 'CONVERT(NVARCHAR(4000),'+c.name+')' END+','''')'
     FROM sys.all_columns c 
         INNER JOIN sys.types tp 
      ON c.system_type_id=tp.system_type_id
      AND c.user_type_id=tp.user_type_id 
     LEFT JOIN sys.index_columns ic
      ON ic.object_id=t.object_id
      AND ic.column_id=c.column_id
     LEFT JOIN sys.indexes i
      ON ic.object_id=i.object_id
      AND ic.index_id=i.index_id 
     LEFT JOIN sys.foreign_key_columns fc
      ON fc.parent_object_id=t.object_id 
            AND c.column_id=fc.parent_column_id 
     WHERE t.object_id=c.object_id
      AND ISNULL(c.is_identity, 0)=0
      AND ISNULL(c.is_computed, 0)=0
      AND ISNULL(c.is_filestream, 0)=0
      AND ISNULL(c.is_rowguidcol, 0)=0
      AND ISNULL(i.is_primary_key, 0)=0 
      AND fc.parent_column_id IS NULL
      AND c.name NOT IN ('Id', 'CreateTime' , 'AcquireTime' , 'IntermediateCreateTime', 'IntermediateModifyTime', 'IntermediateDeleteTime', 'ValuesHash')
     ORDER BY Sign(c.max_length) DESC, c.max_length, Lower(c.name)
     FOR XML PATH(''), TYPE).value('.', 'nvarchar(max)')
    , 1, 7, '') 
  + ',0,4000)) WHERE ValuesHash IS NULL' AS computed 
FROM   sys.tables t 
INNER JOIN sys.all_columns c 
  ON t.object_id = c.object_id 
WHERE  c.name = 'ValuesHash'
) x 
WHERE  computed IS NOT NULL 
ORDER  BY name 

0 comments: