Bulk Comparison with Hashbytes
Imagine we have two tables (or views), TableNew
and TableOld
, and we want to know if their data is the same or different.
Assume they have the same schema.
Further assume that their primary key is a composite key with two columns, PK1
and PK2
.
With
NEWSET as (
SELECT
MBT.PK1,
MBT.PK2,
hashbytes('MD5',
(SELECT MBT.*
FROM (VALUES(NULL))foo(bar)
FOR xml auto)) AS [Hash]
FROM TableNew AS MBT)
,OLDSET as (
SELECT
MBT.PK1,
MBT.PK2,
hashbytes('MD5',
(SELECT MBT.*
FROM (VALUES(NULL))foo(bar)
FOR xml auto)) AS [Hash]
FROM TableOld AS MBT)
,Comparison as (
Select
n.PK1,
n.PK2,
Case
when o.[Hash] is null then 'Not in old set'
when o.[Hash] != n.[Hash] then 'Changed'
else 'Same'
end as Summary
from NewSet n
left outer join OldSet o on n.PK1 = o.Pk1 and n.PK2 = o.PK2
)
,Comparison2 as (
Select
o.PK1,
o.PK2,
'Not in new set' as Summary
from OldSet o
left outer join NewSet n on n.PK1 = o.Pk1 and n.PK2 = o.PK2
where n.PK1 is null and n.PK2 is null -- i.e. not match
)
Select Summary, Count(*) as [Count] from Comparison group by Summary
UNION
Select Summary, Count(*) as [Count] from Comparison2 group by Summary
This will give you a summary table that tells you:
- how many are in the old set but not the new (based on primary key)
Not in new set
- how many are in the new set but not the old (based on primary key)
Not in old set
- how many are in both sets but with different details in at least one column
Changed
- how many are in both sets and with identical details in every column
Same
e.g.
Summary | Count |
---|---|
Changed | 607534 |
Not in new set | 56 |
Not in old set | 7017 |
Same | 65449 |
Problem
The maximum input to hashbytes
-- prior to SQL Server 2016
-- is 8000 bytes.
Solution for versions less than SQL Server 2016
This yak took many cans of shaving cream!
You can use this function to hash varbinary(max)
... (note you don't specify which hash function is used... so I'm guessing it's possible that two different database instances might use different hash functions inside the repl hash binary... it may be configurable? but within one database instance this should work consistently...)
master.sys.fn_repl_hash_binary(SOME_VARBINARY)
...but we don't have a var binary we have a nvarchar(max)
!
So cast the nvarchar(max)
to varbinary(max)
before running the function....
master.sys.fn_repl_hash_binary(cast(SOME_NVARCHAR_MAX as varbinary(max)))
...And that appears to work, in SSMS... but the output is actually an array of bytes, not a varchar of any sort.
I found the trick this time is to convert the result to varchar(32) with a third parameter of 2
, i.e.
Convert(varchar(32), master.sys.fn_repl_hash_binary(cast(SOME_NVARCHAR_MAX as varbinary(max))), 2) as Hasho
So here's our function modified to work with large nvarchar's prior to SQL Server 2016
. (I'm writing/needing this in 2020!)
With
NEWSET as (
SELECT
MBT.PK1,
MBT.PK2,
Convert(varchar(32), master.sys.fn_repl_hash_binary(cast((SELECT MBT.*
FROM (VALUES(NULL))foo(bar)
FOR xml auto) as varbinary(max))), 2) as [Hash]
FROM TableNew AS MBT)
,OLDSET as (
SELECT
MBT.PK1,
MBT.PK2,
Convert(varchar(32), master.sys.fn_repl_hash_binary(cast((SELECT MBT.*
FROM (VALUES(NULL))foo(bar)
FOR xml auto) as varbinary(max))), 2) as [Hash]
FROM TableOld AS MBT)
,Comparison as (
Select
n.PK1,
n.PK2,
Case
when o.[Hash] is null then 'Not in old set'
when o.[Hash] != n.[Hash] then 'Changed'
else 'Same'
end as Summary
from NewSet n
left outer join OldSet o on n.PK1 = o.Pk1 and n.PK2 = o.PK2
)
,Comparison2 as (
Select
o.PK1,
o.PK2,
'Not in new set' as Summary
from OldSet o
left outer join NewSet n on n.PK1 = o.Pk1 and n.PK2 = o.PK2
where n.PK1 is null and n.PK2 is null -- i.e. not match
)
Select Summary, Count(*) as [Count] from Comparison group by Summary
UNION
Select Summary, Count(*) as [Count] from Comparison2 group by Summary