May 10, 2022

Introduction

I work heavily with data warehouses. Instead of utilizing tables for creating fact and dimension tables, I tend to virtualize this data by creating views for Analysis Services to consume. As a general rule, I usually have a unique column as an ID column in my dimensions that are used to connect to the fact table. Over the years, I have just used the Window Function ROW_NUMBER() OVER() to create a unique id field in those views. The problem with this is I then have to join my fact view to all the dimension views to get the unique id per row. This causes all the views to be queried when the fact view is queried.

A couple weeks ago I was working on some views where there were very complex joins and the performance was super slow because of all the joins. I got to thinking that instead of doing all these joins, there had to be a better way to come up with a repeatable unique id since the Window Function was not consistent per row as the number would change if the new rows were added. If I came up with a method to do this, then I could just generate the key in the fact view from the same source database tables as I generated the dimension views from since I had to join to those tables anyways.

The fact I needed a unique identifier made me immediately think about using a hash on certain fields to get a unique value per row. The possibility of the hashes colliding is extremely rare, but not necessarily impossible. For the small dataset I was working on, an MD5 hash should be more than sufficient. This reduced the multiple complex queries being performed at the same time.

The other downside to this method was a hash in SQL Server is returned as a VARBINARY. If you are connecting views directly, this isn't a big deal. However, this view is consumed by SQL Server Analysis Services which does not recognize VARBINARY data types. My workaround for this was to cast the VARBINARY as a BIGINT instead. This would allow Analysis Services to recognize the unique identifier column.

The other day I was talking with a co-worker about a solution I came up with. He was curious about the speed difference of these two approaches. I had been dealing with a small enough of a dataset that I didn't really give it much thought before. After our discussion though, I thought I would do some tests and see.

Resolving the Issue

For this test, I created a table and pulled some randomly generated data from a data generating website. I ended up getting 129,992 rows of data for this test. I ran each query 5 times and then averaged the times to come up with my end comparison.

The first test I did was using the Window Function ROW_NUMBER() OVER() to generate the unique ID.

SELECT 
	  ROW_NUMBER() OVER(Partition BY 1 ORDER BY first_name, last_name, email) rn
	, [user_seed_id]
	, [first_name]
	, [last_name]
	, [email]
	, [address]
	, [city]
	, [postal_code]
	, [country]
FROM [dbo].[UserSeedData];

-- Ran 5 times:
-- 00:00:01.007
-- 00:00:00.972
-- 00:00:00.941
-- 00:00:00.938
-- 00:00:00.945

-- Average: 
-- 00:00:00.961

using ROW_NUMBER() OVER(), I averaged 961 milliseconds.

Next, I ran pretty much the same query using HASHBYTES on the same fields as I used for the query about and then casted the results to a BIGINT to determine a unique key.

For anyone not familiar with CONCAT_WS(), it is a function that allows you to concatenate fields into a single value. The first argument in the function is a character that will be placed in between each column.

SELECT CONCAT_WS('-', 'a', 'b', 'c');

-- Result: 
-- a-b-c

Using the CONCAT_WS() allowed me to join the fields to be able to do a HASBYTES on a single value.

SELECT 
	  CAST(HASHBYTES('MD5', CONCAT_WS('-', [first_name], [last_name], [email])) AS BIGINT) id
	, [user_seed_id]
    , [first_name]
    , [last_name]
    , [email]
    , [address]
    , [city]
    , [postal_code]
    , [country]
FROM [dbo].[UserSeedData];


-- Ran 5 times:
-- 00:00:00.849
-- 00:00:00.798
-- 00:00:00.822
-- 00:00:00.809
-- 00:00:00.805

-- Average: 
-- 00:00:00.817

The results here surprised me a little. Using this method, it ran in only 817 milliseconds. That is 144 milliseconds faster on average.

Considerations

The situation I ran into required me to generate unique identifiers per row. Additionally, the key columns in each row had to be unique. By using this method, I could tell very quickly if there were duplicates in the table based on the new identifier field I had created.

Using the ROW_NUMBER() OVER() function created a unique identifier for each row, but did not take into account unique values for the key rows. This actually cause the results to change when the key values were duplicated due to natural sorting. There are reasons to use either one of these methods.

Another thing I wanted to point out is HASHBYTES is case sensitive. "Alan" and "ALAN" are not the same value so while the database may see that as the same value, HASHBYTES does not. If you are dealing with string data, I recommend using UPPER() around the varchar fields. This will make it to where the values should be consistent and HASHBYTES will code the values the same way to detect duplicate values. It would look something like this.

SELECT 
	  CAST(HASHBYTES('MD5', CONCAT_WS('-', UPPER([first_name]), UPPER([last_name]), UPPER([email]))) AS BIGINT) id
	, [user_seed_id]
    , [first_name]
    , [last_name]
    , [email]
    , [address]
    , [city]
    , [postal_code]
    , [country]
FROM [dbo].[UserSeedData];

Conclusion

The queries above demonstrated a few ways to generate unique identifiers in views.

ROW_NUMBER() OVER() allows you to generate a unique number and does not duplicate values based on the values in the columns you are using as key fields.

Using HASHBYTES() to generate a unique identifier allows you to create an id that will potentially have duplicate values if your key field has duplicate data in it. If you are needing to expose duplicate data, HASHBYTES() is a fast solution for generating an identifier field.

These tests were run on SQL Server 2019 Developer Edition.

I don't have a comments section yet, so feel free to send me feedback on this blog.


Kevin Williams

Kevin is a data engineer and is the Business Intelligence Practice Lead at Software Design Partners specializing in data warehousing. He is a father, an occasional gamer, and lover of many different types of music.


The opinions expressed on this site are my own and may not represent my employer's view.
Share this post
About this blog...

Taking a look at using ROW_NUMBER() vs. HASHBYTES() for generating unique ID fields in views.

Archives


An error has occurred. This application may no longer respond until reloaded. Reload 🗙