2
votes

(Submitting on behalf of a Snowflake client)

.........................

I want to create a dimension with features as JSON attribute.

I am thinking of using HASH to uniquely identify my rows, including the JSON column.

I expect to have a few million rows in that dimension.

Snowflake documentation (https://docs.snowflake.net/manuals/sql-reference/functions/hash.html) says that HASH is likely to produce duplicates for 4 billion rows or more... and warn against using HASH as a key...

Is using a HASH value as key a reasonable approach when only having a few million row members?

.........................

Any ideas, alternative recommendations, or possible work-arounds? THANK YOU.

1

1 Answers

4
votes

That's a fun question.

Assuming that the hash is really, truly, random, calculating the probability of a collision is really an extension of the birthday problem. We can approximate the probability as

p(collision) ≈ 1 - e^(-(n^2)/2d))

where n is the number of values, and d is the size of the domain. Plugging in 2^32 (4 billion) in as n, and 2^64 in as d, we get p ≈ .39, so there's a pretty high likelihood of a collision.

But if n is only a few million, this probability is much lower. E.g., for n = 10,000,000, we get p ≈ .0000027. That sounds pretty safe, but there's clearly some risk. And this assumes that the hash is perfect, so you should probably nudge that probability up a bit.

You could try a longer, more standard hash, like SHA-2, which Snowflake supports. There's always some risk of collision, but if you make the hash long enough, this will become vanishingly small—which is all you can hope for with a hash.

A better alternative to hashing, though, might be to put the JSON in a separate table and use autoincrement to assign a real unique identifier to each record. You'd then join using this key. If you do it right, it should always work, and I'd expect join perf to be better as well.