0
votes

I will incrementally insert rows to a table. This table stores sales facts, and has some columns that will be used to define a identifier: business id (int), product name (string), product price (float). E.g. <1, heineken, 1.0>, <1, heineken, 22.99>.

Certainly, these values will be used in joins. When thinking the SQL way, I would create a hashed column using those columns. This way, I would be able to optimize some queries.

How about data lake and u-sql? Should I calculate the hash on insert? Should I leave it as is? Should I simply concatenate the values and create a big string?

Thanks in advance.

1

1 Answers

1
votes

While U-SQL supports clustering and distribution schemes on multiple columns, you probably could gain some additional performance in your joins if you find an efficient value to do your equi-join comparison. So you could calculate a hash or concatenate.

However, I think finding the right distribution scheme and clustering is the better "bang for your buck".

And, more importantly, please do not incrementally insert small number of rows, but use bulk insertion of many rows at the same time (e.g., daily or weekly). And regularly rebuild the table or table partition to avoid table fragmentation that will have a much bigger impact on your query performance.