I have a big tables hive. In which to identify a unique record I need to check 4 columns, lets say col_A, col_B, col_C and col_D. I use these columns to identify changed records and overwrite HIVE table partitions.
I am trying several options to do this: 1) join on all 4 columns separately (this was slow)
2) Concat all 4 fields to create a key column (string type).
3) use step-2 and convert string to binary (is binary faster to join on?).
4) Use hash functions to create a key. (I got around million duplicates when I used the HASH( ) function in HIVE over a billion records.)(output is type Int).
5) Use other functions like HEX, MD5 (string), SHA1/2 or CR32 to form a key column.
Have you used anything similar in your project? What gave a better performance? Is there any other better way to join ?
P.s. I have already used several optimization parameters like:
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
set hive.vectorized.execution.enabled = true;
set hive.auto.convert.join ;
set hive.auto.convert.join.noconditionaltask=false;
Analyze table T compute statistics for columns;
etc... My main idea is to understand what is the best and optimal way to join a table in the above scenario.
col_A | col_B | col_C | col_D | lat | long
abc | df | qw | 2005-10-30 T 10:45 | 12.3256 | -50.2368
abc | df | qw | 2005-10-30 T 10:45 | 11.12478 | -54.23587
abc | d | xx | 2005-10-30 T 10:45 | 15.3397 | -58.6982