0
votes

I have a big tables hive. In which to identify a unique record I need to check 4 columns, lets say col_A, col_B, col_C and col_D. I use these columns to identify changed records and overwrite HIVE table partitions.

I am trying several options to do this: 1) join on all 4 columns separately (this was slow)

2) Concat all 4 fields to create a key column (string type).

3) use step-2 and convert string to binary (is binary faster to join on?).

4) Use hash functions to create a key. (I got around million duplicates when I used the HASH( ) function in HIVE over a billion records.)(output is type Int).

5) Use other functions like HEX, MD5 (string), SHA1/2 or CR32 to form a key column.

Have you used anything similar in your project? What gave a better performance? Is there any other better way to join ?

P.s. I have already used several optimization parameters like:

SET hive.compute.query.using.stats=true;

SET hive.stats.fetch.column.stats=true;

SET hive.stats.fetch.partition.stats=true;

set hive.vectorized.execution.enabled = true;

set hive.auto.convert.join ;

set hive.auto.convert.join.noconditionaltask=false;

Analyze table T compute statistics for columns;

etc... My main idea is to understand what is the best and optimal way to join a table in the above scenario.

col_A | col_B | col_C | col_D | lat | long

abc | df | qw | 2005-10-30 T 10:45 | 12.3256 | -50.2368

abc | df | qw | 2005-10-30 T 10:45 | 11.12478 | -54.23587

abc | d | xx | 2005-10-30 T 10:45 | 15.3397 | -58.6982

1
Are you using ORC as the file format? Can you give more details about your table configuration (partitions etc) - Pradeep Gollakota
Hi Pradeep, I am using ORC table with snappy compress and few other optimizations like below, 'orc.compress'='SNAPPY', 'orc.create.index'='true', 'orc.bloom.filter.columns'='col_C', 'orc.bloom.filter.fpp'='0.05', 'orc.stripe.size'='268435456', 'orc.row.index.stride'='10000' - mgm_data

1 Answers

1
votes

Caveats: I say this based on documentation, intuition, experience and tribal knowledge. It is very unscientific and I have not tested any of this.

I don't think that binary is necessarily faster than strings for joining. Since, the limiting factor is most likely going to be i/o and serialization. Typically fixed length (such as int's and long's) tend be faster because of the low cost of deserialization.

There are certain techniques you can use to improve the performance of joins though. If it fits your use case, you should bucket and sort your table by your join key. This will allow Hive to perform a Sort-Merge-Bucket Map join, which tends to be faster than a reduce side hash join. Read Hive Join Optimization for more information.

There's also some behavior with predicate push down and Outer Join's that IMO everyone should be aware of.

Having said that, I'm not sure what your use case is for identifying "changed" records, but it may fit your use case to use Hive Streaming Data Ingest instead if you want mutations on your records.