2
votes

We have a very large Hadoop dataset having more than a decade of historical transaction data - 6.5B rows and counting. We have partitioned it on year and month.

Performance is poor for a number of reasons. Nearly all of our queries can be further qualified by customer_id, as well, but we have 500 customers and growing quickly. If we narrow the query to a given month, we still need to scan all records just to find the records for one customer. The data is stored as Parquet now, so the main performance issues are not related to scanning all of the contents of a record.

We hesitated to add a partition on customer because if we have 120 year-month partitions, and 500 customers in each this will make 60K partitions which is larger than Hive metastore can effectively handle. We also hesitated to partition only on customer_id because some customers are huge and other tiny, so we have a natural data skew.

Ideally, we would be able to partition historical data, which is used far less frequently using one rule (perhaps year + customer_id) and current data using another (like year/month + customer_id). Have considered using multiple datasets, but managing this over time seems like more work and changes and so on.

Are there strategies, or capabilities of Hive that provide a way to handle a case like this where we "want" lots of partitions for performance, but are limited by the metastore?

I am also confused about the benefit of bucketing. A suitable bucketing based on customer id, for example, would seem to help in a similar way as partitioning. Yet Hortonworks "strongly recommends against" buckets (with no explanation why). Several other pages suggest bucketing is useful for sampling. Another good discussion of bucketing from Hortonworks indicates that Hive cannot do pruning with buckets the same way it can with partitions.

We're on a recent version of Hive/Hadoop (moving from CDH 5.7 to AWS EMR).

1
My 2 cents: use a columnar format which stores min/max counters for every column on every stripe (allowing for Skip Scans, especially if your data is sorted judiciously on INSERT), such as ORC. Tweak the Hive config cf. slideshare.net/Hadoop_Summit/… pp.12-21. Then consolidate the partitions. And don't store the columnar files on S3 (kills random access, hence the Skip Scan feature)... - Samson Scharfrichter
It seems that you outline a solution, why not keep 2 tables partitioned differently? - Alex Libov
My 2 cents, continued - about ORC tuning slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data (and BTW Parquet is not a bad choice, either...) - Samson Scharfrichter
Thanks Samson and @alex-libov. I have clarified the question. We have indeed considered adding a second table partitioned differently, but this creates a lot of management overhead (archiving old data periodically, managing access by a view, etc.) - Tom Harrison
You can try hive indexes, create index on customer_id column and keep the partition as is that is year and month and see how its goes. - kkmishra

1 Answers

0
votes

In real 60K partitions is not a big problem for Hive. I have experience with about 2MM partitions for one Have table and it works pretty fast. Some details you can find on link https://andr83.io/1123 Of course you need write queries carefully. Also I can recommend to use ORC format with indexes and bloom filters support.