Hive external table optimal partition size

Question

What is the optimal size for external table partition? I am planning to partition table by year/month/day and we are getting about 2GB of data daily.

What do you want to "optimize"? Do you have a lot of small nodes, a few small nodes, a few big nodes? With local disks or network drives or S3 object store? Do you have any control on the size of incoming files (i.e. 1000s of small files out of Flume, or a single FTP'ed file)? Do you have control on file format (CSV, AVRO,Parquet)? Do you have control on file compression (none, Snappy, GZip)? What is your default container size in MB? Do you use TEZ or MapReduce? Do you prefer a few long-running Mappers or a lot of short-lived ones? Do you have to Reduce a lot? — Samson Scharfrichter
On the other hand, if you don't know what you are doing and just want a magic number that means nothing, then the Generally Accepted Meaningful Number is 42 cf. en.wikipedia.org/wiki/… — Samson Scharfrichter
@SamsonScharfrichter I meant optimal size of the directory partition points to. I'll try to make files between 64 to 128 MB — Igor K.

leftjoin leftjoin · Accepted Answer · 2016-06-01T19:08:39

Optimal table partitioning is such that matching to your table usage scenario. Partitioning should be chosen based on:

how the data is being queried (if you need to work mostly with daily data then partition by date).
how the data is being loaded (parallel threads should load their own partitions, not overlapped)

2Gb is not too much even for one file, though it again depends on your usage scenario. Avoid unnecessary complex and redundant partitions like (year, month, date) - in this case date is enough for partition pruning.

Hive external table optimal partition size

3 Answers