2
votes

What is the optimal size for external table partition? I am planning to partition table by year/month/day and we are getting about 2GB of data daily.

3
What do you want to "optimize"? Do you have a lot of small nodes, a few small nodes, a few big nodes? With local disks or network drives or S3 object store? Do you have any control on the size of incoming files (i.e. 1000s of small files out of Flume, or a single FTP'ed file)? Do you have control on file format (CSV, AVRO,Parquet)? Do you have control on file compression (none, Snappy, GZip)? What is your default container size in MB? Do you use TEZ or MapReduce? Do you prefer a few long-running Mappers or a lot of short-lived ones? Do you have to Reduce a lot? - Samson Scharfrichter
On the other hand, if you don't know what you are doing and just want a magic number that means nothing, then the Generally Accepted Meaningful Number is 42 cf. en.wikipedia.org/wiki/… - Samson Scharfrichter
@SamsonScharfrichter I meant optimal size of the directory partition points to. I'll try to make files between 64 to 128 MB - Igor K.

3 Answers

2
votes

Optimal table partitioning is such that matching to your table usage scenario. Partitioning should be chosen based on:

  1. how the data is being queried (if you need to work mostly with daily data then partition by date).
  2. how the data is being loaded (parallel threads should load their own partitions, not overlapped)

2Gb is not too much even for one file, though it again depends on your usage scenario. Avoid unnecessary complex and redundant partitions like (year, month, date) - in this case date is enough for partition pruning.

2
votes

Hive partitions definition will be stored in the metastore, therefore too many partitions will take much space in the metastore.

Partitions will be stored as directories in the HDFS, therefore many partitions keys will produce hirarchical directories which make their scanning slower.

Your query will be executed as a MapReduce job, therefore it's useless to make too tiny partitions.

It's case depending, think how your data will be queried. For your case I prefer one key defined as 'yyyymmdd', hence we will get 365 partitions / year, only one level in the table directory and 2G data / partition which is nice for a MapReduce job.

For the completness of the answer, if you use Hive < 0.12, make your partition key string typed, see here.

Usefull blog here.

0
votes

Hive partitioning is most effective in cases where the data is sparse. By sparse I mean that the data internally has visible partitions such as by year, month or day.

In your case, partitioning by date doesn't make much sense as each day will have 2 Gb of data which is not too big to handle. Partitioning by week or month makes more sense as it will optimize the query time and will not create too many small partition files.