0
votes

I have an existing bucketed table that has YEAR, MONTH, DAY partitioning, but I want to add additional partitioning by INGESTION_KEY, a column that doesn't exist in the existing table. This is to accommodate future table inserts so that I don't have to OVERWRITE a YEAR, MONTH, DAY partition every time I ingest data for that date; I can just do a simple INSERT INTO and create a new INGESTION_KEY partition.

I need a year's worth of data in my new table to start, so I want to copy a year of partitions from my existing table to a new table. Rather than doing a Hive INSERT for each partition, I thought it would be quicker to use distcp to copy files into the new table's partition directories in the Hive warehouse directory in HDFS, then ADD PARTITION to the new table.

So, this is all I'm doing:

hadoop distcp /apps/hive/warehouse/src_db.db/src_tbl/year=2017/month=02/day=06 /apps/hive/warehouse/dest_db.db/dest_tbl/year=2017/month=02/day=06/ingestion_key=123

hive -e "ALTER TABLE dest_tbl ADD PARTITION (year=2017,month=02,day=06,ingestion_key='123')"

Both are managed tables, the new table dest_tbl is clustered by the same column into the same number of buckets as the src_tbl, and the only difference in schema is the addition of INGESTION_KEY.

So far my SELECT * FROM dest_tbl shows everything in the new table looking normal. So my question is: is there anything wrong with this approach? Is it bad to INSERT to a managed, bucketed table this way, or is this an acceptable alternative to INSERT if no transformations are being done on the copied data?

Thanks!!

1
You should not think in terms of INSERT, rather in terms of backup/restore operations. The Hive metastore does not keep trace of the data files, only of the directory structure; so as long as the file structure match the metadata, it's fair. - Samson Scharfrichter
BTW it's also a convenient way to sample Production data and ship it to a Test or QA cluster... - Samson Scharfrichter

1 Answers

3
votes

Although i prefer copying by Hive query just to make it all in Hive, but it's ok to copy data files using other tools, but ..

  • There is a dedicated command that add the new partitions metadata, you can use it in place of alter table add partition.., and it can add many partitions at once :

    MSCK REPAIR TABLE dest_tbl;
    
  • Keep using Hive default partitioning format : partionKey=partitionValue