0
votes

I want to apply archive and purge mechanism on hive tables, which includes internal and external tables and both partitioned and non-partitioned.

I have a site_visitors table and its partitioned with visit_date. And I wanted to archive the site_visitors table data, where in users not visited my site in last one year. At the same time, I don't want to keep this archived data in same table directory. I can have archived data some specific location.

1

1 Answers

1
votes

You can handle that on the partitions in the HDFS directory, below is one of the ways you can achieve that.

Your internal table/Main table will be sitting on top of hdfs and the directory will look something like below hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-01 hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-02 hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-03

You can create an archive table on top of HDFS or if you are just looking to archive the data you can dump the partitions to other location in HDFS. Either way, your HDFS location will look something like below.

hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-01 hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-02 hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-03

You can run a UNIX script or javascript or in any other language that is used in your environment to move the files from one HDFS location to the other archive hdfs location based on the partition dates.

You can also do with the below approach, where you can load the data into archive table and drop the data in the original table.

#!bin/bash
ARCHIVE=$1
now=$(date +%Y-%m-%d) 
StartDate=$now
#archive_dt will give a date based on the ARCHIVE date and that be will used for alterations and loading
archive_dt=$(date --date="${now} - ${ARCHIVE} day" +%Y-%m-%d)
EndDate=$archive_dt
#You can use hive or beeline or impala to insert the data into archive table, i'm using beeline for my example
beeline -u ${CONN_URL} -e "insert into table ${SCHEMA}.archive_table partition (visit_date) select * from ${SCHEMA}.${TABLE_NAME} where visit_date < ${archive_dt}"
#After the data been loaded to the archive table i can drop the partitions in original table
beeline -u ${CONN_URL} -e "ALTER TABLE ${SCHEMA}.main_table DROP PARTITION(visit_date < ${archive_dt})"
#Repair the tables to sync the metadata after alterations
beeline -u ${CONN_URL} -e "MSCK REPAIR TABLE ${SCHEMA}.main_table; MSCK REPAIR TABLE archiveSchema.archive_table"