0
votes

We have an Amazon EMR Cluster (v5.19.0) where we use Presto (v0.212) over Hive (v2.3.2) to process data. When the topic is data reading and writing, it is a monster and does everything pretty fast.

On the other hand, I am pretty frustrated with the data exclusion options. There are a lot of posts regarding data access and writing on Internet, but almost nothing but basic use-cases regarding data deleting. Here are some methods which I tried to use:

  • Presto delete statement, which seems to fail randomly. It works fine for small tables but it starts to raise random exceptions for other tables (most of them regarding missing files which it is deleting). We are planning to update the EMR version soon to see if this issue ceases, but for while it isn't reliable (or we configured something wrong);

  • Hive drop partition statement. This one is surprisingly slow. For bigger tables (more than 4000 partitions), it takes minutes to drop a partition referencing an empty/deleted folder. I really don't understand how this command can be so slow;

  • Amazon S3 / HDFS RMDIR command. Actually we are using this one, it can remove partitions in less than a second.

The last method seems to work fine when we use Presto queries to access data. However, we noticed that the partitions still present on Hive Metastore, making Hive raise exceptions when trying to execute any query and inflating the amount of partitions on it. Since Hive is super slow to drop partitions, we don't know what to do to keep the Metastore clean and have a fast process.

On the Hive documentation there is a section regarding the MSCK REPAIR TABLE command which includes an option to remove missing partitions. Sadly when I try to run it on my terminal using the "DROP PARTITIONS" parameter, it shows an error message "FAILED: ParseException line 1:34 missing EOF at 'drop' near 'TABLENAME'". So I believe that my Hive version isn't compatible or there is a bug on it.

So, do you know a good way to delete partitions on a real system using a configuration like the mine? Please tell me how you do to delete data on your big data pipeline to see if I can find some inspiration to my problem. Also please let me know if you know some method of removing only Partition references from Hive or to list all partitions whose data was deleted. Thanks!

3

3 Answers

1
votes

As you observed, if you drop partition data (files & directories) on S3 or HDFS, the partitions still need to be deregistered from the Hive metastore.

The Hive way of syncing storage's state with metastore's state is MSCK REPAIR TABLE.

The Presto way of syncing storage's state with metastore's state is system.sync_partition_metadata Presto Hive connector procedure.

0
votes

Try ALTER TABLE table_name RECOVER PARTITIONS; instead of MSCK REPAIR TABLE command. It should work fine on AWS.

0
votes

including here more details of how I solved this issue. Note that, if possible, avoid this solution and use the Delete function from your data processing tool.

  • First, try to use the Hive ALTER TABLE table_name DROP PARTITION(... statement if possible;
  • If it is too slow, use the partition key fields to drop the partition folder using a command like aws s3 rm or hadoop fs -rm;
  • Finally, on Hive, transform your table into a EXTERNAL table just to clean its metadata faster, since on the previous step you already deleted its data:
ALTER TABLE tablename SET TBLPROPERTIES('EXTERNAL'='TRUE');
ALTER TABLE tablename DROP PARTITION(...
ALTER TABLE tablename SET TBLPROPERTIES('EXTERNAL'='FALSE');

Also check Piotr's answer to this answer to see a nice way of deleting Partitions if you are using a more updated Presto.