We have an Amazon EMR Cluster (v5.19.0) where we use Presto (v0.212) over Hive (v2.3.2) to process data. When the topic is data reading and writing, it is a monster and does everything pretty fast.
On the other hand, I am pretty frustrated with the data exclusion options. There are a lot of posts regarding data access and writing on Internet, but almost nothing but basic use-cases regarding data deleting. Here are some methods which I tried to use:
Presto delete statement, which seems to fail randomly. It works fine for small tables but it starts to raise random exceptions for other tables (most of them regarding missing files which it is deleting). We are planning to update the EMR version soon to see if this issue ceases, but for while it isn't reliable (or we configured something wrong);
Hive drop partition statement. This one is surprisingly slow. For bigger tables (more than 4000 partitions), it takes minutes to drop a partition referencing an empty/deleted folder. I really don't understand how this command can be so slow;
Amazon S3 / HDFS RMDIR command. Actually we are using this one, it can remove partitions in less than a second.
The last method seems to work fine when we use Presto queries to access data. However, we noticed that the partitions still present on Hive Metastore, making Hive raise exceptions when trying to execute any query and inflating the amount of partitions on it. Since Hive is super slow to drop partitions, we don't know what to do to keep the Metastore clean and have a fast process.
On the Hive documentation there is a section regarding the MSCK REPAIR TABLE command which includes an option to remove missing partitions. Sadly when I try to run it on my terminal using the "DROP PARTITIONS" parameter, it shows an error message "FAILED: ParseException line 1:34 missing EOF at 'drop' near 'TABLENAME'". So I believe that my Hive version isn't compatible or there is a bug on it.
So, do you know a good way to delete partitions on a real system using a configuration like the mine? Please tell me how you do to delete data on your big data pipeline to see if I can find some inspiration to my problem. Also please let me know if you know some method of removing only Partition references from Hive or to list all partitions whose data was deleted. Thanks!