I know that MSCK REPAIR TABLE
updates the metastore with the current partitions of an external table.
To do that, you only need to do ls
on the root folder of the table (given the table is partitioned by only one column), and get all its partitions, clearly a < 1s operation.
But in practice, the operation can take a very long time to execute (or even timeout if ran on AWS Athena).
So my question is, what does MSCK REPAIR TABLE
actually do behind the scenes and why?
How does MSCK REPAIR TABLE find the partitions?
Additional data in case it's relevant:
Our data is all on S3, it's both slow when running on EMR (Hive) or Athena (Presto), there are ~450 partitions in the table, every partition has on avg 90 files, overall 3 Gigabytes for a partition, files are in Apache parquet format
ALTER TABLE RECOVER PARTITIONS
. Is it just an alias forMSCK
or does it do less work? – Piotr Findeisen