What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

Question

I know that MSCK REPAIR TABLE updates the metastore with the current partitions of an external table.

To do that, you only need to do ls on the root folder of the table (given the table is partitioned by only one column), and get all its partitions, clearly a < 1s operation.

But in practice, the operation can take a very long time to execute (or even timeout if ran on AWS Athena).

So my question is, what does MSCK REPAIR TABLE actually do behind the scenes and why?

How does MSCK REPAIR TABLE find the partitions?

Additional data in case it's relevant:

_{Our data is all on S3, it's both slow when running on EMR (Hive) or Athena (Presto), there are ~450 partitions in the table, every partition has on avg 90 files, overall 3 Gigabytes for a partition, files are in Apache parquet format}

cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL mentions ALTER TABLE RECOVER PARTITIONS. Is it just an alias for MSCK or does it do less work? — Piotr Findeisen
@PiotrFindeisen Seems to be just the equivalent command for EMR. — gdoron is supporting Monica
For all I know, it lists all the partition files and gather some metadata about them. If you have 450 partitions and 90 files per partition, it might be doing 40500 calls to s3 to get every file size individually. I'm not sure if it does more than that, but if it does, it probably runs some statistical analysis on the files as well. If it's the case, you can try with this option: SET hive.stats.autogather=false; How long exactly does it take? Are we talking a around few minutes or a few hours? A few minutes wouldn't shock me. — FurryMachine

Vihang Vihang · Accepted Answer · 2018-12-10T23:28:15

You are right in the sense it reads the directory structure, creates partitions out of it and then updates the hive metastore. In fact more recently, the command was improved to remove non-existing partitions from metastore as well. The example that you are giving is very simple since it has only one level of partition keys. Consider table with multiple partition keys (2-3 partition keys is common in practice). msck repair will have to do a full-tree traversal of all the sub-directories under the table directory, parse the file names, make sure that the file names are valid, check if the partition is already existing in the metastore and then add the only partitions which are not present in the metastore. Note that each listing on the filesystem is a RPC to the namenode (in case of HDFS) or a web-service call in case of S3 or ADLS which can add to significant amount of time. Additionally, in order to figure out if the partition is already present in metastore or not, it needs to do a full listing of all the partitions which metastore knows of for the table. Both these steps can potentially increase the time taken for the command on large tables. The performance of msck repair table was improved considerably recently Hive 2.3.0 (see HIVE-15879 for more details). You may want to tune hive.metastore.fshandler.threads and hive.metastore.batch.retrieve.max to improve the performance of command.

What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

1 Answers