3
votes

We have an external Hive table that is used for processing raw log file data. The files are hourly, and are partitioned by date and source host name.

At the moment we are importing files using simple python scripts that are triggered a few times per hour. The script creates sub folders on HDFS as needed, copies new files from the temporary local storage and adds any new partitions to Hive.

Today, new partitions are created using "ALTER TABLE ... ADD PARTITION ...". However, if another Hive query is running on the table it will be locked, which means that the add partition command will fail (if the query runs for long enough) since it requires an exclusive lock.

An alternative to this approach would be to use "MSCK REPAIR TABLE", which for some reason does not seem to aquire any locks on the table. However, I have gotten the impression that using repair table is not recommended for a production setting.

  • What is the best practise for adding Hive partitions programmatically in a concurrent environment?
  • What are the risks or disadvantages of using MSCK REPAIR TABLE?
  • Is there an explanation for the seemingly inconsistent locking behaviour of the two partition adding commands? I.e. do they have different effects on running queries?
1

1 Answers

1
votes

Not a good answer, but we have the same issue and here are our findings :

So basically, we're still thinking of our partition strategy, but we will probably try to create all possible partition in advance (before getting the data), as we know precisely the values of all future partitions (might not be the case for you).