Summary of the issue:
Whenever I insert data into a dynamically partitioned table, far too much time is being spent updating the partition statistics in the metastore.
More details:
I have several queries that select data from one hive table and insert it into another table that is dynamically partitioned into about 8000 partitions. The queries complete quickly and correctly. The output files are copied into the partition directories very quickly. But then this happens for every partition:
INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(253)) - ugi=hive ip=unknown-ip-addr cmd=append_partition : db=default tbl=some_table[14463,1410]
WARN hive.log (MetaStoreUtils.java:updatePartitionStatsFast(284)) - Updating partition stats fast for: some_table
WARN hive.log (MetaStoreUtils.java:updatePartitionStatsFast(292)) - Updated size to 1042
Each such partition update is taking about 500 milliseconds. But Hive puts an exclusive lock on the entire table while these updates are happening, and with 8000 such partitions this means that my table is locked for an unacceptably long time.
It seems to me that there must be some way to disable these partition statistics without affecting the performance of Hive too terribly; after all, I could just manually copy files to these partitions without involving Hive at all.
I've tried settings some of the "hive.stats" settings, but there's very little documentation on these settings so I don't know exactly what they're supposed to do. Specifically, I've tried setting:
set hive.stats.autogather=false;
set hive.stats.collect.rawdatasize=false;
Any suggestions on how to prevent Hive from trying to keep track of partition statistics would be greatly appreciated!