0
votes

Summary of the issue:

Whenever I insert data into a dynamically partitioned table, far too much time is being spent updating the partition statistics in the metastore.

More details:

I have several queries that select data from one hive table and insert it into another table that is dynamically partitioned into about 8000 partitions. The queries complete quickly and correctly. The output files are copied into the partition directories very quickly. But then this happens for every partition:

INFO  HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(253)) - ugi=hive    ip=unknown-ip-addr  cmd=append_partition : db=default tbl=some_table[14463,1410]    
WARN  hive.log (MetaStoreUtils.java:updatePartitionStatsFast(284)) - Updating partition stats fast for: some_table
WARN  hive.log (MetaStoreUtils.java:updatePartitionStatsFast(292)) - Updated size to 1042

Each such partition update is taking about 500 milliseconds. But Hive puts an exclusive lock on the entire table while these updates are happening, and with 8000 such partitions this means that my table is locked for an unacceptably long time.

It seems to me that there must be some way to disable these partition statistics without affecting the performance of Hive too terribly; after all, I could just manually copy files to these partitions without involving Hive at all.

I've tried settings some of the "hive.stats" settings, but there's very little documentation on these settings so I don't know exactly what they're supposed to do. Specifically, I've tried setting:

set hive.stats.autogather=false;
set hive.stats.collect.rawdatasize=false;

Any suggestions on how to prevent Hive from trying to keep track of partition statistics would be greatly appreciated!

2
Is it really necessary to have 8000 partitions...seems like a lot of partitions, especially if you are able to query all of that data quickly. - Jordan Young
It's only the insert query that runs quickly. Each partition contains hundreds of gigabytes of data, and queries against this table will always be interested in the data in exactly one of these partitions. - Jason Rosendale

2 Answers

1
votes

Using set hive.stats.autogather=false will not take effect within the application. The reason is that when the Hive connection is created, it configures the hive configs to the metastore and once it is configured, it cannot be modified anymore.

You can disable the statistics in two ways:

1. Via the Hive shell

Using the Hive shell, type hive --hiveconf hive.stats.autogather=false.

2. Updating hive-site.xml

Update the following in hive-site.xml and restart the Hive session.

<property>
  <name>hive.stats.autogather</name>
  <value>false</value>
</property>
0
votes

https://cwiki.apache.org/confluence/display/Hive/StatsDev

According to the Hive documentation, this should be able to disable the statistics on partitions:

set hive.stats.autogather=false;