Getting job statistics for Hive inserts

Question

While using Hive 0.10 with Cloudera CHD4.x it was always possible to see how many rows were inserted into a particular table by reading the command output. The line looked something like:

Loaded 1234 rows into tablename

Although not ideal (no programmatic interface to the query manager) it was a reasonable indication of the amount of data inserted. However in Hive 0.13 with Cloudera CDH 5.1 that line does not appear in the command output. I also cannot figure out how to get the import count from the query manager.

How can I find out how many rows were inserted into a given table by a given query? I wondered if accessing the Hadoop counters may do it, but I can't find any information about how Hive uses them. There doesn't appear to be anything in the Thrift interface that would allow access to these statistics.

Basically I don't want to issue a SELECT COUNT(*) against my source data just to find out how many rows are/were processed.

Slava Markeyev Slava Markeyev · Accepted Answer · 2014-12-04T08:25:47

I'm trying to figure this out myself right now. Presumably the job counters were refactored as apart of HIVE-4518. This seems like a regression in functionality because the code to get and display the row counts still exists but it never prints because there are no counters to get the number from.

One option is to turn on hive.stats.autogather which will return statistics but it may or may not have the row count depending on your query.

Edit: filed ticket HIVE-9023 describing the bug.

Getting job statistics for Hive inserts

1 Answers