Custom InputFormat.getSplits() never called in Hive

Question

I'm writing custom InputFormat (specifically, a subclass of org.apache.hadoop.mapred.FileInputFormat), OutputFormat, and SerDe for use with binary files to be read in through Apache Hive. Not all records within the binary file have the same size.

I'm finding that Hive's default InputFormat, CombineHiveInputFormat, is not delegating getSplits to my custom InputFormat's implementation, which causes all input files to be split on regular 128MB boundaries. The problem with this is that this split may be in the middle of a record, so all splits but the first are very likely to appear to have corrupt data.

I've already found a few workarounds, but I'm not pleased with any of them.

One workaround is to do:

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

When using HiveInputFormat over CombineHiveInputFormat, the call to getSplits is correctly delegated to my InputFormat and all is well. However, I want to make my InputFormat, OutputFormat, etc. easily available to other users, so I'd prefer not to have to go through this. Additionally, I'd like to be able to take advantage of combining splits if possible.

Yet another workaround is to create a StorageHandler. However, I'd prefer not to do this, since this makes all tables backed by the StorageHandler non-native (so all reducers write out to one file, cannot LOAD DATA into the table, and other nicities I'd like to preserve from native tables).

Finally, I could have my InputFormat implement CombineHiveInputFormat.AvoidSplitCombination to bypass most of CombineHiveInputFormat, but this is only available in Hive 1.0, and I'd like my code to work with earlier versions of Hive (at least back to 0.12).

I filed a ticket in the Hive bug tracker here, in case this behavior is unintentional: https://issues.apache.org/jira/browse/HIVE-9771

Has anyone written a custom FileInputFormat that overrides getSplits for use with Hive? Was there ever any trouble getting Hive to delegate the call to getSplits that you had to overcome?

Jeremy Beard Jeremy Beard · Accepted Answer · 2015-03-18T22:03:21

Typically in this situation you leave the splits alone so that you can get data locality for the blocks, and have your RecordReader understand how to start the reading from the first record in the block (split) and to read into the next block where the final record does not end at the exact end of the split. This takes some remote reads but it is normal and usually very minimal.

TextInputFormat/LineRecordReader does this - it uses newline to delimit records, so naturally a record can span two blocks. It will traverse to the first record in the split instead of starting at the first character, and on the last record it will read into the next block if necessary to read the complete data.

Where LineRecordReader starts the split by seeking past the current partial record.

Where LineRecordReader ends the split by reading past the end of the current block.

Hope that helps direct the design of your custom code.

Custom InputFormat.getSplits() never called in Hive

1 Answers