I'm writing custom InputFormat (specifically, a subclass of org.apache.hadoop.mapred.FileInputFormat), OutputFormat, and SerDe for use with binary files to be read in through Apache Hive. Not all records within the binary file have the same size.
I'm finding that Hive's default InputFormat, CombineHiveInputFormat, is not delegating getSplits to my custom InputFormat's implementation, which causes all input files to be split on regular 128MB boundaries. The problem with this is that this split may be in the middle of a record, so all splits but the first are very likely to appear to have corrupt data.
I've already found a few workarounds, but I'm not pleased with any of them.
One workaround is to do:
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
When using HiveInputFormat over CombineHiveInputFormat, the call to getSplits is correctly delegated to my InputFormat and all is well. However, I want to make my InputFormat, OutputFormat, etc. easily available to other users, so I'd prefer not to have to go through this. Additionally, I'd like to be able to take advantage of combining splits if possible.
Yet another workaround is to create a StorageHandler. However, I'd prefer not to do this, since this makes all tables backed by the StorageHandler non-native (so all reducers write out to one file, cannot LOAD DATA into the table, and other nicities I'd like to preserve from native tables).
Finally, I could have my InputFormat implement CombineHiveInputFormat.AvoidSplitCombination to bypass most of CombineHiveInputFormat, but this is only available in Hive 1.0, and I'd like my code to work with earlier versions of Hive (at least back to 0.12).
I filed a ticket in the Hive bug tracker here, in case this behavior is unintentional: https://issues.apache.org/jira/browse/HIVE-9771
Has anyone written a custom FileInputFormat that overrides getSplits for use with Hive? Was there ever any trouble getting Hive to delegate the call to getSplits that you had to overcome?