0
votes

I must read Avro record serialized in avro files in HDFS. To do that, I use the AvroKeyInputFormat, so my mapper is able to work with the read records as keys.

My question is, how can I control the split size? With the text input format it consists on define the size in bytes. Here I need to define how many records every split will consist of.

I would like to manage every file in my input directory like a one big file. Have I to use CombineFileInputFormat? Is it possible to use it with Avro?

1

1 Answers

0
votes

Splits honor logical record boundaries and the min and max boundaries are in bytes - text input format won't break lines in a text file even though the split boundaries are defined in bytes.

To have each file in a split, you can either set the max split size to Long.MAX_VALUE or you can override the isSplitable method in your code and return false.