Hive MapReduce job splitting up files

Question

I have created a hive external table which reads a custom file input format. This is working perfectly fine when the file are small. But when the files are big the job is splitting up the files and my job fails.

I'm returning false in my custom input format class for the IsSplittable method. I have also tried setting mapreduce.input.fileinputformat.split.minsize and mapred.min.split.size to large values. I have created a Custom InputFormat, OutputFormat and a SerDe class and used them while creating this table.

In my job logs I'm still seeing the splits happening.

Processing split: Paths:/user/test/testfile1:0+134217728,/user/test/testfile1:134217728+95198924,/user/test/testfile2:0+134217728,/user/test/testfile2:134217728+96092244...

The 134217728 is 128 MB which must be my HDFS block size. Is there a way I can prevent this split from happening? Is it related to this issue https://issues.apache.org/jira/browse/HIVE-8630 ?

My Create table statement is:

CREATE EXTERNAL TABLE test_data(
  key STRING, 
  body map<string, string>  
  )
PARTITIONED BY (year int, month int, day int)  
ROW FORMAT SERDE 'com.hiveio.io.CustomHiveSerde' 
STORED AS INPUTFORMAT 'com.hiveio.io.CustomHiveInputFormat' 
OUTPUTFORMAT 'com.hiveio.io.CustomHiveOutputFormat' 
LOCATION '/user/test/';

Could you elaborate on "my job fails" -- is it because your record delimiter is not the usual LF? And by the way, did you try to GZip the files to make them un-splittable? — Samson Scharfrichter
Job fails since my input file is not splittable and the map reduce job fails as my input format starts to read the files wrong and gets invalid data for hive table values.. Gzipping the files work because the files are compressed to ~20mb. Smaller files also work unzipped. Its when the file size is large that the job fails. I haven't tried with a gzip file > 128 mb. — Manoj Sreekumar
is the map/reduce job running off a hive query ? If so, the table also must have input format declared or hive would use its default. We need more detail to answer your question, in particular, what job you are running and how. — Roberto Congiu
Maybe your problem is similar to the bug reported for MAPREDUCE-2254, see final comments - issues.apache.org/jira/browse/MAPREDUCE-2254 — Samson Scharfrichter
@RobertoCongiu: I have added my create table statement. My job is generated from a hive count query. "select count(1) from test". I don't think this is happening due to a wrong input format as it is working for small files. SamsonScharfrichter: Since it is working for smaller files I think it might not be due delimiters. — Manoj Sreekumar

Roberto Congiu Roberto Congiu · Accepted Answer · 2015-12-16T23:28:07

Ok..actually, you mentioning https://issues.apache.org/jira/browse/HIVE-8630 rang a bell. A while ago we dealt with a very similar problem. The bug mentions how CombineHiveInputFormat will still split unsplittable formats. CombineHiveInputFormat is the default HiveInputFormat and its purpose is it combine multiple small files to reduce overhead. You can disable it, setting

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat

before the query, or set it as xml in hive-site.xml if you want it as default:

<property>
   <name>hive.input.format</name>
   <value>org.apache.hadoop.hive.ql.io.HiveInputFormat</value>
</property>

Note that you'll be sacrificing the feature of the Combine part, so if you have many small files, they'll each one take a mapper when processing....but this should work, it did work for us.

Hive MapReduce job splitting up files

1 Answers