Hive on EMR not reading all files at S3 location

Question

I created a hive table using the following syntax, pointed to an S3 folder:

CREATE EXTERNAL TABLE IF NOT EXISTS daily_input_file ( 
        log_day STRING, 
        resource STRING, 
        request_type STRING, 
        format STRING, 
        mode STRING, 
        count INT 
) row format delimited fields terminated by '\t' LOCATION 's3://my-bucket/my-folder';

When I execute a query, such as:

SELECT * FROM daily_input_file WHERE log_day IN ('20160508', '20160507');

I expect that records will be returned.

I have verified that this data is contained in the files in that folder. In fact, if I copy the file that contains this particular data into a new folder, create a table for that new folder and run the query, I get the results. I also get results from other files (in fact from most files) within the original folder.

The contents of s3://my-bucket/my-folder are simple. There are no subdirectories within my folder. There are two varieties of file names (a and b), all are prefixed with the date they were created (YYYYMMDD_), all have an extension of .txt000.gz. Here are some examples:

20160508_a.txt000.gz
20160508_b.txt000.gz
20160509_a.txt000.gz
20160509_b.txt000.gz

So what might be going on? Is there a limit to the number of files within a single folder that can be processed from S3? Or is something else the culprit?

Here are the versions used:

Release label: emr-4.7.0
Hadoop distribution: Amazon 2.7.2
Applications: Hive 1.0.0, Pig 0.14.0, Hue 3.7.1

can you share the structure of your 'my-folder'? what all files/directories it contains — ketan vijayvargiya
It appears (I'm running a query to verify now), that perhaps only the first 1000 files from the folder are being read. Also, I updated the question with some more details about the file structure and release versions. — dennislloydjr
Yep - exactly 1000 files is what it reads. Apparently I get the first 1000 as sorted by file name. — dennislloydjr

ChristopherB ChristopherB · Accepted Answer · 2016-06-17T13:31:05

The behavior being experienced with the S3 files is an issue with EMR release 4.7.0 and not a limitation of EMR.

Use EMR release 4.7.1 or later.

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html

Hive on EMR not reading all files at S3 location

1 Answers