I created a hive table using the following syntax, pointed to an S3 folder:
CREATE EXTERNAL TABLE IF NOT EXISTS daily_input_file (
log_day STRING,
resource STRING,
request_type STRING,
format STRING,
mode STRING,
count INT
) row format delimited fields terminated by '\t' LOCATION 's3://my-bucket/my-folder';
When I execute a query, such as:
SELECT * FROM daily_input_file WHERE log_day IN ('20160508', '20160507');
I expect that records will be returned.
I have verified that this data is contained in the files in that folder. In fact, if I copy the file that contains this particular data into a new folder, create a table for that new folder and run the query, I get the results. I also get results from other files (in fact from most files) within the original folder.
The contents of s3://my-bucket/my-folder are simple. There are no subdirectories within my folder. There are two varieties of file names (a and b), all are prefixed with the date they were created (YYYYMMDD_), all have an extension of .txt000.gz. Here are some examples:
- 20160508_a.txt000.gz
- 20160508_b.txt000.gz
- 20160509_a.txt000.gz
- 20160509_b.txt000.gz
So what might be going on? Is there a limit to the number of files within a single folder that can be processed from S3? Or is something else the culprit?
Here are the versions used:
- Release label: emr-4.7.0
- Hadoop distribution: Amazon 2.7.2
- Applications: Hive 1.0.0, Pig 0.14.0, Hue 3.7.1