I have a table that is being partitioned by a specific start date (ds). I can query the latest partition (the previous day's data) and it will use the partition fine.
hive> select count(1) from vtc4 where ds='2012-11-01' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 6.43 sec HDFS Read: 46281957 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 430 msec
OK
151225
Time taken: 35.007 seconds
However, when I try to query earlier partitions, hive seems to read the partition fine, but does not return any results.
hive> select count(1) from vtc4 where ds='2012-10-31' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 7.64 sec HDFS Read: 37754168 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 640 msec
OK
0
Time taken: 29.07 seconds
However, if I tell hive to run the query against the date field inside the table itself, and don't use the partition, I get the correct result.
hive> select count(1) from vtc4 where date_started >= "2012-10-31 00:00:00" and date_started < "2012-11-01 00:00:00" ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 63 Reduce: 1 Cumulative CPU: 453.52 sec HDFS Read: 16420276606 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 minutes 33 seconds 520 msec
OK
123201
Time taken: 265.874 seconds
What am I missing here? I'm running hadoop 1.03 and hive 0.9. I'm pretty new to hive/hadoop, so any help would be appreciated.
Thanks.
EDIT 1: hive> describe formatted vtc4 partition (ds='2012-10-31');
Partition Value: [2012-10-31 ]
Database: default
Table: vtc4
CreateTime: Wed Oct 31 12:02:24 PDT 2012
LastAccessTime: UNKNOWN
Protect Mode: None
Location: hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31
Partition Parameters:
transient_lastDdlTime 1351875579
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.191 seconds
The partition folders exist, but when i try to do a hadoop fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31 it says the file/directory does not exist. If I browse to that directory using the web interface, I can get into the folder , as well as see the /part-m-000* files. If I do a fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-11-01 it works fine.
describe formatted vtc4 partition (ds='2012-10-31')? What is the location on hdfs and what files do you see there? - Joe K