2
votes

I have a table that is being partitioned by a specific start date (ds). I can query the latest partition (the previous day's data) and it will use the partition fine.

hive> select count(1) from vtc4 where ds='2012-11-01' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 6.43 sec   HDFS Read: 46281957 HDFS Write:  7 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 430 msec
OK
151225
Time taken: 35.007 seconds

However, when I try to query earlier partitions, hive seems to read the partition fine, but does not return any results.

hive> select count(1) from vtc4 where ds='2012-10-31' ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 7.64 sec   HDFS Read: 37754168 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 640 msec
OK
0
Time taken: 29.07 seconds

However, if I tell hive to run the query against the date field inside the table itself, and don't use the partition, I get the correct result.

hive> select count(1) from vtc4 where date_started >= "2012-10-31 00:00:00" and date_started < "2012-11-01 00:00:00" ;
...garbage...
MapReduce Jobs Launched:
Job 0: Map: 63  Reduce: 1   Cumulative CPU: 453.52 sec   HDFS Read: 16420276606 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 minutes 33 seconds 520 msec
OK
123201
Time taken: 265.874 seconds

What am I missing here? I'm running hadoop 1.03 and hive 0.9. I'm pretty new to hive/hadoop, so any help would be appreciated.

Thanks.

EDIT 1: hive> describe formatted vtc4 partition (ds='2012-10-31');

Partition Value:        [2012-10-31 ]
Database:               default
Table:                  vtc4
CreateTime:             Wed Oct 31 12:02:24 PDT 2012
LastAccessTime:         UNKNOWN
Protect Mode:           None
Location:               hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31
Partition Parameters:
    transient_lastDdlTime   1351875579

# Storage Information 
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
serialization.format    1
Time taken: 0.191 seconds

The partition folders exist, but when i try to do a hadoop fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-10-31 it says the file/directory does not exist. If I browse to that directory using the web interface, I can get into the folder , as well as see the /part-m-000* files. If I do a fs -ls on hdfs://hadoop5.internal/user/hive/warehouse/vtc4/ds=2012-11-01 it works fine.

2
What happens when you do describe formatted vtc4 partition (ds='2012-10-31')? What is the location on hdfs and what files do you see there? - Joe K
@Joe K I updated the post with the info you asked for. - wafflecopter

2 Answers

2
votes

Seems like either a permissions thing, or something funky with the either hive's or the namenode's metadata. Here's what I would try:

  1. copy the data in that partition to some other location in hdfs. You may need to do this as the hive or hdfs user, depending on how your permissions are set up.
  2. alter table vtc4 drop partition (ds='2012-10-31');
  3. alter table vtc4 add partition (ds='2012-10-31');
  4. copy the data back into that partition on hdfs
1
votes

Another thing with hive partition is that it sometime doesn't register in metadata system when created outside of hive (e.g. from sparksql). You can also try MSCK REPAIR TABLE xc_bonus; after any changes to partition so it reflects correctly.