1
votes

I am having flat files being delivered to hdfs. The general structure of files are as below:

<header - 2 rows>
 <data>
<footer - 1 row>

I am having an external hive table build on top of this dataset. Below is my hive ddl:

create external table ext_test
(
  id string,
  name string, 
  age string
) row format DELIMITED FIELDS TERMINATED BY ','
  STORED AS TEXTFILE
  LOCATION '<hdfs file location>'
  TBLPROPERTIES ('skip.footer.line.count'='1', 'skip.header.line.count'='2')

When i query select * from ext_test in HIVE; i am getting the exact number of rows as expected from the external table. But when i run the exact same query in IMPALA, i receive one additional row which is basically the footer property.

e.g: The below example will clear out the scenario:

Sample Data in the Feed:

Header,Sample-03122018,
ID,NAME,AGE
1,R,10
2,RR,11
3,RRR,12
Footer,End of Report,

Output when i run select command in HIVE using the above DDL [Expected Output]:

1,R,10
2,RR,11
3,RRR,12

Output when i run select command in IMPALA using the above DDL: [Issue]

1,R,10
2,RR,11
3,RRR,12
Footer,End of Report,Null

Did you face similar issue? If yes, please advice me of the suitable solution.

1

1 Answers

1
votes

This feature is not implemented in Impala yet, see this Jira IMPALA-7196.

Impala doesn't recognize skip.footer.line.count, so it is effectively ignored.

The workaround is to query using filter WHERE ID <> 'Footer', or cut the file footers before loading files into table directory.