How to ignore headers in PySpark when using Athena and AWS Glue Data Catalog

Question

Assume I have a CSV file like this:

"Col1Name", "Col2Name"
"a", "b"
"c", "d"

Assume I issue the following CREATE EXTERNAL TABLE command in Athena:

CREATE EXTERNAL TABLE test.sometable (
   col1name string,
   col2name string
) 
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
   'separatorChar' = ',',
   'quoteChar' = '\"',
   'escapeChar' = '\\'
) 
stored as textfile
location 's3://somebucket/some/path/'
tblproperties("skip.header.line.count"="1")

Then I issue the following SELECT:

SELECT * FROM test.sometable

I expect to get the following:

+----------+----------+
|  col1name|  col2name|
+----------+----------+
|         a|         b|
|         c|         d|
+----------+----------+

...and sure enough, that's exactly what I get.

On an EMR cluster using the AWS Glue metadata catalog in Spark, I issue the following in the pyspark REPL:

a = spark.sql("select * from test.sometable")
a.show()

I expect to receive the same output, but, instead, I get this:

+----------+----------+
|  col1name|  col2name|
+----------+----------+
|  col1name|  col2name|
|         a|         b|
|         c|         d|
+----------+----------+

Obviously, Athena is honoring the "skip.header.line.count" tblproperty, but PySpark appears to be ignoring it.

How can I get PySpark to ignore this header line, as Athena does?

does this question give you an answer? stackoverflow.com/questions/49866197/… — Sayat Satybald

DEEPAK LAKHOTIA DEEPAK LAKHOTIA · Accepted Answer · 2018-04-30T14:46:21

Any of the two methods should help you:

(1) Set the header row count to be skipped in the parameter:

'skip.header.line.count'='1'

(2) Or, in the select query use a where clause to filter that row. Say:

SELECT * FROM test.sometable where col1name <> 'col1name'

How to ignore headers in PySpark when using Athena and AWS Glue Data Catalog

1 Answers