Assume I have a CSV file like this:
"Col1Name", "Col2Name"
"a", "b"
"c", "d"
Assume I issue the following CREATE EXTERNAL TABLE command in Athena:
CREATE EXTERNAL TABLE test.sometable (
col1name string,
col2name string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
stored as textfile
location 's3://somebucket/some/path/'
tblproperties("skip.header.line.count"="1")
Then I issue the following SELECT:
SELECT * FROM test.sometable
I expect to get the following:
+----------+----------+
| col1name| col2name|
+----------+----------+
| a| b|
| c| d|
+----------+----------+
...and sure enough, that's exactly what I get.
On an EMR cluster using the AWS Glue metadata catalog in Spark, I issue the following in the pyspark REPL:
a = spark.sql("select * from test.sometable")
a.show()
I expect to receive the same output, but, instead, I get this:
+----------+----------+
| col1name| col2name|
+----------+----------+
| col1name| col2name|
| a| b|
| c| d|
+----------+----------+
Obviously, Athena is honoring the "skip.header.line.count" tblproperty, but PySpark appears to be ignoring it.
How can I get PySpark to ignore this header line, as Athena does?