How to use save table with format csv with header from EMR with glue with stored as textfile

Question

Current behavior of EMR spark (version 5.26) with associated glue catalog, while saving data to s3 and glue metadata is following

I have an EMR cluster and I am running below commands

Scenario1

Seq(1,2,3).toDF("id")
    .write
    .option("header","true")
    .option("delimiter","|")
    .format("csv")
    .saveAsTable("testdb.spark_csv_test_v1")

This Generates

S3 file properly with header and "|" separated data
Glue metadata is Input format(org.apache.hadoop.mapred.SequenceFileInputFormat), Output format(org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat), Serialization lib(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)
Schema

#	Column Name	Data type	Partition key	Comment
1	col	array	-	from deserializer

This works great with EMR but throws error when using Redshift

Scenario2

Seq(1,2,3).toDF("id").createOrReplaceTempView("df_test")
spark.sql("""
CREATE TABLE testdb.spark_csv_test_v2
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (header='true', 'field.delim'='|')
TBLPROPERTIES ('skip.header.line.count'='1', 'classification'='csv', 'delimiter'='|')
STORED AS TEXTFILE
AS 
select * from df_test
""")

This Generates

S3 file generated properly with "|" separated data but without header row
Glue metadata is Input format(org.apache.hadoop.mapred.TextInputFormat), Output format(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat), Serialization lib(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)
Schema

#	Column Name	Data type	Partition key	Comment
1	id	int	-	-

This works great with EMR and redshift, but do not have header row.

Question : Is there a way in which I can write data from EMR+Spark to S3 with 
a. S3 files have a header row
b. Format is csv with provided delimeter
c. Glue metadata is properly set with a schema ( not the array with col as column)
d. Have Inputfmt = TextInputFormat and Outputfmt = HiveIgnoreKeyTextOutputFormat
e. read data from redshift spectrum
f. read data from spark

In Scenario1 above I get (a., b., f.) but not (c., d., e.)
In scenario2 above I get (b., c., d., e., f.) but not (a.)

If somehow scenario2 writes files with header, this is the solution. I believe we may pass option("header"="true") but that doesn't work well with CTAS syntax here.

Robert Kossendey Robert Kossendey · Accepted Answer · 2021-05-03T07:01:45

In you scenario 2 you are adding to your table properties "skip.header.line.count"="1" which, according to the AWS Documentation, skips the header row. So can you try it without this option?

How to use save table with format csv with header from EMR with glue with stored as textfile

1 Answers