Current behavior of EMR spark (version 5.26) with associated glue catalog, while saving data to s3 and glue metadata is following
I have an EMR cluster and I am running below commands
Scenario1
Seq(1,2,3).toDF("id")
.write
.option("header","true")
.option("delimiter","|")
.format("csv")
.saveAsTable("testdb.spark_csv_test_v1")
This Generates
- S3 file properly with header and "|" separated data
- Glue metadata is Input format(org.apache.hadoop.mapred.SequenceFileInputFormat), Output format(org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat), Serialization lib(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)
- Schema
# | Column Name | Data type | Partition key | Comment |
---|---|---|---|---|
1 | col | array | - | from deserializer |
This works great with EMR but throws error when using Redshift
Scenario2
Seq(1,2,3).toDF("id").createOrReplaceTempView("df_test")
spark.sql("""
CREATE TABLE testdb.spark_csv_test_v2
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (header='true', 'field.delim'='|')
TBLPROPERTIES ('skip.header.line.count'='1', 'classification'='csv', 'delimiter'='|')
STORED AS TEXTFILE
AS
select * from df_test
""")
This Generates
- S3 file generated properly with "|" separated data but without header row
- Glue metadata is Input format(org.apache.hadoop.mapred.TextInputFormat), Output format(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat), Serialization lib(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)
- Schema
# | Column Name | Data type | Partition key | Comment |
---|---|---|---|---|
1 | id | int | - | - |
This works great with EMR and redshift, but do not have header row.
Question : Is there a way in which I can write data from EMR+Spark to S3 with
a. S3 files have a header row
b. Format is csv with provided delimeter
c. Glue metadata is properly set with a schema ( not the array with col as column)
d. Have Inputfmt = TextInputFormat and Outputfmt = HiveIgnoreKeyTextOutputFormat
e. read data from redshift spectrum
f. read data from spark
- In Scenario1 above I get (a., b., f.) but not (c., d., e.)
- In scenario2 above I get (b., c., d., e., f.) but not (a.)
If somehow scenario2 writes files with header, this is the solution. I believe we may pass option("header"="true") but that doesn't work well with CTAS syntax here.