0
votes

Current behavior of EMR spark (version 5.26) with associated glue catalog, while saving data to s3 and glue metadata is following

I have an EMR cluster and I am running below commands

Scenario1

Seq(1,2,3).toDF("id")
    .write
    .option("header","true")
    .option("delimiter","|")
    .format("csv")
    .saveAsTable("testdb.spark_csv_test_v1")

This Generates

  1. S3 file properly with header and "|" separated data
  2. Glue metadata is Input format(org.apache.hadoop.mapred.SequenceFileInputFormat), Output format(org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat), Serialization lib(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)
  3. Schema
# Column Name Data type Partition key Comment
1 col array - from deserializer

This works great with EMR but throws error when using Redshift

Scenario2

Seq(1,2,3).toDF("id").createOrReplaceTempView("df_test")
spark.sql("""
CREATE TABLE testdb.spark_csv_test_v2
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (header='true', 'field.delim'='|')
TBLPROPERTIES ('skip.header.line.count'='1', 'classification'='csv', 'delimiter'='|')
STORED AS TEXTFILE
AS 
select * from df_test
""")

This Generates

  1. S3 file generated properly with "|" separated data but without header row
  2. Glue metadata is Input format(org.apache.hadoop.mapred.TextInputFormat), Output format(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat), Serialization lib(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe)
  3. Schema
# Column Name Data type Partition key Comment
1 id int - -

This works great with EMR and redshift, but do not have header row.

Question : Is there a way in which I can write data from EMR+Spark to S3 with 
a. S3 files have a header row
b. Format is csv with provided delimeter
c. Glue metadata is properly set with a schema ( not the array with col as column)
d. Have Inputfmt = TextInputFormat and Outputfmt = HiveIgnoreKeyTextOutputFormat
e. read data from redshift spectrum
f. read data from spark
  • In Scenario1 above I get (a., b., f.) but not (c., d., e.)
  • In scenario2 above I get (b., c., d., e., f.) but not (a.)

If somehow scenario2 writes files with header, this is the solution. I believe we may pass option("header"="true") but that doesn't work well with CTAS syntax here.

1

1 Answers

0
votes

In you scenario 2 you are adding to your table properties "skip.header.line.count"="1" which, according to the AWS Documentation, skips the header row. So can you try it without this option?