Using Apache Spark on an EMR cluster, I have read in xml data, inferred the schema, and stored it on s3 in parquet format. It is now, essentially, a nested table.
Using Spark, I have the schema. I now want to be able to create an external table for Redshift Spectrum to query.
How do I convert the schema from the format provided by Spark to that required for a CREATE EXTERNAL TABLE statement for Redshift Spectrum?
As I'm dealing with multiple 'external tables', hand-jamming the schema is not an option.
I've not been able to find any existing tool to do the conversion from Spark schema format to Redshift Spectrum external table format (see Amazon Nested Table Tutorial )
The Spark schema is a pyspark.sql.types.StructType
I can convert a schema schema
to json with schema.jsonValue()
and write a tool that will do the conversion, but if there's an existing tool for doing this I'd prefer to use that.
Any thoughts / suggestions?