I have a service (Secor) which writes raw messages received to parquet files. I'd like to create a predefined schema for these parquet files, so Spark will be able to combine several files with possible changes in the schema.
We're using Spark 2.1.0
Elaborated example: We save data for an entire day under folders bearing the date, and under each date we have the data segregated by hour. Meaning, our parquet files look like this:
date=2017-03-23
|-- hour=00
|-- hour=01
//
|-- hour=23
let's say the schema the messages parsed to parquet contained only two fields at the beginning of the day, say:
root
|-- user: String
|-- id: Long
Now around mid-day, wev'e added another field, so the schema becomes:
root
|-- user: String
|-- id: Long
|-- country: String
That means that if we're trying to read the data of the entire day using sparksession.read.parquet.("s3a://bucket/date=2017-03-23")
, Spark will crash due to the fact that the sub-folders do not have the same schema. s we rarely change the schema, I prefer not to use the schema-merging option, as it's highly expensive.
Bottom line: What I'd like to do is to pre-define a schema and store it under date=2017-03-23
, and so Spark will know what columns to look for, and add null
where the column is missing. In Spark 1.6 there were _metadata
files, but it seems that in Spark 2.1 they're no longer there.
How can I manually create these schema files for Spark?