How to create a predefined schema for Spark?

Question

I have a service (Secor) which writes raw messages received to parquet files. I'd like to create a predefined schema for these parquet files, so Spark will be able to combine several files with possible changes in the schema.

We're using Spark 2.1.0

Elaborated example: We save data for an entire day under folders bearing the date, and under each date we have the data segregated by hour. Meaning, our parquet files look like this:

date=2017-03-23
   |-- hour=00
   |-- hour=01
   //
   |-- hour=23

let's say the schema the messages parsed to parquet contained only two fields at the beginning of the day, say:

root
  |-- user: String
  |-- id: Long

Now around mid-day, wev'e added another field, so the schema becomes:

root
  |-- user: String
  |-- id: Long
  |-- country: String

That means that if we're trying to read the data of the entire day using sparksession.read.parquet.("s3a://bucket/date=2017-03-23"), Spark will crash due to the fact that the sub-folders do not have the same schema. s we rarely change the schema, I prefer not to use the schema-merging option, as it's highly expensive.

Bottom line: What I'd like to do is to pre-define a schema and store it under date=2017-03-23, and so Spark will know what columns to look for, and add null where the column is missing. In Spark 1.6 there were _metadata files, but it seems that in Spark 2.1 they're no longer there.

How can I manually create these schema files for Spark?

Mehrez Mehrez · Accepted Answer · 2017-03-23T12:59:09

If you can't do it with spark, I think it's not complicated to make a class that transform a json to a schema, an then store your json file for each day.

How to create a predefined schema for Spark?

1 Answers