Spark partition projection/pushdown and schema inference with partitioned JSON

Question

I would like to read a subset of partitioned data, in JSON format, with spark (3.0.1) inferring the schema from the JSON.

My data is partitioned as s3a://bucket/path/type=[something]/dt=2020-01-01/

When I try to read this with read(json_root_path).where($"type" == x && $"dt" >= y && $"dt" <= z), spark attempts to read the entire dataset in order to infer the schema.

When I try to figure out my partition paths in advance and pass them with read(paths :_*), spark throws an error that it cannot infer the schema and I need to specify the schema manually. (Note that in this case, unless I specify basePath, spark also loses the columns for type and dt, but that's fine, I can live with that.)

What I'm looking for, I think, is some option that tells spark to either infer the schema from only the relevant partitions, so the partitioning is pushed-down, or tells it that it can infer the schema from just the JSONs in the paths I've given it. Note that I do not have the option of calling mcsk or glue to maintain a hive metastore. In addition, the schema changes over time, so it can't be specified in advance - taking advantage of spark JSON schema inference is an explicit goal.

Can anyone help?

Michael West Michael West · Accepted Answer · 2021-03-22T16:45:26

Could you read each day you are interested in using schema inference and then union the dataframes using schema merge code like this:

Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema

Spark partition projection/pushdown and schema inference with partitioned JSON

2 Answers