I would like to read a subset of partitioned data, in JSON format, with spark (3.0.1) inferring the schema from the JSON.
My data is partitioned as s3a://bucket/path/type=[something]/dt=2020-01-01/
When I try to read this with read(json_root_path).where($"type" == x && $"dt" >= y && $"dt" <= z)
, spark attempts to read the entire dataset in order to infer the schema.
When I try to figure out my partition paths in advance and pass them with read(paths :_*)
, spark throws an error that it cannot infer the schema and I need to specify the schema manually. (Note that in this case, unless I specify basePath
, spark also loses the columns for type
and dt
, but that's fine, I can live with that.)
What I'm looking for, I think, is some option that tells spark to either infer the schema from only the relevant partitions, so the partitioning is pushed-down, or tells it that it can infer the schema from just the JSONs in the paths I've given it. Note that I do not have the option of calling mcsk
or glue
to maintain a hive metastore. In addition, the schema changes over time, so it can't be specified in advance - taking advantage of spark JSON schema inference is an explicit goal.
Can anyone help?