4
votes

I would like to read a subset of partitioned data, in JSON format, with spark (3.0.1) inferring the schema from the JSON.

My data is partitioned as s3a://bucket/path/type=[something]/dt=2020-01-01/

When I try to read this with read(json_root_path).where($"type" == x && $"dt" >= y && $"dt" <= z), spark attempts to read the entire dataset in order to infer the schema.

When I try to figure out my partition paths in advance and pass them with read(paths :_*), spark throws an error that it cannot infer the schema and I need to specify the schema manually. (Note that in this case, unless I specify basePath, spark also loses the columns for type and dt, but that's fine, I can live with that.)

What I'm looking for, I think, is some option that tells spark to either infer the schema from only the relevant partitions, so the partitioning is pushed-down, or tells it that it can infer the schema from just the JSONs in the paths I've given it. Note that I do not have the option of calling mcsk or glue to maintain a hive metastore. In addition, the schema changes over time, so it can't be specified in advance - taking advantage of spark JSON schema inference is an explicit goal.

Can anyone help?

2

2 Answers

0
votes

Could you read each day you are interested in using schema inference and then union the dataframes using schema merge code like this:

Spark - Merge / Union DataFrame with Different Schema (column names and sequence) to a DataFrame with Master common schema

-1
votes

One way that comes to my mind is to extract the schema you need from a single file, and then force it when you want to read the others.

Since you know the first partition and the path, try to read first a single JSON like s3a://bucket/path/type=[something]/dt=2020-01-01/file_0001.json then extract the schema.

Run the full reading part and pass the schema that you extracted as parameter read(json_root_path).schema(json_schema).where(...

The schema should be converted into a StructType to be accepted.

I've found a question that may partially help you Create dataframe with schema provided as JSON file