0
votes

I need to process quite a big json file using spark. I don't need all the fields in the json and actually would like to read only part of them (not read all fields and project). I was wondering If I could use the json connector and give it a partial read schema with only the fields I'm interested loading.

1

1 Answers

0
votes

It depends on whether your json is multi line. Currently spark only support json on single line as data frame. The next release of spark 2.3 will support multiline json.

But for your question. I don't think you can use a partial schema to read in json. You can first provide the full schema to read in as a dataframe, then select the specific column you need to construct your partial schema as a seperate dataframe. Since spark's use lazy evaluation and the sql engine is able to push down the filter, the performance won't be bad.