I am currently reading json files which have variable schema in each file. We are using the following logic to read json - first we read the base schema which has all fields and then read the actual data. We are using this approach because the schema is understood based on the first file read, but we are not getting all the fields in the first file it self. So just tricking the code to understand the schema first and then start reading the actual data.
rdd=sc.textFile(baseSchemaWithAllColumns.json).union(pathToActualFile.json)
sqlContext.read.json(rdd)
//Create dataframe and then save as temp table and query
I know the above is just work around and we need a cleaner solution to accept json files with varying schema.
I understand that there are two other ways to understand schema as mentioned here
However, for that it looks like we need to parse the json and map each field to the data received.
There seems to be an option for parquet schema merger, but that looks like mostly at the reading from the dataframe - or am I missing something here.
What is the best way to read a changing schema of json files and work with Spark SQL for querying.
Can I just read the json file as is and save as temp table and then use mergeSchema=true while querying