Spark Reading data frames from different schema directory

Question

My spark program has to read from a directory, This directory has data of different schema

Dir/subdir1/files
1,10, Alien
1,11, Bob

Dir/subdir2/files
2,blue, 123, chicago
2,red, 34, Dallas

Around 50 more directories with different schemas.

My spark job has to read data from all these directories and generate a file merging this files as shown below

1, 10, Alien;
1, 11, Bob;
2, blue, 123,chicago;
2, red, 34, Dallas;

Spark data frame expects schema to be same in all directories. is there any way I can read all these files of different schema and merge into single file using spark

What is the format of the data ? If it is parquet you can use spark parquet schema merging. — Avishek Bhattacharya
They are parquet files. But all the files have different schema. — ranjith reddy

Avishek Bhattacharya Avishek Bhattacharya · Accepted Answer · 2017-09-26T17:39:35

With parquet and different schema there are 2 strategy that I know of

If the schema is compatible you can use mergeSchema

spark.read.option("mergeSchema", "true").parquet("Dir/")

Documentation : https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

If the columns are incompatible type you need to read individual directories then you need to convert them to jsonRDD with
```
df.toJSON
```

and then union all the jsonRDD

   df.toJSON.union(df2.toJSON)

followed by converting back to parquet

   spark.read.json(finalJsonRDD)

Spark Reading data frames from different schema directory

1 Answers