4
votes

My spark program has to read from a directory, This directory has data of different schema

Dir/subdir1/files
1,10, Alien
1,11, Bob

Dir/subdir2/files
2,blue, 123, chicago
2,red, 34, Dallas

Around 50 more directories with different schemas.

My spark job has to read data from all these directories and generate a file merging this files as shown below

1, 10, Alien;
1, 11, Bob;
2, blue, 123,chicago;
2, red, 34, Dallas;

Spark data frame expects schema to be same in all directories. is there any way I can read all these files of different schema and merge into single file using spark

1
What is the format of the data ? If it is parquet you can use spark parquet schema merging.Avishek Bhattacharya
They are parquet files. But all the files have different schema.ranjith reddy

1 Answers

4
votes

With parquet and different schema there are 2 strategy that I know of

  1. If the schema is compatible you can use mergeSchema

    spark.read.option("mergeSchema", "true").parquet("Dir/")
    

Documentation : https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

  1. If the columns are incompatible type you need to read individual directories then you need to convert them to jsonRDD with

    df.toJSON
    

and then union all the jsonRDD

   df.toJSON.union(df2.toJSON)

followed by converting back to parquet

   spark.read.json(finalJsonRDD)