On Spark DataFrame save to JSON and load back, schema column sequence changes

Question

I am using spark DataFrames and trying to do de-duplication across to DataFrames of same schema.

schema for before saving DataFrame to JSON is like:

root
 |-- startTime: long (nullable = false)
 |-- name: string (nullable = true)

Schema of DataFrame after loading from JSON file is like:

root
 |-- name: string (nullable = true)
 |-- startTime: long (nullable = false)

I save to JSON as:

newDF.write.json(filePath)

and read back as:

existingDF = sqlContext.read.json(filePath)

After doing unionAll

existingDF.unionAll(newDF).distinct()

or except

newDF.except(existingDF)

The de-duplication fails because of schema change.

Can I avoid this schema conversion? Is there a way to conserve (or enforce) schema sequence while saving to and loading back from JSON file?

Ashish Ashish · Accepted Answer · 2016-01-16T13:47:23

Implemented a workaround to convert the schema back to what I need:

val newSchema = StructType(jsonDF.schema.map {
  case StructField(name, dataType, nullable, metadata) if name.equals("startTime") => StructField(name, LongType, nullable = false, metadata)
  case y: StructField => y
})
existingDF = sqlContext.createDataFrame(jsonDF.rdd, newSchema).select("startTime", "name")

On Spark DataFrame save to JSON and load back, schema column sequence changes

1 Answers