I am using spark DataFrames and trying to do de-duplication across to DataFrames of same schema.
schema for before saving DataFrame to JSON is like:
root
|-- startTime: long (nullable = false)
|-- name: string (nullable = true)
Schema of DataFrame after loading from JSON file is like:
root
|-- name: string (nullable = true)
|-- startTime: long (nullable = false)
I save to JSON as:
newDF.write.json(filePath)
and read back as:
existingDF = sqlContext.read.json(filePath)
After doing unionAll
existingDF.unionAll(newDF).distinct()
or except
newDF.except(existingDF)
The de-duplication fails because of schema change.
Can I avoid this schema conversion? Is there a way to conserve (or enforce) schema sequence while saving to and loading back from JSON file?