How do I create/mock a Spark Scala dataframe with a case class nested inside the top level?
root
|-- _id: long (nullable = true)
|-- continent: string (nullable = true)
|-- animalCaseClass: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- gender: string (nullable = true)
I am currently unit testing a function which outputs a dataframe in the above schema. To check equality, I used the toDF() which unfortunately gives a schema with nullable = true for "_id" in the mocked dataframe, thus making the test fail (Note that the "actual" output from the function has nullable = true for everything).
I also tried creating the mocked dataframe a different way which led to errors: https://pastebin.com/WtxtgMJA
Here is what I tried in this approach:
import org.apache.spark.sql.Encoders
val animalSchema = Encoders.product[AnimalCaseClass].schema
val schema = List(
StructField("_id", LongType, true),
StructField("continent", StringType, true),
StructField("animalCaseClass", animalSchema, true)
)
val data = Seq(Row(12345L, "Asia", AnimalCaseClass("tiger", "male")), Row(12346L, "Asia", AnimalCaseClass("tigress", "female")))
val expected = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
I had to use this approach to make the nullable true for those fields where toDF makes the nullable false by default.
How could I make a dataframe with the same schema as the output of the mocked function and declare values which can also be a case class?