How to mock a Spark Scala DataFrame with a nested case-class schema?

Question

How do I create/mock a Spark Scala dataframe with a case class nested inside the top level?

root
 |-- _id: long (nullable = true)
 |-- continent: string (nullable = true)
 |-- animalCaseClass: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- gender: string (nullable = true)

I am currently unit testing a function which outputs a dataframe in the above schema. To check equality, I used the toDF() which unfortunately gives a schema with nullable = true for "_id" in the mocked dataframe, thus making the test fail (Note that the "actual" output from the function has nullable = true for everything).

I also tried creating the mocked dataframe a different way which led to errors: https://pastebin.com/WtxtgMJA

Here is what I tried in this approach:

import org.apache.spark.sql.Encoders
val animalSchema = Encoders.product[AnimalCaseClass].schema

val schema = List(
  StructField("_id", LongType, true),
  StructField("continent", StringType, true),
  StructField("animalCaseClass", animalSchema, true)
)

val data = Seq(Row(12345L, "Asia", AnimalCaseClass("tiger", "male")), Row(12346L, "Asia", AnimalCaseClass("tigress", "female")))

val expected = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

I had to use this approach to make the nullable true for those fields where toDF makes the nullable false by default.

How could I make a dataframe with the same schema as the output of the mocked function and declare values which can also be a case class?

Mansoor Baba Shaik Mansoor Baba Shaik · Accepted Answer · 2018-09-18T20:10:32

From the logs you provided, you can see that

Caused by: java.lang.RuntimeException: models.AnimalCaseClass is not a valid external type for schema of struct<name:String,gender:String,,... 3 more fields>

which means you are trying to insert an object type of AnimalCaseClass into a datatype of struct<name:String,gender:String> and this was caused since you have used Row object.

import org.apache.spark.SparkConf
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
import org.apache.spark.sql.SparkSession

case class AnimalCaseClass(name: String, gender: String)

object Test extends App {

  val conf: SparkConf = new SparkConf()
  conf.setAppName("Test")
  conf.setMaster("local[2]")
  conf.set("spark.sql.test", "")
  conf.set(SQLConf.CODEGEN_FALLBACK.key, "false")

  val spark: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()

  // ** The relevant part **
  import org.apache.spark.sql.Encoders
  val animalSchema = Encoders.product[AnimalCaseClass].schema

  val expectedSchema: StructType = StructType(Seq(
    StructField("_id", LongType, true),
    StructField("continent", StringType, true),
    StructField("animalCaseClass", animalSchema, true)
  ))

  import spark.implicits._
  val data = Seq((12345L, "Asia", AnimalCaseClass("tiger", "male")), (12346L, "Asia", AnimalCaseClass("tigress", "female"))).toDF()

  val expected = spark.createDataFrame(data.rdd, expectedSchema)

  expected.printSchema()

  expected.show()

  spark.stop()
}

How to mock a Spark Scala DataFrame with a nested case-class schema?

1 Answers