I have a dataset which I am extracting and applying a specific schema to before writing out as a json.
My test dataset looks like:
cityID|retailer|postcode
123|a1|1
123|s1|2
123|d1|3
124|a1|4
124|s1|5
124|d1|6
I want to group by city ID. I am then applying the below schema and putting it into a dataframe. I then want to write out the data as a json. My code is as follows:
Grouping by cityID
val rdd1 = cridf.rdd.map(x=>(x(0).toString, (x(1).toString, x(2).toString))).groupByKey()
Mapping RDD to Row
val final1 = rdd1.map(x=>Row(x._1,x._2.toList))
Applying Schema
val schema2 = new StructType()
.add("cityID", StringType)
.add("reads", ArrayType(new StructType()
.add("retailer", StringType)
.add("postcode", IntegerType)))
Creating data frame
val parsedDF2 = spark.createDataFrame(final1, schema2)
Writing to json file
parsedDF2.write.mode("overwrite")
.format("json")
.option("header", "false")
.save("/XXXX/json/testdata")
The job aborts due to the following error:
java.lang.RuntimeException: Error while encoding:
java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of struct