For the last hours, I was trying to convert a JSON file to a Scala case class
with Apache Spark.
The JSON has the following structure:
{
"12": {
"wordA": 1,
"wordB": 2,
"wordC": 3
},
"13": {
"wordX": 10,
"wordY": 12,
"wordZ": 15
}
}
First try: Set an build-up schema
I have tried to build artificially my schema:
val schema = new StructType()
.add("",MapType(StringType, new StructType()
.add("", StringType)
.add("", IntegerType)))
val df = session.read
.option("multiline",true)
.option("mode", "PERMISSIVE")
.schema(schema)
.json(filePath)
df.show()
But this is obviously not right since I have to give the field name.
Second try: map to a case class
I have also tried to create case class
es, which is a bit more elegant:
case class KeywordData (keywordsByCode: Map[String, WordAndWeight])
case class WordAndWeight (word: String, weight: Int)
Problem:
But in any case, df.show() displays:
+----+
| |
+----+
|null|
+----+
The JSON structure is not easy to manipulate since my columns don't have a fix name. Any idea?
Expected result
A map with 12 and 13 as key and List[wordA,...wordC] respectively List[wordX, ..., wordZ] as values
Edit: Map of Map With the case class
case class WordAndWeight(code: Map[String, Map[String, Integer]])
It gives me the following error:
+-------+----------+
| 12| 13|
+-------+----------+
|[1,2,3]|[10,12,15]|
+-------+----------+
cannot resolve '`code`' given input columns: [12, 13];
org.apache.spark.sql.AnalysisException: cannot resolve '`code`' given input columns: [12, 13];
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)