I'm a total newbie in Spark&Scala stuff, it would be great if someone could explain this to me. Let's take following JSON
{
"id": 1,
"persons": [{
"name": "n1",
"lastname": "l1",
"hobbies": [{
"name": "h1",
"activity": "a1"
},
{
"name": "h2",
"activity": "a2"
}]
},
{
"name": "n2",
"lastname": "l2",
"hobbies": [{
"name": "h3",
"activity": "a3"
},
{
"name": "h4",
"activity": "a4"
}]
}]
}
I'm loading this Json to RDD via sc.parralelize(file.json) and to DF via sqlContext.sql.load.json(file.json). So far so good, this gives me RDD and DF (with schema) for mentioned Json, but I want to create annother RDD/DF from existing one that contains all distinct "hobbies" records. How can I achieve sth like that? The only things I get from my operations are multiple WrappedArrays for Hobbies but I cannot go deeper nor assign them to DF/RDD.
Code for SqlContext I have so far
val jsonData = sqlContext.read.json("path/file.json")
jsonData.registerTempTable("jsonData") //I receive schema for whole file
val hobbies = sqlContext.sql("SELECT persons.hobbies FROM jasonData") //subschema for hobbies
hobbies.show()
That leaves me with
+--------------------+
| hobbies|
+--------------------+
|[WrappedArray([a1...|
+--------------------+
What I expect is more like:
+--------------------+-----------------+
| name | activity |
+--------------------+-----------------|
| h1| a1 |
+--------------------+-----------------+
| h2| a2 |
+--------------------+-----------------+
| h3| a3 |
+--------------------+-----------------+
| h4| a4 |
+--------------------+-----------------+