I am reading json data into spark data frame using scala. The schema is as follows:
root
|-- metadata: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- playerId: string (nullable = true)
| | |-- sources: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- matchId: long (nullable = true)
The data looks as follows:
{ "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 1 } ] }, { "playerId": "1235", "sources": [ { "matchId": 1 } ] } ] }
{ "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 2 } ] }, { "playerId": "1248", "sources": [ { "score": 12.2 , "matchId": 1 } ] } ] }
{ "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 3 } ] }, { "playerId": "1248", "sources": [ { "matchId": 3 } ] } ] }
The goal is to find if playerId is 1234 and matchId is 1, then return isPlayed as true. The structure of sources is not fixed. There might be fields other than matchId.
I wrote a udf considering metadata to be of type WrappedArray[String] and am able to read the playerId column
def hasPlayer = udf((metadata: WrappedArray[String], playerId: String) => {
metadata.contains(playerId)
})
df.withColumn("hasPlayer", hasPlayer(col("metadata"), col("superPlayerId")))
But I am not able to figure out how to query the matchId field given playerId. I tried reading the field as WrappedArray[WrappedArray[Long]] but it gives a typecasting exception in withColumn on metadata.sources.matchId column.
I am relatively new to Spark. Any help would be deeply appreciated.
Cheers!