2
votes

I am reading json data into spark data frame using scala. The schema is as follows:

 root
 |-- metadata: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- playerId: string (nullable = true)
 |    |    |-- sources: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- matchId: long (nullable = true)

The data looks as follows:

{ "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 1 } ] }, { "playerId": "1235", "sources": [ { "matchId": 1 } ] } ] }
{ "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 2 } ] }, { "playerId": "1248", "sources": [ { "score": 12.2 , "matchId": 1 } ] } ] }
{ "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 3 } ] }, { "playerId": "1248", "sources": [ { "matchId": 3 } ] } ] }

The goal is to find if playerId is 1234 and matchId is 1, then return isPlayed as true. The structure of sources is not fixed. There might be fields other than matchId.

I wrote a udf considering metadata to be of type WrappedArray[String] and am able to read the playerId column

def hasPlayer = udf((metadata: WrappedArray[String], playerId: String) => { 
  metadata.contains(playerId)
  })

df.withColumn("hasPlayer", hasPlayer(col("metadata"), col("superPlayerId")))

But I am not able to figure out how to query the matchId field given playerId. I tried reading the field as WrappedArray[WrappedArray[Long]] but it gives a typecasting exception in withColumn on metadata.sources.matchId column.

I am relatively new to Spark. Any help would be deeply appreciated.

Cheers!

1

1 Answers

2
votes

When you are dealing with JSON, get to know the built-in function explode, which turns a single cell containing a WrappedArray into multiple rows representing the structs inside. I think it helps here (twice):

df.select(explode($"metadata").as("metadata"))
  .select($"metadata.playerId", explode($"metadata.sources.matchId").as("matchId"))
  .filter($"matchId".equalTo(1))
  .select($"matchId", lit(true).as("isPlayed"))

Basically I use explode to create multiple rows (and rename to something convenient), navigate the object tree to the JSON fields I want, repeat the explode/rename process for the matchId, and filter out everything that isn't 1 This allows me finally to use the lit function to "hardcode" values of true for a brand new column called isPlayed since everything that isn't 1 is gone.

If this isn't exactly what you're looking for, hopefully it gives you some pointers. The functions library can be very helpful to you as you get to know Spark.