Parse a JSON struct as a JSON object in Scala / Spark

Question

I store into my MongoDB collection a huge list of JSON strings. For simplicity, I have extracted a sample document into the text file businessResource.json:

{
   "data" : {
        "someBusinessData" : {
            "capacity" : {
                "fuelCapacity" : NumberLong(282)
            },
            "someField" : NumberLong(16),
            "anotherField" : {
                "isImportant" : true,
                "lastDateAndTime" : "2008-01-01T11:11",
                "specialFlag" : "YMA"
            },
   ...
}

My problem: how can I convert the "someBusinessData" into a JSON object using Spark/Scala?

If I do that (for example using json4s or lift-json), I hope I can perform basic operations on them, for example checking them for equality.

Have in mind that this is a rather large JSON object. Creating a case class is not worth it in my case since the only operation I will perform will be some filtering on two fields, comparing documents for equality, and then I will export them again.

This is how I fetch the data:

 val df: DataFrame = (someSparkSession).sqlContext.read.json("src/test/resources/businessResource.json")

 val myData: DataFrame = df.select("data.someBusinessData")
 myData.printSchema

The schema shows:

root
 |-- someBusinessData: struct (nullable = true)
 |    |-- capacity: struct (nullable = true)

Since "someBusinessData" is a structure, I cannot get it as String. When I try to print using myData.first.getStruct(0), I get a string that contains the values but not the keys: [[[282],16,[true,2008-01-01T11:11,YMA]

Thanks for your help!

notNull notNull · Accepted Answer · 2020-03-16T19:06:49

Instead of using .json use .textFile to read your json file.

Then we convert rdd to dataframe(will have only one string column).

Example:

//read json file as textfile and create df

val df=spark.sparkContext.textFile("<json_file_path>").toDF("str")

//use get_json_object function to traverse json string
df.selectExpr("""get_json_object(str,"$.data.someBusinessData")""").show(false)

//+-----------------------------------------------------------------------------------------------------------------------------------------------------+
//|get_json_object(str,$.data.someBusinessData)                                                                                                         |
//+-----------------------------------------------------------------------------------------------------------------------------------------------------+
//|{"capacity":{"fuelCapacity":"(282)"},"someField":"(16)","anotherField":{"isImportant":true,"lastDateAndTime":"2008-01-01T11:11","specialFlag":"YMA"}}|
//+-----------------------------------------------------------------------------------------------------------------------------------------------------+

Parse a JSON struct as a JSON object in Scala / Spark

2 Answers