Query JSON data column using Spark DataFrames, But Not sure about its Schema

Question

I have this particular JSON file which is a log generated by PostgreSQL. I turned this JSON file from multi-line format to single line format. In one of the fields in the Dataframe that I parse there is a String Column. This String Column is by itself in JSON format with this example:

"query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "671.64"
    },
    "ordering_operation": {
      "using_filesort": true,
      "cost_info": {
        "sort_cost": "100.00"
      },
      "table": {
        "table_name": "test1",
        "access_type": "range",
        "possible_keys": [
          "PRIMARY"
        ],
        "key": "PRIMARY",
        "used_key_parts": [
          "id"
        ],
        "key_length": "4",
        "rows_examined_per_scan": 100,
        "rows_produced_per_join": 100,
        "filtered": "100.00",
        "cost_info": {
          "read_cost": "21.64",
          "eval_cost": "20.00",
          "prefix_cost": "41.64",
          "data_read_per_join": "18K"
        },
        "used_columns": [
          "id",
          "c"
        ],
        "attached_condition": "(`proxydemo`.`test1`.`id` between 501747 and <cache>((504767 + 99)))"
      }
    }
  }
}

I know that in Spark 2.0+ I can use

from_json(e: Column, schema: StructType): Column

function from SparkSQL functions. But I am not sure what should be the schema for this String. I have done many schema and StructType definitions but this one is kinda hierarchical! and I do not understand how this schema should be defined! `

I just wanted to add that I do not ask for someone to complete the schema for me, only to know what is the technique for explaining this hierarchical data's schema? — M.Rez

M.Rez M.Rez · Accepted Answer · 2017-04-16T17:36:27

I found out how nested schemas work.

In this particular example schemas go as this:

For the root of the object:

  val query_block_schema = (new StructType)
      .add("select_id", LongType)
      .add("cost_info", StringType)
      .add("ordering_operation", StringType)

For the second layer:

  val query_plan_schema = (new StructType)
    .add("query_block", StringType)

and so on...

So I consider this problem as solved. Later on, I merge all these together in case they are not null and basically flat the whole nested object.

Query JSON data column using Spark DataFrames, But Not sure about its Schema

1 Answers