I have this particular JSON file which is a log generated by PostgreSQL. I turned this JSON file from multi-line format to single line format. In one of the fields in the Dataframe that I parse there is a String Column. This String Column is by itself in JSON format with this example:
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "671.64"
},
"ordering_operation": {
"using_filesort": true,
"cost_info": {
"sort_cost": "100.00"
},
"table": {
"table_name": "test1",
"access_type": "range",
"possible_keys": [
"PRIMARY"
],
"key": "PRIMARY",
"used_key_parts": [
"id"
],
"key_length": "4",
"rows_examined_per_scan": 100,
"rows_produced_per_join": 100,
"filtered": "100.00",
"cost_info": {
"read_cost": "21.64",
"eval_cost": "20.00",
"prefix_cost": "41.64",
"data_read_per_join": "18K"
},
"used_columns": [
"id",
"c"
],
"attached_condition": "(`proxydemo`.`test1`.`id` between 501747 and <cache>((504767 + 99)))"
}
}
}
}
I know that in Spark 2.0+ I can use
from_json(e: Column, schema: StructType): Column
function from SparkSQL functions. But I am not sure what should be the schema for this String. I have done many schema and StructType definitions but this one is kinda hierarchical! and I do not understand how this schema should be defined! `