I'm new to Spark and working with JSON and I'm having trouble doing something fairly simple (I think). I've tried using parts of solutions to similar questions but can't quite get it right. I currently have a Spark dataframe with several columns representing variables. Each row is a unique combination of variable values. I then have a UDF that is applied to every row which takes each of the columns as input, does some analysis, and outputs a summary table as a JSON string for each row, and saves these this result in a new column of the table. Some small sample data looks like:
+------+-----+------+-------------------------------------------------------------------
|Var 1 |Var 2|Var 3 |JSON Table
+------+------------+-------------------------------------------------------------------
|True |10% |200 |[{"Out_1": "Mean", "Out_2": "25"}, {"Out_1": "Median", "Out_2": "21"}]
|False |15% |150 |[{"Out_1": "Mean", "Out_2": "19"}, {"Out_1": "Median", "Out_2": "18"}]
|True |12% |100 |[{"Out_1": "Mean", "Out_2": "22"}, {"Out_1": "Median", "Out_2": "20"}]
I would like to transform this into the following format:
+------+-----+------+------+-----+
|Var 1 |Var 2|Var 3 |Out_1 |Out_2|
+------+------------+------+-----+
|True |10% |200 |Mean |25 |
|True |10% |200 |Median|21 |
|False |15% |150 |Mean |19 |
|False |15% |150 |Median|18 |
|True |12% |100 |Mean |22 |
|True |12% |100 |Median|20 |
In reality there are many more variables, millions of rows and larger JSON strings with more outputs, but the core problem remains the same. I've basically tried getting the JSON schema and using from_json like so:
from pyspark.sql.functions import *
from pyspark.sql.types import *
schema = spark.read.json(df.rdd.map(lambda row: row["JSON Table"])).schema
df = df\
.withColumn("JSON Table", from_json("JSON Table", schema))\
.select(col('*'), col('JSON Table.*'))\
df.show()
This appears to get the JSON structure correctly (albeit every value is read as a string though most are integers), but the resulting dataframe is empty, though has the correct column headings. Any advice on how to deal with this?