I have the following data frame and I would like to explode the values column, so that each value is in the separate column:
id | values
-----------------------
1 | '[[532,969020406,89],[216,969100125,23],[169,39356140000,72],[399,14407358500,188],[377,13761937166.6667,24]]'
2 | '[[532,969020406,89]]'
Note that the lists under the values column can have different lengths and that they are of String data type.
The desired table should look like this:
id | v11 | v12 | v13 | v21 | v22...
--------------------------------------
1 | 532 | 969020406 | 89 | 216 | 969100125...
2 | 532 | 969020406 | 89 | Null | Null...
I tried to specify the schema and use the from_json method to create the array and then explode it, but I encountered issues, namely any of the schemas seems not fit into my data
json_schema = types.StructType([types.StructField('array', types.StructType([ \
types.StructField("v1",types.StringType(),True), \
types.StructField("v2",types.StringType(),True), \
types.StructField("v3",types.StringType(),True)
]))])
json_schema = types.ArrayType(types.StructType([ \
types.StructField("v1",types.StringType(),True), \
types.StructField("v2",types.StringType(),True), \
types.StructField("v3",types.StringType(),True)
]))
json_schema = types.ArrayType(types.ArrayType(types.IntegerType()))
df.select('id', F.from_json('values', schema=json_schema)).show()
The proceeding returns only the Null value or an empty array: [,,]
I got also the following error: StructType can not accept object '[' in type <class 'str'>
Schema of the input data inferred by Pyspark:
root
|-- id: integer (nullable = true)
|-- values: string (nullable = true)
Any help would be appreciated.
df.printSchema()
for your original data? – werner