0
votes

I have HIVE-table with 2 columns (string, array<struct<type=string, cnt=int>>) like this:
||id || params ||
|| id1 || [{type=A, cnt=4}, {type=B, cnt=2}]
|| id2 || [{type=A, cnt=3}, {type=C, cnt=1}, {type=D, cnt=0}]
|| id3 || [{type=E, cnt=1}]

I need to transform it to table with separated int columns, where columns name are 'types' and values are equal to cnt:

|| id || A || B || C || D || E ||
|| id1 || 4 || 2 || NULL || NULL || NULL ||
|| id2 || 3 || NULL || 1 || 0 || NULL ||
|| id3 || NULL || NULL || NULL || NULL || 1 ||
What is the best and effective way to transform table? Both in Spark SQL and PySpark style. Thank you.

1
explode array, extract values(each value in new colum, if exists - value, if not - 0) and then group by id and use aggregate sum function. - chlebek
Please accept answers as opposed to leaving them open. - thebluephantom

1 Answers

1
votes

Try this - not sure if sum required or not, but seems safe to assume:

from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

# Some variation in your data
df = spark.createDataFrame([(1, ["type=AA, cnt=4", "type=B, cnt=2222"]),
                            (2, ["type=AA, cnt=3", "type=C, cnt=1", "type=D, cnt=0"]),
                            (3, ["type=E, cnt=1"])],["id", "params"])
# Explode
df = df.select(df.id, F.explode(df.params))

# Make separate cols and trip leading strings & convert to Int
split_col = F.split(df['col'], ',')
df = df.withColumn('type', split_col.getItem(0)).withColumn('count', split_col.getItem(1)).drop('col')
df = df.withColumn('type',F.expr("substring(type, 6, length(type))")).withColumn('count',F.expr("substring(count, 6, length(count))").cast(IntegerType()))

# Pivot to your format
df.groupBy("id").pivot("type").agg(F.sum("count")).sort(F.col("id").asc()).show()

returns:

+---+----+----+----+----+----+
| id|  AA|   B|   C|   D|   E|
+---+----+----+----+----+----+
|  1|   4|2222|null|null|null|
|  2|   3|null|   1|   0|null|
|  3|null|null|null|null|   1|
+---+----+----+----+----+----+