I have loaded a parquet file and created a Data frame as shown below
----------------------------------------------------------------------
time | data1 | data2
-----------------------------------------------------------------------
1-40 | [ lion-> 34, bear -> 2 ] | [ monkey -> [9,23], goose -> [4,5] ]
So, the data type of data1 column is string->integer map, where data type of data2 column is string->array map.
I want to explode the above data frame into below structure
------------------------
time | key | val
------------------------
1-40 | lion | 34
1-40 | bear | 2
1-40 | monkey_0 | 9
1-40 | monkey_1 | 23
1-40 | goose_0 | 4
1-40 | goose_1 | 5
I tried to convert both data1 and data2 into same datatype as string->array by using udfs in pyspark and then exploded the column as show below
def to_map(col1, col2):
for i in col1.keys():
col2[i] = [col1[i]]
return col2
caster= udf(to_map,MapType(StringType(),ArrayType(IntegerType())))
pm_df = pm_df.withColumn("animals", caster('data1', 'data2'))
pm_df.select('time',explode(col('animals')))
I also tried using hive sql by assuming hive sql has more performance than using pyspark UDFs.
rdd = spark.sparkContext.parallelize([[datetime.datetime.now(), {'lion': 34, 'bear': 2}, {'monkey': [9, 23], 'goose':[4,5]} ]])
df = rdd.toDF(fields)
df.createOrReplaceTempView("df")
df = spark.sql("select time, explode(data1), data2 from df")
df.createOrReplaceTempView("df")
df = spark.sql("select time,key as animal,value,posexplode(data2) from df").show(truncate=False)
But I am stuck with below result and don't know how to merge the splitted columns as per my requirement.Output of above hive sql is:
+--------------------------+------+-----+---+------+-------+
|time |animal|value|pos|key |value |
+--------------------------+------+-----+---+------+-------+
|2019-06-12 19:23:00.169739|bear |2 |0 |goose |[4, 5] |
|2019-06-12 19:23:00.169739|bear |2 |1 |monkey|[9, 23]|
|2019-06-12 19:23:00.169739|lion |34 |0 |goose |[4, 5] |
|2019-06-12 19:23:00.169739|lion |34 |1 |monkey|[9, 23]|
+--------------------------+------+-----+---+------+-------+
I know that while using python udfs there is lot of overhead that goes for communication between a python processor and JVMs. Is there any way to achieve the above expected result using inbuilt functions or hive sql.