i'm working with a pyspark dataframe which is:
+----+----+---+---+---+----+
| a| b| c| d| e| f|
+----+----+---+---+---+----+
| 2|12.3| 5|5.6| 6|44.7|
|null|null| 9|9.3| 19|23.5|
| 8| 4.3| 7|0.5| 21| 8.2|
| 9| 3.8| 3|6.5| 45| 4.9|
| 3| 8.7| 2|2.8| 32| 2.9|
+----+----+---+---+---+----+
To create the above dataframe:
rdd = sc.parallelize([(2,12.3,5,5.6,6,44.7),
(None,None,9,9.3,19,23.5),
(8,4.3,7,0.5,21,8.2),
(9,3.8,3,6.5,45,4.9),
(3,8.7,2,2.8,32,2.9)])
df = sqlContext.createDataFrame(rdd, ('a', 'b','c','d','e','f'))
df.show()
I want to create another column 'g' whose values are list of tuples based on existing non null columns. The list of tuples are of form :
((column a, column b),(column c, column d),(column e, column f))
Requirements for output col: 1) Only consider the non null columns while creating the list of tuples. 2) Return the list of tuples.
So the final dataframe with column 'g' would be:
+---+----+---+---+---+----+--------------------------+
| a| b| c| d| e| f| g |
+---+----+---+---+---+----+--------------------------+
| 2|12.3| 5|5.6| 6|44.7|[[2,12.3],[5,5.6],[6,44.7]|
|nul|nul| 9 |9.3| 19|23.5|[[9,9.3],[19,23.5] |
| 8| 4.3| 7|0.5| 21| 8.2|[[8,4.3],[7,0.5],[21,8.2] |
| 9| 3.8| 3|6.5| 45| 4.9|[[9,3.8],[3,6.5],[45,4.9] |
| 3| 8.7| 2|2.8| 32| 2.9|[[3,8.7],[2,2.8],[32,2.9] |
+---+----+---+---+---+----+--------------------------+
In column "g", the second row tuple has only two pairs as opposed to three, because for second row, we omit column 'a' and 'b' values since they are nulls.
I'm not sure how to dynamically omit the null values in columns and form the tuple list
I tried to partially achieve the final column by a udf:
l1=['a','c','e']
l2=['b','d','f']
def func1(r1,r2):
l=[]
for i in range(len(l1)):
l.append((r1[i],r2[i]))
return l
func1_udf=udf(func1)
df=df.withColumn('g',func1_udf(array(l1),array(l2)))
df.show()
I tried declaring the udf as ArrayType, it did not work. Any help would be much appreciated. I'm working with pyspark 1.6. Thank you!