How concatenate Two array in pyspark

Question

I have a pyspark Dataframe.

Example:

ID   |    phone   |  name <array>  | age <array>
-------------------------------------------------
12   | 827556     | ['AB','AA']    |  ['CC']
-------------------------------------------------
45   |  87346     |  null          |   ['DD']
-------------------------------------------------
56   |  98356     |  ['FF']        |  null
-------------------------------------------------
34   |  87345     |   ['AA','BB']  |  ['BB']

I want to concatenate the 2 arrays name and age. I did it like this:

df = df.withColumn("new_column", F.concat(df.name, df.age))
df = df.select("ID", "phone", "new_column")

But I got some missing columns, it seems the concat function works on a String not on an array and remove the duplicates:

Result expected:

ID   |    phone   |  new_column <array>  
----------------------------------------
12   | 827556     | ['AB','AA','CC']    
----------------------------------------
45   |  87346     |  ['DD']             
----------------------------------------
56   |  98356     |  ['FF']        
----------------------------------------
34   |  87345     |   ['AA','BB']    
----------------------------------------

How can I concatenate 2 arrays in pyspark knowing that I'm using Spark version < 2.4

Thank you

Possible duplicate of Combine PySpark DataFrame ArrayType fields into single ArrayType field — pault

Bala Bala · Accepted Answer · 2019-10-29T12:47:00

You could use selectExpr as well.

testdata = [(0, ['AB','AA'],  ['CC']), (1, None, ['DD']), (2,  ['FF'] ,None), (3,  ['AA','BB'] , ['BB'])]
df = spark.createDataFrame(testdata, ['id', 'name', 'age'])

>>> df.show()
+---+--------+----+
| id|    name| age|
+---+--------+----+
|  0|[AB, AA]|[CC]|
|  1|    null|[DD]|
|  2|    [FF]|null|
|  3|[AA, BB]|[BB]|
+---+--------+----+

>>> df.selectExpr('''array(concat_ws(',',name,age)) as joined''').show()
+----------+
|    joined|
+----------+
|[AB,AA,CC]|
|      [DD]|
|      [FF]|
|[AA,BB,BB]|
+----------+

How concatenate Two array in pyspark

4 Answers