I have a pyspark Dataframe.
Example:
ID | phone | name <array> | age <array>
-------------------------------------------------
12 | 827556 | ['AB','AA'] | ['CC']
-------------------------------------------------
45 | 87346 | null | ['DD']
-------------------------------------------------
56 | 98356 | ['FF'] | null
-------------------------------------------------
34 | 87345 | ['AA','BB'] | ['BB']
I want to concatenate the 2 arrays name and age. I did it like this:
df = df.withColumn("new_column", F.concat(df.name, df.age))
df = df.select("ID", "phone", "new_column")
But I got some missing columns, it seems the concat function
works on a String not on an array and remove the duplicates:
Result expected:
ID | phone | new_column <array>
----------------------------------------
12 | 827556 | ['AB','AA','CC']
----------------------------------------
45 | 87346 | ['DD']
----------------------------------------
56 | 98356 | ['FF']
----------------------------------------
34 | 87345 | ['AA','BB']
----------------------------------------
How can I concatenate 2 arrays in pyspark knowing that I'm using Spark version < 2.4
Thank you