I have a PySpark Dataframe that contains an ArrayType(StringType())
column. This column contains duplicate strings inside the array which I need to remove. For example, one row entry could look like [milk, bread, milk, toast]
. Let's say my dataframe is named df
and my column is named arraycol
. I need something like:
df = df.withColumn("arraycol_without_dupes", F.remove_dupes_from_array("arraycol"))
My intution was that there exists a simple solution to this, but after browsing stackoverflow for 15 minutes I didn't find anything better than exploding the column, removing duplicates on the complete dataframe, then grouping again. There has got to be a simpler way that I just didn't think of, right?
I am using Spark version 2.4.0
df = df.dropDuplicates(subset = ["arraycol"])
– YOLO