0
votes

I have a pyspark dataframe df. it has 2 columns like the example input shown below. I would like to create a new output dataframe, with a new column 'col3' that only has the alphanumeric values from the strings in col2.

I've tried using spark sql with

regexp_extract('('+col1+')','[^[A-Za-z0-9] ]', 0)

but it only returns null.

can anyone suggest how to do this?

input

df.show()


+----+----+
|col1|col2|
+----+----+
|1   |ab& |
+----+----+
|2   |efg |
+----+----+

output

+----+----+
|col1|col3|
+----+----+
|1   |ab  |
+----+----+
|2   |efg |
+----+----+
1

1 Answers

0
votes

Use regexp_replace() function in spark.

Example:

df.show()
#+----+----+
#|col1|col2|
#+----+----+
#|   1| ab&|
#|   2| efg|
#+----+----+

from pyspark.sql.functions import *

df.withColumn("col3",regexp_replace("col2",'[^A-Za-z0-9]','')).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|   1| ab&|  ab|
#|   2| efg| efg|
#+----+----+----+