Conditional string manipulation in Pyspark

Question

I have a Pyspark dataframe, among others, a column of MSNs (of string type) like the following:

+------+
| Col1 |
+------+
| 654- |
| 1859 |
| 5875 |
| 784- |
| 596- |
| 668- |
| 1075 |
+------+

As you can see, those entries with a value of less than 1000 (i.e. three characters) have a - character at the end to make a total of 4 characters.

I want to get rid of that - character, so that I end up with something like:

+------+
| Col2 |
+------+
| 654  |
| 1859 |
| 5875 |
| 784  |
| 596  |
| 668  |
| 1075 |
+------+

I have tried the following code (where df is the dataframe containing the column, but it does not appear to work:

if df.Col1[3] == "-":
        df = df.withColumn('Col2', df.series.substr(1, 3))
        return df
else:
        return df

Does anyone know how to do it?

Sreeram TP Sreeram TP · Accepted Answer · 2020-10-29T09:40:16

You can replace - in the column with empty string ("") using F.regexp_replace

See the code below,

df.withColumn("Col2", F.regexp_replace("Col1", "-", "")).show()

+----+----+
|Col1|Col2|
+----+----+
|589-| 589|
|1245|1245|
|145-| 145|
+----+----+