1
votes

I have a Pyspark dataframe, among others, a column of MSNs (of string type) like the following:

+------+
| Col1 |
+------+
| 654- |
| 1859 |
| 5875 |
| 784- |
| 596- |
| 668- |
| 1075 |
+------+

As you can see, those entries with a value of less than 1000 (i.e. three characters) have a - character at the end to make a total of 4 characters.

I want to get rid of that - character, so that I end up with something like:

+------+
| Col2 |
+------+
| 654  |
| 1859 |
| 5875 |
| 784  |
| 596  |
| 668  |
| 1075 |
+------+

I have tried the following code (where df is the dataframe containing the column, but it does not appear to work:

if df.Col1[3] == "-":
        df = df.withColumn('Col2', df.series.substr(1, 3))
        return df
else:
        return df

Does anyone know how to do it?

2

2 Answers

1
votes

You can replace - in the column with empty string ("") using F.regexp_replace

See the code below,

df.withColumn("Col2", F.regexp_replace("Col1", "-", "")).show()

+----+----+
|Col1|Col2|
+----+----+
|589-| 589|
|1245|1245|
|145-| 145|
+----+----+
0
votes

Here is a solution using the .substr() method:

df.withColumn("Col2", F.when(F.col("Col1").substr(4, 1) == "-", 
                             F.col("Col1").substr(1, 3)
                      ).otherwise(
                             F.col("Col1"))).show()

+----+----+
|Col1|Col2|
+----+----+
|654-| 654|
|1859|1859|
|5875|5875|
|784-| 784|
|596-| 596|
|668-| 668|
|1075|1075|
+----+----+