I have a Pyspark dataframe, among others, a column of MSNs (of string type) like the following:
+------+
| Col1 |
+------+
| 654- |
| 1859 |
| 5875 |
| 784- |
| 596- |
| 668- |
| 1075 |
+------+
As you can see, those entries with a value of less than 1000 (i.e. three characters) have a - character at the end to make a total of 4 characters.
I want to get rid of that - character, so that I end up with something like:
+------+
| Col2 |
+------+
| 654 |
| 1859 |
| 5875 |
| 784 |
| 596 |
| 668 |
| 1075 |
+------+
I have tried the following code (where df is the dataframe containing the column, but it does not appear to work:
if df.Col1[3] == "-":
df = df.withColumn('Col2', df.series.substr(1, 3))
return df
else:
return df
Does anyone know how to do it?