I have the following issue since I am a bit of a noob in pyspark. Based on conditions on values coming from two columns I want to populate a third column. The conditions are:
- if semicolon contained in col2, check col1
- if col1 == 1, take the value before the semicolon
- if col1 == 2, take the value after the semicolon
- if no semicolon in col1, take the value from col2 as-is
This is what the dataframe looks like.
col1 | col2 | col3 |
---|---|---|
1 | 24.9;34.9 | 24.9 |
2 | 24.9;34.9 | 34.9 |
1 | 80.8;90.9 | 80.8 |
2 | 80.8;90.9 | 90.9 |
1 | 777 | 777 |
I made the following udf which gives the error Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
import pyspark.sql.functions as F
def split_by_semicolon_if_exists(col1,col2):
if (col1.contains(';') == True):
if col2 == 1:
result = F.substring(col1, 0, (F.length(col1) - F.expr('locate('';'', col1) - 1')))
if col2 == 2:
result = F.substring(col1, F.expr('locate('';'', col1) - 1'), (F.length(col1) - F.expr('locate('';'', col1) - 1')))
return result
else:
return col1
df = df.withColumn('col3',
split_by_semicolon_if_exists(df['col1'],
df['col2']))
I have built this udf by googling for the various functions so there probably are multiple issues with it. Can you please help me build a udf for this case?