0
votes

I have the following issue since I am a bit of a noob in pyspark. Based on conditions on values coming from two columns I want to populate a third column. The conditions are:

  • if semicolon contained in col2, check col1
    • if col1 == 1, take the value before the semicolon
    • if col1 == 2, take the value after the semicolon
  • if no semicolon in col1, take the value from col2 as-is

This is what the dataframe looks like.

col1 col2 col3
1 24.9;34.9 24.9
2 24.9;34.9 34.9
1 80.8;90.9 80.8
2 80.8;90.9 90.9
1 777 777

I made the following udf which gives the error Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

import pyspark.sql.functions as F

def split_by_semicolon_if_exists(col1,col2):
    if (col1.contains(';') == True):
        if col2 == 1:
          result = F.substring(col1, 0, (F.length(col1) - F.expr('locate('';'', col1) - 1')))
        if col2 == 2:
          result = F.substring(col1, F.expr('locate('';'', col1) - 1'), (F.length(col1) - F.expr('locate('';'', col1) - 1')))     
        return result
    else:
      return col1

df = df.withColumn('col3', 
                     split_by_semicolon_if_exists(df['col1'], 
                                                  df['col2']))

I have built this udf by googling for the various functions so there probably are multiple issues with it. Can you please help me build a udf for this case?