15
votes

I have a dataframe rawdata on which i have to apply filter condition on column X with values CB,CI and CR. So I used the below code:

df = dfRawData.filter(col("X").between("CB","CI","CR"))

But I am getting the following error:

between() takes exactly 3 arguments (4 given)

Please let me know how I can resolve this issue.

1
Related: stackoverflow.com/a/58541958/3712254. I found the join implementation to be faster than where.bantmen

1 Answers

39
votes

The function between is used to check if the value is between two values, the input is a lower bound and an upper bound. It can not be used to check if a column value is in a list. To do that, use isin:

import pyspark.sql.functions as f
df = dfRawData.where(f.col("X").isin(["CB", "CI", "CR"]))