1
votes

I am trying to modify a column value in PySpark dataframe as follow:

df_cleaned = df_cleaned.withColumn('brand_c', when(df_cleaned['brand'] == "samsung" |\
                                                   df_cleaned['brand'] == "oppo", df_cleaned.brand)\
                                   .otherwise('others'))

This generates the following exception:

An error occurred while calling o435.or. Trace: py4j.Py4JException: Method or([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/column.py", line 115, in _ njc = getattr(self._jc, name)(jc) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 332, in get_return_value format(target_id, ".", name, value)) py4j.protocol.Py4JError: An error occurred while calling o435.or. Trace: py4j.Py4JException: Method or([class java.lang.String]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

1

1 Answers

4
votes

You are just missing a couple of brackets. Try:

df_cleaned = df.withColumn('brand_c', when((df['Product'] == "apple") |\
                (df['Product'] == "oppo"), df.User).otherwise('others'))

Always use parenthesis while using comparison operators in pyspark.