How to add a column in pyspark if two column values is in another dataframe?

Question

I'm very new to pyspark. I have two dataframes like this:

label column in df1 does not exist at first. I added it later. If [user_id, sku_id] pair of df1 is in df2, then I want to add a column in df1 and set it to 1, otherwise 0, just like df1 shows. How can I do it in pyspark? I'm using py2.7.

Pushkr Pushkr · Accepted Answer · 2017-04-17T03:28:13

Its possible by doing left outer join on two dataframes first, and then using when and otherwise functions on one of columns of right dataframe. here is complete solution I tried -

from pyspark.sql import functions as F
from pyspark.sql.functions import col

# this is just data input
data1 = [[4,3,3],[2,4,3],[4,2,4],[4,3,3]]
data2 = [[4,3,3],[2,3,3],[4,1,4]]

# create dataframes
df1 = spark.createDataFrame(data1,schema=['userId','sku_id','type'])
df2 = spark.createDataFrame(data2,schema=['userId','sku_id','type'])

# condition for join
cond=[df1.userId==df2.userId,df1.sku_id==df2.sku_id,df1.type==df2.type]

# magic
df1.join(df2,cond,how='left_outer')\
  .select(df1.userId,df1.sku_id,df1.type,df2.userId.alias('uid'))\
  .withColumn('label',F.when(col('uid')>0  ,1).otherwise(0))\
  .drop(col('uid'))\
  .show()

output :

+------+------+----+-----+
|userId|sku_id|type|label|
+------+------+----+-----+
|     2|     4|   3|    0|
|     4|     3|   3|    1|
|     4|     3|   3|    1|
|     4|     2|   4|    0|
+------+------+----+-----+

How to add a column in pyspark if two column values is in another dataframe?

1 Answers