I have a pyspark data frame(df1) which consist of 10K rows and the data frame looks like -
id mobile_no value
1 1111111111 .43
2 2222222222 .54
3 3333333333 .03
4 4444444444 .22
another pyspark data frame (df2) consist of 100k records and looks like -
mobile_no gender
912222222222 M
914444444444 M
919999999999 F
915555555555 M
918888888888 F
I want inner join using pyspark where final data frame looks like -
mobile_no value gender
2222222222 .54 M
4444444444 .22 M
the length of the mobile_no in df2 is 12 but in df1 is 10. I can join it but it's costly operation. Any help using pyspark?
common_cust = spark.sql("SELECT mobile_number, age \
FROM df1 \
WHERE mobile_number IN (SELECT DISTINCT mobile_number FROM df2)")