1
votes

I am new to Data Science and I am working on a simple self project using Google Colab. I took a data from a something1.csv file and something2.csv file.

df1 = spark.read.csv('something1.csv', header=True)
df2 = spark.read.csv('something2.csv', header=True)

The data of something1.csv looks like

#+----------+---------+----------+--------+-------+
#|   country| Latitude| Longitude|   col1 |  col2 |
#+----------+---------+----------+--------+-------+
#|   Andorra|   42.506|    1.5218|    2   |   1   |
#| Australia|   -31.81|    115.86|    1   |   6   |
#|   Austria|   41.597|    12.580|    4   |   9   |
#|   Belgium|   21.782|     1.286|    2   |   3   |
#|     India|   78.389|    12.972|    1   |   7   |
#|   Ireland|    9.281|     9.286|    9   |   8   |
#|       USA|   69.371|    21.819|    7   |   2   |
#+----------+---------+----------+--------+-------+

The data of something2.csv looks like

#+----------+---------+----------+--------+-------+
#|   country| Latitude| Longitude|   col1 |  col2 |
#+----------+---------+----------+--------+-------+
#| Australia|   -31.81|    115.86|    2   |   6   |
#|   Belgium|   21.782|     1.286|    1   |   6   |
#|     India|   78.389|    12.972|    3   |   5   |
#|       USA|   69.371|    21.819|    2   |   5   |
#+----------+---------+----------+--------+-------+

Now I want to intersect df2 with df1 based on Longitude and Latitude and get the rows that are present in df1 along with col1 and col2 from df1. My table should look like

#+----------+---------+----------+--------+-------+
#|   country| Latitude| Longitude|   col1 |  col2 |
#+----------+---------+----------+--------+-------+
#| Australia|   -31.81|    115.86|    1   |   6   |
#|   Belgium|   21.782|     1.286|    2   |   3   |
#|     India|   78.389|    12.972|    1   |   7   |
#|       USA|   69.371|    21.819|    7   |   2   |
#+----------+---------+----------+--------+-------+

I tried using the following code but didn't work.

new_df =  df1.intersect(df2) #using the intersection in pyspark which gave me null table

Then I also tried based on Latitude and Longitude

new_df = df2.select('Latitude','Longitude').intersect(df1.select('Latitude','Logitude')) #intersecting based on columns

I tried both the above methods in pyspark but didn't work.

1
Spark Version ? - Srinivas
@Srinivas Yes spark version and pyspark - R0bert
df1.intersect(df2) this will give empty result because you don't have matching rows in both df, which is expected. second one should give result because both columns have matching rows in both dfs. - Srinivas

1 Answers

3
votes

Intersect only gets rows that are common in both dataframes.

  • But in your case you need col1,col2 from df1 and other columns from df2, Join the dataframes (left/inner as per requirement) and select only col1,col2 from df1 and other columns from df2.

  • (or) As mentioned in comments by Mohammad Murtaza Hashmi Use left_semi join

Example:

#using left semi join
df1.join(df2,['Latitude','Longitude'],'left_semi').show()

#using left join
df2.alias("t2").join(df1.alias("t1"),['Latitude','Longitude'],'left').select("t2.country","t2.Latitude","t2.Longitude","t1.col1","t1.col2").show()
#+---------+--------+---------+----+----+
#|  country|Latitude|Longitude|col1|col2|
#+---------+--------+---------+----+----+
#|Australia|  -31.81|   115.86|   1|   6|
#|  Belgium|  21.782|    1.286|   2|   3|
#|    India|  78.389|   12.972|   1|   7|
#|      USA|  69.371|   21.819|   7|   2|
#+---------+--------+---------+----+----+

Dynamic way:

#join columns
join_cols=[x for x in df1.columns if x.startswith("L")]

#selecting cols from t1
t1_cols=["t1."+x for x in df1.columns if x.startswith("col")]

#selecting cols from t2
t2_cols=["t2."+x for x in df2.columns if not x.startswith("col")]

df2.alias("t2").join(df1.alias("t1"),['Latitude','Longitude'],'inner').select(*t2_cols + t1_cols).show()

#+---------+--------+---------+----+----+
#|  country|Latitude|Longitude|col1|col2|
#+---------+--------+---------+----+----+
#|Australia|  -31.81|   115.86|   1|   6|
#|  Belgium|  21.782|    1.286|   2|   3|
#|    India|  78.389|   12.972|   1|   7|
#|      USA|  69.371|   21.819|   7|   2|
#+---------+--------+---------+----+----+