I have two Spark-Scala dataframes and I need to use one boolean column from one dataframe to filter the second dataframe. Both dataframes have the same number of rows.
In pandas I would so it like this:
import pandas as pd
df1 = pd.DataFrame({"col1": ["A", "B", "A", "C"], "boolean_column": [True, False, True, False]})
df2 = pd.DataFrame({"col1": ["Z", "X", "Y", "W"], "col2": [1, 2, 3, 4]})
filtered_df2 = df2[df1['boolean_column']]
// Expected filtered_df2 should be this:
// df2 = pd.DataFrame({"col1": ["Z", "Y"], "col2": [1, 3]})
How can I do the same operation in Spark-Scala in the most time-efficient way?
My current solution is to add "boolean_column"
from df1
to df2
, then filter df2
by selecting only the rows with a true
value in the newly added column and finally removing "boolean_column"
from df2
, but I'm not sure it is the best solution.
Any suggestion is appreciated.
Edit:
- The expected output is a Spark-Scala dataframe (not a list or a column) with the same schema as the second dataframe, and only the subset of rows from
df2
that satisfy the boolean mask from the"boolean_column"
ofdf1
. - The schema of
df2
presented above is just an example. I'm expecting to receivedf2
as a parameter, with any number of columns of different (and not fixed) schemas.
true
) and then use aleft-semi
join – Raphael Rothdf1[row1]
corresponds todf2[row1]
? Is the order of the two datasets consistent every time that you load them? In general, the only way to use/apply values from one dataset to the other is through join, as @RaphaelRoth already mentioned. If the order is deterministic for both datasets, then you can use something likerowNumber
to add a common id to both datasets, and finally join them. – abiratsis