I am using the DataFrame API of pyspark (Apache Spark) and am running into the following problem:
When I join two DataFrames that originate from the same source DataFrame, the resulting DF will explode to a huge number of rows. A quick example:
I load a DataFrame with n rows from disk:
df = sql_context.parquetFile('data.parquet')
Then I create two DataFrames from that source.
df_one = df.select('col1', 'col2')
df_two = df.select('col1', 'col3')
Finally I want to (inner) join them back together:
df_joined = df_one.join(df_two, df_one['col1'] == df_two['col1'], 'inner')
The key in col1 is unique. The resulting DataFrame should have n rows, however it does have n*n rows.
That does not happen, when I load df_one and df_two from disk directly. I am on Spark 1.3.0, but this also happens on the current 1.4.0 snapshot.
Can anyone explain why that happens?
df_one.merge(df_two, left_on='col1', right_on='col2', how='inner')? - EdChummergeon a Spark DataFrame I'm afraid. - karlson