dynamically populate column name in Python dataframe join

Question

I am developing a dynamic script which can join any given pyspark dataframes. The problem is the column names in file will vary & number of join conditions may vary. I can handle this in a loop but I execute the join with a variable name it fails.

(My intention is to dynamically populate a and b or more columns based on file structure and join conditions)

b="incrementalFile.Id1"
a="existingFile.Id"
unChangedRecords = existingFile.join(incrementalFile,(a==b),"left")

Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 818, in join assert isinstance(on[0], Column), "on should be Column or list of Column" AssertionError: on should be Column or list of Column

But the same code works fine if I don't place any variables in join condition as below.

unChangedRecords = existingFile.join(
    incrementalFile,
    (existingFile.Id==incrementalFile.Id1), 
    "left")

@DyZ : the reason is, the logic can be same in scala or pyspark — logan

hoyland hoyland · Accepted Answer · 2018-02-24T01:18:29

In your second example, existingFile.Id is a column, not a string, but in your first example, it's a string. You want to use pyspark.sql.functions.col to reference the column by name. Its docs don't have an example, but it's used in the example for alias on the same page.

dynamically populate column name in Python dataframe join

1 Answers