0
votes

I am developing a dynamic script which can join any given pyspark dataframes. The problem is the column names in file will vary & number of join conditions may vary. I can handle this in a loop but I execute the join with a variable name it fails.

(My intention is to dynamically populate a and b or more columns based on file structure and join conditions)

b="incrementalFile.Id1"
a="existingFile.Id"
unChangedRecords = existingFile.join(incrementalFile,(a==b),"left") 

Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 818, in join assert isinstance(on[0], Column), "on should be Column or list of Column" AssertionError: on should be Column or list of Column

But the same code works fine if I don't place any variables in join condition as below.

unChangedRecords = existingFile.join(
    incrementalFile,
    (existingFile.Id==incrementalFile.Id1), 
    "left")
1
Why is this tagged 'scala'?DYZ
@DyZ : the reason is, the logic can be same in scala or pysparklogan

1 Answers

1
votes

In your second example, existingFile.Id is a column, not a string, but in your first example, it's a string. You want to use pyspark.sql.functions.col to reference the column by name. Its docs don't have an example, but it's used in the example for alias on the same page.