My ultimate goal is two compare column names in df2 if the names appear in a list of values extracted from df1.
I have a list of names and a function that checks if those names exists as a column name in df1. However, this worked in python and doesn't work in pySpark. The error I'm getting:AttributeError: 'DataFrame' object has no attribute 'values'.
How can I change my function so that it iterates over the column names? Or is there a way to compare my list values to the df2's column names (the full dataframe; ie. no need to make a new dataframe with just the column names)?
#Function to check matching values
def checkIfDomainsExists(data, listOfValues):
'''List of elements '''
entityDomainList=Entity.select("DomainName").rdd.flatMap(lambda x:x).collect()
#entityDomainList
'''Check if given elements exists in data'''
results_true = {}
results_false ={}
#Iterate over list of domains one by one
for elem in listOfValues:
#Check if the element exists in dataframe values
if elem in data.columns:
results_true[elem] = True
else:
results_false[elem] = False
#Return dictionary of values and their flag
#Only return TRUE values
return results_true;
# Get TRUE matched column values
results_true = checkIfDomainsExists(psv, entityDomainList)
results_true
df=spark.createDataFrame([(i, x) for i, x in enumerate(datasetDomainList)],['index','col1'])
– murtihash