1
votes

My ultimate goal is two compare column names in df2 if the names appear in a list of values extracted from df1.

I have a list of names and a function that checks if those names exists as a column name in df1. However, this worked in python and doesn't work in pySpark. The error I'm getting:AttributeError: 'DataFrame' object has no attribute 'values'.

How can I change my function so that it iterates over the column names? Or is there a way to compare my list values to the df2's column names (the full dataframe; ie. no need to make a new dataframe with just the column names)?

#Function to check matching values 
def checkIfDomainsExists(data, listOfValues):


    '''List of elements  '''
    entityDomainList=Entity.select("DomainName").rdd.flatMap(lambda x:x).collect()
    #entityDomainList

    '''Check if given elements exists in data'''
    results_true = {}
    results_false ={}
    #Iterate over list of domains one by one
    for elem in listOfValues:
        #Check if the element exists in dataframe values
        if elem in data.columns:
            results_true[elem] = True
        else:
            results_false[elem] = False
    #Return dictionary of values and their flag
    #Only return TRUE values 
    return results_true;
# Get TRUE matched column values 
results_true = checkIfDomainsExists(psv, entityDomainList)
results_true
1
try this df=spark.createDataFrame([(i, x) for i, x in enumerate(datasetDomainList)],['index','col1'])murtihash
@MohammadMurtazaHashmi, hmm, this only creates a df to the length of the item in my list. I made changes to my question to add more clarity.jgtrz

1 Answers

0
votes

You don't need to write the function for just filtering the values. YOu can do this in following ways:

df = spark.createDataFrame([(1, 'LeaseStatus'), (2, 'IncludeLeaseInIPM'), (5, 'NonExistantDomain')], ("id", "entity"))
domainList=['LeaseRecoveryType','LeaseStatus','IncludeLeaseInIPM','LeaseAccountType', 'ClassofUse','LeaseType']

df.withColumn('Exists', df.entity.isin(domainList)).filter(f.col('Exists')=='true').show()

+---+-----------------+------+
| id|           entity|Exists|
+---+-----------------+------+
|  1|      LeaseStatus|  true|
|  2|IncludeLeaseInIPM|  true|
+---+-----------------+------+

#or you can filter directly without adding additional column

df.filter(f.col('entity').isin(domainList)).select('entity').collect()

[Row(entity='LeaseStatus'), Row(entity='IncludeLeaseInIPM')]

Hope it helps.