python - pySpark mapping multiple variables

Question

The code below maps values and column names of my reference df with my actual dataset, finding exact matches and if an exact match is found, return the OutputValue. However, I'm trying to add the rule that when PrimaryValue = DEFAULT to also return the OutputValue.

The solution I'm trying out to tackle this is to create a new dataframe with null values - since there was no match provided by code below. Thus the next step would be to target the null values whose corresponding PrimaryValue = DEFAULT to replace null by the OutputValue.

  #create a map based on columns from reference_df
  map_key = concat_ws('\0', final_reference.PrimaryName, final_reference.PrimaryValue)
  map_value = final_reference.OutputValue

  #dataframe of concatinated mappings to get the corresponding OutputValues from reference table
  d = final_reference.agg(collect_set(array(concat_ws('\0','PrimaryName','PrimaryValue'), 'OutputValue')).alias('m')).first().m
  #display(d)

  #iterate through mapped values 
  mappings = create_map([lit(i) for i in chain.from_iterable(d)])

  #dataframe with corresponding matched OutputValues
  dataset = datasetM.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in matched_List.items()]) 
  display(dataset)

do you mean when the requested col-name from primaryLookupAttributeName_List does not exists in datasetMatchedPortfolio which will yield ERROR? so you want to add a default name to go through the ERROR? — jxc
@jxc, so there's no error. It simply populates the column with null. The dataset will never include DEFAULT , it will have a regular value. When PrimaryLookupAttributeName is DEFAULT then I will like to replace those null (no match found) by the corresponding OutputItemNameByValue. I will update my question with more info! — jgtrz
very likely, you just need coalesce, for example: coalesce(mappings[concat_ws('\0', lit(c), col(c))], lit("DEFAULT")).alias(c_name). make sure to import pyspark.sql.functions.coalesce — jxc
@jxc, sorry, are you proposing to do this to the null df or to include it as part of my initial datasetPrimaryAttributes_False = — jgtrz
@jxc, would your recommend including a udf here with if/else/ elif loop? I have 3 cases for matched: 1) if match is found, copy the outputValue, 2) if DeFAULT, copy outputValue, 3) if no match at all and null, "Lookup not found". Otherwise, my idea so far is to continue building filter down dataframes until my last case and all dataset values have a corresponding updated value. — jgtrz

jxc jxc · Accepted Answer · 2020-05-22T22:36:50

From discussion in comments, I think you just need to add a default mappings from the existing one and then use coalease() function to find the first non-null value, see below:

from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map, coalesce

# skip some old code

d    
#[['LeaseStatus\x00Abandoned', 'Active'],
# ['LeaseStatus\x00DEFAULT', 'Pending'],
# ['LeaseRecoveryType\x00Gross-modified', 'Modified Gross'],
# ['LeaseStatus\x00Archive', 'Expired'],
# ['LeaseStatus\x00Terminated', 'Terminated'],
# ['LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'],
# ['LeaseRecoveryType\x00Gross', 'Gross']]

# original mapping
mappings = create_map([ lit(j) for i in d for j in i ])

# default mapping
mappings_default = create_map([ lit(j.split('\0')[0]) for i in d if i[0].upper().endswith('\x00DEFAULT') for j in i ])
#Column<b'map(LeaseStatus, Pending)'>

# a set of available PrimaryLookupAttributeName
available_list = set([ i[0].split('\0')[0] for i in d ])
# {'LeaseRecoveryType', 'LeaseStatus'}

# use coalesce to find the first non-null values from mappings, mappings_defaul etc
datasetPrimaryAttributes_False = datasetMatchedPortfolio.select("*",*[ 
  coalesce(
    mappings[concat_ws('\0', lit(c), col(c))],
    mappings_default[c],
    lit("Not Specified at Source" if c in available_list else "Lookup not found")
  ).alias(c_name) for c,c_name in matchedAttributeName_List.items()])

Some explanation:

(1) d is a list of lists retrieved from the reference_df, we use a list comprehension [ lit(j) for i in d for j in i ] to flatten this to a list and apply the flattened list to the create_map function:

(2) The mappings_default is similar to the above, but add a if condition to serve as a filter and keep only entries having PrimaryLookupAttributeValue (which is the first item of the inner list i[0]) ending with \x00DEFAULT and then use split to strip PrimaryLookupAttributeValue(which is basically \x00DEFAULT) off from the map_key.

python - pySpark mapping multiple variables

1 Answers