0
votes

I have a PySpark dataframe with 87 columns. I want to pass each row of the dataframe to a function and get a list for each row so that I can create a column separately.

PySpark code

UDF:

def make_range_vector(row,categories,ledger):
    print(type(row),type(categories),type(ledger))                
    category_vector=[]
    for category in categories:
      if(row[category]!=0):
         category_percentage=func.round(row[category]*100/row[ledger])
         category_vector.append(category_percentage)
      else:
          category_vector.append(0)
    category_vector=sqlCtx.createDataFrame(category_vector,IntegerType())    
    return category_vector

Main function

pivot_card.withColumn('category_debit_vector',(make_range_vector(struct([pivot_card[x]  for x in pivot_card.columns] ),pivot_card.columns[3:],'debit')))

I am beginner in PySpark, and I can't find answers to below questions.

  1. if(row[category]!=0): This statement gives me ValueError: ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

  2. So, I printed the arguments inside the function. It outputs, <class 'pyspark.sql.column.Column'> <class 'list'> <class #'str'>. Shouldn't it be StructType?

  3. Can I pass a Row object and do something similar, like we do in Pandas ?

I looked at many sources, and is mostly taken from this question and this source (https://community.hortonworks.com/questions/130866/rowwise-manipulation-of-a-dataframe-in-pyspark.html)

PySpark row-wise function composition

1

1 Answers

0
votes

I found the silly mistake I made in the code. Instead of calling the UDF, I called the original function. Have corrected it in the answer below:

Main function

pivot_card.withColumn('category_debit_vector',(make_range_vector_udf(struct([pivot_card[x] for x in pivot_card.columns] ),pivot_card.columns[3:],'debit')))

EDIT

I have understood that we cannot really pass other arguments in UDF. Thanks.