I'm facing an issue when mixing python map and lambda functions on a Spark environment.
Given df1, my source dataframe:
Animals | Food | Home
----------------------------------
Monkey | Banana | Jungle
Dog | Meat | Garden
Cat | Fish | House
Elephant | Banana | Jungle
Lion | Meat | Desert
I want to create another dataframe df2. It will contain two columns with a row per column of df1 (3 in my example). The first column would contain the name of df1 columns. The second column would contain an array of elements with the most occurrences (n=3 in the example below) and the count.
Column | Content
-----------------------------------------------------------
Animals | [("Cat", 1), ("Dog", 1), ("Elephant", 1)]
Food | [("Banana", 2), ("Meat", 2), ("Fish", 1)]
Home | [("Jungle", 2), ("Desert", 1), ("Garden", 1)]
I tried to do it with python list, map and lambda functions but I had conflicts with PySpark functions:
def transform(df1):
# Number of entry to keep per row
n = 3
# Add a column for the count of occurence
df1 = df1.withColumn("future_occurences", F.lit(1))
df2 = df1.withColumn("Content",
F.array(
F.create_map(
lambda x: (x,
[
str(row[x]) for row in df1.groupBy(x).agg(
F.sum("future_occurences").alias("occurences")
).orderBy(
F.desc("occurences")
).select(x).limit(n).collect()
]
), df1.columns
)
)
)
return df2
The error is:
TypeError: Invalid argument, not a string or column: <function <lambda> at 0x7fc844430410> of type <type 'function'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
Any idea how to fix it?
Thanks a lot!
union
the results. How do you break ties? Why is itCat
,Dog
,Elephant
when the other two animals also have a count of 1? – pault