0
votes

I am performing a simple filter operation on a pyspark dataframe, that has a minhash jaccard similarity column.

minhash_sig = ['123', '345']

minhash_sig = [str(x) for x in minhash.signature(doc)]


df = spark.createDataFrame(....) # --dataframe with 100,000 rows
# columns are id, and minhash_array(arrays of minhash signatures).
df = df.withColumn('minhash_array0', array([lit(i) for i in minhash_sig]))
intersect = size(array_intersect("minhash_array0", "minhash_array"))
union = size(array_union("minhash_array0", "minhash_array"))
df = df.withColumn('minhash_sim',intersect/union)

df = df.filter(df.column > .5)
df.head()

I've tried df.head() before the filter and this only takes a few seconds to complete.

This head after filter operation doesn't complete within 15 minutes of runtime. I've checked the number of partitions of the dataframe and it's only 4.

Should I reduce the number of partitions? Is there any other solution to reduce compute time?

1

1 Answers

0
votes

Was able to solve the issue by upgrading the cluster to c5 2x large vs m4 large.