I've a Dataproc cluster with 2 worker nodes. My pyspark program is very simple
1) Reads a 500MB data from Bigquery 2) Apply a few UDFs 3) Display results from pyspark SQL dataframe based on some condition
At the third step the jobs gets stuck at stage 0 and does nothing. I'm new to Pyspark but I don't think so the data is huge for it to get hanged. Please help me.
@Adam,
My UDF is from RDkit library. Is it possible to make the UDF efficient so the output is in seconds?
from rdkit import Chem
user_smile_string = 'ONC(=O)c1ccc(I)cc1'
mol = Chem.MolFromSmiles(user_smile_string)
def Matched(smile_structure):
try:
match = mol.HasSubstructMatch(Chem.MolFromSmiles(smile_structure))
except Exception:
pass
else:
return (match)
/var/logs/hadoop-yarn/
- cyxxy