Pyspark job on Dataproc gets stuck at stage 0

Question

I've a Dataproc cluster with 2 worker nodes. My pyspark program is very simple

1) Reads a 500MB data from Bigquery 2) Apply a few UDFs 3) Display results from pyspark SQL dataframe based on some condition

At the third step the jobs gets stuck at stage 0 and does nothing. I'm new to Pyspark but I don't think so the data is huge for it to get hanged. Please help me.

@Adam,

My UDF is from RDkit library. Is it possible to make the UDF efficient so the output is in seconds?

from rdkit import Chem

user_smile_string = 'ONC(=O)c1ccc(I)cc1' 
mol = Chem.MolFromSmiles(user_smile_string)

def Matched(smile_structure):
    try:
        match = mol.HasSubstructMatch(Chem.MolFromSmiles(smile_structure))
    except Exception:
        pass
    else:
        return (match)

Since .show() (the display of the df) is the action, which triggers the transformations, there could be a problem with the UDF's, but this hard to estimate without knowing the actual code, or error message. — Adam Dukkon
Did you check the resource manager and node manager logs? On a Dataproc VM these are under /var/logs/hadoop-yarn/ — cyxxy
You can check DAG to know where exactly execution stuck. Sometimes input dataset is small, however it becomes huge after several operations like join. — kalpesh

Igor Dvorzhak Igor Dvorzhak · Accepted Answer · 2020-02-06T14:29:51

As mentioned in the comments, you need to troubleshoot your job to understand what's happening.

You can start from exploring job driver output, job logs and Spark job DAG that are accessible from Google Cloud UI.

If this will not yield any useful information, then you need to enable debug logging in Spark and go from there.

Pyspark job on Dataproc gets stuck at stage 0

1 Answers