My problem is that my pyspark job is not running in parallel.
Code and data format:
My PySpark looks something like this (simplified, obviously):
class TheThing:
def __init__(self, dInputData, lDataInstance):
# ...
def does_the_thing(self):
"""About 0.01 seconds calculation time per row"""
# ...
return lProcessedData
#contains input data pre-processed from other RDDs
#done like this because one RDD cannot work with others inside its transformation
#is about 20-40MB in size
#everything in here loads and processes from BigQuery in about 7 minutes
dInputData = {'dPreloadedData': dPreloadedData}
#rddData contains about 3M rows
#is about 200MB large in csv format
#rddCalculated is about the same size as rddData
rddCalculated = (
rddData
.map(
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
)
)
llCalculated = rddCalculated.collect()
#save as csv, export to storage
Running on Dataproc cluster:
Cluster is created via the Dataproc UI.
Job is executed like this:gcloud --project <project> dataproc jobs submit pyspark --cluster <cluster_name> <script.py>
I observed the job status via the UI, started like this. Browsing through it I noticed that only one (seemingly random) of my worker nodes was doing anything. All others were completely idle.
Whole point of PySpark is to run this thing in parallel, and is obviously not the case. I've run this data in all sorts of cluster configurations, the last one being massive, which is when I noticed it's singular-node use. And hence why my jobs take too very long to complete, and time seems independent of cluster size.
All tests with smaller datasets pass without problems on my local machine and on the cluster. I really just need to upscale.
EDIT
I changedllCalculated = rddCalculated.collect()
#... save to csv and export
torddCalculated.saveAsTextFile("gs://storage-bucket/results")
and only one node is still doing the work.