2
votes

I have migrated a portion of C application to process on DataProc using PySpark Jobs (Reading and writing into Big Query - Amount of data - around 10 GB) . The C application that is running in 8 minutes in local data centre taking around 4 Hrs on Data Proc . Could someone advise me the optimal Data Proc configuration ? At present I am using below one :

--master-machine-type n2-highmem-32 --master-boot-disk-type pd-ssd --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n2-highmem-32 --worker-boot-disk-type pd-ssd --worker-boot-disk-size 500 --image-version 1.4-debian10

Will really appreciate any help on optimal dataproc configuration .

Thanks, RP

1
What is the hardware configuration in your data center? You are specifying` --num-workers 2`. For jobs that benefit from parallelization, two worker nodes will not provide much of a benefit if any when you factor in job overhead. Edit your question with details on both environments and the code that is executing. As a tip, n2-highmem-32 is a small VM. My desktop is probably 10x as fast. When comparing systems, compare equal systems in memory, CPU, network and disk I/O. - John Hanley
May you share command that you use to run this job on Datparoc? Also, how do you parallelize processing in Spark? What data is processed and how do you partition it? - Igor Dvorzhak

1 Answers

0
votes

Here are some good articles on job performance tuning on Dataproc: Spark job tuning tips and 10 questions to ask about your Hadoop and Spark cluster performance.