I have migrated a portion of C application to process on DataProc using PySpark Jobs (Reading and writing into Big Query - Amount of data - around 10 GB) . The C application that is running in 8 minutes in local data centre taking around 4 Hrs on Data Proc . Could someone advise me the optimal Data Proc configuration ? At present I am using below one :
--master-machine-type n2-highmem-32 --master-boot-disk-type pd-ssd --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n2-highmem-32 --worker-boot-disk-type pd-ssd --worker-boot-disk-size 500 --image-version 1.4-debian10
Will really appreciate any help on optimal dataproc configuration .
Thanks, RP