Is there a better way to load a spark df into BigQuery through PySpark cluster (dataporc)?

Question

I'm currently using the below code to load data into BigQuery through a PySpark cluster(dataproc) but either it takes way too long to process or gets terminated with an execution time exceeded error. Is there a better and faster way to load spark df into BigQuery?

output.write \
      .format("bigquery") \
      .option("table","{}.{}".format(bq_dataset, bq_table)) \
      .option("temporaryGcsBucket", gcs_bucket) \
      .mode('append') \
      .save()

Below is my dataproc cluster configuration:

Master node : Standard (1 master, N workers)
Machine type : n1-standard-4
Number of GPUs : 0
Primary disk type : pd-standard
Primary disk size : 500GB
Worker nodes : 3
Machine type : n1-standard-4
Number of GPUs : 0
Primary disk type : pd-standard
Primary disk size : 500GB
Image version : 1.4.30-ubuntu18

What is the size of the data? What is the size of the cluster - how many executors, cpus, memory? — David Rabinowitz
df.count() or df.show() runs for unlimited time and doesn't execute, not sure why but i'm guessing it should not be 200-300 rows and i have added the cluster configuration as a part of the question. — Tracy

Gaurangi Saxena Gaurangi Saxena · Accepted Answer · 2020-06-10T19:33:35

Please make sure you are using the latest version of SparkBigQueryConnector.

Try testing your code with other intermediate formats, such as, avro, orc and parquet. Avro tends to perform better with larger data.

If the data you are writing is really huge, try adding more workers or choosing a different machine type.

Is there a better way to load a spark df into BigQuery through PySpark cluster (dataporc)?

1 Answers