I'm currently using the below code to load data into BigQuery through a PySpark cluster(dataproc) but either it takes way too long to process or gets terminated with an execution time exceeded error. Is there a better and faster way to load spark df into BigQuery?
output.write \
.format("bigquery") \
.option("table","{}.{}".format(bq_dataset, bq_table)) \
.option("temporaryGcsBucket", gcs_bucket) \
.mode('append') \
.save()
Below is my dataproc cluster configuration:
Master node : Standard (1 master, N workers)
Machine type : n1-standard-4
Number of GPUs : 0
Primary disk type : pd-standard
Primary disk size : 500GB
Worker nodes : 3
Machine type : n1-standard-4
Number of GPUs : 0
Primary disk type : pd-standard
Primary disk size : 500GB
Image version : 1.4.30-ubuntu18
df.count()ordf.show()runs for unlimited time and doesn't execute, not sure why but i'm guessing it should not be 200-300 rows and i have added the cluster configuration as a part of the question. - Tracy