I have a Google Dataproc Cluster running and am submitting a PySpark job to it that reads in a file from Google Cloud Storage (945MB CSV file with 4 million rows --> takes 48 seconds in total to read in) to a PySpark Dataframe and applies a function to that dataframe (parsed_dataframe = raw_dataframe.rdd.map(parse_user_agents).toDF()
--> takes about 4 or 5 seconds).
I then have to save those modified results back to Google Cloud Storage as a GZIP'd CSV or Parquet file. I can also save those modified results locally, and then copy them to a GCS bucket.
I repartition the dataframe via parsed_dataframe = parsed_dataframe.repartition(15)
and then try saving that new dataframe via
parsed_dataframe.write.parquet("gs://somefolder/proto.parquet")
parsed_dataframe.write.format("com.databricks.spark.csv").save("gs://somefolder/", header="true")
parsed_dataframe.write.format("com.databricks.spark.csv").options(codec="org.apache.hadoop.io.compress.GzipCodec").save("gs://nyt_regi_usage/2017/max_0722/regi_usage/", header="true")
Each of those methods (and their different variants with lower/higher partitions and saving locally vs. on GCS) takes over 60 minutes for the 4 million rows (945 MB) which is a pretty long time.
How can I optimize this/make saving the data faster?
It should be worth noting that both the Dataproc Cluster and GCS bucket are in the same region/zone, and that the Cluster has a n1-highmem-8
(8CPU, 52GB memory) Master node with 15+ worker nodes (just variables I'm still testing out)
write.parquet("hdfs:///tmp/foo")
? – Dennis Huo