1
votes

I've been using Spark for a couple of years now, and my new team uses Redshift. I've successfully bound Spark/Redhshift clusters and can successfully perform Redshift queries via Spark and unload them into S3.

If I understand correctly, when I generate a dataframe in spark-redshift, the actual heavy-lifting is done by Redshift itself, not by Spark. I've been running the same queries in both platforms and the one in Spark takes about twice as much. Since the query itself is being solved by Redshift in both cases, I believe the additional time is network I/O from redshift to Spark clusters.

Spark (pyspark) acts really as just a collector of the queries result in the form of a convenient dataframe that I can then use to parallelize machine learning methods with its libraries.

How accurate is this depiction?


EDIT: I did a quick test: fired up Spark locally (16Gb machine) and ran a redshift-spark query in around 750 million records that returned a small 7x2 dataframe (one for each day of the week, with the distribution of cases in it). It took about 3 seconds for the result to be displayed locally in my Spark shell, and it takes about 1.2 seconds for the query to finish on Redshift standalone. My 16Gb machine could not be able to process that much data so quickly, and monitoring CPU / network shows there's minimal activity during the query. Unless I'm misunderstanding something, it really looks like the heavy aggregation and other processing is done at the Redshift, not the Spark, level.

1

1 Answers

1
votes

If I understand correctly, when I generate a dataframe in spark-redshift, the actual heavy-lifting is done by Redshift itself, not by Spark. I

It it not correct. Spark Data Source API can delegate a small part of the job (projections and simple filters) to external source but most of the work is done on Spark itself.

A whole process is quite ugly:

  • Spark executes query.
  • Redshift UNLOADS result of the query to S3.
  • Spark reads data from S3.