2
votes

Our datawarehouse is in Redshift (50TB size). Sometimes business users run big queries (too many joins, inline queries - generated by BI tools such as Tableau). Big queries slow database performance.

  1. It is wise to use Spark on top of Redshift to offload some of the computation outside Redshift?

  2. Or will it be easier and cost effective to increase Redshift computation power by adding more nodes?

  3. If I execute select a.col1, b.col2 from table1 a, table2 b where a.key = b.key in Spark. Tables are connected via JDBC and resides on Redshift, where does actual processing happen (in Spark or Redshift)?

1

1 Answers

4
votes

Any queries on the data stored in Amazon Redshift are performed by the Amazon Redshift nodes. While Spark could make an external JDBC call, the SQL will be executed by Redshift.

There are many techniques to optimize Redshift query execution:

Start by looking at queries that consume too many resources and determine whether they can be optimized by changing the Sort Key, Distribution Key and Compression Encodings used by each table. Correct use of these parameters can greatly improve Redshift performance.

Then, if many users are running simultaneous queries, check whether it is worth improving Workload Management settings to create separate queues with different memory settings.

Finally, if performance is still a problem, add additional Redshift nodes. The dense compute nodes will offer better performance because they use SSD storage, but it is a higher cost per TB of storage.