I am using Redhshift spectrum. I created an external table and uploaded a csv data file on S3 with around 5.5 million records. If fire a query on this external table, it is taking ~15 seconds whereas If I run same query on Amazon redshift, I was getting same result in ~2 seconds. What could be the reason for this performance lag where AWS claim it to be be very high performance platform. Please suggest solution for same performance using spectrum.
4 Answers
For your performance optimizations please have a look to understand your query.
Right now, the best performance is if you don't have a single CSV file but multiple. Typically, you could say you get great performance if the number of files per query is at least about an order of magnitude larger than the number of nodes of your cluster.
In addition, if you use Parquet files you get the advantage of a columnar format on S3 rather than reading CSV which will read the whole file from S3 - and decreases your cost as well.
You can use the script to convert data to Parquet:
Reply from AWS forum as follows :
I understand that you have the same query running on Redshift & Redshift Spectrum. However, the results are different, while one run in 2 seconds the other run in around 15 seconds.
First of all, we must agree that both Redshift and Spectrum are different services designed differently for different purpose. Their internal structure varies a lot from each other, while Redshift relies on EBS storage, Spectrum works directly with S3. Redshift Spectrum's queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remain in Amazon S3.
Spectrum is also designed to deal with Petabytes of data structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables while Redshift offers you the ability to store data efficiently and in a highly-optimez manner by means of Distribution and Sort Keys.
AWS does not advertise Spectrum as a faster alternative to Redshift. We offer Amazon Redshift Spectrum as an add-on solution to provide access to data stored in Amazon S3 without having to load it into Redshift (similar to Amazon Athena).
In terms of query performance, unfortunately, we can't guarantee performance improvements since Redshift Spectrum layer produces query plans completely different from the ones produced by Redshift's database engine interpreter. This particular reason itself would be enough to discourage any query performance comparison between these services as it is not fair to neither of them.
About your question about on the fly nodes, Spectrum adds them based on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing. There aren't any specific criteria to trigger this behavior, however, bearing in mind that by following the best practices about how to improve query performance[1] and how to create data files for queries[2] you can potentially improve the overall Spectrum's performance.
For the last, I would like to point some interesting documentation to clarify you a bit more about how to achieve better performance improvements. Please see the references at the end!
These results are to be expected. The whole reason for using Amazon Redshift is that it stores data in a highly-optimized manner to provide fast queries. Data stored outside of Redshift will not run anywhere near as fast.
The intention of Amazon Redshift Spectrum is to provide access to data stored in Amazon S3 without having to load it into Redshift (similar to Amazon Athena), but it makes no performance guarantees.
I am somewhat late to answer this. As of Feb-2018, AWS is supporting the AWS Spectrum queries on files in columnar formats like Parquet, ORC etc. In your case, you are storing the file as .CSV. CSV is row based which results in pulling out the entire row for any field queried. I will suggest you to convert the files from .csv to Parquet format before querying. This will certainly result in much faster performance. Details from AWS: Amazon Redshift Spectrum