I should preface this with the fact that I'm using Enhanced VPC Routing for my AWS account, which precludes me from using the traditional S3 to Redshift querying:
"Your cluster can't have Enhanced VPC Routing enabled. (to use Spectrum)"
My understanding is that AWS Redshift is a high-octane Postgres-as-a-service that is optimized for extremely fast reads over large data volumes. So if you lots of have relational data that you want to query/analyze, then Redshift is a good choice for you.
My understanding of AWS Athena is that its just using something like Apache Drill (or similar) to provide a SQL-like interface over any data stored in S3 buckets (relational and otherwise, as well as any format: unstructured plaintext, JSON, XML, etc.). So if you just have data in S3 that you want to query via SQL-like syntax, Athena is a good choice for you.
To begin with, can anyone begin by confirming/clarifying my understanding above? Assuming I'm more or less correct...
I have structured/relational (stored in JSON and CSV files) that lives on S3. I'd like to create an ETL process that reads this data out of S3 and dumps it into Redshift so that downstream processes can analyze it.
So I'm thinking about creating a Spark-based ETL pipeline whereby:
- Spark uses Athena to query S3 data into
DataFrames; I'm also wondering if AWS Glue can possibly do some heavy lifting here - Spark writes the contents of those
DataFramesto Redshift
So my question: is this the most efficient way to port LARGE amounts of partially-structured/relational S3 data (again stored in various file formats) into Redshift, or is there a better/simpler way?