Porting partially-relational S3 data into Redshift via Spark and Glue

Question

I should preface this with the fact that I'm using Enhanced VPC Routing for my AWS account, which precludes me from using the traditional S3 to Redshift querying:

"Your cluster can't have Enhanced VPC Routing enabled. (to use Spectrum)"

My understanding is that AWS Redshift is a high-octane Postgres-as-a-service that is optimized for extremely fast reads over large data volumes. So if you lots of have relational data that you want to query/analyze, then Redshift is a good choice for you.

My understanding of AWS Athena is that its just using something like Apache Drill (or similar) to provide a SQL-like interface over any data stored in S3 buckets (relational and otherwise, as well as any format: unstructured plaintext, JSON, XML, etc.). So if you just have data in S3 that you want to query via SQL-like syntax, Athena is a good choice for you.

To begin with, can anyone begin by confirming/clarifying my understanding above? Assuming I'm more or less correct...

I have structured/relational (stored in JSON and CSV files) that lives on S3. I'd like to create an ETL process that reads this data out of S3 and dumps it into Redshift so that downstream processes can analyze it.

So I'm thinking about creating a Spark-based ETL pipeline whereby:

Spark uses Athena to query S3 data into DataFrames; I'm also wondering if AWS Glue can possibly do some heavy lifting here
Spark writes the contents of those DataFrames to Redshift

So my question: is this the most efficient way to port LARGE amounts of partially-structured/relational S3 data (again stored in various file formats) into Redshift, or is there a better/simpler way?

Redshift Spectrum now supports Enhanced VPC Routing: picnicerror.net/development/aws/… — GShenanigan

botchniaque botchniaque · Accepted Answer · 2018-05-15T07:44:59

EDIT Since your Redshift cluster does not have any access to S3 whatsoever (due to Enhanced VPC Routing), the option I see here is to use JDBC to write to Redshift.

Regarding Athena: Since you're using Spark, you don't need Athena here - spark can read data from S3 and create a dataframe out of it.

Regarding Glue: Glue is a serverless Spark offering which comes with Glue Catalog (Metastore), Crawlers to discover tables and Spark code generator for simple transformations. Best about it is that it also runs plain spark code. I don't know how you run your spark workloads, but this could be an option.

Writing a DataFrame to a JDBC endpoint in Spark could look like this:

val spark = SparkSession.getOrCreate(...)
val dataframe = spark
   .read
   .format("csv") //or json
   .load("s3://your-s3-path") 

val connectionProps = new Properties()
redshiftConnProps.setProperty(...)

dataframe
    .write
    .jdbc("redshift-url", "tablename", connectionProperties)

For the details about jdbc connection properties have a look into Spark DataFrameWriter documentation.

Please note: As I mentioned in a comment - loading data via JDBC interface to Redshift is discouraged, since all the traffic goes through master node, and it cannot utilize parallelism properly.

There is a simpler way, even two:

To load data from s3 to Redshift you don't need any external ETL process (provided you don't have to transform the data). Redshift supports loading data from S3 directly
```
copy venue from 's3://mybucket/venue'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '|';
```
Other option is to use Redshift spectrum (and this is something very similar to AWS Athena, but with redshift sql engine instead of presto) which allows you to expose S3 buckets as tables without the need to move data anywhere. You just create an EXTERNAL SCHEMA and an EXTERNAL TABLE and your table is visible and querable from Redshift (and also Athena). More in official AWS documentation

Please Note: Redshift supports only flat schemas, so if you have any struct, array or map in your schema, you will not be able to load it into Redshift before you flatten it somehow (AWS Glue provides a relationalize function to do that automatically). Since Presto DOES support nested schemas, you CAN use Athena to query conplex nexted schemas.

Porting partially-relational S3 data into Redshift via Spark and Glue

1 Answers