AWS Glue: SQL Server multiple partitioned databases ETL into Redshift

Question

Our team is trying to create an ETL into Redshift to be our data warehouse for some reporting. We are using Microsoft SQL Server and have partitioned out our database into 40+ datasources. We are looking for a way to be able to pipe the data from all of these identical data sources into 1 Redshift DB.

Looking at AWS Glue it doesn't seem possible to achieve this. Since they open up the job script to be edited by developers, I was wondering if anyone else has had experience with looping through multiple databases and transfering the same table into a single data warehouse. We are trying to prevent ourselves from having to create a job for each database... Unless we can programmatically loop through and create multiple jobs for each database.

We've taken a look at DMS as well, which is helpful for getting the schema and current data over to redshift, but it doesn't seem like it would work for the multiple partitioned datasource issue as well.

Some of my colleagues had a similar problem and after not being impressed by Amazon Glue they ended up using Matillion. However I think they had to do some fancy logic to coalesce the data from multiple databases. — ColdSolstice
@ColdSolstice Thanks for the info, sad to hear they weren't able to utilize Amazon Glue but I will definitely look into Matillion. Do you know if they used this solution for more than just a data migration and used it to pipe constantly as a data warehouse? — jetset
SQL Server supports parititioning and columnstores too. Are you sure you need a new database instead of a better schema? What does 40 datasources mean? Why not a single data warehouse? Why not use star schemas and columnstores? — Panagiotis Kanavos
@LucaKlaassen 1st) how much data are we talking about? Why 40 servers instead of more disks in the RAID array, or OLAP cubes, or columnstores? 2nd) SQL Server supports federated views since 2000. You can have one view that references the other tables and as long as they have the proper constraints, the optimizer knows to ask the appropriate linked table for data. — Panagiotis Kanavos
@LucasKlaassen on the other hand Lenovo published TPC-H results for a single instance 10TB data warehouse almost 2 years ago. How much data do you have? In that range Redshift would cost $10K-$55K per year — Panagiotis Kanavos

langton langton · Accepted Answer · 2018-01-19T11:50:01

This sounds like an excellent use-case for Matillion ETL for Redshift.

(Full disclosure: I am the product manager for Matillion ETL for Redshift)

Matillion is an ELT tool - it will Extract data from your (numerous) SQL server databases and Load them, via an efficient Redshift COPY, into some staging tables (which can be stored inside Redshift in the usual way, or can be held on S3 and accessed from Redshift via Spectrum). From there you can add Transformation jobs to clean/filter/join (and much more!) into nice queryable star-schemas for your reporting users.

If the table schemas on your 40+ databases are very similar (your question doesn't clarify how you are breaking your data down into those servers - horizontal or vertical) you can parameterise the connection details in your jobs and use iteration to run them over each source database, either serially or with a level of parallelism.

Pushing down transformations to Redshift works nicely because all of those transformation queries can utilize the power of a massively parallel, scalable compute architecture. Workload Management configuration can be used to ensure ETL and User queries can happen concurrently.

Also, you may have other sources of data you want to mash-up inside your Redshift cluster, and Matillion supports many more - see https://www.matillion.com/etl-for-redshift/integrations/.

AWS Glue: SQL Server multiple partitioned databases ETL into Redshift

2 Answers