Incremental Load in Redshift

Question

We are currently working on loading data into Redshift. We have different scenarios here. If the OLTP database is SQL Server residing on premise, then we can consider tool like Attunity that can help loading data to Redshift via S3. Attunity is smart in CDC, that identifies changes reading transaction log, and can apply changes to target accordingly. But this kind of tool is poor in applying transformation logic during the ETL process. Attunity is not a replacement of SSIS or ODI, but good in extracting and loading data from various sources. So for doing the transformation we need a proper ETL tool. We can load data using Attunity in a staging area inside Redshift, and from staging area we can load data to target tables using another ETL tool or using Triggers. As trigger is not supported in Redshift, so what could be that ETL tool? We have not found anything other than AWS Data Pipeline here. But using two tools: Attunity and AWS Data Pipeline might get costly. Is there any other alternative way? We don’t think Data Pipeline can connect to on premise SQL Server. It is only for Amazon ecosystem.

Now let’s consider our on premise SQL Server is now deployed in Amazon RDS. Then the situation might get different. We can still follow the same ETL process described above: using two tools Attunity and AWS Data Pipeline. But this time it should be easier to use only one tool: AWS Data Pipeline. Now is AWS Data Pipeline capable enough to handle all scenarios? We don’t find it can read transaction log. But we should be able to apply other approaches for incremental load. A very common approach is to consider last modified date column with each source table. Then we can identify the rows in RDS Sql Server tables, which are modified from the last load time. But, we cannot take the changed data from RDS to Redshift directly. We will have to use either S3 or DynamoDB. We can make AWS Data Pipeline to use S3 as the route. It again seems like a headache. Maybe there could be some other easier approach. Now again, AWS Data Pipeline is quite new in the competitive market. And a very big limitation to this tool is inability to load data from different sources outside AWS (say Salesforce, Oracle, etc). Is there any other easy to use tool that can work flawlessly inside AWS ecosystem without any difficulty and causing minimal cost?

If you can simplify your question to a specific topic (eg how to do incremental loading from RDS MS-SQL to Redshfit), the StackOverflow community will likely be more able to assist you with some answers/suggestions. — John Rotenstein
A potential tool: bryte.com.au/solutions/amazon-redshift-integration — John Rotenstein

mike_pdb mike_pdb · Accepted Answer · 2015-09-09T22:07:09

I'd rely on Attunity to bring your OLTP data into a staging area since it is pretty good at managing that part of the pipeline (though you have to build a fair amount of your own monitoring using repctl) and can be quite cost effective to address this part of the ETL that is traditionally very expensive to build. Pentaho DI is a good choice as an ETL tool to run procedural components of your ETL process since you can build (though it has a few built in) "triggers" to watch database tables, file systems, ftp sites, queues, etc... and have them run pretty much any kind of process you'd want. There's a nice community edition that has most of the nuts and bolts and paying for an EE version is worth the money for the support and scheduler.

Incremental Load in Redshift

2 Answers