We are currently working on loading data into Redshift. We have different scenarios here. If the OLTP database is SQL Server residing on premise, then we can consider tool like Attunity that can help loading data to Redshift via S3. Attunity is smart in CDC, that identifies changes reading transaction log, and can apply changes to target accordingly. But this kind of tool is poor in applying transformation logic during the ETL process. Attunity is not a replacement of SSIS or ODI, but good in extracting and loading data from various sources. So for doing the transformation we need a proper ETL tool. We can load data using Attunity in a staging area inside Redshift, and from staging area we can load data to target tables using another ETL tool or using Triggers. As trigger is not supported in Redshift, so what could be that ETL tool? We have not found anything other than AWS Data Pipeline here. But using two tools: Attunity and AWS Data Pipeline might get costly. Is there any other alternative way? We don’t think Data Pipeline can connect to on premise SQL Server. It is only for Amazon ecosystem.
Now let’s consider our on premise SQL Server is now deployed in Amazon RDS. Then the situation might get different. We can still follow the same ETL process described above: using two tools Attunity and AWS Data Pipeline. But this time it should be easier to use only one tool: AWS Data Pipeline. Now is AWS Data Pipeline capable enough to handle all scenarios? We don’t find it can read transaction log. But we should be able to apply other approaches for incremental load. A very common approach is to consider last modified date column with each source table. Then we can identify the rows in RDS Sql Server tables, which are modified from the last load time. But, we cannot take the changed data from RDS to Redshift directly. We will have to use either S3 or DynamoDB. We can make AWS Data Pipeline to use S3 as the route. It again seems like a headache. Maybe there could be some other easier approach. Now again, AWS Data Pipeline is quite new in the competitive market. And a very big limitation to this tool is inability to load data from different sources outside AWS (say Salesforce, Oracle, etc). Is there any other easy to use tool that can work flawlessly inside AWS ecosystem without any difficulty and causing minimal cost?