5
votes

I'm trying to figure out most efficient way to extract data from MySQL, transform it and load to Redshift in near real-time. Currently we have overnight ETL process (using Pentaho) which last ~ 40min and we want to replace it with near real-time (mini batch 1-5 minutes).

I found couple of tools for data extraction from MySQL binlog (i.e.http://maxwells-daemon.io) with connector to AWS Kineses and I plan to transform and join data with Apache Spark or AWS Lambda and write it to S3 and from there COPY command writing it to Redshift.

Does anybody have any suggestion or recommendation regarding this or similar solution? Thanks in advance!

2
have you considered making an exact copy of mysql tables into redshift and then running etl inside redshift? then you could use DMS to sync mysql->redshift. DMS uses binlog so can be efficient and up to date.Jon Scott

2 Answers

2
votes

It really depends on what transformations you want to do on data coming from MySQL, and what language and framework is most suitable in your environment.

Below are my suggestions

  1. Add intermediary storage between MySQL and Redshift, for example S3
  2. Consider data re-loading mechanism in case data loads to Redshift fails

I would consider AWS Data Pipeline for this because it has ready-to-use templates, retry mechanisms and built-in logging and monitoring.

It could look like this:

MySQL -> S3 -> Redshift (staging) -> SQL -> Redshift (production)

MySQL -> S3

This data pipeline would start with "Incremental copy of RDS MySQL to S3" template.

You can paramaterize S3 path in which to store incremental data from MySQL so it is easier to manage those increments over time using S3 prefixes. If data on S3 is not to be used ever again, you can delete those objects with S3 Object Lifecycle Management periodically.

But, having data on S3 provides several other benefits, you can query it with AWS Athena, visualize using QuickSight, or just archive using Infrequent Access or Glacier storage class to reduce storage costs but keep it for future use.

S3 -> Redshift (staging)

Of course, for your use case, it must go to Redshift, so I recommend AWS Redshift Loader Lambda. Its setup is a little bit complex, but once completed successfully, it is very close to zero administration, as they claim. See this

With AWS Redshift Loader, every time data arrives on a defined S3 prefix, it is loaded to Redshift cluster (one more more) and you can have SNS configured to notify you or some logging system about it. There are other options, such as, waiting for x files before loading, or loading every y minutes.

Also, you might have a requirement to load only some of the data from S3 to Redshift into testing or development environment. With AWS Redshift Loader you could define only a specific S3 prefix e.g. /data/2017/11 to load to Redshift.

By using Redshift Loader, you are making data load asynchronous, as such, it is a bit harder to minor and control that process. This might be an issue for your case or not.

If data load to Redshift fails, you use Redshfit Loader command line tooling to reloading specific objects to Redshfit.

Redshift(staging) -> Redshift(production)

Note that Redshift does not enforce referential integrity, for example unique keys, which means you would have to have mechanisms to prevent inserting duplicate rows into Redshift tables. If you don't worry about duplicates, this is irrelevant, your data is already in Redshift.

Why does it matter? Because, if data retrieved from source, that is, MySQL, is already in Redshift, you have to know what actions you want to take. Do you overwrite it, or just ignore it.

By doing it in Redshift itself it would be easy to compare new and existing rows using a unique column and either delete and insert or just update. Doing it outside Redshfit, would probably mean keeping track of unique keys already available in Redshift outside of it and doing comparison there. Where? When to update them? How? Maybe you have a solution for it already.

So, in a previous step you insert data to Redshift (staging). In order to ensure data integrity when moving it to production tables, we have to do merging. AWS recommends the following techniques to merge data in Redsift.

If that's the case, why not doing transformations using SQL as well so there is less components to manage?

You can define your transformation jobs using SQL, store those SQL scripts on S3 and reference them in SQLActivity for execution on your Redsshift cluster alongside data merging scripts.

AWS Kinesis might also be suitable given its transformation and storage to S3 capabilities. Some of the points I have raised above will be applicable to using Kinesis too.

0
votes

MySQL is one of the world's oldest, open-source, and most reliable relational database. With great security, reliability, and ease of use, MySQL has emerged as the leading choice for OLTP systems. In many applications, transferring data between MySQL and popular Data Warehouses like Amazon Redshift becomes pivotal.

This data transfer can be done using custom ETL scripts or by using 3rd-party applications like Hevo Data. In the former method, data is transferred into CSV/JSON files using the COPY Command. It is then cleaned, transformed, and uploaded to the S3 bucked to be imported into Amazon Redshift. The general format of the COPY Command is shown below:

COPY table_name FROM 's3://my_redshift_bucket/some-path/table_name/' credentials  
'aws_access_key_id=my_access_key;aws_secret_access_key=my_secret_key';

Although this method is equally efficient to handle your data, it has many limitations such as timeout and compatibility issues during the Redshift COPY process, Network Failures during the S3 upload, and when you transfer MySQL data into files.

Luckily, there is a solution to all these problems. Hevo Data, a fully automated No-code Data Pipeline and Integration Platform helps you bring data from 100+ sources to Amazon Redshift in near real-time without writing any code. You can connect to any data source using Hevo’s point & click UI and instantly move data from any data source to Amazon Redshift. Thus, this is the most efficient way you can transfer data from MySQL to Amazon Redshift.

You can also refer to the below blog to transfer data from MySQL to Amazon Redshift for your reference:

MySQL to Redshift