1
votes

I need to move data from on-premise to AWS redshift(region1). what is the fastest way?

1) use AWS snowball to move on-premise to s3 (region1)and then use Redshift's SQL COPY cmd to copy data from s3 to redshift.

2) use AWS Datapipeline(note there is no AWS Datapipeline in region1 yet. so I will setup a Datapipeline in region2 which is closest to region1) to move on-premise data to s3 (region1) and another AWS DataPipeline (region2) to copy data from s3 (region1) to redshift (region1) using the AWS provided template (this template uses RedshiftCopyActivity to copy data from s3 to redshift)?

which of above solution is faster? or is there other solution? Besides, will RedshiftCopyActivity faster than running redshift's COPY cmd directly?

Note it is one time movement so I do not need AWS datapipeline's schedule function.

Here is AWS Datapipeline's link: AWS Data Pipeline. It said: AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sources....

1
how many TB of data? why do you think aws data pipeline rather than simple (aws cli) upload to s3? im not aware of data pipeline service being good in this area. can you point to a web page showing this?Jon Scott
Jon Scott: I added the link. it said: AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premises data sourcesuser389955
definitely go with the option of loading your data to s3, and not use data pipeline. from there you have the flexibility to use glue/athena/reddshift load/emr etc. you can gzip your data before you load it and if practical organise it in an efficient and effective way. (you may need professional help with that). you can either load up your data manually over the internet (OK for maybe 1TB compressed or so), or use snowball if you have significantly more than that.Jon Scott
Jon Scott: Thanks for you answer. Looks like AWS DataPipeline is good only for scheduled movement like daily sync. internally it calls s3 cp or redshift LOAD same as what I can do manually. so it will not make movement faster, just make movement "pipelined".user389955
My issue with data pipeline is that it is a bit "flakey" - unlike most aws products. I've used it with a couple of clients and had to abort for other methods after a while. it can also be a lot more expensive if you use it continually with a lot of tables.Jon Scott

1 Answers

1
votes

It comes down to network bandwidth versus the quantity of data.

The data needs to move from the current on-premises location to Amazon S3.

This can either be done via:

  • Network copy
  • AWS Snowball

You can use an online network calculator to calculate how long it would take to copy via your network connection.

Then, compare that to using AWS Snowball to copy the data.

Pick whichever one is cheaper/easier/faster.

Once the data is in Amazon S3, use the Amazon Redshift COPY command to load it.

If data is being continually added, you'll need to find a way to send continuous updates to Redshift. This might be easier via network copy.

There is no benefit in using Data Pipeline.