0
votes

I have around 500GB compressed data in amazon s3. I wanted to load this data to Amazon Redshift. For that, I have created an internal table in AWS Athena and I am trying to load data in the internal table of Amazon Redshift.

Loading of this big data into Amazon Redshift is taking more than an hour. The problem is when I fired a query to load data it gets aborted after 1hour. I tried it 2-3 times but it's getting aborted after 1 hour. I am using Aginity Tool to fire the query. Also, in Aginity tool it is showing that query is currently running and the loader is spinning.

More Details: Redshift cluster has 12 nodes with 2TB space for each node and I used 1.7 TB space. S3 files are not the same size. One of them is 250GB. Some of them in MB.

I am using the command

create table table_name as select * from athena_schema.table_name

it stops exactly after 1hr.

Note: I have set the current query timeout in Aginity to 90000 sec.

2
why do it via Athena and not just run a copy command?AlexYes
I agree - a copy command is likely faster and - cheaper!! anyway - is it EXACTLY 1 hour? What size is your redshift cluster? are you sure you have enough space? is your s3 data split into small files? what size? are they all about the same size? what command are you running exactly? if you are using spectrum - what serde are you using (perhaps show your table definition) Please answer as many of the above as possible and add this detail to your questionJon Scott
@JonScott I have added more details to this question. Thanks.RohanB
Are the s3 files gzipped? does that query include the 250GB file?Jon Scott
you need to split out your files more , make that 250GB file into 1000 smaller filesJon Scott

2 Answers

1
votes

I know this is an old thread, but for anyone coming here because of the same issue, I've realised that, at least for my case, the problem was the Aginity client; so, it's not related with Redshift or its Workload Manager, but only with such third party client called Aginity. In summary, use a different client like SQL Workbench and run the COPY command from there.

Hope this helps! Carlos C.

More information, about my environment:

Redshift:

Cluster TypeThe cluster's type: Multi Node
Cluster: ds2.xlarge
NodesThe cluster's type: 4
Cluster Version: 1.0.4852 

Client Environment:

Aginity Workbench for Redshift
Version 4.9.1.2686 (build 05/11/17)
Microsoft Windows NT 6.2.9200.0 (64-bit)

Network:

Connected to OpenVPN, via SSH Port tunneling. 
The connection is not being dropped. This issue is only affecting the COPY command. The connection remains active. 

Command:

copy tbl_XXXXXXX 
from 's3://***************'
iam_role 'arn:aws:iam::***************:role/***************';

S3 Structure:

120 files of 6.2 GB each. 20 files of 874MB.

Output:

ERROR: 57014: Query (22381) cancelled on user's request

Statistics:

Start:    ***************
End:      ***************
Duration: 3,600.2420863
0
votes

I'm not sure if following answer will solve your exact problem of timeout at exactly 1 Hr. But, based on my experience, in case of Redshift loading data via Copy command is best and fast way. SO I feel that timeout issue shouldn't happen at all in your case.

The copy command in RedShift could load data from S3 or via SSH.

e.g. Simple copy

copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role 
'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;

e.g. Using Menifest

copy customer
from 's3://mybucket/cust.manifest' 
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;

PS: Even if you do it using Menifest and divide your data into Multiple files, it will be more faster as RedShift loads data in parallel.