Unloading large data set from Redshift to S3 fails with I/O error

Question

What I'm really trying to accomplish is to change the data types of a couple of the columns. In particular a couple of columns that are stored as strings that need to be dates. I was trying to use the CREATE TABLE / SELECT command with similar issues. I/O errors or broken connections.

I figured that it might be more efficient to export and re-import the data than to try to abuse the cluster by reading / writing so much data in a circle more or less.

I've tried both Aginity Workbench and SQL Workbench J with large timeouts (10 hours) configured. SQL Workbench J managed to run for 2 hours and 12 minutes today before failing with the same error I keep seeing over and over.

An I/O error occured while sending to the backend.

This is a fair chunk of data... 2,028,448,405 rows at present (I say "at present" because we're adding about 70 million rows per day). But I would expect that Redshift would handle this easily.

UNLOAD ( 'select weekday, day, month, year, guid,
      ...,

      colN
  from actions a
  where a.colN in (\'my\',\'penguin\',\'lolz\')'
)
TO 's3://penguin-lolz-bucket/all.csv' 
CREDENTIALS 'aws_access_key_id=$id;aws_secret_access_key=$key
ALLOWOVERWRITE;</code>

Googling the error message has just led me to a ton of posts about timeout configurations or general connection issues.

UPDATE 1:

So made several changes and so far the query has been running for 3 hours without error:

Removed the file name from the TO field
Created a new bucket just for this process
Added the GZIP and PARALLEL options to the query

This was based on the idea that I might be exceeding the capacity of the bucket and/or the max object size on a per file basis.

UPDATE 2: The UNLOAD performs as expected now. (After the changes I made in UPDATE 1)

PRO TIP: When doing a large unload like this, keep your eye on the status of your query in the AWS Redshift console (you'll find it in one of the 'tabs' when you drill into your cluster details). I was using SQL Workbench J for this. After about 5 hours or so SQL Workbench J showed the query was still running. However the AWS Redshift console showed it was complete and it was.

Thanks for specifying that you're using Redshift. It means I can't really help you, but hopefully others will. I think there's more interest in Redshift over on the AWS forums than Stack Overflow though. — Craig Ringer
Is there a SQL state thrown out along with the error message ? — A Null Pointer
No, no SQL state included in the message. I'm using SQL Workbench J which I'm new to. I'll check to see if there's a way get more info out of the error messaging. Unfortunately it takes 2 hours or so to get the query to fail again. That's a pretty slow testing cycle. — Taylor
You can always use STV_RECENTS table to see Running status queries and see STV_INFLIGHT table to see if the query is actually running or waiting for one of the 15 slots. — androboy

Sandesh Deshmane Sandesh Deshmane · Accepted Answer · 2014-07-30T05:46:29

This is due to your query takes very long time and SQL Workbench get disconnected.

You can write small script using PHP or shell ( using pgsql_connect) and use your unload queries.

Make sure your run your script in background. if you are running script from your local pc make sure its not disconnected from network.

Other options is if you have EC2 instance run you script on EC2 .

Unloading large data set from Redshift to S3 fails with I/O error

1 Answers