2
votes

I currently don't have enough data to test the scenario but I need to know if Redshift unload command with parallel on unloads sorted data to multi part files on s3, if I use order by clause with unload query? I know I can unload sorted data to s3 serially with 6.2 GB on each part if I use parallel off.

Redshift documentation states about unload:

A SELECT query. The results of the query are unloaded. In most cases, it is worthwhile to unload data in sorted order by specifying an ORDER BY clause in the query; this approach saves the time required to sort the data when it is reloaded.

Any related link with this topic would be helpful.

1

1 Answers

3
votes

After searching a lot I found my answer.

According to Redshift docs:

By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. To write data to a single file, specify PARALLEL OFF. UNLOAD writes the data serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. If the data size is greater than the maximum, UNLOAD creates additional files, up to 6.2 GB each.

So it is required to use parallel off if one needs sorted data.