Load Parquet files into Redshift

Question

I have a bunch of Parquet files on S3, i want to load them into redshift in most optimal way.

Each file is split into multiple chunks......what is the most optimal way to load data from S3 into Redshift?

Also, how do you create the target table definition in Redshift? Is there a way to infer schema from Parquet and create table programatically? I believe there is a way to do this using Redshift spectrum, but i want to know if this can be done in scripting.

Appreciate your help!

I am considering all AWS tools such as Glue, Lambda etc to do this the most optimal way(in terms of performance, security and cost).

if youre planning to use glue,then create a and use the glue metastore directly in spectrum.The glue crawler is helpful if you want to update the metastore peridically — theDbGuy

John Rotenstein John Rotenstein · Accepted Answer · 2018-09-06T01:30:39

The Amazon Redshift COPY command can natively load Parquet files by using the parameter:

FORMAT AS PARQUET

See: Amazon Redshift Can Now COPY from Parquet and ORC File Formats

The table must be pre-created; it cannot be created automatically.

Also note from COPY from Columnar Data Formats - Amazon Redshift:

COPY inserts values into the target table's columns in the same order as the columns occur in the columnar data files. The number of columns in the target table and the number of columns in the data file must match.

Load Parquet files into Redshift

2 Answers