1
votes

I have a dataset stored in s3 in parquet format. I would like to know if I can load this data into redshift using copy command.I have read that I can use redhisft spectrum where I can mention the schema stored in hive and read that in redhisft.

what would be useful for me is, if I can query this parquet data stored in s3 from redhisft or if I can load them directly into redshift using copy command.

There are options where I can spin a cluster and write parquet data into s3 using jdbc. but the problem is jdbc is too slow compared to copy command.

1

1 Answers

1
votes

Using Redshift External Schema and tables you can load the parquet data from S3 in to redshift.

CREATE  external table spectrum.LINEITEM_PART_PARQ ( 
 L_ORDERKEY BIGINT,
 L_PARTKEY BIGINT,
 L_SUPPKEY BIGINT,
 L_LINENUMBER INT,
 L_QUANTITY DECIMAL(12,2),
 L_EXTENDEDPRICE DECIMAL(12,2),
 L_DISCOUNT DECIMAL(12,2),
 L_TAX DECIMAL(12,2),
 L_RETURNFLAG VARCHAR(128),
 L_LINESTATUS VARCHAR(128),
 L_COMMITDATE VARCHAR(128),
 L_RECEIPTDATE VARCHAR(128),
 L_SHIPINSTRUCT VARCHAR(128),
 L_SHIPMODE VARCHAR(128),
 L_COMMENT VARCHAR(128))
partitioned by (L_SHIPDATE VARCHAR(128))
stored as PARQUET
location 's3://<your-bucket>/<xyz>/lineitem_partition/

source:- https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/

http://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html

http://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-query-s3-data.html

From external table insert data into redshift table.

insert into table (select * from external_table);