How to read parquet data with partitions from Aws S3 using presto?

Question

I have data stored in S3 in form of parquet files with partitions. I am trying to read this data using presto. I am able to read data if I give the complete location of parquet file with partition. Below is the query to read data from "section a":

presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (external_location = 's3://bucket/presto/section=a', format = 'PARQUET');

But my data is partitioned with different sections i.e. s3://bucket/presto folder contains multiple folders like "section=a", "section=b", etc.

I am trying to read the data with partitions as follows:

presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');

The table is being created but when I try to select the data the table is empty.

I am new to Presto, please help.

Thanks

Piotr Findeisen Piotr Findeisen · Accepted Answer · 2019-08-06T16:18:52

You create table correctly:

create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255))
WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');

However, in "Hive table format" the partitions are not auto-discovered. Instead, they need to be declared explicitly. There are some reasons for this:

explicit declaration of partitions allows you to publish a partition "atomically", once you're done writing
section=a, section=b is only the convention, the partition location may be different. In fact the partition can be located in some other S3 bucket, or different storage

To auto-discover partitions in the case like yours, you can use the system.sync_partition_metadata procedure that comes with Presto.

How to read parquet data with partitions from Aws S3 using presto?

1 Answers