1
votes

I need to read using presto from s3 an entire dataset that sits in "bucket-a". But, inside the bucket, the data was saved in sub-folders by year. So I have a bucket that looks like that:

Bucket-a>2017>data

Bucket-a>2018>more data

Bucket-a>2019>more data

All the above data is the same table but saved this way in s3. Notice that in the bucket-a itself there is no data, just inside each folder.

What I have to do is read all the data from the bucket as a single table adding a year as column or partition.

I tried doing this way, but didn't work:

CREATE TABLE hive.default.mytable (
  col1 int,
  col2 varchar,
  year int
)
WITH (
  format = 'json',
  partitioned_by = ARRAY['year'],
  external_location = 's3://bucket-a/'--also tryed 's3://bucket-a/year/'

)

and also

CREATE TABLE hive.default.mytable (
  col1 int,
  col2 varchar,
  year int
)
WITH (
  format = 'json',
  bucketed_by = ARRAY['year'],
  bucket_count = 3,
  external_location = 's3://bucket-a/'--also tryed's3://bucket-a/year/'
)

All of the above didn't work.

I have seen people writing with partitions to s3 using presto, but what I'm trying to do is the opposite: read from s3 data that is already splitted in folders as single table.

Thanks.

1

1 Answers

1
votes

If your folders were following Hive partition folder naming convention (year=2019/), you could declare the table as partitioned and just use system. sync_partition_metadata procedure in Presto.

Now, your folders do not follow the convention, so you need to register each one individually as a partition using system.register_partition procedure (will be available in Presto 330, about to be released). (The alternative to register_partition is to run appropriate ADD PARTITION in Hive CLI.)