4
votes

I have a DataProc cluster with Presto installed as an optional component. My data is stored in google cloud storage (GCS) and I'm able to query it with Presto. However, I didn't find a way to write the query result back to GCS. I can write to hdfs if I logged in to master node and run Presto commands from there, but it doesn't recognize any GCS location.

How can I write the Presto query results to GCS?

1
Presto supports GCS natively since Presto 302 (prestosql.io/docs/current/release/release-302.html). What do you mean by "it doesn't recognize any GCS location"? - Piotr Findeisen
Dataproc Presto is PrestoDB not, PrestoSQL. - Dagang

1 Answers

2
votes

You need to create a Hive external table backed by GCS, for example:

gcloud dataproc jobs submit hive \
    --cluster <cluster> \
    --execute "
        CREATE EXTERNAL TABLE my_table(id  INT, name  STRING)
        STORED AS PARQUET
        location 'gs://<bucket>/<dir>/';"

then insert your Presto query result into the table.