2
votes

I have a Google DataProc cluster with presto installed as an optional component. I create a external table in Hive and its size is ~1GB. While the table is queryable(for example, groupby statement, distinct, etc succeed), I have problems with perform a simple select * from tableA with Hive and Presto:

  1. For Hive, if I logged in to master node of cluster, and run the query from Hive command line, it success. However, when I run the following command from my local machine:

gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;"

I get the following error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space ERROR: (gcloud.dataproc.jobs.submit.hive) Job [3e165c0edcda4e35ad0d5f62b77725bc] entered state [ERROR] while waiting for [DONE].

Though I've updated the configurations in mapred-site.xml as:

mapreduce.map.memory.mb=9000;
mapreduce.map.java.opts=-Xmx7000m;
mapreduce.reduce.memory.mb=9000;
mapreduce.reduce.java.opts=-Xmx7000m;
  1. For Presto, similarly the statements such as groupBy and distinct work. However, for the select * from tableA, everytime it just hangs forever at about RUNNING 60% until timeout. And regardless if I run from local machine or from master node of cluster, I get the same issue.

I don't understand why such a small external table can have such issue. Any help is appreciated, thank you!

2
Can you clarify how you set the Hive configuration parameters? Did you set properties on the Dataproc job, set the properties on cluster creation, or some other way? For Presto, which stage of the query was stalled? You can check by accessing the Presto UI at $MASTER_HOSTNAME:8060 and looking at the page for the hung query. - Jerry Ding
I edited the OP to specify the configuration parameter settings. For presto, I'm aware of the webUI, but I'm not sure what specific info I should look for. Under the 'stage performance' tab there are stage 0 and 1, both of them not showing anything, with the message 'Operator graph will appear automatically when query completes.' Under 'Live Plan' tab, there are two stages both at 'RUNNING' status, but it will stay like that forever until timeout. - user2830451
When you select * from tableA in Presto, how do you receive results? Where are they stored/displayed? - Piotr Findeisen
Yes I received some results, they are displayed directly to command line console. - user2830451
I'm not familiar with DataProc, but my guess is that the console output is designed for smaller debug output, not 1GB (or much larger in text format) of query output. The typical way to read large amounts of data for external processing is to access the files in GCS directly. You might run a Presto query that inserts the results into a TEXTFILE format table, then read those files externally. - David Phillips

2 Answers

1
votes

The Presto CLI binary /usr/bin/presto specifies a jvm -Xmx argument inline (it uses some tricks to bootstrap itself as a java binary); unfortunately, that -Xmx is not normally fetched from /opt/presto-server/etc/jvm.config like the settings for the actual presto-server.

In your case, if you're selecting everything from a 1G parquet table, you're probably actually dealing with something like 6G uncompressed text, and you're trying to stream all of that to the console output. This is likely also not going to work with the Dataproc job-submission because the streamed output is designed to print out human-readable amounts of data, and will slow down considerably if dealing with non-human amounts of data.

If you want to still try doing that with the CLI, you can run:

sudo sed -i "s/Xmx1G/Xmx5G/" /usr/bin/presto

To modify the jvm settings for the CLI on the master, before starting it back up. You'd probably then want to pipe the output to a local file instead of streaming it to your console, because you won't be able to read 6G of text streaming through your screen.

1
votes

I think the problem is that the output of gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;" went through the Dataproc server which OOMed. To avoid that, you can query data from the cluster directly without going through the server.

Try following the Dataproc Presto tutorial - Presto CLI queries, run these commands from your local machine:

gcloud compute ssh <master-node> \
    --project=${PROJECT} \
    --zone=${ZONE} \
    -- -D 1080 -N
./presto-cli \
    --server <master-node>:8080 \
    --socks-proxy localhost:1080 \
    --catalog hive \
    --schema default