I have a Google DataProc cluster with presto installed as an optional component. I create a external table in Hive and its size is ~1GB. While the table is queryable(for example, groupby statement, distinct, etc succeed), I have problems with perform a simple select * from tableA with Hive and Presto:
- For Hive, if I logged in to master node of cluster, and run the query from Hive command line, it success. However, when I run the following command from my local machine:
gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;"
I get the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space ERROR: (gcloud.dataproc.jobs.submit.hive) Job [3e165c0edcda4e35ad0d5f62b77725bc] entered state [ERROR] while waiting for [DONE].
Though I've updated the configurations in mapred-site.xml as:
mapreduce.map.memory.mb=9000;
mapreduce.map.java.opts=-Xmx7000m;
mapreduce.reduce.memory.mb=9000;
mapreduce.reduce.java.opts=-Xmx7000m;
- For Presto, similarly the statements such as groupBy and distinct work. However, for the
select * from tableA, everytime it just hangs forever at about RUNNING 60% until timeout. And regardless if I run from local machine or from master node of cluster, I get the same issue.
I don't understand why such a small external table can have such issue. Any help is appreciated, thank you!
select * from tableAin Presto, how do you receive results? Where are they stored/displayed? - Piotr FindeisenTEXTFILEformat table, then read those files externally. - David Phillips