2
votes

Running Presto large scan query on 5 nodes cluster it looks that only one node is the query coordinator and reads the data from 5 hdfs nodes over network.

All presto processes are running on data nodes.

Is there way to let 5 nodes to read data from hdfs using shortcut local reads ?

Are presto nodes doing any preaggregation?

2

2 Answers

4
votes

It is not clear from you question if you have installed Presto workers on the same machine as your HDFS data nodes. If you have not, the installation instructions will help you do this.

Once you have Presto workers on all of your data nodes, Presto should automatically perform local reads when accessing data from the local DFS node. Presto will prefer scheduling work on the same machine as the DFS node, but if that machine is overloaded, it will schedule the work on another machine, so you will typically get some remote reads. The majority of reads should be local, and you can verify this distribution using the com.facebook.presto.execution:name=NodeScheduler mbean on the coordinator.

Presto always performs partial aggregation on the leaf worker nodes.

1
votes

If you have presto installed on all the nodes, and want presto workers to process local stripes, you need to turn the "hive.force-local-scheduling" session flag to true. This is false by default in the presto versions i saw (0.153).

See more details at: https://github.com/prestodb/presto/issues/894

https://github.com/prestodb/presto/pull/1770