Apache Drill in distributed Mode

Question

I started exploring drill for our requirement to run SQL-on-semi structured data. I have setup a 4node drill cluster with zookeeper. Have few questions on how it actually works,

When I run Drill in distributed mode, using dfs (local file system) i.e., I have a 1GB Json file on one of the nodes(say n1). I am able to run the query by launching sqlline from any of the nodes(n1, n2, n3, n4) inspire have date only on n1. My questions is

a. Is the query being executed on all the nodes? i.e., will Drill parallelise the query execution by distributing the data to other node n2,n3n4?

b. If NO, by copying the same file on all the nodes n2,n3,n4 will help in leveraging MPP architecture of Drill?

catpaws catpaws · Accepted Answer · 2015-07-22T21:52:09

Is the query being executed on all the nodes? Maybe, the node has to be running Drill and the data you are querying has to be on the distributed file system, such as HDFS. Drill doesn't distribute the files.

The nodes that run the Drillbit service (where you installed Drill) participate in the query work. Only columns that appear in the query are loaded from the file. Drill tries to push any filter in your query to leaf nodes to prevent the nodes from sending a row that doesn't pass the filter. Drill maximizes data locality during query execution without moving data over the network or between nodes, per the docs. Minor Fragments section talks about parallelizing. Drill parallelizes operations when number of records in a fragment reaches 100,000.

Apache Drill in distributed Mode

2 Answers