Understanding distributed join of Apache Ignite

Question

We are exploring to use Apache Ignite in our project. Basically, we have dozens of oracle tables.And we want to load each table into Ignite Cache ,and then do join between these caches. There are many joins between our tables(so there will be many distributed join between caches).

The uncertain thing it that it could be really hard to collocate our data using the affinity-collocation feature... as described here: https://apacheignite.readme.io/docs/affinity-collocation

So, I would ask if our data in cache is not collocated, then does Ignite distributed join support this(we are using Ignite 1.7.0)? I would imagine there will be many data movement when doing the join(This would be very similar to SQL on Hadoop， like Hive or Spark SQL)

Also, I am wondering the performance between non-collocation distributed join and spark sql.

dmagda dmagda · Accepted Answer · 2016-12-10T05:30:42

I would add that if you use distributed non-collocated mode for SQL queries then it doesn't mean that the data will be silly moved all the time. The engine will try all its best to optimize the execution and, even, it may result in no data movement at all. However, it depends on a type of query and how data is spread our across the cluster.

In any case, my recommendation will be to collocate as much data as you can so that you can rely on the most performant collocated mode and fallback to non-collocated mode for the rest of the scenarios.

I do believe that the performance of non-collocated Ignite queries will be still better than the performance of Spark SQL engine simply because Ignite allows you to index the data while Spark doesn't which is essential.

Understanding distributed join of Apache Ignite

2 Answers