How to use Apache Ignite as an external data source of Apache Spark

Question

We need a memory cache for our Apache Spark database to improve performance. I did some research on Apache Ignite recently, and we decide to use Ignite as an external data source of Spark, following is what I found and confused about right now:

After digging into the codes, I found Spark SQL will be transform to Ignite SQL, then the query will be sent to each Ignite node and executed by H2 engine on it, does that mean all the data need to be in Ignite cache, data in HDFS will not be hit? Our data is too big to load them all into memory, we can only load some of them into memory, maybe only load some small tables, client will turn to HDFS if query doesn't hit the cache. My question is: as a external data source of Spark, how can we scan all the data coming from both Ignite and HDFS in one Spark SQL? Such as: SELECT person.name AS person, age, city.name AS city, country FROM person JOIN city ON person.city_id = city.id city: all the data is in memory person: only some of data is in memory, some of them is in HDFS; Maybe even NOT cached in Ignite.
The version of our Spark is 3.0, Ignite only support 2.4 right now, I don't know what's the plan of supporting Spark 3.0 for Apache Ignite, what's your suggestion to support Spark 3.0 for our system? Is it a good idea to re-implement all that had been done for supporting 2.4? https://ignite.apache.org/docs/latest/extensions-and-integrations/ignite-for-spark/ignite-dataframe

Thanks for your kindly suggestions :)

dmagda dmagda · Accepted Answer · 2021-02-19T06:08:29

You can run federated queries across an Ignite and HDFS cluster with Spark SQL. Here is an example. Also, you can always enable Ignite native persistence and grow beyond available memory capacity.
The ticket has been reported to the Ignite JIRA. Vote up!

How to use Apache Ignite as an external data source of Apache Spark

1 Answers