Apache Ignite and Apache Spark integration, Cache loading into Spark Context using IgniteRDD

Question

If I create a igniteRDD out of a cache with 10M entries in my spark job, will it load all 10M into my spark context? Please find my code below for reference.

    SparkConf conf = new SparkConf().setAppName("IgniteSparkIntgr").setMaster("local");
    JavaSparkContext context = new JavaSparkContext(conf);        


    JavaIgniteContext<Integer, Subscriber> igniteCxt = new JavaIgniteContext<Integer,Subscriber>(context,"example-ignite.xml");

    JavaIgniteRDD<Integer,Subscriber> cache = igniteCxt.fromCache("subscriberCache");

    DataFrame query_res = cache.sql("select id, lastName, company from Subscriber where id between ? and ?", 12, 15);
    DataFrame input = loadInput(context);
    DataFrame joined_df = input.join(query_res,input.col("id").equalTo(query_res.col("ID")));
    System.out.println(joined_df.count());

In the above code, subscriberCache is having more than 10M entries. Will at any point of the above code the 10M Subscriber objects be loaded into JVM? Or it only loads the query output?

FYI:(Ignite is running in a separate JVM)

Valentin Kulichenko Valentin Kulichenko · Accepted Answer · 2016-05-30T14:26:13

cache.sql(...) method queries the data that is already in Ignite in-memory cache, so before doing this you should load the data. You can use IgniteRDD.saveValues(...) or IgniteRDD.savePairs(...) method for this. Each of them will iterate through all partitions and load all the data that currently exists in Spark into Ignite.

Note that any transformations or joins that you're doing with the resulting DataFrame will be done locally on the driver. You should avoid this as much as possible to get the best performance from Ignite SQL engine.

Apache Ignite and Apache Spark integration, Cache loading into Spark Context using IgniteRDD

1 Answers