1
votes

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe

   val df = sparksession.read.parquet(path).cache()

I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite

  val df = spark.read()
  .format(IgniteDataFrameSettings.FORMAT_IGNITE())              //Data source 
  .option(IgniteDataFrameSettings.OPTION_TABLE(), "person")     //Table to read.
  .option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
  .load();
  df.createOrReplaceTempView("person");

SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?

Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?

Thanks

1
What exactly do you mean by "Spark alone"? Where the data is stored in this case, i.e. what are you comparing Ignite with? Also, please provide more details - what is the query that you're executing, table sizes, etc. - Valentin Kulichenko
Thanks for the comment, updated question. - zfy
I'm still confused. Spark is not a data storage, so "query spark DF directly" doesn't make much sense to me. What is the data source in your non-Ignite case? - Valentin Kulichenko
Added code example to explain how spark and ignite DF is created - zfy
OK, so you're comparing storing data in Ignite vs Parquet. But it's still apples and oranges, since Parquet is just a data format, and Ignite is a complete storage system. How do manage Parquet files, where do you keep them? How many nodes do you use? What is the Ignite configuration you're trying? What data do you store? - Stanislav Lukyanov

1 Answers

2
votes

First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.

However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index