1
votes

I am loading data from hdfs to ignite through spark. Raw data is around 5GB in parquet-snappy format (around 0.5 Bn rows).

I am using the spark-ignite api to load the data (https://apacheignite-fs.readme.io/docs/ignite-data-frame).

The ignite cluster is 3 nodes all running in server mode with 8GB durable memory, persistence enabled and WAL disabled.

While loading the data, it runs fast as long as it has space in durable memory. Once it is unable to fit in memory, the loading happens very slowly and gradually decreases.

I have tried some suggested configurations (GC tuning, On-Heap storage) but none improved the loading speed significantly.

As ignite memory-store doesn't compress the data it demands more storage (I have loaded 0.2 Bn rows which took almost 45 GB of space). I believe increasing the durable memory size shouldn't be the only solution.

Any suggestions or resources on where to start on tuning ignite cluster for better performance. Appreciate your time and help. Thank you.

2

2 Answers

1
votes

If RAM is a scarce resource then work on native persistence optimizations. This should be your bottleneck. Fine-tune it for your specific use case. Refer to the following pages:

0
votes

GC tuning isn’t going to help, since the durable storage is off-heap. Similarly, on-heap is an addition to the off-heap, so, if anything, that’s going to make things worse.

Instead, you need to configure the eviction policy. In addition to specifying a maximum memory size for your data region, you would say:

<property name="pageEvictionMode" value="RANDOM_2_LRU"/>

Having said that, you are copying the data to a disk so it is going to be quite a bit slower.