2
votes

Some background: we are building a machine learning farm off Ignite cluster. Part of the use case is generating training data sets, which are gigantic matrices (up to billions of rows x thousands of columns in theory), one row per data entry in Ignite cache.

We are using SqlQuery to fetch the records matching a predicate locally on each node, then iterate over records, generate vectors, write them into an external storage for further consumption. Each node exports data independently, so for 32 nodes we end up with 32 exported data sets.

The problem: small data set generations worked OK, but larger data set generations (queries expected to return 10M+ rows per node) basically kill the entire cluster, blowing out the nodes due to OOME and GC hell. We looked at "Performance and Debugging" section of Ignite docs (https://apacheignite-sql.readme.io/docs/performance-and-debugging#result-set-lazy-load), tried lazy result set and page size settings. Nope.

The investigations (profiling, memory dumps, debugger, etc), suggested that result sets of the queries are loaded into memory completely before our code gets to read the first row, even though we are using QueryCursor and iterations. The Query#pageSize and SqlFieldsQuery#setLazy do not seem to have any effect on that whatsoever. Digging deeper, it turned out that H2 (that Ignite indexing is using) does not support server-side cursor in result sets at all, and can only be configured to buffer the records onto disk (even with SSDs, this is a performance non-starter). Reading through Ignite source code suggested that Query#pageSize and SqlFieldsQuery#setLazy are only used in distributed queries / scan queries, and still Ignite does full reads of result sets on the nodes into memory.

Sigh. These are remediations we thought of:

  • Literally run thousands of tiny Ignite nodes with a lots of unused memory, as opposed to dozens of big ones. Looks really cost-ineffective.
  • Use h2.maxMemoryRows set to some small value. Seems like a silly solution (memory-to-memory in the same JVM through a buffer on disk? really?).
  • Ditch the SQLQuery / SQLFieldsQuery, bypass H2, use ScanQuery. This is a ton of work (have to parse predicate and compile them into IgniteBiPredicate, plus these are full table scans, therefore no index / optimization, which sort makes the whole ordeal pointless).
  • Presuade Thomas Mueller to do server-side cursors in H2 somehow (see How to set h2 to stream resultset?)?
  • Allocate gigantic ram drives on each ignite node to buffer H2 record sets there (alright, this got weird, I'd probably stop).

The question: is our assessment correct and query result streaming is not a thing for local queries? Are there any more sensible workarounds?

Would appreciate some insight from Ignite comrades if they read this. Thanks in advance!

UPDATE: in case it matters, the cluster is 8 nodes as follows:

Architecture: x86_64 CPU(s): 32 Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz CPU MHz: 2345.904 BogoMIPS: 4600.05 Hypervisor vendor: Xen Virtualization type: full RAM: 240Gb

These are designated EC2-s with ephemeral volumes for Ignite /data mount.

2
Show please query that leads to OOME. - Mitya XMitya
@MityaXMitya, the query is SqlQuery<Object,BinaryObject> with limit 10000000 as predicate. Queries are running with setLocal(true) on each node independently in parallel (that seems to be important). The cache it is going against is approx 600M records per node. There are 8 identical nodes in the cluster, the specs are 32-CPU Xeon CPU E5-2686 v4 @ 2.30GHz in EC2 (Xen supervisor), dedicated. Can grab more specs if desired. - Borv
I allocated some time to build a bench test. Will post the git link here. - Borv
FYI, we did another test today after swapping the SqlQuery with ScanQuery. Not seeing any OOMEs in the cluster, but still see a significant JVM pauses (over 15 sec, at a rate of 5-20 per minute from cluster). These pauses result in client being disconnected from the cluster. - Borv
What your query does? Does it use grouping or ordering? It highly depends on query when H2 has to load all results to heap to process them. Could you show your actual query? - Mitya XMitya

2 Answers

0
votes

lazy does definitely do anything: for SELECT * FROM cache query, it's the difference between cluster failure and normal operation. page size should be working together with lazy.

Can you show your query? Also, I'm not sure that it really applies to local queries.

0
votes

Updating this after seemingly solving the issue.

Note: the question was about local queries.

It looks like using SqlFieldsQuery with 'lazy' helped for local queries. Local SqlQuery seems to be ignoring the lazy flag.