Some background: we are building a machine learning farm off Ignite cluster. Part of the use case is generating training data sets, which are gigantic matrices (up to billions of rows x thousands of columns in theory), one row per data entry in Ignite cache.
We are using SqlQuery to fetch the records matching a predicate locally on each node, then iterate over records, generate vectors, write them into an external storage for further consumption. Each node exports data independently, so for 32 nodes we end up with 32 exported data sets.
The problem: small data set generations worked OK, but larger data set generations (queries expected to return 10M+ rows per node) basically kill the entire cluster, blowing out the nodes due to OOME and GC hell. We looked at "Performance and Debugging" section of Ignite docs (https://apacheignite-sql.readme.io/docs/performance-and-debugging#result-set-lazy-load), tried lazy result set and page size settings. Nope.
The investigations (profiling, memory dumps, debugger, etc), suggested that result sets of the queries are loaded into memory completely before our code gets to read the first row, even though we are using QueryCursor and iterations. The Query#pageSize and SqlFieldsQuery#setLazy do not seem to have any effect on that whatsoever. Digging deeper, it turned out that H2 (that Ignite indexing is using) does not support server-side cursor in result sets at all, and can only be configured to buffer the records onto disk (even with SSDs, this is a performance non-starter). Reading through Ignite source code suggested that Query#pageSize and SqlFieldsQuery#setLazy are only used in distributed queries / scan queries, and still Ignite does full reads of result sets on the nodes into memory.
Sigh. These are remediations we thought of:
- Literally run thousands of tiny Ignite nodes with a lots of unused memory, as opposed to dozens of big ones. Looks really cost-ineffective.
- Use
h2.maxMemoryRowsset to some small value. Seems like a silly solution (memory-to-memory in the same JVM through a buffer on disk? really?). - Ditch the
SQLQuery/SQLFieldsQuery, bypass H2, useScanQuery. This is a ton of work (have to parse predicate and compile them intoIgniteBiPredicate, plus these are full table scans, therefore no index / optimization, which sort makes the whole ordeal pointless). - Presuade Thomas Mueller to do server-side cursors in H2 somehow (see How to set h2 to stream resultset?)?
- Allocate gigantic ram drives on each ignite node to buffer H2 record sets there (alright, this got weird, I'd probably stop).
The question: is our assessment correct and query result streaming is not a thing for local queries? Are there any more sensible workarounds?
Would appreciate some insight from Ignite comrades if they read this. Thanks in advance!
UPDATE: in case it matters, the cluster is 8 nodes as follows:
Architecture: x86_64
CPU(s): 32
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
CPU MHz: 2345.904
BogoMIPS: 4600.05
Hypervisor vendor: Xen
Virtualization type: full
RAM: 240Gb
These are designated EC2-s with ephemeral volumes for Ignite /data mount.
SqlQuery<Object,BinaryObject>withlimit 10000000as predicate. Queries are running withsetLocal(true)on each node independently in parallel (that seems to be important). The cache it is going against is approx 600M records per node. There are 8 identical nodes in the cluster, the specs are 32-CPU Xeon CPU E5-2686 v4 @ 2.30GHz in EC2 (Xen supervisor), dedicated. Can grab more specs if desired. - BorvSqlQuerywithScanQuery. Not seeing any OOMEs in the cluster, but still see a significant JVM pauses (over 15 sec, at a rate of 5-20 per minute from cluster). These pauses result in client being disconnected from the cluster. - Borv