We are facing intermittent performance issues with ignite where response times become very high and we see below error in our logs. We have 10 indexed columns and I don't see any issues with indexes as all the columns in the "where" clause are indexed. Joins are happening on the fields with affinity colocation which means that joins are happening only on the data in a particular node and not across nodes.
[21:48:30,765][WARNING][jvm-pause-detector-worker][IgniteKernal%PincodeGrid] Possible too long JVM pause: 4939 milliseconds.
[21:48:30,783][WARNING][query-#120%PincodeGrid%][IgniteH2Indexing] Query execution is too long [time=5052 ms, sql='SELECT
Please let me know if you can provide any help on this.
Apache Ignite version : 2.7.5
Ignite persistence is enabled (true)
2 node cluster in partitioned mode
RAM - 150 GB per node
JVM xms and xmx 20G
Number of records - 160 million
JVM options -
/usr/java/jdk1.8.0_144/bin/java -XX:+AggressiveOpts -server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/etappdata/ignite/logs/PROD/etail-prod-ignite76-163/logs -XX:+ExitOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/etappdata/ignite/logs/PROD/etail-prod-ignite76-163/gc.log -XX:+PrintAdaptiveSizePolicy -XX:+UseTLAB -verbose:gc -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Addresses=true -Djava.net.preferIPv6Stack=false -Djava.net.preferIPv6Addresses=false -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=8996 -Dcom.sun.management.jmxremote.rmi.port=8996 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=etail-prod-ignite76-163 -XX:MaxDirectMemorySize=4g -javaagent:/tmp/apminsight-javaagent-prod/apminsight-javaagent.jar -Dfile.encoding=UTF-8 -XX:+UseG1GC -DIGNITE_QUIET=false -DIGNITE_SUCCESS_FILE=/ignite/apache-ignite-2.7.5-bin/work/ignite_success_7d9ec20d-9728-475a-aa80-4355eb8eaf02 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=49112 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -DIGNITE_HOME=/ignite/apache-ignite-2.7.5-bin -DIGNITE_PROG_NAME=./bin/ignite.sh -cp /ignite/apache-ignite-2.7.5-bin/libs/:/ignite/apache-ignite-2.7.5-bin/libs/ignite-indexing/:/ignite/apache-ignite-2.7.5-bin/libs/ignite-spring/:/ignite/apache-ignite-2.7.5-bin/libs/licenses/ org.apache.ignite.startup.cmdline.CommandLineStartup config/config-cache.xml
Added additional details
[04:03:48,251][INFO][main][IgniteKernal%PincodeGrid] IgniteConfiguration [igniteInstanceName=PincodeGrid, pubPoolSize=30, svcPoolSize=30, callbackPoolSize=30, stripedPoolSize=30, sysPoolSize=30, mgmtPoolSize=4, igfsPoolSize=30, dataStreamerPoolSize=30, utilityCachePoolSize=30, utilityCacheKeepAliveTime=60000, p2pPoolSize=2, qryPoolSize=30, igniteHome=/ignite/apache-ignite-2.7.5-bin, igniteWorkDir=/ignite/apache-ignite-2.7.5-bin/work, mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@13221655, nodeId=6aee7bb4-2804-4396-a9ec-65abdc9483e3, marsh=BinaryMarshaller [], marshLocJobs=false, daemon=false, p2pEnabled=true, netTimeout=5000, sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=10000, metricsUpdateFreq=2000, metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=600000, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null], segPlc=STOP, segResolveAttempts=2, waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=10000, commSpi=TcpCommunicationSpi [connectGate=null, connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@34123d65, enableForcibleNodeKill=false, enableTroubleshootingLog=false, locAddr=null, locHost=null, locPort=47100, locPortRange=100, shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=5000, maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=65536, msgQueueLimit=2048, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null, usePairedConnections=true, connectionsPerNode=10, tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0, sockWriteTimeout=10000, boundTcpPort=-1, boundTcpShmemPort=-1, selectorsCnt=15, selectorSpins=0, addrRslvr=null, ctxInitLatch=java.util.concurrent.CountDownLatch@59474f18[Count = 1], stopping=false], evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@65fb9ffc, colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [], indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@3590fc5b, addrRslvr=null, encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@397fbdb, clientMode=false, rebalanceThreadPoolSize=16, txCfg=TransactionConfiguration [txSerEnabled=false, dfltIsolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0, txTimeoutOnPartitionMapExchange=0, pessimisticTxLogSize=0, pessimisticTxLogLinger=10000, tmLookupClsName=null, txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true, discoStartupDelay=60000, deployMode=SHARED, p2pMissedCacheSize=100, locHost=null, timeSrvPortBase=31100, timeSrvPortRange=100, failureDetectionTimeout=80000, sysWorkerBlockedTimeout=30000, clientFailureDetectionTimeout=120000, metricsLogFreq=6000000, hadoopCfg=null, connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211, noDelay=true, directBuf=false, sndBufSize=32768, rcvBufSize=32768, idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0, selectorCnt=4, idleTimeout=7000, sslEnabled=false, sslClientAuth=false, sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=30, msgInterceptor=null], odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration [seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null, platformCfg=null, binaryCfg=null, memCfg=null, pstCfg=null, dsCfg=DataStorageConfiguration [sysRegionInitSize=41943040, sysRegionMaxSize=104857600, pageSize=4096, concLvl=0, dfltDataRegConf=DataRegionConfiguration [name=Default_Region, maxSize=128849018880, initSize=112742891520, swapPath=null, pageEvictionMode=DISABLED, evictionThreshold=0.9, emptyPagesPoolSize=100, metricsEnabled=false, metricsSubIntervalCount=5, metricsRateTimeInterval=60000, persistenceEnabled=true, checkpointPageBufSize=4294967296], dataRegions=null, storagePath=/ignite/persistence, checkpointFreq=180000, lockWaitTime=10000, checkpointThreads=8, checkpointWriteOrder=SEQUENTIAL, walHistSize=20, maxWalArchiveSize=1073741824, walSegments=10, walSegmentSize=1073741824, walPath=/wal/pincode, walArchivePath=/wal/pincode/archive, metricsEnabled=false, walMode=BACKGROUND, walTlbSize=131072, walBuffSize=0, walFlushFreq=2000, walFsyncDelay=1000, walRecordIterBuffSize=67108864, alwaysWriteFullPages=false, fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@2d3379b4, metricsSubIntervalCnt=5, metricsRateTimeInterval=60000, walAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=false, walCompactionEnabled=false, walCompactionLevel=1, checkpointReadLockTimeout=null], activeOnStart=true, autoActivation=true, longQryWarnTimeout=3000, sqlConnCfg=null, cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100, sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpenCursorsPerConn=128, threadPoolSize=30, idleTimeout=0, jdbcEnabled=true, odbcEnabled=true, thinCliEnabled=true, sslEnabled=false, useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, authEnabled=false, failureHnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], commFailureRslvr=null]
Edit -2 - GC logs
2020-12-01T22:49:31.729+0530: 15.630: [GC pause (Metadata GC Threshold) (young) (initial-mark) 15.630: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 0, predicted base time: 38.43 ms, remaining time: 161.57 ms, target pause time: 200.00 ms]
15.630: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 24 regions, survivors: 2 regions, predicted young region time: 356.58 ms]
15.630: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 24 regions, survivors: 2 regions, old: 0 regions, predicted pause time: 395.01 ms, target pause time: 200.00 ms]
15.657: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: concurrent cycle is about to start], 0.0274990 secs]
[Parallel Time: 15.8 ms, GC Workers: 21]
[GC Worker Start (ms): Min: 15630.2, Avg: 15630.5, Max: 15630.8, Diff: 0.7]
[Ext Root Scanning (ms): Min: 1.6, Avg: 3.4, Max: 11.4, Diff: 9.8, Sum: 71.8]
[Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Code Root Scanning (ms): Min: 0.0, Avg: 1.2, Max: 12.6, Diff: 12.6, Sum: 24.2]
[Object Copy (ms): Min: 0.0, Avg: 9.5, Max: 12.0, Diff: 11.9, Sum: 199.9]
[Termination (ms): Min: 0.0, Avg: 0.7, Max: 0.8, Diff: 0.7, Sum: 14.8]
[Termination Attempts: Min: 1, Avg: 2.2, Max: 4, Diff: 3, Sum: 47]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.2, Diff: 0.1, Sum: 1.0]
[GC Worker Total (ms): Min: 14.5, Avg: 14.8, Max: 15.2, Diff: 0.7, Sum: 311.8]
[GC Worker End (ms): Min: 15645.3, Avg: 15645.3, Max: 15645.4, Diff: 0.1]
[Code Root Fixup: 0.6 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.5 ms]
[Other: 10.5 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 8.3 ms]
[Ref Enq: 0.5 ms]
[Redirty Cards: 0.5 ms]
[Humongous Register: 0.0 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.2 ms]
[Eden: 192.0M(1008.0M)->0.0B(984.0M) Survivors: 16.0M->40.0M Heap: 198.0M(20.0G)->33.7M(20.0G)]
[Times: user=0.31 sys=0.00, real=0.03 secs]
2020-12-01T22:49:31.757+0530: 15.657: [GC concurrent-root-region-scan-start]
2020-12-01T22:49:31.764+0530: 15.664: [GC concurrent-root-region-scan-end, 0.0067826 secs]
2020-12-01T22:49:31.764+0530: 15.664: [GC concurrent-mark-start]
2020-12-01T22:49:31.765+0530: 15.666: [GC concurrent-mark-end, 0.0015043 secs]
2020-12-01T22:49:31.766+0530: 15.666: [GC remark 2020-12-01T22:49:31.766+0530: 15.666: [Finalize Marking, 0.0010641 secs] 2020-12-01T22:49:31.767+0530: 15.667: [GC ref-proc, 0.0100232 secs] 2020-12-01T22:49:31.777+0530: 15.677: [Unloading, 0.0072592 secs], 0.0191010 secs]
[Times: user=0.20 sys=0.00, real=0.02 secs]
2020-12-01T22:49:31.785+0530: 15.685: [GC cleanup 37M->37M(20G), 0.0085803 secs]
[Times: user=0.04 sys=0.00, real=0.01 secs]
2020-12-01T22:53:45.090+0530: 268.990: [GC pause (G1 Evacuation Pause) (young) 268.990: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 0, predicted base time: 30.72 ms, remaining time: 169.28 ms, target pause time: 200.00 ms]
268.990: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 123 regions, survivors: 5 regions, predicted young region time: 1342.47 ms]
268.990: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 123 regions, survivors: 5 regions, old: 0 regions, predicted pause time: 1373.18 ms, target pause time: 200.00 ms]
269.040: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: candidate old regions not available]
, 0.0494933 secs]
[Parallel Time: 31.8 ms, GC Workers: 21]
[GC Worker Start (ms): Min: 268991.8, Avg: 268992.1, Max: 268992.5, Diff: 0.7]
[Ext Root Scanning (ms): Min: 0.9, Avg: 1.9, Max: 5.7, Diff: 4.7, Sum: 39.6]
[Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.5]
[Code Root Scanning (ms): Min: 0.0, Avg: 1.0, Max: 7.0, Diff: 7.0, Sum: 20.2]
[Object Copy (ms): Min: 21.7, Avg: 28.1, Max: 29.1, Diff: 7.4, Sum: 591.0]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 21]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 1.8]