1
votes

We are facing intermittent performance issues with ignite where response times become very high and we see below error in our logs. We have 10 indexed columns and I don't see any issues with indexes as all the columns in the "where" clause are indexed. Joins are happening on the fields with affinity colocation which means that joins are happening only on the data in a particular node and not across nodes.

[21:48:30,765][WARNING][jvm-pause-detector-worker][IgniteKernal%PincodeGrid] Possible too long JVM pause: 4939 milliseconds.
[21:48:30,783][WARNING][query-#120%PincodeGrid%][IgniteH2Indexing] Query execution is too long [time=5052 ms, sql='SELECT

Please let me know if you can provide any help on this. 

  • Apache Ignite version : 2.7.5

  • Ignite persistence is enabled (true)

  • 2 node cluster in partitioned mode

  • RAM - 150 GB per node 

  • JVM xms and xmx 20G

  • Number of records - 160 million 

  • JVM options - 

    /usr/java/jdk1.8.0_144/bin/java -XX:+AggressiveOpts -server -Xms20g -Xmx20g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/etappdata/ignite/logs/PROD/etail-prod-ignite76-163/logs -XX:+ExitOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Xloggc:/etappdata/ignite/logs/PROD/etail-prod-ignite76-163/gc.log -XX:+PrintAdaptiveSizePolicy -XX:+UseTLAB -verbose:gc -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Addresses=true -Djava.net.preferIPv6Stack=false -Djava.net.preferIPv6Addresses=false -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=8996 -Dcom.sun.management.jmxremote.rmi.port=8996 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Djava.rmi.server.hostname=etail-prod-ignite76-163 -XX:MaxDirectMemorySize=4g -javaagent:/tmp/apminsight-javaagent-prod/apminsight-javaagent.jar -Dfile.encoding=UTF-8 -XX:+UseG1GC -DIGNITE_QUIET=false -DIGNITE_SUCCESS_FILE=/ignite/apache-ignite-2.7.5-bin/work/ignite_success_7d9ec20d-9728-475a-aa80-4355eb8eaf02 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=49112 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -DIGNITE_HOME=/ignite/apache-ignite-2.7.5-bin -DIGNITE_PROG_NAME=./bin/ignite.sh -cp /ignite/apache-ignite-2.7.5-bin/libs/:/ignite/apache-ignite-2.7.5-bin/libs/ignite-indexing/:/ignite/apache-ignite-2.7.5-bin/libs/ignite-spring/:/ignite/apache-ignite-2.7.5-bin/libs/licenses/ org.apache.ignite.startup.cmdline.CommandLineStartup config/config-cache.xml

    Added additional details

    [04:03:48,251][INFO][main][IgniteKernal%PincodeGrid] IgniteConfiguration [igniteInstanceName=PincodeGrid, pubPoolSize=30, svcPoolSize=30, callbackPoolSize=30, stripedPoolSize=30, sysPoolSize=30, mgmtPoolSize=4, igfsPoolSize=30, dataStreamerPoolSize=30, utilityCachePoolSize=30, utilityCacheKeepAliveTime=60000, p2pPoolSize=2, qryPoolSize=30, igniteHome=/ignite/apache-ignite-2.7.5-bin, igniteWorkDir=/ignite/apache-ignite-2.7.5-bin/work, mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@13221655, nodeId=6aee7bb4-2804-4396-a9ec-65abdc9483e3, marsh=BinaryMarshaller [], marshLocJobs=false, daemon=false, p2pEnabled=true, netTimeout=5000, sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=10000, metricsUpdateFreq=2000, metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=600000, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null], segPlc=STOP, segResolveAttempts=2, waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=10000, commSpi=TcpCommunicationSpi [connectGate=null, connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@34123d65, enableForcibleNodeKill=false, enableTroubleshootingLog=false, locAddr=null, locHost=null, locPort=47100, locPortRange=100, shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=5000, maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=65536, msgQueueLimit=2048, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null, usePairedConnections=true, connectionsPerNode=10, tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0, sockWriteTimeout=10000, boundTcpPort=-1, boundTcpShmemPort=-1, selectorsCnt=15, selectorSpins=0, addrRslvr=null, ctxInitLatch=java.util.concurrent.CountDownLatch@59474f18[Count = 1], stopping=false], evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@65fb9ffc, colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [], indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@3590fc5b, addrRslvr=null, encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@397fbdb, clientMode=false, rebalanceThreadPoolSize=16, txCfg=TransactionConfiguration [txSerEnabled=false, dfltIsolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0, txTimeoutOnPartitionMapExchange=0, pessimisticTxLogSize=0, pessimisticTxLogLinger=10000, tmLookupClsName=null, txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true, discoStartupDelay=60000, deployMode=SHARED, p2pMissedCacheSize=100, locHost=null, timeSrvPortBase=31100, timeSrvPortRange=100, failureDetectionTimeout=80000, sysWorkerBlockedTimeout=30000, clientFailureDetectionTimeout=120000, metricsLogFreq=6000000, hadoopCfg=null, connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211, noDelay=true, directBuf=false, sndBufSize=32768, rcvBufSize=32768, idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0, selectorCnt=4, idleTimeout=7000, sslEnabled=false, sslClientAuth=false, sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=30, msgInterceptor=null], odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration [seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null, platformCfg=null, binaryCfg=null, memCfg=null, pstCfg=null, dsCfg=DataStorageConfiguration [sysRegionInitSize=41943040, sysRegionMaxSize=104857600, pageSize=4096, concLvl=0, dfltDataRegConf=DataRegionConfiguration [name=Default_Region, maxSize=128849018880, initSize=112742891520, swapPath=null, pageEvictionMode=DISABLED, evictionThreshold=0.9, emptyPagesPoolSize=100, metricsEnabled=false, metricsSubIntervalCount=5, metricsRateTimeInterval=60000, persistenceEnabled=true, checkpointPageBufSize=4294967296], dataRegions=null, storagePath=/ignite/persistence, checkpointFreq=180000, lockWaitTime=10000, checkpointThreads=8, checkpointWriteOrder=SEQUENTIAL, walHistSize=20, maxWalArchiveSize=1073741824, walSegments=10, walSegmentSize=1073741824, walPath=/wal/pincode, walArchivePath=/wal/pincode/archive, metricsEnabled=false, walMode=BACKGROUND, walTlbSize=131072, walBuffSize=0, walFlushFreq=2000, walFsyncDelay=1000, walRecordIterBuffSize=67108864, alwaysWriteFullPages=false, fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@2d3379b4, metricsSubIntervalCnt=5, metricsRateTimeInterval=60000, walAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=false, walCompactionEnabled=false, walCompactionLevel=1, checkpointReadLockTimeout=null], activeOnStart=true, autoActivation=true, longQryWarnTimeout=3000, sqlConnCfg=null, cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100, sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpenCursorsPerConn=128, threadPoolSize=30, idleTimeout=0, jdbcEnabled=true, odbcEnabled=true, thinCliEnabled=true, sslEnabled=false, useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, authEnabled=false, failureHnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], commFailureRslvr=null]

Edit -2 - GC logs

2020-12-01T22:49:31.729+0530: 15.630: [GC pause (Metadata GC Threshold) (young) (initial-mark) 15.630: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 0, predicted base time: 38.43 ms, remaining time: 161.57 ms, target pause time: 200.00 ms]
15.630: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 24 regions, survivors: 2 regions, predicted young region time: 356.58 ms]
15.630: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 24 regions, survivors: 2 regions, old: 0 regions, predicted pause time: 395.01 ms, target pause time: 200.00 ms]
15.657: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: concurrent cycle is about to start], 0.0274990 secs]
[Parallel Time: 15.8 ms, GC Workers: 21]
[GC Worker Start (ms): Min: 15630.2, Avg: 15630.5, Max: 15630.8, Diff: 0.7]
[Ext Root Scanning (ms): Min: 1.6, Avg: 3.4, Max: 11.4, Diff: 9.8, Sum: 71.8]
[Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Code Root Scanning (ms): Min: 0.0, Avg: 1.2, Max: 12.6, Diff: 12.6, Sum: 24.2]
[Object Copy (ms): Min: 0.0, Avg: 9.5, Max: 12.0, Diff: 11.9, Sum: 199.9]
[Termination (ms): Min: 0.0, Avg: 0.7, Max: 0.8, Diff: 0.7, Sum: 14.8]
[Termination Attempts: Min: 1, Avg: 2.2, Max: 4, Diff: 3, Sum: 47]
[GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.2, Diff: 0.1, Sum: 1.0]
[GC Worker Total (ms): Min: 14.5, Avg: 14.8, Max: 15.2, Diff: 0.7, Sum: 311.8]
[GC Worker End (ms): Min: 15645.3, Avg: 15645.3, Max: 15645.4, Diff: 0.1]
[Code Root Fixup: 0.6 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.5 ms]
[Other: 10.5 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 8.3 ms]
[Ref Enq: 0.5 ms]
[Redirty Cards: 0.5 ms]
[Humongous Register: 0.0 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.2 ms]
[Eden: 192.0M(1008.0M)->0.0B(984.0M) Survivors: 16.0M->40.0M Heap: 198.0M(20.0G)->33.7M(20.0G)]
[Times: user=0.31 sys=0.00, real=0.03 secs]
2020-12-01T22:49:31.757+0530: 15.657: [GC concurrent-root-region-scan-start]
2020-12-01T22:49:31.764+0530: 15.664: [GC concurrent-root-region-scan-end, 0.0067826 secs]
2020-12-01T22:49:31.764+0530: 15.664: [GC concurrent-mark-start]
2020-12-01T22:49:31.765+0530: 15.666: [GC concurrent-mark-end, 0.0015043 secs]
2020-12-01T22:49:31.766+0530: 15.666: [GC remark 2020-12-01T22:49:31.766+0530: 15.666: [Finalize Marking, 0.0010641 secs] 2020-12-01T22:49:31.767+0530: 15.667: [GC ref-proc, 0.0100232 secs] 2020-12-01T22:49:31.777+0530: 15.677: [Unloading, 0.0072592 secs], 0.0191010 secs]
[Times: user=0.20 sys=0.00, real=0.02 secs]
2020-12-01T22:49:31.785+0530: 15.685: [GC cleanup 37M->37M(20G), 0.0085803 secs]
[Times: user=0.04 sys=0.00, real=0.01 secs]
2020-12-01T22:53:45.090+0530: 268.990: [GC pause (G1 Evacuation Pause) (young) 268.990: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 0, predicted base time: 30.72 ms, remaining time: 169.28 ms, target pause time: 200.00 ms]
268.990: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 123 regions, survivors: 5 regions, predicted young region time: 1342.47 ms]
268.990: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 123 regions, survivors: 5 regions, old: 0 regions, predicted pause time: 1373.18 ms, target pause time: 200.00 ms]
269.040: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: candidate old regions not available]
, 0.0494933 secs]
[Parallel Time: 31.8 ms, GC Workers: 21]
[GC Worker Start (ms): Min: 268991.8, Avg: 268992.1, Max: 268992.5, Diff: 0.7]
[Ext Root Scanning (ms): Min: 0.9, Avg: 1.9, Max: 5.7, Diff: 4.7, Sum: 39.6]
[Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0]
[Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.5]
[Code Root Scanning (ms): Min: 0.0, Avg: 1.0, Max: 7.0, Diff: 7.0, Sum: 20.2]
[Object Copy (ms): Min: 21.7, Avg: 28.1, Max: 29.1, Diff: 7.4, Sum: 591.0]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 21]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 1.8]
1

1 Answers

1
votes

If you have 20G heap you may expect to have an eventual full GC taking 4 seconds. I can believe that. Why do you need so much heap with Apache Ignite? How much heap is used during normal course of operations? You may also take a heap dump and search for a memory leak.

Node that Apache Ignite does not store the data on heap by default so it can't be explained by amount of data alone.

I ran your GC log through gceasy.io and it have found several GCs spanning around 2 seconds. I'm not sure it explains your observed 4s pause but you may expect 2s pauses obviously, from GC, which is in the same ballpark.

gceasy.io chart

So you need to figure out why your JVM becomes slow sometimes. Maybe it's IO, virtualization pauses, etc, etc. Also, if your heap is never larger than 2G, maybe you should run with something like -Xmx4G?