I am trying to setup a simple two node ignite cluster with kubernetes. The same configuration works fine when running on VM directly.
Essentially I have two pods that are microservice written in Vertx with Ignite as embedded node , pod1 exposes 9090 via service1
and pod2 exposes 9092 via service2
Both the pods use ignite-service
to expose the Ignite discovery ports 47100 and 47500 and both pods implements KubernetesIPFinder
pod1 --> service1(9090, 10900) |
| --> ignite-service (47100/TCP,47500/TCP)
pod2 --> service2(9092, 10900) | ^
|
KubernetesIPFinder----------------------------
ns = ignite-ns
svc = ignite-service
ServiceAccount (ignite-account)
When both pods start I can see the discovery happening but the second pod always hangs with the below logs. I am not sure if this because of the way I configured the k8s objects or some resource contention in the k8s.
If changed the configuration to use thin clients for the pods then everything works fine. The pods are able to start ignite and expose the rest endpoints of the vertx application
[INFO ] 2020-08-26 16:09:08.969 [main] IgniteKernal%aztecCommunityUserIgnite - VM arguments: [-Xms1g, -Xmx1g, -XX:MaxGCPauseMillis=500, -XX:GCPauseIntervalMillis=30000, -XX:InitiatingHeapOccupancyPercent=60, -XX:G1ReservePercent=30, -XX:+HeapDumpOnOutOfMemoryError, -XX:+DisableExplicitGC, -Djava.net.preferIPv4Stack=true, -XX:+UseG1GC, -Xlog:gc*,safepoint,age*,ergo*:file=/app/aztec/logs/gc-%p-%t.log:tags,uptime,time,level:filecount=10,filesize=50m, -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true, -DIGNITE_LONG_OPERATIONS_DUMP_TIMEOUT=300000, -Dlog4j.configurationFile=file:///app/aztec/communityuser_service/conf/log4j2.xml, -DIGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN=true, -DIGNITE_NO_SHUTDOWN_HOOK=true, -DIGNITE_WAL_MMAP=false]
[INFO ] 2020-08-26 16:09:08.970 [main] IgniteKernal%aztecCommunityUserIgnite - System cache's DataRegion size is configured to 40 MB. Use DataStorageConfiguration.systemRegionInitialSize property to change the setting.
[INFO ] 2020-08-26 16:09:08.970 [main] IgniteKernal%aztecCommunityUserIgnite - Configured caches [in 'sysMemPlc' dataRegion: ['ignite-sys-cache']]
[INFO ] 2020-08-26 16:09:09.054 [main] IgnitePluginProcessor - Configured plugins:
[INFO ] 2020-08-26 16:09:09.054 [main] IgnitePluginProcessor - ^-- None
[INFO ] 2020-08-26 16:09:09.054 [main] IgnitePluginProcessor -
[INFO ] 2020-08-26 16:09:09.059 [main] FailureProcessor - Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[WARN ] 2020-08-26 16:09:09.278 [main] TcpCommunicationSpi - Failure detection timeout will be ignored (one of SPI parameters has been set explicitly)
[INFO ] 2020-08-26 16:09:09.299 [main] TcpCommunicationSpi - Successfully bound communication NIO server to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0, selectorsCnt=4, selectorSpins=0, pairedConn=false]
[WARN ] 2020-08-26 16:09:09.302 [main] TcpCommunicationSpi - Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[WARN ] 2020-08-26 16:09:09.312 [main] NoopCheckpointSpi - Checkpoints are disabled (to enable configure any GridCheckpointSpi implementation)
[WARN ] 2020-08-26 16:09:09.337 [main] GridCollisionManager - Collision resolution is disabled (all jobs will be activated upon arrival).
[INFO ] 2020-08-26 16:09:09.341 [main] IgniteKernal%aztecCommunityUserIgnite - Security status [authentication=off, tls/ssl=off]
[INFO ] 2020-08-26 16:09:09.392 [main] TcpDiscoverySpi - Successfully bound to TCP port [port=47500, localHost=0.0.0.0/0.0.0.0, locNodeId=11e43ce8-b846-41ac-b688-9c6c34aebcf9]
[INFO ] 2020-08-26 16:09:09.421 [main] PdsFoldersResolver - Successfully created new persistent storage folder [/app/aztec/data/ignite/db/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c]
[INFO ] 2020-08-26 16:09:09.422 [main] PdsFoldersResolver - Consistent ID used for local node is [6cd407c6-0c86-4e57-9803-ab56bec5b16c] according to persistence data storage folders
[INFO ] 2020-08-26 16:09:09.423 [main] CacheObjectBinaryProcessorImpl - Resolved directory for serialized binary metadata: /app/aztec/data/ignite/binary_meta/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c
[INFO ] 2020-08-26 16:09:09.637 [main] FilePageStoreManager - Resolved page store work directory: /app/aztec/data/ignite/db/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c
[INFO ] 2020-08-26 16:09:09.637 [main] FileWriteAheadLogManager - Resolved write ahead log work directory: /app/aztec/data/ignite/db/wal/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c
[INFO ] 2020-08-26 16:09:09.638 [main] FileWriteAheadLogManager - Resolved write ahead log archive directory: /app/aztec/data/ignite/db/wal/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c
[INFO ] 2020-08-26 16:09:09.951 [main] FileHandleManagerImpl - Initialized write-ahead log manager [mode=BACKGROUND]
[WARN ] 2020-08-26 16:09:09.954 [main] GridCacheDatabaseSharedManager - DataRegionConfiguration.maxWalArchiveSize instead DataRegionConfiguration.walHistorySize would be used for removing old archive wal files
[INFO ] 2020-08-26 16:09:09.975 [main] GridCacheDatabaseSharedManager - Configured data regions initialized successfully [total=4]
[INFO ] 2020-08-26 16:09:09.993 [main] PartitionsEvictManager - Evict partition permits=2
[WARN ] 2020-08-26 16:09:10.029 [main] IgniteH2Indexing - Serialization of Java objects in H2 was enabled.
[INFO ] 2020-08-26 16:09:10.251 [main] ClientListenerProcessor - Client connector processor has started on TCP port 10900
[INFO ] 2020-08-26 16:09:10.324 [main] GridTcpRestProtocol - Command protocol successfully started [name=TCP binary, host=0.0.0.0/0.0.0.0, port=11211]
[INFO ] 2020-08-26 16:09:10.374 [main] IgniteKernal%aztecCommunityUserIgnite - Non-loopback local IPs: 172.17.239.163
[INFO ] 2020-08-26 16:09:10.375 [main] IgniteKernal%aztecCommunityUserIgnite - Enabled local MACs: 2255F14C9361
[INFO ] 2020-08-26 16:09:10.381 [main] GridCacheDatabaseSharedManager - Read checkpoint status [startMarker=null, endMarker=null]
[INFO ] 2020-08-26 16:09:10.388 [main] PageMemoryImpl - Started page memory [memoryAllocated=100.0 MiB, pages=24814, tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
[INFO ] 2020-08-26 16:09:10.391 [main] GridCacheDatabaseSharedManager - Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0, len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0], lastCheckpointId=00000000-0000-0000-0000-000000000000]
[INFO ] 2020-08-26 16:09:10.428 [main] GridCacheDatabaseSharedManager - Applying lost cache updates since last checkpoint record [lastMarked=FileWALPointer [idx=0, fileOff=0, len=0], lastCheckpointId=00000000-0000-0000-0000-000000000000]
[INFO ] 2020-08-26 16:09:10.430 [main] GridCacheDatabaseSharedManager - Finished applying WAL changes [updatesApplied=0, time=0 ms]
[INFO ] 2020-08-26 16:09:10.430 [main] GridCacheProcessor - Restoring partition state for local groups.
[INFO ] 2020-08-26 16:09:10.430 [main] GridCacheProcessor - Finished restoring partition state for local groups [groupsProcessed=0, partitionsProcessed=0, time=0ms]
[INFO ] 2020-08-26 16:09:10.483 [main] FilePageStoreManager - Cleanup cache stores [total=1, left=0, cleanFiles=false]
[INFO ] 2020-08-26 16:09:10.491 [main] PageMemoryImpl - Started page memory [memoryAllocated=100.0 MiB, pages=24814, tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
[INFO ] 2020-08-26 16:09:10.492 [main] PageMemoryImpl - Started page memory [memoryAllocated=100.0 MiB, pages=24814, tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
[INFO ] 2020-08-26 16:09:10.493 [main] PageMemoryImpl - Started page memory [memoryAllocated=100.0 MiB, pages=24814, tableSize=1.9 MiB, checkpointBuffer=100.0 MiB]
[INFO ] 2020-08-26 16:09:10.502 [main] GridCacheDatabaseSharedManager - Configured data regions started successfully [total=4]
[INFO ] 2020-08-26 16:09:10.503 [main] GridCacheDatabaseSharedManager - Starting binary memory restore for: [-2100569601]
[INFO ] 2020-08-26 16:09:10.518 [main] GridCacheDatabaseSharedManager - Read checkpoint status [startMarker=null, endMarker=null]
[INFO ] 2020-08-26 16:09:10.518 [main] GridCacheDatabaseSharedManager - Checking memory state [lastValidPos=FileWALPointer [idx=0, fileOff=0, len=0], lastMarked=FileWALPointer [idx=0, fileOff=0, len=0], lastCheckpointId=00000000-0000-0000-0000-000000000000]
[INFO ] 2020-08-26 16:09:10.522 [main] FileWriteAheadLogManager - Resuming logging to WAL segment [file=/app/aztec/data/ignite/db/wal/node00-6cd407c6-0c86-4e57-9803-ab56bec5b16c/0000000000000000.wal, offset=0, ver=2]
[INFO ] 2020-08-26 16:09:10.684 [main] GridCacheProcessor - Started cache in recovery mode [name=ignite-sys-cache, id=-2100569601, dataRegionName=sysMemPlc, mode=REPLICATED, atomicity=TRANSACTIONAL, backups=2147483647, mvcc=false]
[INFO ] 2020-08-26 16:09:10.689 [main] GridCacheDatabaseSharedManager - Binary recovery performed in 186 ms.
[INFO ] 2020-08-26 16:09:10.690 [main] GridCacheDatabaseSharedManager - Read checkpoint status [startMarker=null, endMarker=null]
[INFO ] 2020-08-26 16:09:10.690 [main] GridCacheDatabaseSharedManager - Applying lost cache updates since last checkpoint record [lastMarked=FileWALPointer [idx=0, fileOff=0, len=0], lastCheckpointId=00000000-0000-0000-0000-000000000000]
[INFO ] 2020-08-26 16:09:10.692 [main] GridCacheDatabaseSharedManager - Finished applying WAL changes [updatesApplied=0, time=0 ms]
[INFO ] 2020-08-26 16:09:10.692 [main] GridCacheProcessor - Restoring partition state for local groups.
[INFO ] 2020-08-26 16:09:10.703 [main] GridCacheProcessor - Finished restoring partition state for local groups [groupsProcessed=1, partitionsProcessed=0, time=10ms]
[INFO ] 2020-08-26 16:09:10.738 [main] TcpDiscoverySpi - Connection check threshold is calculated: 300000
[INFO ] 2020-08-26 16:11:18.387 [tcp-disco-srvr-[:47500]-#3%aztecCommunityUserIgnite%] TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/172.17.239.64, rmtPort=34837]
[INFO ] 2020-08-26 16:11:18.395 [tcp-disco-srvr-[:47500]-#3%aztecCommunityUserIgnite%] TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/172.17.239.64, rmtPort=34837]
[INFO ] 2020-08-26 16:11:18.396 [tcp-disco-sock-reader-[]-#4%aztecCommunityUserIgnite%] TcpDiscoverySpi - Started serving remote node connection [rmtAddr=/172.17.239.64:34837, rmtPort=34837]
[INFO ] 2020-08-26 16:11:18.399 [tcp-disco-sock-reader-[]-#4%aztecCommunityUserIgnite%] TcpDiscoverySpi - Received ping request from the remote node [rmtNodeId=f4df02cf-0700-4f31-93b0-9073c9394d2d, rmtAddr=/172.17.239.64:34837, rmtPort=34837]
[INFO ] 2020-08-26 16:11:18.400 [tcp-disco-sock-reader-[]-#4%aztecCommunityUserIgnite%] TcpDiscoverySpi - Finished writing ping response [rmtNodeId=f4df02cf-0700-4f31-93b0-9073c9394d2d, rmtAddr=/172.17.239.64:34837, rmtPort=34837]
[INFO ] 2020-08-26 16:11:18.400 [tcp-disco-sock-reader-[]-#4%aztecCommunityUserIgnite%] TcpDiscoverySpi - Finished serving remote node connection [rmtAddr=/172.17.239.64:34837, rmtPort=34837
[INFO ] 2020-08-26 16:13:25.749 [tcp-disco-srvr-[:47500]-#3%aztecCommunityUserIgnite%] TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/172.17.239.64, rmtPort=36858]
[INFO ] 2020-08-26 16:13:25.749 [tcp-disco-srvr-[:47500]-#3%aztecCommunityUserIgnite%] TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/172.17.239.64, rmtPort=36858]
[INFO ] 2020-08-26 16:13:25.750 [tcp-disco-sock-reader-[]-#5%aztecCommunityUserIgnite%] TcpDiscoverySpi - Started serving remote node connection [rmtAddr=/172.17.239.64:36858, rmtPort=36858]
[INFO ] 2020-08-26 16:13:25.752 [tcp-disco-sock-reader-[f4df02cf 172.17.239.64:36858]-#5%aztecCommunityUserIgnite%] TcpDiscoverySpi - Initialized connection with remote server node [nodeId=f4df02cf-0700-4f31-93b0-9073c9394d2d, rmtAddr=/172.17.239.64:36858]
[INFO ] 2020-08-26 16:13:25.772 [tcp-disco-msg-worker-[]-#2%aztecCommunityUserIgnite%] TcpDiscoverySpi - New next node [newNext=TcpDiscoveryNode [id=f4df02cf-0700-4f31-93b0-9073c9394d2d, consistentId=b003163e-ef90-450a-885c-6d7e9b0cbef4, addrs=ArrayList [127.0.0.1, 172.17.193.243], sockAddrs=HashSet [sit-aztec-authentication-service/192.168.164.225:47500, /127.0.0.1:47500, /172.17.193.243:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1598458405757, loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false]]
Update:
IgniteConfiguration:
[INFO ] 2020-08-26 16:25:03.364 [main] IgniteKernal%aztecAuthIgnite - IgniteConfiguration [igniteInstanceName=aztecAuthIgnite, pubPoolSize=8, svcPoolSize=8, callbackPoo
lSize=8, stripedPoolSize=8, sysPoolSize=8, mgmtPoolSize=4, igfsPoolSize=1, dataStreamerPoolSize=8, utilityCachePoolSize=8, utilityCacheKeepAliveTime=60000, p2pPoolSize=
2, qryPoolSize=8, sqlQryHistSize=1000, dfltQryTimeout=0, igniteHome=null, igniteWorkDir=/app/aztec/data/ignite, mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@d554c5f,
nodeId=3b17a57c-6ee6-4225-bc50-a762f6ec50af, marsh=BinaryMarshaller [], marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=150000, netCompressionLevel=1, s
ndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=10000, metricsUpdateFreq=2000, metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeo
ut=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=5, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, sk
ipAddrsRandomization=false], segPlc=STOP, segResolveAttempts=2, waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=10000, commSpi=TcpCommunicationSpi [connect
Gate=null, connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@60c38c44, chConnPlc=null, enableForcibleNodeKill=false, enableTroub
leshootingLog=false, locAddr=null, locHost=null, locPort=47100, locPortRange=100, shmemPort=-1, directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=
5000, maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768, msgQueueLimit=0, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null, usePairedConnections
=false, connectionsPerNode=1, tcpNoDelay=true, filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0, sockWriteTimeout=2000, boundTcpPort=-1, boundTc
pShmemPort=-1, selectorsCnt=4, selectorSpins=0, addrRslvr=null, ctxInitLatch=java.util.concurrent.CountDownLatch@1ee2a1e2[Count = 1], stopping=false, metricsLsnr=null],
evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@59ae2de7, colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [], indexingSpi=org.apache.ignite.spi.
indexing.noop.NoopIndexingSpi@38bb9fad, addrRslvr=null, encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@11620476, clientMode=false, rebalanceThrea
dPoolSize=4, rebalanceTimeout=10000, rebalanceBatchesPrefetchCnt=3, rebalanceThrottle=0, rebalanceBatchSize=524288, txCfg=TransactionConfiguration [txSerEnabled=false,
dfltIsolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0, txTimeoutOnPartitionMapExchange=0, deadlockTimeout=10000, pessimisticTxLogSize=0, pessimist
icTxLogLinger=10000, tmLookupClsName=null, txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true, discoStartupDelay=60000, deployMode=SHARED, p2pMissed
CacheSize=100, locHost=null, timeSrvPortBase=31100, timeSrvPortRange=100, failureDetectionTimeout=300000, sysWorkerBlockedTimeout=null, clientFailureDetectionTimeout=30
000, metricsLogFreq=60000, hadoopCfg=null, connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211, noDelay=true, directBuf=false, sndBufSize=32768,
rcvBufSize=32768, idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0, selectorCnt=1, idleTimeout=7000, sslEnabled=false, sslClientAuth=false, sslCtxFa
ctory=null, sslFactory=null, portRange=100, threadPoolSize=8, msgInterceptor=null], odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration [seqReserveSize=1000, c
acheMode=PARTITIONED, backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null, platformCfg=null, binaryCfg=null, memCfg=null, pstCfg=null, dsCfg=DataStora
geConfiguration [sysRegionInitSize=41943040, sysRegionMaxSize=104857600, pageSize=4096, concLvl=0, dfltDataRegConf=DataRegionConfiguration [name=Default_Region, maxSize
=131072000, initSize=26214400, swapPath=null, pageEvictionMode=DISABLED, evictionThreshold=0.9, emptyPagesPoolSize=100, metricsEnabled=false, metricsSubIntervalCount=5,
metricsRateTimeInterval=60000, persistenceEnabled=true, checkpointPageBufSize=0, lazyMemoryAllocation=true], dataRegions=null, storagePath=db, checkpointFreq=60000, lo
ckWaitTime=10000, checkpointThreads=4, checkpointWriteOrder=SEQUENTIAL, walHistSize=20, maxWalArchiveSize=250000000, walSegments=4, walSegmentSize=67108864, walPath=db/wal, walArchivePath=db/wal, metricsEnabled=false, walMode=BACKGROUND, walTlbSize=131072, walBuffSize=33554432, walFlushFreq=5000, walFsyncDelay=1000, walRecordIterBuffSize=67108864, alwaysWriteFullPages=false, fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@25a02442, metricsSubIntervalCnt=5, metricsRateTimeInterval=60000, walAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=true, walCompactionEnabled=false, walCompactionLevel=1, checkpointReadLockTimeout=null, walPageCompression=DISABLED, walPageCompressionLevel=null], activeOnStart=true, autoActivation=false, longQryWarnTimeout=3000, sqlConnCfg=null, cliConnCfg=ClientConnectorConfiguration [host=sit-aztec-authentication-service, port=10900, portRange=10, sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpenCursorsPerConn=64, threadPoolSize=8, idleTimeout=0, handshakeTimeout=10000, jdbcEnabled=true, odbcEnabled=true, thinCliEnabled=true, sslEnabled=false, useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null, thinCliCfg=ThinClientConfiguration [maxActiveTxPerConn=100]], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, authEnabled=false, failureHnd=null, commFailureRslvr=null]