0
votes

We are using Titan-1.0.0 supported by backed as cassandra 2.1.7. The cluster consists of 36 VMs each of 16GB RAM (6GB heapsize) and 16 CPU cores. We are using SSD disk for cassandra data and a normal HDD for commitlogs. The RF is 3 and reads/writes are being done at CF=2 each. The Java version is 1.8.0_45 There is heavy reads being served through this cluster. We are facing random outages with some servers at an average interval of around twice-thrice in day one of the server goes down. Let me share a particular instance of around 10-May-2017 04:00 AM, there were too many READ/MUTATION drops and we had to restart the server. Following are some of parameters that we have collected when the servers go down:

Can someone please please help us with some pointers, as to what is going wrong here and what can be done to stabilize the servers.

  • gc.log

    2017-05-11T21:09:44.018+0530: 1900952.543: Total time for which application threads were stopped: 0.0156063 seconds, Stopping threads took: 0.0125466 seconds 2017-05-11T21:09:45.021+0530: 1900953.546: Total time for which application threads were stopped: 0.0031342 seconds, Stopping threads took: 0.0006697 seconds 2017-05-11T21:09:46.025+0530: 1900954.550: Total time for which application threads were stopped: 0.0037100 seconds, Stopping threads took: 0.0011525 seconds 2017-05-11T21:09:47.031+0530: 1900955.556: Total time for which application threads were stopped: 0.0057972 seconds, Stopping threads took: 0.0030861 seconds 2017-05-11T21:09:48.034+0530: 1900956.559: Total time for which application threads were stopped: 0.0029592 seconds, Stopping threads took: 0.0007129 seconds 2017-05-11T21:09:49.061+0530: 1900957.586: Total time for which application threads were stopped: 0.0268731 seconds, Stopping threads took: 0.0242117 seconds 2017-05-11T21:10:03.095+0530: 1900971.620: Total time for which application threads were stopped: 0.0045643 seconds, Stopping threads took: 0.0009709 seconds 2017-05-11T21:10:04.099+0530: 1900972.624: Total time for which application threads were stopped: 0.0033764 seconds, Stopping threads took: 0.0007673 seconds 2017-05-11T21:10:07.104+0530: 1900975.629: Total time for which application threads were stopped: 0.0056445 seconds, Stopping threads took: 0.0019495 seconds 2017-05-11T21:10:07.447+0530: 1900975.972: Total time for which application threads were stopped: 0.0045721 seconds, Stopping threads took: 0.0009695 seconds {Heap before GC invocations=85258 (full 0):  garbage-first heap   total 6291456K, used 5059586K [0x0000000640000000, 0x0000000640206000, 0x00000007c0000000)  region size 2048K, 1844 young (3776512K), 105 survivors (215040K)  Metaspace       used 36253K, capacity 36654K, committed 36784K, reserved 1081344K   class space    used 3844K, capacity 3965K, committed 4016K, reserved 1048576K 2017-05-11T21:10:07.453+0530:
    1900975.978: [GC pause (GCLocker Initiated GC) (young) Desired survivor size 242221056 bytes, new threshold 15 (max 15)
    - age   1:  182659896 bytes,  182659896 total
    - age   2:   21418160 bytes,  204078056 total
    - age   3:     871016 bytes,  204949072 total
    - age   4:    3600512 bytes,  208549584 total
    - age   5:    1313096 bytes,  209862680 total
    - age   6:      21152 bytes,  209883832 total , 0.0972867 secs]    [Parallel Time: 83.6 ms, GC Workers: 13]
          [GC Worker Start (ms): Min: 1900975980.9, Avg: 1900975986.0, Max: 1900975989.4, Diff: 8.5]
          [Ext Root Scanning (ms): Min: 8.5, Avg: 11.9, Max: 17.4, Diff: 8.9, Sum: 154.3]
          [Update RS (ms): Min: 26.8, Avg: 30.8, Max: 34.2, Diff: 7.4, Sum: 399.8]
             [Processed Buffers: Min: 31, Avg: 87.3, Max: 144, Diff: 113, Sum: 1135]
          [Scan RS (ms): Min: 0.1, Avg: 2.8, Max: 6.2, Diff: 6.1, Sum: 36.6]
          [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 0.9]
          [Object Copy (ms): Min: 31.5, Avg: 32.3, Max: 33.2, Diff: 1.7, Sum: 419.9]
          [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
          [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.3, Sum: 3.1]
          [GC Worker Total (ms): Min: 74.7, Avg: 78.1, Max: 83.3, Diff: 8.7, Sum: 1014.7]
          [GC Worker End (ms): Min: 1900976063.9, Avg: 1900976064.1, Max: 1900976064.3, Diff: 0.3]    [Code Root Fixup: 0.1 ms]    [Code Root Purge: 0.0 ms]    [Clear CT: 1.4 ms]    [Other: 12.1 ms]
          [Choose CSet: 0.0 ms]
          [Ref Proc: 1.0 ms]
          [Ref Enq: 0.0 ms]
          [Redirty Cards: 1.6 ms]
          [Humongous Reclaim: 0.4 ms]
          [Free CSet: 4.5 ms]    [Eden: 3478.0M(3476.0M)->0.0B(252.0M) Survivors: 210.0M->54.0M Heap: 4941.5M(6144.0M)->1175.0M(6144.0M)] Heap after GC invocations=85259 (full 0):  garbage-first heap   total 6291456K, used 1203201K [0x0000000640000000, 0x0000000640206000, 0x00000007c0000000)   region size 2048K, 27 young (55296K), 27 survivors (55296K)  Metaspace       used 36253K, capacity 36654K, committed 36784K, reserved 1081344K   class space    used 3844K, capacity 3965K, committed 4016K, reserved 1048576K }  [Times: user=0.58 sys=0.21, real=0.09 secs]  2017-05-11T21:10:07.551+0530:
    1900976.076: Total time for which application threads were stopped: 0.1015506 seconds, Stopping threads took: 0.0004470 seconds 2017-05-11T21:10:08.557+0530: 1900977.082: Total time for which application threads were stopped: 0.0053236 seconds, Stopping threads took: 0.0010397 seconds {Heap before GC invocations=85259 (full 0):  garbage-first heap   total 6291456K, used 1461249K [0x0000000640000000, 0x0000000640206000, 0x00000007c0000000)   region size 2048K, 153 young (313344K), 27 survivors (55296K)  Metaspace      used 36253K, capacity 36654K, committed 36784K, reserved 1081344K   class space    used 3844K, capacity 3965K, committed 4016K, reserved 1048576K 2017-05-11T21:10:08.995+0530: 1900977.520: [GC pause (G1 Evacuation Pause) (mixed) Desired survivor size 20971520 bytes, new threshold 3 (max 15)
    - age   1:   19769440 bytes,   19769440 total
    - age   2:     801568 bytes,   20571008 total
    - age   3:   20679216 bytes,   41250224 total
    - age   4:     738736 bytes,   41988960 total
    - age   5:    3536096 bytes,   45525056 total
    - age   6:    1242648 bytes,   46767704 total
    - age   7:      19208 bytes,   46786912 total , 0.1879873 secs]    [Parallel Time: 175.3 ms, GC Workers: 13]
          [GC Worker Start (ms): Min: 1900977524.7, Avg: 1900977525.0, Max: 1900977525.3, Diff: 0.6]
          [Ext Root Scanning (ms): Min: 14.5, Avg: 14.8, Max: 15.3, Diff: 0.8, Sum: 192.9]
          [Update RS (ms): Min: 7.1, Avg: 7.9, Max: 13.3, Diff: 6.1, Sum: 103.2]
             [Processed Buffers: Min: 4, Avg: 37.0, Max: 93, Diff: 89, Sum: 481]
          [Scan RS (ms): Min: 55.7, Avg: 60.9, Max: 61.8, Diff: 6.1, Sum: 791.9]
          [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.7, Diff: 0.7, Sum: 0.9]
          [Object Copy (ms): Min: 90.3, Avg: 90.7, Max: 91.2, Diff: 0.9, Sum: 1179.5]
          [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
          [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.1, Sum: 0.8]
          [GC Worker Total (ms): Min: 174.3, Avg: 174.6, Max: 174.8, Diff: 0.6, Sum: 2269.2]
          [GC Worker End (ms): Min: 1900977699.6, Avg: 1900977699.6, Max: 1900977699.7, Diff: 0.1]    [Code Root Fixup: 0.2 ms]    [Code Root Purge: 0.0 ms]    [Clear CT: 0.9 ms]    [Other: 11.5 ms]
          [Choose CSet: 3.3 ms]
          [Ref Proc: 0.4 ms]
          [Ref Enq: 0.0 ms]
          [Redirty Cards: 1.1 ms]
          [Humongous Reclaim: 0.1 ms]
          [Free CSet: 3.6 ms]    [Eden: 252.0M(252.0M)->0.0B(3656.0M) Survivors: 54.0M->30.0M Heap: 1427.0M(6144.0M)->890.0M(6144.0M)] Heap after GC invocations=85260 (full 0):  garbage-first heap   total 6291456K, used 911310K [0x0000000640000000, 0x0000000640206000, 0x00000007c0000000)   region size 2048K, 15 young (30720K), 15 survivors (30720K)  Metaspace       used 36253K, capacity 36654K, committed 36784K, reserved 1081344K   class space    used 3844K, capacity 3965K, committed 4016K, reserved 1048576K }  [Times: user=1.59 sys=0.60, real=0.19 secs]

  • jstat

      S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
      0.00 100.00   2.29  37.85  98.56  95.74  79645 6351.963     0    0.000 6351.963
      0.00 100.00   4.63  37.85  98.56  95.74  79645 6351.963     0    0.000 6351.963
      0.00 100.00   7.81  37.85  98.56  95.74  79645 6351.963     0    0.000 6351.963
      0.00 100.00  10.46  37.85  98.56  95.74  79645 6351.963     0    0.000 6351.963
      0.00 100.00  13.17  37.81  98.56  95.74  79645 6351.963     0    0.000 6351.963
      0.00 100.00  15.30  38.11  98.56  95.74  79645 6351.963     0    0.000 6351.963

  • /var/log/cassandra/system.log
WARN  [SharedPool-Worker-1] 2017-05-11 21:08:49,621 SliceQueryFilter.java:319 - Read 286 live and 12978 tombstone cells in icmsgraph_10042017.edgestore for key: f800000000104600 (see tombstone_warn_threshold). 2147483647 columns were requested, slices=[b02ea000800000350080000002-b02ea000800000350080000003]
WARN  [SharedPool-Worker-1] 2017-05-11 21:08:52,647 SliceQueryFilter.java:319 - Read 286 live and 12978 tombstone cells in icmsgraph_10042017.edgestore for key: f800000000104600 (see tombstone_warn_threshold). 2147483647 columns were requested, slices=[b02ea000800000350080000002-b02ea000800000350080000003]
WARN  [GossipTasks:1] 2017-05-11 21:09:05,030 Gossiper.java:714 - Gossip stage has 1 pending tasks; skipping status check (no nodes will be marked down)
WARN  [GossipTasks:1] 2017-05-11 21:09:16,588 FailureDetector.java:249 - Not marking nodes down due to local pause of 16886451559 > 5000000000
INFO  [ScheduledTasks:1] 2017-05-11 21:09:16,954 MessagingService.java:888 - 1924 READ messages dropped in last 5000ms
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,153 StatusLogger.java:51 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,154 StatusLogger.java:66 - MutationStage                     0         0       32568360         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,155 StatusLogger.java:66 - RequestResponseStage              1         0    12112153129         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,155 StatusLogger.java:66 - ReadRepairStage                   0         0      593141839         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,155 StatusLogger.java:66 - CounterMutationStage              0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,155 StatusLogger.java:66 - ReadStage                        15         0    12168491480         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,155 StatusLogger.java:66 - MiscStage                         0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,156 StatusLogger.java:66 - HintedHandoff                     0         1           3531         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,156 StatusLogger.java:66 - GossipStage                       0         0        5798815         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,156 StatusLogger.java:66 - CacheCleanupExecutor              0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,156 StatusLogger.java:66 - InternalResponseStage             0         0            537         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,156 StatusLogger.java:66 - CommitLogArchiver                 0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,157 StatusLogger.java:66 - CompactionExecutor                0         0       29915673         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,157 StatusLogger.java:66 - ValidationExecutor                0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,157 StatusLogger.java:66 - MigrationStage                    0         0           3990         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,157 StatusLogger.java:66 - AntiEntropyStage                  0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,158 StatusLogger.java:66 - PendingRangeCalculator            0         0            255         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,159 StatusLogger.java:66 - Sampler                           0         0              0         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,159 StatusLogger.java:66 - MemtableFlushWriter               0         0           7705         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,159 StatusLogger.java:66 - MemtablePostFlush                 0         0          44884         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,159 StatusLogger.java:66 - MemtableReclaimMemory             0         0           7705         0                 0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:75 - CompactionManager                 0         0
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:87 - MessagingService                n/a      57/3
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:97 - Cache Type                     Size                 Capacity               KeysToSave
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:99 - KeyCache                  104857584                104857600                      all
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:105 - RowCache                          0                        0                      all
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,160 StatusLogger.java:112 - ColumnFamily                Memtable ops,data

INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.compaction_history                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.hints                              0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.IndexInfo                          0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.schema_columnfamilies                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.schema_triggers                    0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,162 StatusLogger.java:115 - system.size_estimates         762300,93404798
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.paxos                              0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.peer_events                        0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.range_xfers                        0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.compactions_in_progress                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.peers                       283,103428
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.schema_keyspaces                   0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.schema_usertypes                   0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.local                              0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.sstable_activity             315,63532
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.schema_columns                     0,0    
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,163 StatusLogger.java:115 - system.batchlog                           0,0   

INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.titan_ids                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.graphindex         1843,277464
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.system_properties_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.system_properties                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.edgestore_lock_              6,1191
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.systemlog          741,242777
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.txlog           741,242777
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.graphindex_lock_         1332,277240
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_10042017.edgestore      75171,16624309
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_05052017.system_properties_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,169 StatusLogger.java:115 - icmsgraph_05052017.system_properties                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.edgestore_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.txlog                  0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.edgestore                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.graphindex                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.titan_ids                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.systemlog                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:17,170 StatusLogger.java:115 - icmsgraph_05052017.graphindex_lock_                 0,0   
WARN  [SharedPool-Worker-24] 2017-05-11 21:09:17,200 SliceQueryFilter.java:319 - Read 286 live and 12978 tombstone cells in icmsgraph_10042017.edgestore for key: f800000000104600 (see tombstone_warn_threshold). 2147483647 columns were requested, slices=[b02ea000800000350080000002-b02ea000800000350080000003]
WARN  [SharedPool-Worker-18] 2017-05-11 21:09:17,201 SliceQueryFilter.java:319 - Read 286 live and 12978 tombstone cells in icmsgraph_10042017.edgestore for key: f800000000104600 (see tombstone_warn_threshold). 2147483647 columns were requested, slices=[b02ea000800000350080000002-b02ea000800000350080000003]
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,170 MessagingService.java:888 - 1 READ messages dropped in last 5000ms
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,170 StatusLogger.java:51 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - MutationStage                     0         0       32568376         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - RequestResponseStage              0         0    12112193109         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - ReadRepairStage                   0         0      593143878         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - CounterMutationStage              0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - ReadStage                         0         0    12168508581         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,171 StatusLogger.java:66 - MiscStage                         0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - HintedHandoff                     0         1           3531         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - GossipStage                       0         0        5798844         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - CacheCleanupExecutor              0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - InternalResponseStage             0         0            537         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - CommitLogArchiver                 0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - CompactionExecutor                0         0       29915673         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - ValidationExecutor                0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - MigrationStage                    0         0           3990         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - AntiEntropyStage                  0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - PendingRangeCalculator            0         0            255         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,172 StatusLogger.java:66 - Sampler                           0         0              0         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:66 - MemtableFlushWriter               0         0           7705         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:66 - MemtablePostFlush                 0         0          44884         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:66 - MemtableReclaimMemory             0         0           7705         0                 0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:75 - CompactionManager                 0         0     
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:87 - MessagingService                n/a       0/0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:97 - Cache Type                     Size                 Capacity               KeysToSave
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:99 - KeyCache                  104857584                104857600                      all   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,174 StatusLogger.java:105 - RowCache                          0                        0                      all   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,173 StatusLogger.java:99 - KeyCache                  104857584                104857600                      all   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,174 StatusLogger.java:105 - RowCache                          0                        0                      all   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,174 StatusLogger.java:112 - ColumnFamily                Memtable ops,data
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.titan_ids                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.graphindex         1844,277804
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.system_properties_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.system_properties                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.edgestore_lock_              6,1191
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.systemlog          741,242777
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.txlog           741,242777
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.graphindex_lock_         1332,277240
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,179 StatusLogger.java:115 - icmsgraph_10042017.edgestore      75176,16625645
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.system_properties_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.system_properties                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.edgestore_lock_                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.txlog                  0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.edgestore                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.graphindex                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.titan_ids                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.systemlog                 0,0   
INFO  [ScheduledTasks:1] 2017-05-11 21:09:22,180 StatusLogger.java:115 - icmsgraph_05052017.graphindex_lock_                 0,0   
WARN  [SharedPool-Worker-34] 2017-05-11 21:09:24,177 SliceQueryFilter.java:319 - Read 286 live and 12978 tombstone cells in icmsgraph_10042017.edgestore for key: f800000000104600 (see tombstone_warn_threshold). 2147483647 columns were requested, slices=[b02ea000800000350080000002-b02ea000800000350080000003]
  • ThreadDump Around 600 Thrift threads:

    "Thrift:20118" #124110 daemon prio=5 os_prio=0 tid=0x00007feab9a71460 nid=0xb1c6 runnable [0x00007fea6a578000]
       java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        - locked  (a java.io.BufferedInputStream)
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
        at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
        at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
        at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
        at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
        at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:205)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

72 MessagingService-Incoming threads ...


    "MessagingService-Incoming-/192.168.33.67" #124106 prio=5 os_prio=0 tid=0x00007feab98436b0 nid=0xb0ca runnable [0x00007fea7503a000]    java.lang.Thread.State: RUNNABLE     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)  at sun.nio.ch.IOUtil.read(IOUtil.java:197)  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
        - locked  (a java.lang.Object)  at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:192)
        - locked  (a java.lang.Object)  at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
        - locked  (a sun.nio.ch.SocketAdaptor$SocketInputStream)    at net.jpountz.lz4.LZ4BlockInputStream.readFully(LZ4BlockInputStream.java:215)  at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:149)     at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:101)   at java.io.DataInputStream.readInt(DataInputStream.java:387)    at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:169)   at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88)

  • sar -q
09:10:01  IST         5      1396     16.02      9.24      6.31         0
1

1 Answers

2
votes

I think the problem is cassandra's tombstone

In Cassandra whenever a delete, insert explicitly null or TTL expire Tombstone created. Tombstones are a mechanism which allows Cassandra to write fast but it has an operational price to pay.

First of all, tombstones are themselves records. They take up space and can substantially increase the amount of storage you require.

Secondly, Large Number of Tombstones Causes Latency and Heap Pressure

From your cassandra's system log it seems that about 13K tombstone is created only for a single vertex (Vertex Key ID : 0xf800000000104600). It's huge. You can see that when this warning appear on cassandra's system.log, Heap Pressure increases and so gc.log printed at the same time.

What should i do ?

Cassandra will fully drop those tombstones when a compaction triggers, only after local_delete_time + gc_grace_seconds as defined on the table the data belongs to.

Read More :
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html https://opencredo.com/cassandra-tombstones-common-issues/