1
votes

Started repairing the total cluster through opscenter and one node went down with the below error.

ERROR [CompactionExecutor:530] 2016-03-04 18:25:39,893  CassandraDaemon.java:227 - Exception in thread Thread[CompactionExecutor:530,1,main]
java.lang.AssertionError: /data/cass_data/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-3046-Data.db
        at org.apache.cassandra.io.sstable.SSTableReader.getApproximateKeyCount(SSTableReader.java:268) ~[cassandra-all-2.1.11.908.jar:2.1.11.908]
        at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:151) ~[cassandra-all-2.1.11.908.jar:2.1.11.908]
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[cassandra-all-2.1.11.908.jar:2.1.11.908]
        at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:73) ~[cassandra-all-2.1.11.908.jar:2.1.11.908]
        at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) ~[cassandra-all-2.1.11.908.jar:2.1.11.908]
        at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:262) ~[cassandra-all-2.1.11.908.jar:2.1.11.908]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_60]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_60]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_60]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_60]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]

Tried to kill the process id and started the dse service, got the below error and not starting the cassandra service. Fatal exception during initialization

Versions
dse 4.8.2
Cassandra 2.1.11

What could be the problem? How does running repair stop the cassandra service?

2

2 Answers

3
votes

Your system keyspace got corrupted. And from the first error it seems some table data is corrupt as well. So you may have some issues with your disks/file system. For the node to get started again you will need to either restore the system keyspace from a recent backup of it, or remove the system keyspace folder, make sure you have your tokens set in the cassandra.yaml, and then you should be able to start the node and it will re-create it.

1
votes

The code being executed is pretty straight forward:

    Keyspace keyspace;
    try
    {
        keyspace = Keyspace.open(Keyspace.SYSTEM_KS);
    }
    catch (AssertionError err)
    {
        // this happens when a user switches from OPP to RP.
        ConfigurationException ex = new ConfigurationException("Could not read system keyspace!");
        ex.initCause(err);
        throw ex;
    }

    ColumnFamilyStore cfs = keyspace.getColumnFamilyStore(LOCAL_CF);

    String req = "SELECT cluster_name FROM system.%s WHERE key='%s'";
    UntypedResultSet result = executeInternal(String.format(req, LOCAL_CF, LOCAL_KEY));

    if (result.isEmpty() || !result.one().has("cluster_name"))
    {
        // this is a brand new node
        if (!cfs.getSSTables().isEmpty())
            throw new ConfigurationException("Found system keyspace files, but they couldn't be loaded!");

        // no system files.  this is a new node.
        req = "INSERT INTO system.%s (key, cluster_name) VALUES ('%s', ?)";
        executeInternal(String.format(req, LOCAL_CF, LOCAL_KEY), DatabaseDescriptor.getClusterName());
        return;
    }

It is able to open the system keyspace, and then tries to read from system.local, but that fails. That means the system.local table/data is either missing or corrupt.

The sstable is printed out, so we know it's on disk. It's got -ka- versioning, so we know it's 2.1. The next most likely option is that it's corrupt somehow - testing corruption seems like it should be easy, but in many versions of 2.1, the checksum (-Digest.sha1 file) is actually an adler32 checksum (not sha1), and moreover, it's likely incorrect for compressed sstables (like the system.local tables). So checking for corruption is going to be hard.

I believe you have two viable options:

1) You can try to run scrub offline (sstablescrub, keeping in mind that it's going to write a root-owned commitlog segment which you'll need to chown when you complete). If that doesn't work:

2) You can wipe the system keyspace, and re-join the node to the cluster (with or without replace_address).