1
votes

I have a small Titan 0.5.0 cluster with 8 nodes. Every node runs Titan in Rexster 2.5.0 and Cassandra. They all are configured the same. Unfortunately nearly all the time one of them does not manage to start.
In most cases this is one of the seed nodes.

Using cassandra as storage backend I get the following in the Rexster/Titan log.

WARN  com.tinkerpop.rexster.config.GraphConfigurationContainer - Could 
  not open global configuration com.thinkaurelius.titan.core.TitanException:
  Could not open global configuration
 at com.thinkaurelius.titan.diskstorage.Backend.
   getStandaloneGlobalConfiguration(Backend.java: 405)
...
Caused by: com.thinkaurelius.titan.diskstorage.TemporaryBackendException: 
  Temporary failure in storage backend
 at com.thinkaurelius.titan.diskstorage.cassandra.astyanax.
   AstyanaxStoreManager.ensureColumnFamilyExists(AstyanaxStoreManager.java:446)
...
Caused by: com.netflix.astyanax.connectionpool.exceptions.BadRequestException: 
  BadRequestException: [host=192.168.0.10(192.168.0.10):9160, latency=496(496),
  attempts=1] InvalidRequestException(why:Cannot add already existing
  column family "system_properties" to keyspace "titan")
 at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(
   ThriftConverter.java:159)

Rexster does fail to start and thus did not load the graph. However, the Cassandra node Rexster failed to connect to seems to be fine: nodetool lists the node as part of the ring. If I fire requests against the remaining Rexster instances everything seems to work.

I wiped all data before starting the nodes.

I switched to cassandrathrift resulting in a similar exception (same TitanException caused by PermanentBackendException caused by TimeoutException). The storage timeout in Rexster is 30s. This may be too low since I start all nodes simultaneously at the moment, but does not explain the issues with cassandra.

What is going wrong here?

edit:

I was misusing Titan. To not have to deal with index creation on startup - which happens quite often in my case - I created the index in the Rexster extension. I think this code got invoked multiple times: When I started multiple nodes simultaneously it seems some of them tried to create the index.

Question: Is there any way the extension can create the indices safely? I created a separate thread for this: What are the methods to create indices?

I increased the storage timeout to 60s and retried the procedure after removing the index creation from code. I still startup all nodes simultaneously. Again one Rexstitan node (seed node #2) fails to start.

The Cassandra log indeed contains an exception

java.lang.IllegalArgumentException: Unknown keyspace/cf pair (titan.txlog)
    at org.apache.cassandra.db.Keyspace.getColumnFamilyStore(Keyspace.java:166)
    at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:326)
    at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:65)
    at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:47)
    at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

which I can see in both seed nodes. While the Rexster on one seed node does not seem to care the other Rexster instance fails to start with

Caused by: com.netflix.astyanax.connectionpool.exceptions.BadRequestException: BadRequestException: [host=192.168.0.10(192.168.0.10):9160, latency=66(66), attempts=1]InvalidRequestException(why:Cannot add already existing column family "graphindex_lock_" to keyspace "titan")
    at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:159)
    at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65)
    at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28)
    at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151)
    at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:119)
    at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:338)
    at com.netflix.astyanax.thrift.ThriftClusterImpl.executeSchemaChangeOperation(ThriftClusterImpl.java:146)
    at com.netflix.astyanax.thrift.ThriftClusterImpl.internalCreateColumnFamily(ThriftClusterImpl.java:240)

in rexstitan.log. Sounds quite similar to the exceptions raised before.

Just to clarify: With fail I mean that Rexster is started and can be queried but failed to load the Titan graph "graph".

Maybe I have to reduce the size to a minimum to check if this is related to cluster size.

edit #2:

It is not related to cluster size. And it's getting really annoying. Sometimes it is the BadRequestException above, sometimes it's a BadRequestException because there already is a keyspace "titan". Or it is an IllegalArgumentException:

2646 [main] WARN  com.tinkerpop.rexster.config.GraphConfigurationContainer -
  Database has already been initialized but not frozen
  java.lang.IllegalArgumentException: Database has already been initialized but not frozen
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:93)
    at com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration.<init>(GraphDatabaseConfiguration.java:1294)
    at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:93)
    at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:73)
    at com.thinkaurelius.titan.tinkerpop.rexster.TitanGraphConfiguration.configureGraphInstance(TitanGraphConfiguration.java:33)
    at com.tinkerpop.rexster.config.GraphConfigurationContainer.getGraphFromConfiguration(GraphConfigurationContainer.java:124)
    at com.tinkerpop.rexster.config.GraphConfigurationContainer.<init>(GraphConfigurationContainer.java:54)
    at com.tinkerpop.rexster.server.XmlRexsterApplication.reconfigure(XmlRexsterApplication.java:99)
    at com.tinkerpop.rexster.server.XmlRexsterApplication.<init>(XmlRexsterApplication.java:47)
    at com.tinkerpop.rexster.Application.<init>(Application.java:97)
    at com.tinkerpop.rexster.Application.main(Application.java:189)

Is it not possible to start multiple nodes at once, do they conflict? This is the only reason I can think of, because I can get any exception and sometimes it works fine.

1
From "I was misusing Titan." and "I created a separate thread for this:" I seem to conclude that the question should be deleted? Mind doing it?Jacek Laskowski
If I'm honest I would not want to delete it. At first there might be people running in the same issue. More important is that the problem still exists, even if caused by a different exception now.Sebastian Schlicht
Doesn't the other question cover the case? This one appears as an initial thought that led to the other question that's very likely going to grab people's attention.Jacek Laskowski
Unfortunately it doesn't. It's just an additional question derived from what I have tried to face the issue above. You are right, my try led to a different exception. Maybe I should make a new question for the second exception, but to me they seem to be very similar and might have the same cause.Sebastian Schlicht

1 Answers

0
votes

The problem is the simultaneous startup of the Titan nodes. (version 0.5.0)
The more nodes you startup at once, the more likely the BadRequestExceptions are, since all the nodes try to create the same keyspace/column families in the Cassandra cluster concurrently.

To overcome this issue you have to

  1. start Cassandra (all nodes at once is fine)
  2. start a single Titan node
  3. open the Rexster console on this node, create the schema and indices
  4. start the remaining Titan nodes