1
votes

We have defined 5 indexes using titan cassandra in the follow block of code

 def mgmt = g.managementSystem;
 try {
     if (!mgmt.containsGraphIndex("byId")) {
         def key = mgmt.makePropertyKey('__id').dataType(String.class).make()
         mgmt.buildIndex("byId",Vertex.class).addKey(key).buildCompositeIndex()
     }
     if (!mgmt.containsGraphIndex("byType")) {
          def key = mgmt.makePropertyKey('__type').dataType(String.class).make()
         mgmt.buildIndex("byType",Vertex.class).addKey(key).buildCompositeIndex()
     }
     if (!mgmt.containsGraphIndex("lastName")) {
         def key = mgmt.makePropertyKey('lastName').dataType(String.class).make()
         mgmt.buildIndex('lastName',Vertex.class).addKey(key).buildMixedIndex(INDEX_NAME)
     }
     if (!mgmt.containsGraphIndex("firstName")) {
         def key = mgmt.makePropertyKey('firstName').dataType(String.class).make()
         mgmt.buildIndex('firstName',Vertex.class).addKey(key).buildMixedIndex(INDEX_NAME)
     }
     if (!mgmt.containsGraphIndex("vin")) {
         def key = mgmt.makePropertyKey('vin').dataType(String.class).make()
         mgmt.buildIndex('vin',Vertex.class).addKey(key).buildMixedIndex(INDEX_NAME)
     }
     mgmt.commit()
 } catch (Exception e) {
     System.err.println("An error occurred initializing indices")
     e.printStackTrace()
 }

we then execute the following query

g.V.has('__id','49fb8bae5f994cf5825b849a5dd9b49a')

This produces a warning informing us that :

"Query requires iterating over all vertices [{}]. For better performance, use indexes"

I'm confused because according to the documentation these indexes are set up correctly, but for some reason titan is not using them.

The indexes are created before any data is in the graph, so reindexing is not neccessary. Any help is greatly appreciated.

Update- I've managed to break this down into a very simple test. In our code we have developed a custom gremlin step to use for the stated query

Gremlin.defineStep('hasId', [Vertex,Pipe], { String id ->
    _().has('__id', id)
})

then from our code we call

g.V.hasId(id)

It appears that when we use the custom gremlin step the query does not use the index, but when using the vanilla gremlin call the index is used.

It looks like a similar oddity was noted in this post https://groups.google.com/forum/#!topic/aureliusgraphs/6DqMG13_4EQ

1

1 Answers

1
votes

I would prefer to check for existence of the property key which would mean you adjust your checks to:

if (!mgmt.containsRelationType("__id")) {

I tried out your code in the Titan Gremlin Console and I'm not seeing an issue:

gremlin> g  = TitanFactory.open("conf/titan-cassandra.properties")
==>titangraph[cassandrathrift:[127.0.0.1]]
gremlin> mgmt = g.managementSystem
==>com.thinkaurelius.titan.graphdb.database.management.ManagementSystem@2227a6c1
gremlin> key = mgmt.makePropertyKey('__id').dataType(String.class).make()
==>__id
gremlin> mgmt.buildIndex("byId",Vertex.class).addKey(key).buildCompositeIndex()
==>com.thinkaurelius.titan.graphdb.database.management.TitanGraphIndexWrapper@6d4c273c
gremlin> mgmt.commit()
==>null
gremlin> mgmt = g.managementSystem
==>com.thinkaurelius.titan.graphdb.database.management.ManagementSystem@79d743e6
gremlin> mgmt.containsGraphIndex("byId")
==>true
gremlin> mgmt.rollback()
==>null
gremlin> v = g.addVertex()
==>v[256]
gremlin> v.setProperty("__id","123")
==>null
gremlin> g.commit()
==>null
gremlin> g.V
12:56:45 WARN  com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>v[256]
gremlin> g.V("__id","123")
==>v[256]
gremlin> g.V.has("__id","123")
==>v[256]

Note I'm not getting any ugly message about "...use indexes". Perhaps you can try my example here and see if that behaves as expected before going back to your code.

UPDATE: In answer to the updated question above with respect to the custom step. As the post you found noted, Titan's query optimizer doesn't seem to be able to sort this one out. I think it's easy to see why in this example:

gremlin> g = TinkerGraphFactory.createTinkerGraph()
==>tinkergraph[vertices:6 edges:6]
gremlin> Gremlin.defineStep('hasName', [Vertex,Pipe], { n -> _().has('name',n) })
==>null
gremlin> g.V.hasName('marko')
==>v[1]
gremlin> g.V.hasName('marko').toString()
==>[GremlinStartPipe, GraphQueryPipe(vertex), [GremlinStartPipe, PropertyFilterPipe(name,EQUAL,marko)]]

The "compiled" Gremlin looks like that last line above. Note that custom step compiles to an "inner" pipe with a new GremlinStartPipe. Compare that to the same without the custom step:

gremlin> g.V.has('name','marko').toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), IdentityPipe]

Titan can optimize the "GraphQueryPipe" with embedded has, but it seems that isn't the case with the custom step's signature. I think the workaround (at least for this particular scenario is write a function that returns the pipe.

gremlin> def hasName(g,n){g.V.has('name',n)}  
==>true
gremlin> hasName(g,'marko')
==>v[1]
gremlin> hasName(g,'marko').toString()
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), IdentityPipe]

Passing 'g' around kinda stinks. Perhaps write your DSL so that 'g' gets wrapped in an class that then lets you do:

with(g).hasName('marko')

A final thought would be to use Groovy meta-programming facilities:

gremlin> Graph.metaClass.hasName = { n -> delegate.V.has('name',n) }
==>groovysh_evaluate$_run_closure1@600b9d27
gremlin> g.hasName("marko").toString()                              
==>[GremlinStartPipe, GraphQueryPipe(has,vertex), IdentityPipe]
gremlin> g.hasName("marko")                                         
==>v[1]