1
votes

The following query should return at most limit vertices with the label REPOSITORY, that were last updated before minLastUpdated and are not of type FILE_UPLOAD, unless the NEEDS_UPDATE flag is set.

g.V()
    .hasLabel(VertexLabel.REPOSITORY.name())
    .has(PropertyKey.INDEXED_LABEL.name(), VertexLabel.REPOSITORY.name())
    .has(PropertyKey.LAST_UPDATED.name(), P.lt(minLastUpdated))
    .or(__.not(__.has(PropertyKey.TYPE.name(), RepositoryType.FILE_UPLOAD.name())),
        __.has(PropertyKey.NEEDS_UPDATE.name(), true))
    .limit(limit);

To avoid a full graph scan, I have created the following indexes on properties INDEXED_LABEL, TYPE and NEEDS_UPDATE, a composite index combining all three and a mixed index:

//By Label
mgmt.buildIndex("byIndexedLabel", Vertex.class)
    .addKey(indexedLabelKey)
    .buildCompositeIndex();

//By Type
mgmt.buildIndex("byType", Vertex.class)
    .addKey(typeKey)
    .buildCompositeIndex();

//By Needs Update
mgmt.buildIndex("byNeedsUpdate", Vertex.class)
    .addKey(needsUpdateKey)
    .buildCompositeIndex();

//Combination of the three
mgmt.buildIndex("byIndexedLabelTypeAndNeedsUpdate", Vertex.class)
    .addKey(indexedLabelKey)
    .addKey(typeKey)
    .addKey(needsUpdateKey)
    .buildCompositeIndex();

//Mixed Index
mgmt.buildIndex("repositoryByTypeAndLastUpdated", Vertex.class)
    .addKey(indexedLabelKey, Mapping.STRING.asParameter())
    .addKey(lastUpdatedKey)
    .indexOnly(repositoryLabel)
    .buildMixedIndex("search");

Yet when executing the query, I get this warning:

WARN  - StandardTitanTx$6: Query requires iterating over all vertices [()]. For better performance, use indexes

Sidenotes

  • The Vertex Labels are defined within the same transaction as the indexes, which means all indexes should be available immediately.
  • PropertyKey and VertexLabel are my own enums.
  • The keys used during index setup are all instances of com.thinkaurelius.titan.core.PropertyKey which I added earlier.
  • All properties have the data type String except for NEEDS_UPDATE, which is a Boolean.

Environment

  • Titan 1.0.0
  • TinkerPop 3.0.1
  • Elastic Search 1.0.0
  • Berkeley Storage Backend

Thanks for any suggestions you might have.

1

1 Answers

1
votes

Only PropertyKey.INDEXED_LABEL.name() and PropertyKey.LAST_UPDATED.name() are relevant, other properties can't be used for the index lookup. That said, it would make sense to create a search index as a) you have multiple properties and b) one of them has a range condition: P.lt(minLastUpdated) (no other index can answer range queries and having multiple multiple properties covered by a composite index is known to cause trouble sooner or later). Create a single index that covers both properties to get the best performance.

mgmt.buildIndex('repositoryByTypeAndLastUpdated', Vertex.class).
    addKey(indexedLabelKey, Mapping.STRING.asParameter()).
    addKey(lastUpdatedKey).indexOnly(repositoryLabel).buildMixedIndex("search")

UPDATE:

INDEXED_LABEL is actually not indexable or rather shouldn't be indexed as it only seems to be a copy of the vertex label stored as a property. what follows is a fully working example that doesn't give you any warning about full scans.

gremlin> graph = TitanFactory.open("conf/titan-berkeleyje-es.properties")
==>standardtitangraph[berkeleyje:/projects/aurelius/titan/conf/../db/berkeley]
gremlin> g = graph.traversal()
==>graphtraversalsource[standardtitangraph[berkeleyje:/projects/aurelius/titan/conf/../db/berkeley], standard]
gremlin> m = graph.openManagement()
==>com.thinkaurelius.titan.graphdb.database.management.ManagementSystem@10a0a1e
gremlin> repository = m.makeVertexLabel("repository").make()
==>repository
gremlin> lastUpdated = m.makePropertyKey("lastUpdated").dataType(Long.class).make()
==>lastUpdated
gremlin> needsUpdate = m.makePropertyKey("needsUpdate").dataType(Boolean.class).make()
==>needsUpdate
gremlin> type = m.makePropertyKey("type").dataType(String.class).make()
==>type
gremlin> m.buildIndex("repositoryByLastUpdated", Vertex.class).
gremlin>   addKey(lastUpdated).indexOnly(repository).buildMixedIndex("search")
==>repositoryByLastUpdated
gremlin> m.commit()
==>null

gremlin> g.V().has("repository", "lastUpdated", lt(System.currentTimeMillis())).
gremlin>   or(has("type", neq("FILE UPLOAD")), has("needsUpdate", true)).limit(10)
gremlin> 

There's no data in my graph, but the warning would be shown w/ or w/o data.