I am using Rexster/TITAN 0.4 over Cassandra. The vertex keys are indexed using standard index as below. g.makeKey("domain").dataType(String.class).indexed("standard", Vertex.class).make(); I am not using Uniqueness for performance and scalability. There are around ~10M vertices in graph.
My requirement is to iterate over each vertices and identify if any duplicates and then remove it. Is there a way to get the sorted list of vertices, directly from the index which is already present. A direct query on index (standard TITAN index) similar to "Direct Index Query" . So that I can partition the entire vertices into smaller batches and process individually.
If not possible , what is the best way to achieve this. I don't want to use Titan-Hadoop or similar solution just for finding/removing duplicates in graph.
I want to run the below query to get 1000 vertices in the sorted order.
gremlin> g.V.has('domain').domain.order[0..1000]
WARN com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx - Query requires iterating over all vertice
s [(domain <> null)]. For better performance, use indexes
But this query is not using the standard index which is created on 'domain', and fails to run, giving out of memory exception. I have ~10M vertices in graph.
How can I force gremlin to use index in this particular case?