0
votes

I am using Rexster/TITAN 0.4 over Cassandra. The vertex keys are indexed using standard index as below. g.makeKey("domain").dataType(String.class).indexed("standard", Vertex.class).make(); I am not using Uniqueness for performance and scalability. There are around ~10M vertices in graph.

My requirement is to iterate over each vertices and identify if any duplicates and then remove it. Is there a way to get the sorted list of vertices, directly from the index which is already present. A direct query on index (standard TITAN index) similar to "Direct Index Query" . So that I can partition the entire vertices into smaller batches and process individually.

If not possible , what is the best way to achieve this. I don't want to use Titan-Hadoop or similar solution just for finding/removing duplicates in graph.

I want to run the below query to get 1000 vertices in the sorted order.

gremlin> g.V.has('domain').domain.order[0..1000]

WARN  com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx  - Query requires iterating over all vertice
s [(domain <> null)]. For better performance, use indexes

But this query is not using the standard index which is created on 'domain', and fails to run, giving out of memory exception. I have ~10M vertices in graph.

How can I force gremlin to use index in this particular case?

1

1 Answers

1
votes

The answer is the same as the one I provided in the comments of your previous question:

  1. Throw more memory at the problem (i.e. increase -Xmx to the console or whatever application is running your query) - which would be a short-term solution.
  2. Use titan-hadoop.
  3. Restructure your graph or queries in some way to allow a use of an index. This could mean giving up some performance on insert and using a uniqueness lock. Maybe you don't have to remove duplicates in your source data - perhaps you can dedup them in your Gremlin queries at the time of traversal. The point is that you'll need to be creative.

Despite your reluctance to use titan-hadoop and not wanting to use it to "just for finding/removing duplicates in graph", that's the exact use case it will be good at. You have a batch process that must iterate all vertices and it can't fit in the memory you've allotted and you don't want to use titan-hadoop. That's a bit like saying: "I have a nail and a hammer, but I don't want to use the hammer to bang in the nail." :)

How can I force gremlin to use index in this particular case?

There is no way in gremlin to do this. In theory, there might be a way to try to read from Cassandra directly (bypassing Titan), decode the binary result and somehow iterate and delete, but it's not known to me. Even if you figured it out, which would mean lots of hours trying to dig into the depths of Titan to see how to read the index data, it would be a hack that is likely to break at any time you upgrade Titan, as the core developers might close that avenue to you at any point as you are circumventing Titan in an unexpected way.

The best option is to simply use titan-hadoop to solve your problem. Unless your graph is completely static and no longer growing, you will reach a point where titan-hadoop is inevitable. How will you be sure that your graph is growing correctly when you have 100M+ edges? How will you gather global statistics about your data? How will you repair bad data that got into the database from a bug in your code? All of those things become issues when your graph reaches a certain scale and titan-hadoop is your only friend there at this time.