JanusGraph/Gremlin - Performance issue with repeat step applied to large data sets

Question

I'm experiencing issues querying a large graph involving repeat steps that aim at making "hops" across vertices and edges. My intention is to infer indirect relationships between objects. Consider the following:

John--livesIn-->Paris

Paris--isIn-->France

What I expect to come up with is that John is based in France. Simple enough, and this works great with a small data set.

The query that I use is the following, where I make no more than 2 hops:

g.V().has('name','John')
.emit(loops().is(lt(2)))
.repeat(__.bothE().bothV().simplePath())
.inE('isIn').outV().path()

This is working as expected, until I apply this to a graph made of about 1000 vertices and 3000 edges. Then, after a few minutes, I get various kinds of error (over the REST API) with no clear logic:

Error: Error encountered evaluating script
Error: 504 Gateway Time-out
Error: Java heap space
Error

I suspect that I am doing something wrong in my query. For exemple, setting the number of "hops" to 1 (direct relationship) with .emit(loops().is(lt(1))), I would expect the results to be delivered swiftly since it would not go into the repeat loop. However, this triggers the same issue.

Many thanks for your help!

Olivier

bechbd bechbd · Accepted Answer · 2018-03-12T23:15:21

So it looks like you have a few things going on here. First let me take a shot at answering your question then let's look at why your traversal may be taking a long time to complete.

Based on your description of wanting to return John and France the following traversal should get your data:

g.V().has('name','John').as('person')
out('livesIn')
.out('isIn').as('country').select('person', 'country')

That will select all countries that a person named 'John' lives in.

Now to understand why your traversal was taking a long time. First, you are using several steps which are very memory and resource intensive such as bothE and bothV. Each of these steps navigate the relationship in both directions. Since you know the direction of the edge you are trying to traverse is out in both cases it is much quicker and less resource intensive to just use an out edge as this will traverse the specified edge name (if supplied) and end you on the adjacent vertex. Additionally, the simplePath step is another resource (specifically memory) intensive step as it must track the path value for each traverser until it contains repeated objects at which time it is dropped. This combined with the extra traversers created by the usage of loops and bothE and bothV is likely the cause of the slow query. I suspect that the query above will perform significantly better.

If you would like to see exactly what your query is doing I would suggest taking a look at the explain and profile steps which provide detailed information on your queries performance.

JanusGraph/Gremlin - Performance issue with repeat step applied to large data sets

1 Answers