2
votes

Some background information: I have a "question and answer" site with lots of posts and users. The data are stored in an MSSQL database. I am interesting in how a graph database will help producing interesting results to users, like suggest similar users or suggest relevant posts.

I managed to import all posts and users (only their ids) and the "post to post" and "post to user" relation from a test database into ArangoDB database and ended up with 4 collections:

  1. posts: Document collection, 9833 entries
  2. users: Document collection, 1264 entries
  3. parentToChildPosts: Edge collection storing relation from a parent post to a child post, 4978 entries
  4. postToUsers: Edge collection storing relation from a post to its owner user, 9833 entries.

Then I built a graph "postGraph" with the 2 edges and created an AQL query trying to implement a simple "similar users" algorithm based on the "user posts" and "post post" relations. Here is the query searching top 10 similar users for user with _id "users/1":

FOR e in GRAPH_NEIGHBORS('postGraph', 'users/1', 
{ edgeCollectionRestriction: ['parentToChildPosts', 'postToUsers'], 
    vertexCollectionRestriction: 'users', 
    minDepth: 3, 
    maxDepth: 3 })
FILTER e.vertex._key != '1'
COLLECT userid = e.vertex._key INTO g
SORT LENGTH(g) DESC
LIMIT 10
RETURN userid

Visually, I started search from a user 'users/1', and find paths to other users following:

user => post => parent/child post => user.

So the depth of traversals is 3. Then I exclude the user himself (FILTER), group by user keys and count the number of results and grab the top 10.

I tested the query and it worked for most users but failed for some active users with the error:

[1909] too many iterations

The query runs quite fast (within one second) and gives the error. One of the failing active users has 727 posts (not so many in my opinion).

I also tried 2 other ways to implement the algorithm:

  1. Write 3 GRAPH_NEIGHBORS queries mimicking the 3 steps of one traversal: much slower. It takes around 3 seconds the first time running it.
  2. Use Join (without using any graph functions): quite fast.

From the documentation, ArangoDB suggests using graph functions for this kind of situation. So I am not in favor of using Join (gives me a feeling I can also do it in SQL, so why bother using graph database).

Any suggestion how to make the GRAPH_NEIGHBORS work quickly without any errors? Or any other suggestion how to build graph, like how large the graph can grow?

1

1 Answers

5
votes

Hi you are totally right, the GRAPH_NEIGHBORS function is built exactly for that. There is a parameter maxIterations limiting the amount of vertices that will be touched by one traversal (where GRAPH_NEIGHBORS is mapped onto) and your graph structure might hit the default value (10000) there. We implemented this value to prevent endless looping. So to fix this you simply increase this value. Also i have another performance hint for you: * If your Graph is only containing the two edgeCollections you do not have to set them for the restriction. So we can save a check there. (Will not be very significant)

Here is an updated query:

FOR e in GRAPH_NEIGHBORS('postGraph', 'users/1', 
  { maxIterations: 1000000, 
    vertexCollectionRestriction: 'users', 
    minDepth: 3, 
    maxDepth: 3 })
FILTER e.vertex._key != '1'
COLLECT userid = e.vertex._key INTO g
SORT LENGTH(g) DESC
LIMIT 10
RETURN userid

For further info please stick to the information in the AQL Traversal documentation. Everything that is available there is also available for GRAPH_NEIGHBORS https://docs.arangodb.com/Aql/GraphOperations.html