2
votes

My database has the following "schema":

  • users author posts
  • users like posts

My modest test database contains:

  • 162 users
  • 442 posts
  • 159 likes

Now I would like to query the most popular users, which is the users who have gathered the most likes across all their posts. I came up with the following query:

FOR u IN users
    LET nblikes = SUM(FOR post IN 1 OUTBOUND u isAuthor
        RETURN LENGTH(GRAPH_EDGES('my-graph', post, { edgeCollectionRestriction: 'likes' })))
    SORT nblikes DESC
    RETURN {
        "username": u.username,
        "nblikes": nblikes
    }

which executes in approximately 0.8s on my Mid-2014 MacBookPro (2.8GHz Core i7, 16GB RAM). 0.8s isn't shameful but on such a small dataset, I would have expected better considering that, AFAIK, that's all happening in memory.

So I would appreciate if some ArangoDB gurus out there could review my query and hint some potential performance issues. Thanks a lot!

1
How fast is your query if you use the suggested graph traversal, and how fast is it with a join? Thanks!CodeManX

1 Answers

3
votes

There are a few ways to make this query run faster.

What will improve it most is replacing the inner call to GRAPH_EDGES with another traversal of depth 1 to find the "likers" as shown below:

FOR u IN users 
  LET nblikes = SUM(
    FOR post IN 1 OUTBOUND u isAuthor 
      RETURN LENGTH(
        /* the following traversal replaces the call to GRAPH_EDGES */
        FOR liker IN 1 INBOUND post._id likes 
          RETURN 1
      )
  )
  SORT nblikes DESC
  RETURN { 
    username: u.username, 
    nblikes: nblikes
  }

The inner function call to GRAPH_EDGES is quite expensive, and getting rid of it will improve the query execution time much.

Another variant is to replace the (now) two traversals with a plain join like this:

FOR u IN users 
  LET nblikes = SUM(
    /* the following join between users, isAuthor and likes 
       replaces the traversal & GRAPH_EDGES calls */
    FOR a IN isAuthor 
      FILTER a._from == u._id 
      FOR l IN likes 
        FILTER l._to == a._to 
        RETURN 1
  ) 
  SORT nblikes DESC
  RETURN { 
    username: u.username, 
    nblikes: nblikes
  }

Both variants should be faster than the initial query, mainly because GRAPH_EDGES is expensive to call in a loop. As it is a stateless AQL function, it will need to set up its context repeatedly (as often as it's called from the inner loop). The traversals and the joins solution can keep a bit more context between calls so they are "cheaper".