I am one of the core developers of ArangoDB
and tried to optimize your query. As I do not have your dataset
I can only talk about my test dataset
and would be happy to hear if you can validate my results.
First if all I am running on ArangoDB
2.7 but in this particular case I do not expect a major performance difference to 2.6.
In my dataset
I could execute your query as it is in ~7sec.
First fix:
In your friends statement you use includeData: true
and only return the _id
. With includeData: false
GRAPH_NEIGHBORS
directly returns the _id
and we can also get rid of the subquery here
LET friends = GRAPH_NEIGHBORS('graph',
@user,
{"direction": "any",
"edgeExamples": {
name: "FRIENDS_WITH"
}})
This got it down to ~ 1.1 sec on my machine. So I expect that this will be close to the performance of Neo4J.
Why does this have a high impact?
Internally we first find the _id
value without actually loading the documents JSON. In your query you do not need any of this data, so we can safely continue with not opening it.
But now for the real improvement
Your query goes the "logical" way and first gets users neighbors, than finds their neighbors, counts how often a foaf
is found and sorts it.
This has to build up the complete foaf network in memory and sort it as a whole.
You can also do it in a different way:
1. Find all friends
of user (only _ids
)
2. Find all foaf
(complete document)
3. For each foaf
find all foaf_friends
(only _ids
)
4. Find the intersection of friends
and foaf_friends
and COUNT them
This query would like this:
LET fids = GRAPH_NEIGHBORS("graph",
@user,
{
"direction":"any",
"edgeExamples": {
"name": "FRIENDS_WITH"
}
}
)
FOR foaf IN GRAPH_NEIGHBORS("graph",
@user,
{
"minDepth": 2,
"maxDepth": 2,
"direction": "any",
"includeData": true,
"edgeExamples": {
"name": "FRIENDS_WITH"
}
}
)
LET commonIds = GRAPH_NEIGHBORS("graph",
foaf._id, {
"direction": "any",
"edgeExamples": {
"name": "FRIENDS_WITH"
}
}
)
LET common_friend_count = LENGTH(INTERSECTION(fids, commonIds))
SORT common_friend_count DESC
RETURN {user: foaf, common_friend_count: common_friend_count}
Which in my test graph was executed in ~ 0.024 sec
So this gave me a factor 250 faster execution time and I would expect this to be faster than your current query in Neo4j, but as I do not have your dataset
I can not verify it, it would be good if you could do it and tell me.
One last thing
With the edgeExamples: {name : "FRIENDS_WITH" }
it is the same as with includeData
, in this case we have to find the real edge and look into it. This could be avoided if you store your edges in separate collections based on their name. And then remove the edgeExamples as well. This will further increase the performance (especially if there are a lot of edges).
Future
Stay tuned for our next release, we are right now adding some more functionality to AQL which will make your case much easier to query and should give another performance boost.