Neo4j cypher query performance when ordering a large data set

Question

I have the following cypher query in Neo4j:

MATCH (n0:Company) 
WHERE n0.name = 'Google'
WITH n0
MATCH (n0) <- [r0:WORKED_AT] - (person:Person) 

WITH DISTINCT ID(person) as id,  
((CASE WHEN r0.count IS NOT NULL THEN toFloat(r0.count) ELSE 1.00 END) * 
(CASE WHEN r0.source = 'someSource' THEN 
  CASE TYPE(r0) 
    WHEN 'WORKED_AT' THEN 0.7 
    WHEN 'HAS_SKILL' THEN 0.3 
    WHEN 'HAS_INTEREST' THEN 0.6 
    WHEN 'LIVED_IN_COUNTRY' THEN 0.8 
    ELSE 0.5 END 
  ELSE 0.5 END)) as score 
WITH id, score 
ORDER BY score DESC 
LIMIT 20 
RETURN COLLECT({id: id, score: score}) as result

I'm trying to sort the result based on the type of the relationship and the .source property on the relationship. This query is built up based on user input, so in the first part of the query it could be any type of node or relationship based on user input, The label Company could be switched out with Skill, WORKED_AT could be switched out with HAS_SKILL etc.

I want to be able to control the factor of which the relationship.count value is multiplied by, based on relationship.source and type(relationship).

This works fine, my concern is when the result is becoming large, this will probably be a very slow query cause it needs to do calculations for every row and after all calculations are done, sort the results based on the calculation. Is there any way I can optimize this query so it will work with ANY size of results?

My end goal with the query is to find all the persons matching a criteria, sort the result based on my calculations in the cypher query, then limit the result to get the top 20 results for a given user query.

Edit:

MATCH (n0:Company) 
WHERE n0.name = 'Google'
WITH n0
MATCH (n1:Skill) 
WHERE n1.name = 'java'
WITH n1, n0
MATCH (n0) <- [r0:WORKED_AT] - (person:Person) 
, (person) - [r1:HAS_SKILL] -> (n1) 

WITH DISTINCT ID(person) as id,  
((CASE WHEN r0.count IS NOT NULL THEN toFloat(r0.count) ELSE 1.00 END) * 
  (CASE WHEN r0.source = 'unreliable' THEN CASE TYPE(r0) 
    WHEN 'WORKED_AT' THEN 0.2
    WHEN 'HAS_SKILL' THEN 0.3 
    WHEN 'HAS_INTEREST' THEN 0.1 
    WHEN 'LIVED_IN_COUNTRY' THEN 0.3
    WHEN 'STUDIED_AT' THEN 0.2
  ELSE 0.3 END 
ELSE 0.3 END)) 
+  
((CASE WHEN r1.count IS NOT NULL THEN toFloat(r1.count) ELSE 1.00 END) * 
  (CASE WHEN r1.source = 'reliable' THEN CASE TYPE(r1) 
    WHEN 'WORKED_AT' THEN 0.8 
    WHEN 'HAS_SKILL' THEN 0.7 
    WHEN 'HAS_INTEREST' THEN 0.6
    WHEN 'LIVED_IN_COUNTRY' THEN 0.5 
    WHEN 'STUDIED_AT' THEN 0.8
  ELSE 0.7 END 
ELSE 0.7 END)) as score 

ORDER BY score DESC 
LIMIT 20 
RETURN COLLECT({id: id, score: score}) as result

I removed the extra WITH statement. I have also updated my query with a bit more complex one, I was trying to keep it simple in my original question, but I think I might have left out some important information about the query.

In this query, I can have multiple nodes I need to match against, so there can be multiple relationships (r0, r1...rn) etc, up to a reasonable amount ofc. Adding more nodes to match against will narrow the number of persons returned, which is good. But it also adds more CASE WHEN statements. My modifiers are currently set up like this:

  var modifiers = {
    reliable: { 
      HAS_SKILL: 0.8, 
      WORKED_AT: 0.7, 
      HAS_INTEREST: 0.7, 
      LIVED_IN_COUNTRY: 0.6, 
      default: 0.8
    }, 

    unreliable: {
      HAS_SKILL: 0.2, 
      WORKED_AT: 0.4, 
      HAS_INTEREST: 0.1, 
      LIVED_IN_COUNTRY: 0.3, 
      default: 0.3  
    }, 

    ...

    default: 0.5

  };

I tried passing them in as parameters, but I don't think Neo4j supports "nested" parameters like this. I need to have different modifier based on if the source is reliable or not. So a Skill from a reliable source will get a higher modifier than an unreliable source etc. If I somehow can remove the CASE statements, that will be very nice. The suggestion from JohnMark13 about hyperEdges sounds interesting, will look into that.

I have made a quick edit regarding size. Regarding modifiers, you might need to explain what is crating/running this query. The query as written is for [r0:WORKED_AT] and [r1:HAS_SKILL] so the case statements and all other modifiers are superfluous as they will never match/be used. — JohnMark13
Yes, the results will be narrowed down when adding more matches, however, the results can still probably be millions if just matching against a country or something. Not sure if I get what you mean be modifiers not being used. This query works, but not sure about scaling. Remember that I'm using a CASE TYPE(r0) then using modifiers. — Øyvind
1) You already know the Type of r0 though (you are only matching one type in any given query). 2) I'd suggest if you are only matching by Country (which suggests, querying your whole graph and matching all jobs and skills that a person has ever had) you put in some sort of guards that say "Hey, too many results" OR run the query as some sort of Map/Reduce type function that stores an indicative store against all your "Person" nodes that you can then use to advise your query — JohnMark13
Yes, I only query for one type in any given match clause. The guard suggestion seems reasonable. Will look into that. I might be blind for staring at this cypher query for hours, will go eat something and then see if I understand your points more :) — Øyvind
Good luck. Like yesterday I think that this discussion type question is good fodder for the Google Group as it is unlikely to yield 42! — JohnMark13

JohnMark13 JohnMark13 · Accepted Answer · 2014-10-02T09:06:35

tl;dr ~~Finally, h~~How large in the dataset likely to get? I don't think that there is any reason that this should be slow even as your dataset grows beyond modest sizes The query is always operating within a constrained subgraph with no complex traversals and the aggregates are not complex or in themselves iterative. Can you put some numbers on "ANY size".

As I understand it you are trying to score/boost your search results in a manner similar to a Lucene index.

Sticking with the example that you have given and assuming that if a Person has worked at Google multiple times there will be a count property on the relationship rather than multiple relationships (if you have multiple relationships using the current query I think that the same Person will be in the result collection twice).

Firstly the MATCH looks fine, but (depending on version) you do not need the WHERE or the WITH (I do not believe there is a speed dis/advantage):

MATCH (n0:Company{name:'Google})<-[r0:WORKED_AT]-(person:Person)

I think that you now have a duplicate WITH statements, which again won't cause you any speed penalty but the second WITH id, score doesn't add anything that I can see.

You say "switched out" so you know which relationship type you are querying for, could you then skip all of the case statement and pass the weight in as a parameter? i.e in the example query you know that the relationship type is WORKED_AT, so the case statement is a waste of processing power. The case statement only makes sense in the case where you are matching multiple relationship types (say (c)<-[:WORKED_AT|LIVED_AT]-(p)). This would change your query to be:

(CASE WHEN r0.count IS NOT NULL THEN toFloat(r0.count) ELSE 1.00 END) * 
(CASE WHEN r0.source = {sourceParam} THEN {scoreParam} ELSE 0.5 END) as score

That said I think that it would be more graphy not to have those multipliers coded into your query, but instead to store them into a new Source Node. In this new Node you can assign the trust values (or whatever the weights represent) to that source. That introduces a new level of (in)direction to your graph though and would require you breaking out the WORKED_AT relationship into a Node which would link a Person to a Company and add a Provenance (Source). (This is called a HyperEdge.) You have gained an advantage of being able to traverse your graph by Source now and are able to tweak the query results by modifying the weighing afforded each source without forever fiddling with your query.

If you didn't want to do that, I would still introduce a Source Node, but you could look it up by the source property (r0.source) as a separate match. This seems closer to a relational approach and is almost certainly less fast, but it breaks the scoring logic out of your query and into your data:

MATCH (n0:Company{name:'Google})<-[r0:WORKED_AT]-(person:Person)
OPTIONAL MATCH (s:Source{name:r0.source})

A note on size

I believe that your graph can grow to hundreds of millions of people, but if you are always starting your query with the Company match statement as above (and then the subsequent SKill match in the revision) then your operating set is not hundreds of millions anymore. The Graph should be super fast at finding people who worked for Google who used Java, no matter how large the set gets (this assumes an index on both Company.name and Skill.name).

A note on aggregates

If you want to perform aggregates on millions of nodes across multi millions of relationships in a single query, real time, you might be in trouble (memory intensive). Limiting the response doesn't add any benefit when you are working with aggregates as all the aggregates still have to be calculated in order to determine how to order and which results to return. (Limits - in relation to paging - in Neo4J might not do what you think they do anyway, as they work on the whole dataset being limited and then return a narrow window).

There is nothing wrong with running a procedure to generate common scores and storing them back into the graph, or using other values to limit the dataset (I programmed Assembler 15 years ago, I'd like recruitment databases to forget that fact, but they keep calling). Then your common queries can all look super performant as they hit your own mini index...

If it is a realistic use case that you want to score all of your Persons, for all of their jobs and all of their skills then you will need to use more than Cypher to achieve it and it will not be real time - I wonder if this is a case of premature optimisation or is it a real requirement?

Neo4j cypher query performance when ordering a large data set

Edit:

1 Answers