Using Match with Multiple Clauses Causes Odd Results

Question

I am writing a Cypher query in Neo4j 2.0.4 that attempts to get the total number of inbound and outbound relationships for a selected node. I can do this easily when I only use this query one-node-at-a-time, like so:

MATCH (g1:someIndex{name:"name1"})
MATCH g1-[r1]-()
RETURN count(r1);
//Returns 305

MATCH (g2:someIndex{name:"name2"})
MATCH g2-[r2]-()
RETURN count(r2);
//Returns 2334

But when I try to run the query with 2 nodes together (i.e. get the total number of relationships for both g1 and g2), I seem to get a bizarre result.

MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);
//Returns 1423740

For some reason, the number is much much greater than the total of 305+2334.

It seems like other Neo4j users have run into strange issues when using multiple MATCH clauses, so I read through Michael Hunger's explanation at https://groups.google.com/d/msg/neo4j/7ePLU8y93h8/8jpuopsFEFsJ, which advised Neo4j users to pipe the results of one match using WITH to avoid "identifier uniqueness". However, when I run the following query, it simply times out:

MATCH (g1:gene{name:"SV422_HUMAN"}),(g2:gene{name:"BRCA1_HUMAN"})
MATCH g1-[r1]-()
WITH r1
MATCH g2-[r2]-()
RETURN count(r1)+count(r2);

I suspect this query doesn't return because there's a lot of records returned by r1. In this case, how would I operate my "get-number-of-relationships" query on 2 nodes? Am I just using some incorrect syntax, or is there some fundamental issue with the logic of my "2 node at a time" query?

Nicole White Nicole White · Accepted Answer · 2015-04-03T16:40:06

Your first problem is that you are returning a Cartesian product when you do this:

MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);

If there are 305 instances of r1 and 2334 instances of r2, you're returning (305 * 2334) == 711870 rows, and because you are summing this (count(r1)+count(r2)) you're getting a total of 711870 + 711870 == 1423740.

Your second problem is that you are not carrying over g2 in the WITH clause of this query:

MATCH (g1:gene{name:"SV422_HUMAN"}),(g2:gene{name:"BRCA1_HUMAN"})
MATCH g1-[r1]-()
WITH r1
MATCH g2-[r2]-()
RETURN count(r1)+count(r2);

You match on g2 in the first MATCH clause, but then you leave it behind when you only carry over r1 in the WITH clause at line 3. Then, in line 4, when you match on g2-[r2]-() you are matching literally everything in your graph, because g2 has been unbound.

Let me walk through a solution with the movie dataset that ships with the Neo4j browser, as you have not provided sample data. Let's say I want to get the total count of relationships attached to Tom Hanks and Hugo Weaving.

As separate queries:

MATCH (:Person {name:'Tom Hanks'})-[r]-()
RETURN COUNT(r)

=> 13

MATCH (:Person {name:'Hugo Weaving'})-[r]-()
RETURN COUNT(r)

=> 5

If I try to do it your way, I'll get (13 * 5) * 2 == 90, which is incorrect:

MATCH (:Person {name:'Tom Hanks'})-[r1]-(), 
      (:Person {name:'Hugo Weaving'})-[r2]-()
RETURN COUNT(r1) + COUNT(r2)

=> 90

Again, this is because I've matched on all combinations of r1 and r2, of which there are 65 (13 * 5 == 65) and then summed this to arrive at a total of 90 (65 + 65 == 90).

The solution is to use DISTINCT:

MATCH (:Person {name:'Tom Hanks'})-[r1]-(), 
      (:Person {name:'Hugo Weaving'})-[r2]-()
RETURN COUNT(DISTINCT r1) + COUNT(DISTINCT r2)

=> 18

Clearly, the DISTINCT modifier only counts the distinct instances of each entity.

You can also accomplish this with WITH if you wanted:

MATCH (:Person {name:'Tom Hanks'})-[r]-()
WITH COUNT(r) AS r1
MATCH (:Person {name:'Hugo Weaving'})-[r]-()
RETURN r1 + COUNT(r)

=> 18

TL;DR - Beware of Cartesian products. DISTINCT is your friend:

MATCH (:someIndex{name:"name1"})-[r1]-(), 
      (:someIndex{name:"name2"})-[r2]-()
RETURN COUNT(DISTINCT r1) + COUNT(DISTINCT r2);

Using Match with Multiple Clauses Causes Odd Results

2 Answers