3
votes

I have a mulitgraph with mulitple relationships between nodes. I try to make a Cypher query that returns nodes connected by two relationships with different properties:

The node with label Mirna is connected to Gene with the REGULATES relationship. I'd like to return all Mirna and Gene nodes that are connected by two REGULATES with the source properties first_db and second_db.

Graph schema

Here is what I tried: http://gist.neo4j.org/?4fddc897b30ef7aa4732

This works but it's very slow for large data sets. I guess because I match too much in the beginning:

MATCH (m:Mirna)-[r:REGULATES]->(g:Gene)
WITH m,g, collect(r.source) AS source    
WHERE 'first_db' IN source AND 'second_db' IN source
RETURN m,g

This executes faster and gives the same results for toy data:

MATCH (m:Mirna)-[r:REGULATES { source: 'first_db' }]->(g:Gene),
      (m:Mirna)-[r2:REGULATES { source: 'second_db' }]->(g:Gene)
RETURN m,g,r,r2

But is this safe and does Cypher always understand that I want two relationships between the same nodes? Is there another more efficient/elegant way to query for multiple relationships?

1
Since you use the same identifiers 'm' and 'g' it should be safe, but maybe some minor adjustment would make the query faster (maybe drop the label on the second mention of an identifier, maybe express it as one pattern, a la (a)-[r1]-(b)-[r2]-(a), though I'm not sure about the difference without profiling). Is this a kind of certainty estimation? I.e. "at least two authorities say that x, so probably it's true that x" kind of thing?jjaderberg
One pattern seems to be easier, didn't come to my mind. Yes, it's kind of a certainty estimator. If two data sources agree that a Gene is regulated, I conclude that it's more true ;)Martin Preusse
Is there a difference in performance as one pattern? In many cases the cypher engine might refactor to whatever is best under the hood, so maybe there's no difference (though it looks better). Maybe in biology some things are 'more true', but in my field we are skeptical ;) I've made similar 'probable science' use of neo4j but with very different data; would love to grab a coffee and hear about your work sometime.jjaderberg
First step is to compare the data sources, that hasn't been addressed properly. I wouldn't use it as a 'strict' estimator. Problem with many data sets in biology: They are sparse and they are wrong :) So you cling to whatever makes it a tiny bit less wrong. In my case the interesting part is that there are many data sets and some (new) ones have a lower error rate than older ones. So you might be able to filter ... what are you working on? Coffee sounds great.Martin Preusse

1 Answers

3
votes

Your first query does the filtering much too late, so that it cannot be included in the pattern matching, that's why it's slower (besides being a global graph query).

MATCH (m:Mirna)-[r:REGULATES]->(g:Gene)
WHERE r.source = 'first_db' OR r.source = 'second_db'
WITH m,g, collect(r.source) AS source    
WHERE 'first_db' IN source AND 'second_db' IN source
RETURN m,g

If there are no false positives you can also simplify it to this:

MATCH (m:Mirna)-[r:REGULATES]->(g:Gene)
WHERE r.source = 'first_db' OR r.source = 'second_db'
WITH m,g, count(distint r.source) AS source    
WHERE source = 2
RETURN m,g