Cypher query optimisation - Utilising known properties of nodes

Question

Setup:

Neo4j and Cypher version 2.2.0. I'm querying Neo4j as an in-memory instance in Eclipse created TestGraphDatabaseFactory().newImpermanentDatabase();. I'm using this approach as it seems faster than the embedded version and I assume it has the same functionality. My graph database is randomly generated programmatically with varying numbers of nodes.

Background:

I generate cypher queries automatically. These queries are used to try and identify a single 'target' node. I can limit the possible matches of the queries by using known 'node' properties. I only use a 'name' property in this case. If there is a known name for a node, I can use it to find the node id and use this in the start clause. As well as known names, I also know (for some nodes) if there are names known not to belong to a node. I specify this in the where clause.

The sorts of queries that I am running look like this...

START

nvari = node(5) 

MATCH

 (target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvari:C4)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION),
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION),

WHERE   

NOT(nvarj.Name IN ['nf']) AND NOT(nvarm.Name IN ['nb','nj'])  

RETURN DISTINCT target

Another way to think about this (if it helps), is that this is an isomorphism testing problem where we have some information about how nodes in a query and target graph correspond to each other based on restrictions on labels.

Question:

With regards to optimisation:

Would it help to include relation variables in the match clause? I took them out because the node variables are sufficient to distinguish between relationships but this might slow it down?
Should I restructure the match clause to have match/where couples including the where clauses from my previous example first? My expectation is that they can limit possible bindings early on. For example...

START

nvari = node(5)

MATCH

(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION)

WHERE NOT(nvarj.Name IN ['nf'])

MATCH

(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION)

WHERE NOT(nvarm.Name IN ['nb','nj'])

MATCH

(target:C5)-[:IN_LOCATION]->(nvara:LOCATION), (nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION), (nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION), (nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION), (nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION), (nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION), (nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION), (nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION), (nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION)

RETURN DISTINCT target

On the side:

(Less important but still an interest) If I make each relationship in a match clause an optional match except for relationships containing the target node, would cypher essentially be finding a maximum common sub-graph between the query and the graph data base with the constraint that the MCS contains the target node?

Thanks a lot in advance! I hope I have made my requirements clear but I appreciate that this is not a typical use-case for Neo4j.

cybersam cybersam · Accepted Answer · 2015-04-12T03:46:28

I think querying with node properties is almost always preferable to using relationship properties (if you had a choice), as that opens up the possibility that indexing can help speed up the query.

As an aside, I would avoid using the IN operator if the collection of possible values only has a single element. For example, this snippet: NOT(nvarj.Name IN ['nf']), should be (nvarj.Name <> 'nf'). The current versions of Cypher might not use an index for the IN operator.
Restructuring a query to eliminate undesirable bindings earlier is exactly what you should be doing.
First of all, you would need to keep using MATCH for at least the first relationship in your query (which binds target), or else your result would contain a lot of null rows -- not very useful.

But, thinking clearly about this, if all the other relationships were placed in separate OPTIONAl MATCH clauses, you'd be essentially saying that you want a match even if none of the optional matches succeeded. Therefore, the logical equivalent would be:
```
MATCH (target:C5)-[:IN_LOCATION]->(nvara:LOCATION)
RETURN DISTINCT target
```
I don't think this is a useful result.

Cypher query optimisation - Utilising known properties of nodes

1 Answers