Create relation between large amout of nodes in cypher

Question

Given the the following graph from this question: Cypher 2 not using schema index with OR operator:

CREATE 
(:Application {Name: "Test Application", Aliases: ["Test", "App", "TestProject"]}),
(:Application {Name: "Another Application", Aliases: ["A-App", "XYZ", "XYProject"]}),
(:Application {Name: "Database X", Aliases: ["DB-App", "DB", "DB-Project"]}),
(:System {Name: "Server1", Application: "TestProject"}),
(:System {Name: "Server2", Application: "Test Application"}),
(:System {Name: "Server3", Application: "another App"}),
(:System {Name: "Server4", Application: "Some Database"}),
(:System {Name: "Server5", Application: "App"}),
(:System {Name: "Server6", Application: "App XY"}),
(:System {Name: "Server7", Application: "App DB"}),
(:System {Name: "Server8", Application: "Test"}),
(:System {Name: "Server9", Application: "TestProject"}),
(:System {Name: "Server10", Application: "test"}),
(:System {Name: "Server11", Application: "App XY"});

CREATE INDEX ON :Application(Name);
CREATE INDEX ON :Application(Aliases);

CREATE INDEX ON :System(Application);

But with 900 Application and 200.000 Systemnodes.

I added a new alias (e.g. "Test MiniApp") to one of the applications (that will finally match ~27.000 new System nodes in the production database) and run the following query:

MATCH (a:Application { Name: "Test Application"})
WITH a
MATCH (s:System)
WHERE s.Application IN (a.Aliases + a.Name)
AND NOT (a)-[:InstalledOn]->(s)
CREATE UNIQUE (a)-[:InstalledOn]->(s)

This query is using the schema index on the production database (tested with PROFILE) but runs simply too long, ~5 minutes. I wonder why it takes so long to create a relation for ~27k nodes that are found with an index.

Neo4j 2.1.6 runs with default settings on Linux system (SLES 11) with 96 GB RAM.

EDIT The above query just return a single node of type Application and is only executed when a application is renamed and/or when an alias is added/removed. Since both entities are coming from external systems at any time i cannot only use the case where a new system could directly related to an application, because it may not exist during the import time. So when someone add a new alias, etc. to an application i need to find all matching systems and create that relation.

How many applications match {Name: "Test Application"}? Just one? Seems here you're looking through most :InstalledOn relationships on most :Systems, which could be slow (200K of them). I'm afraid that since you're looking for things which are not connected to a particular node this inherently takes some time since clearly you can't exploit relationships to traverse to those nodes. — FrobberOfBits
Hi, there is just one Application node with this name and/or alias. — dna
You should probably create relationships from :Application to :System as needed to capture that relationship (where s.Application IN (a.Aliases + Name)) - checking a property against a long list of possibilities across 200k System nodes seems like something you don't want to recompute every time, even with an index. Maybe you don't even need the Application property on :System? — FrobberOfBits
Well that the use case, if the application is updated (e.g. name changed and/or alias added/removed) that query is executed. Its not for searching. Problem here is that both data is coming from external systems at any time. Updating a new/change system its search because you're only looking for 900 applications, but when a application changes i need that query to create relations. — dna
My point is that neo4j tends to make relationship traversal fast; matching property values is slower. If you can maintain the property value list, you could also maintain a list of rels that captures the same data. I don't know your model so I can't say for sure, but it's a frequently observed anti-pattern to try and join node populations by some value, when rels exist to do that (and faster). Again, you're trying to find stuff that's not connected too in this query; consider how to exploit rels in your model to simplify what you're asking. — FrobberOfBits

cybersam cybersam · Accepted Answer · 2015-03-02T16:46:19

The check for NOT (a)-[:InstalledOn]->(s) in the WHERE clause should not be necessary, since CREATE UNIQUE (a)-[:InstalledOn]->(s) will do the same check for you automatically. Essentially, you are doing the same check twice.

Does this speed up things?

MATCH (a:Application { Name: "Test Application"}), (s:System)
WHERE s.Application IN (a.Aliases + a.Name)
CREATE UNIQUE (a)-[:InstalledOn]->(s)

Create relation between large amout of nodes in cypher

1 Answers