8
votes

Let's say I have a property "name" of nodes in neo4j. Now I want to enforce that there is maximally one node for a given name by identifying all nodes with the same name. More precisely: If there are three nodes where name is "dog", I want them to be replaced by just one node with name "dog", which:

  1. Gathers all properties from all the original three nodes.
  2. Has all arcs that were attached to the original three nodes.

The background for this is the following: In my graph, there are often several nodes of the same name which should considered as "equal" (although some have richer property information than others). Putting a.name = b.name in a WHERE clause is extremely slow.

EDIT: I forgot to mention that my Neo4j is of version 2.3.7 currently (I cannot update it).

SECOND EDIT: There is a known list of labels for the nodes and for the possible arcs. The type of the nodes is known.

THIRD EDIT: I want to call above "node collapse" procedure from Java, so a mixture of Cypher queries and procedural code would also be a useful solution.

2
Labels for these nodes are known? It is the same for all of these nodes? And what about type of relationships for this nodes?stdob--
What should happen if Node1 (name=A) and Node2 (name=A) do have the same property with different values?K.E.
@K.E. It does not really matter in my case. One could either drop one of the values or define a "property2" for a given "property". The most important point is to gather all the arcs, i.e. the new node should have all the incoming and outgoing arcs with same labels as the replaced nodes. The main reason for my request is that I have a lot of arcs A -> B1, B2 -> C, where B1 and B2 are "really" the same node and the relation A -> B -> C is the one I want to find.J Fabian Meier

2 Answers

5
votes

I have made a testcase with following schema:

CREATE (n1:TestX {name:'A', val1:1})
CREATE (n2:TestX {name:'B', val2:2})
CREATE (n3:TestX {name:'B', val3:3})
CREATE (n4:TestX {name:'B', val4:4})
CREATE (n5:TestX {name:'C', val5:5})

MATCH (n6:TestX {name:'A', val1:1}) MATCH (m7:TestX {name:'B', val2:2}) CREATE (n6)-[:TEST]->(m7)
MATCH (n8:TestX {name:'C', val5:5}) MATCH (m10:TestX {name:'B', val3:3}) CREATE (n8)<-[:TEST]-(m10)

What results in following output:

enter image description here

Where the nodes B are really the same nodes. And here is my solution:

//copy all properties
MATCH (n:TestX), (m:TestX) WHERE n.name = m.name AND ID(n)<ID(m) WITH n, m SET n += m;

//copy all outgoing relations
MATCH (n:TestX), (m:TestX)-[r:TEST]->(endnode) WHERE n.name = m.name AND ID(n)<ID(m) WITH n, collect(endnode) as endnodes
FOREACH (x in endnodes | CREATE (n)-[:TEST]->(x));

//copy all incoming relations
MATCH (n:TestX), (m:TestX)<-[r:TEST]-(endnode) WHERE n.name = m.name AND ID(n)<ID(m) WITH n, collect(endnode) as endnodes
FOREACH (x in endnodes | CREATE (n)<-[:TEST]-(x));

//delete duplicates
MATCH (n:TestX), (m:TestX) WHERE n.name = m.name AND ID(n)<ID(m) detach delete m;

The resulting output looks like this:

enter image description here

It has to be marked that you have to know the type of the various relationships.

All the properties are copied from the nodes with "higher" IDs to the nodes with the "lower" IDs.

5
votes

I think you need something like a synonym of nodes.

1) Go through all nodes and create a node synonym:

MATCH (N)
WITH N
  MERGE (S:Synonym {name: N.name})
  MERGE (S)<-[:hasSynonym]-(N)
RETURN count(S);

2) Remove the synonyms with only one node:

MATCH (S:Synonym)
WITH S
MATCH (S)<-[:hasSynonym]-(N)
WITH S, count(N) as count
WITH S WHERE count = 1
DETACH DELETE S;

3) Transport properties and relationships for the remaining synonyms (with apoc):

MATCH (S:Synonym)
WITH S
MATCH (S)<-[:hasSynonym]-(N)
WITH [S] + collect(N) as nodesForMerge
CALL apoc.refactor.mergeNodes( nodesForMerge );

4) Remove Synonym label:

MATCH (S:Synonym)<-[:hasSynonym]-(N)
CALL apoc.create.removeLabels( [S], ['Synonym'] );