0
votes

I imported a large set of nodes (>16 000) where each node contains the information about a location (longitudinal/lateral geo-data). All nodes have the same label. There are no relationships in this scenario. Now I want to identify for each node the next neighbour by distance and create a relationship between these nodes.

This (brute force) way worked well for sets containing about 1000 nodes: (1) I first defined relationships between all nodes containing the distance information. (2) Then I defined for all relationships the property "mindist=false".(3) After that I identified the next neighbour looking at the the distance information for each relationship and set "mindist" property "true" where the relationship represents the shortest distance. (4) Finally I deleted all relationships with "mindist=false".

(1)

match (n1:XXX),(n2:XXX)
where id(n1) <> id(n2)
with n1,n2,distance(n1.location,n2.location) as dist
create(n1)-[R:DISTANCE{dist:dist}]->(n2)
Return R

(2)

match (n1:XXX)-[R:DISTANCE]->(n2:XXX)
set R.mindist=false return R.mindist

(3)

match (n1:XXX)-[R:DISTANCE]->(n2:XXX)
with n1, min(R.dist) as mindist
match (o1:XXX)-[r:DISTANCE]->(o2:XXX)
where o1.name=n1.name and r.dist=mindist
Set r.mindist=TRUE
return r

(4)

match (n)-[R:DISTANCE]->()
where R.mindist=false
delete R return n

With sets containing about 16000 nodes this solution didn't work (memory problems ...). I am sure there is a smarter way to solve this problem (but at this point of time I am still short on experience working with neo4j/cypher). ;-)

2
What should happen if there are two neighbors at the same distance?Rajendra Kadam
In this theoretic case there should be two/more relationships (But I am using real geo-data with more than 10 digits after the point. So I think this scenario will not take place.).CB_Dev
Do you mind how long it takes to run?Pablissimo
For 1090 nodes step 1 was completed after round about 10 seconds. Steps 2-4 took many minutes (sorry, I don't know exactly) ... finally it worked,... but not efficiently :-(CB_Dev

2 Answers

0
votes

You can process find the closest neighbor one by one for each node in batch using APOC. (This is also a brute-force way, but runs faster). It takes around 75 seconds for 7322 nodes.

CALL apoc.periodic.iterate("MATCH (n1:XXX) 
RETURN n1", "
WITH n1
MATCH (n2:XXX)
WHERE id(n1) <> id(n2)
WITH n1, n2, distance(n1.location,n2.location) as dist ORDER BY dist LIMIT 1
CREATE (n1)-[r:DISTANCE{dist:dist}]->(n2)", {batchSize:1, parallel:true, concurrency:10})

NOTE: batchSize should be always 1 in this query. You can change concurrency for experimentation.

0
votes

Our options within Cypher are I think limited to a naive O(n^2) brute-force check of the distance from every node to every other node. If you were to write some custom Java to do it (which you could expose as a Neo4j plugin), you could do the check much quicker.

Still, you can do it with arbitrary numbers of nodes in the graph without blowing out the heap if you use APOC to split the query up into multiple transactions. Note: you'll need to add the APOC plugin to your install.

Let's first create 20,000 points of test data:

WITH range(0, 20000) as ids
WITH [x in ids | { id: x, loc: point({ x: rand() * 100, y: rand() * 100 }) }] as points
UNWIND points as pt
CREATE (p: Point { id: pt.id, location: pt.loc })

We'll probably want a couple of indexes too:

CREATE INDEX ON :Point(id)
CREATE INDEX ON :Point(location)

In general, the following query (don't run it yet...) would, for each Point node create a list containing the ID and distance to every other Point node in the graph, sort that list so the nearest one is at the top, pluck the first item from the list and create the corresponding relationship.

MATCH (p: Point)
MATCH (other: Point) WHERE other.id <> p.id
WITH p, [x in collect(other) | { id: x.id, dist: distance(p.location, x.location) }] AS dists
WITH p, head(apoc.coll.sortMaps(dists, '^dist')) AS closest
MATCH (closestPoint: Point { id: closest.id })
MERGE (p)-[:CLOSEST_TO]->(closestPoint)

However, the first two lines there cause a cartesian product of nodes in the graph: for us, it's 400 million rows (20,000 * 20,000) that flow into the rest of the query all of which is happening in memory - hence the blow-up. Instead, let's use APOC and apoc.periodic.iterate to split the query in two:

CALL apoc.periodic.iterate(
"
    MATCH (p: Point)
   RETURN p
",
"
   MATCH (other: Point) WHERE other.id <> p.id
    WITH p, [x in collect(other) | { id: x.id, dist: distance(p.location, x.location) }] 
AS dists
    WITH p, head(apoc.coll.sortMaps(dists, '^dist')) AS closest
   MATCH (closestPoint: Point { id: closest.id })
   MERGE (p)-[:CLOSEST_TO]->(closestPoint)
", { batchSize: 100 })

The first query just returns all Point nodes. apoc.periodic.iterate will then take the 20,000 nodes from that query and split them up into batches of 100 before running the inner query on each of the nodes in each batch. We'll get a commit after each batch, and our memory usage is constrained to whatever it costs to run the inner query.

It's not quick, but it does complete. On my machine it's running about 12 nodes a second on a graph with 20,000 nodes but the cost exponentially increases as the number of nodes in the graph increases. You'll rapidly hit the point where this approach just doesn't scale well enough.