0
votes

I have a Neo4J instance running with the Neo4J Spatial plugin. In it, I have a graph with around 3.5 k nodes each with the same label, we'll call Basket. Each Basket relates to a physical location in the same city, and the density of these baskets is very variable. I have calculated walking times between each Basket and all of its neighbours within 600m, and stored these as non-spatial (directed) relationships between nodes. Thus, some Baskets exist as what seems to be part of a big cluster, and others exist almost on their own, with only one or almost no relationships to other Baskets.

My users have a problem: they wish to begin in one place, and end in another place, visiting an arbitrary, user-defined, number of Baskets along the way. My program aims to provide a few route options for the user (as a sequence of nodes - I'll sort the actual how-to-walk-there part later), calculating the n-th number of shortest paths.

I've written a cypher query to do this, below.

start a = node(5955), b=node(6497) 
WITH a,b 
    MATCH p=((a)-[r:IS_WALKABLE_TO*4..5]->(b)) 
RETURN p

N.B. - nodes 5955 and 6497 are two nodes I picked about 2 miles apart, in this instance I decided to opt for between 4 and 5 baskets along the way.

However, I keep running into an out of memory exception, and so would like advice on how to reduce the memory demand for this problem to make it perform on an affordable server in an acceptable time of 1 to 6 seconds.

My understanding is that Neo4j would not perform a Cartesian Product to find the solution, but kind of "pick each node and sniff around from each one until it finds a suitable-sized connection" (please, forgive my phrasing!), so I'm confused about the heap memory error.

My thoughts for improving the program are to:

  1. Somehow restrict the path-finding part of the query to nodes within a bounding box, determined by the placing of the start and end node (i.e., add 500 m in each direction, then limit the query to these nodes). However, I can't find any documentation on how to do this - is it possible without having to create another spatial layer for each query?

  2. Re-write the query in a way which doesn't create a memory error - is this doable easily?

  3. Stop using Neo4J for this entirely and write an algorithm to do it manually using an alternative language. If so, what language would you recommend? C? C++ / C#? Or could I stick with Python / Ruby / Java / Go? (or, I was even thinking I might be able to do it in PHP quite effectively but I'm not sure if that was a moment of madness).

Any help and advice about how to tackle this much appreciated!

2
You forgot the colon in front of :IS_WALKABLE_TO so you consider ALL relationships between your nodes.Michael Hunger
Which Neo4j version do you use? Would you be able to share your database?Michael Hunger
I presume you get a few hundred million possible paths here, that's why it explodes. Can you run your query prefixed with EXPLAIN to see the potential cardinalities ?Michael Hunger
Thanks very much for this! I'm using version 2.3.1. I'm afraid I can't release the database at the moment, but I think you're right. Because of the clustering, the number of possible paths is easily in the order 10^6 to 10^7, which definitely isn't practical for a small server demanding reasonable execution time.Dom Weldon

2 Answers

1
votes

You might be better off refactoring this Cypher query into Java code into an unmanaged extension. Your java code might then use either Traversal API or GraphAlgoFactory.pathsWithLength()

1
votes

I think due to the densely connected shape of your graph you easily end up with hundreds of millions of possible path due to duplicate intermediate nodes.

You should add a LIMIT 100 to your query then it stops searching for paths.

One other idea is to rewrite your query to first find distinct starting points around a (and potentially b).

start a = node(5955), b=node(6497) 
MATCH (a)-[:IS_WALKABLE_TO]->(a1)-[:IS_WALKABLE_TO]->(a2)
WITH a, b, a2, collect(a1) as first
MATCH p = shortestPath((a2)-[:IS_WALKABLE_TO*..2]->(b)) 
RETURN count(*)

// or
UNWIND first as a1
RETURN [a,a1] + nodes(p) as path