2
votes

If I have a tree modelled in Neo4J, and I know all the indexed node properties of a possible subtree, how can I quickly do a match to confirm if that subtree exists?

For an easy to understand example, if my tree is a representation of all the directories and files on a disk, and what I want to confirm is the existence of a file, here's what I'm doing now to test if /a/b/c/d/e/f/g/h/i/j/k/file.txt exists in a graph where my filesystem nodes have label mylabel and have an indexed basename property:

MATCH (root:`mylabel` { basename: '/' })-[:contains]->(`1`:`mylabel` { basename: 'a' })-[:contains]->(`2`:`mylabel` { basename: 'b' })-[:contains]->(`3`:`mylabel` { basename: 'c' })-[:contains]->(`4`:`mylabel` { basename: 'd' })-[:contains]->(`5`:`mylabel` { basename: 'e' })-[:contains]->(`6`:`mylabel` { basename: 'f' })-[:contains]->(`7`:`mylabel` { basename: 'g' })-[:contains]->(`8`:`mylabel ` { basename: 'h' })-[:contains]->(`9`:`mylabel ` { basename: 'i' })-[:contains]->(`10`:`mylabel` { basename: 'j' })-[:contains]->(`11`:`mylabel` { basename: 'k' })-[:contains]->(leaf:`vdp`:`mylabel` { basename: 'file.txt' }) RETURN leaf

I also happen to have an absolute path property on the leaf node, but I can't just search for that because I want to use a similar query to do a merge if I'm adding a file, and in case the file has moved (by deleting the relationship of leaf to k, and adding a new relationship of leaf to some other directory node) without the path property getting updated. Also, I can have the same absolute path multiple times for different files, because it can be for different disks (this is modelled by the root node having a basename that includes the disk name, but that isn't seen in the leaf nodes path property).

The above query takes 2-7 seconds to run (uncached). Is there any way to do something like this quicker? When I specify every node in the path I hoped Neo4J could cope even with paths of length 12+.

1

1 Answers

1
votes

Some ideas here:

  • since you always start at the root node (and I assume there is only one) you don't really need indexes here. Instead assign the root node a label, e.g. Root and make sure it's the only node carrying that label
  • With that in place you can hint Cypher my matching the root node first and do the path matching after a with
  • I assume everything in a subfolder in your hierarchy has the mylabel label, so performance wise it's better to skip the check (since it's implicitly given)

.

MATCH (r:Root)
WITH r
MATCH (r)-[:contains]->(a { basename: 'a' })
-[:contains]->( { basename: 'b' })
-[:contains]->( { basename: 'c' })
-[:contains]->( { basename: 'd' })
-[:contains]->( { basename: 'e' })
-[:contains]->( { basename: 'f' })
-[:contains]->( { basename: 'g' })
-[:contains]->( { basename: 'h' })
-[:contains]->( { basename: 'i' })
-[:contains]->( { basename: 'j' })
-[:contains]->( { basename: 'k' })
-[:contains]->(leaf:`vdp`:`mylabel` { basename: 'file.txt' }) 
RETURN leaf

In case you don't know the depth of your leaf beforehand you can use variable path length matchers:

match (r:Root)
with r
match p=(r)-[:contains*]->()
with p,[x in tail(nodes(p))|x.basename] as names
where names= ["a","b","c","d","e","f","g","h","i","j","k","file.txt"]
return p, names

However this statement will fully traverse your while directory tree and does not an early skip approach. You can do that by using e.g. in Java using traversal API.