Gremlin - finding connected nodes with several boolean conditions on both nodes and edges properties

Question

I want to find nodes who should be linked to a given node, where the link is defined by some logic, which uses the nodes' and existing edges' attribute with the following logic:

A) (The pair has the same zip (node attribute) and name_similarity (edge attribute) > 0.3 OR

B) The pair has a different zip and name_similarity > 0.5 OR

C) The pair has an edge type "external_info" with value = "connect")

D) AND (the pair doesn't have an edge type with "external info" with value = "disconnect")

In short: (A | B | C) & (~D)

I'm still a newbie to gremlin, so I'm not sure how I can combine several conditions on edges and nodes.

Below is the code for creating the graph, as well as the expected results for that graph:

# creating nodes

(g.addV('person').property('name', 'A').property('zip', '123').
addV('person').property('name', 'B').property('zip', '123').
addV('person').property('name', 'C').property('zip', '456').
addV('person').property('name', 'D').property('zip', '456').
addV('person').property('name', 'E').property('zip', '123').
addV('person').property('name', 'F').property('zip', '999').iterate())

node1 = g.V().has('name', 'A').next()
node2 = g.V().has('name', 'B').next()
node3 = g.V().has('name', 'C').next()
node4 = g.V().has('name', 'D').next()
node5 = g.V().has('name', 'E').next()
node6 = g.V().has('name', 'F').next()

# creating name similarity edges

g.V(node1).addE('name_similarity').from_(node1).to(node2).property('score', 1).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node3).property('score', 0.2).next() # under threshold
g.V(node1).addE('name_similarity').from_(node1).to(node4).property('score', 0.4).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node5).property('score', 1).next() # over threshold
g.V(node1).addE('name_similarity').from_(node1).to(node6).property('score', 0).next() # under threshold

# creating external output edges

g.V(node1).addE('external_info').from_(node1).to(node5).property('decision', 'connect').next() 
g.V(node1).addE('external_info').from_(node1).to(node6).property('decision', 'disconnect').next()

The expected output - for input node A - are nodes B (due to condition A), D (due to Condition B), and F (due to condition C). node E should not be linked due to condition D.

I'm looking for a Gremlin query that will retrieve these results.

stephen mallette stephen mallette · Accepted Answer · 2020-08-07T12:17:17

Something seemed wrong in your data given the output you expected so I had to make corrections:

Vertex D wouldn't appear in the results because "score" was less than 0.5
"external_info" edges seemed reversed

Here's the data I used:

g.addV('person').property('name', 'A').property('zip', '123').
addV('person').property('name', 'B').property('zip', '123').
addV('person').property('name', 'C').property('zip', '456').
addV('person').property('name', 'D').property('zip', '456').
addV('person').property('name', 'E').property('zip', '123').
addV('person').property('name', 'F').property('zip', '999').iterate()
node1 = g.V().has('name', 'A').next()
node2 = g.V().has('name', 'B').next()
node3 = g.V().has('name', 'C').next()
node4 = g.V().has('name', 'D').next()
node5 = g.V().has('name', 'E').next()
node6 = g.V().has('name', 'F').next()
g.V(node1).addE('name_similarity').from(node1).to(node2).property('score', 1).next() 
g.V(node1).addE('name_similarity').from(node1).to(node3).property('score', 0.2).next() 
g.V(node1).addE('name_similarity').from(node1).to(node4).property('score', 0.6).next() 
g.V(node1).addE('name_similarity').from(node1).to(node5).property('score', 1).next() 
g.V(node1).addE('name_similarity').from(node1).to(node6).property('score', 0).next() 
g.V(node1).addE('external_info').from(node1).to(node6).property('decision', 'connect').next() 
g.V(node1).addE('external_info').from(node1).to(node5).property('decision', 'disconnect').next()

I went with the following approach:

gremlin> g.V().has('person','name','A').as('a').
......1>   V().as('b').
......2>   where('a',neq('b')).
......3>   or(where('a',eq('b')).                                                    // A
......4>        by('zip').
......5>      bothE('name_similarity').has('score',gt(0.3)).otherV().where(eq('a')), 
......6>      bothE('name_similarity').has('score',gt(0.5)).otherV().where(eq('a')), // B
......7>      bothE('external_info').                                                // C
......8>        has('decision','connect').otherV().where(eq('a'))).
......9>   filter(__.not(bothE('external_info').                                     // D
.....10>                 has('decision','disconnect').otherV().where(eq('a')))).
.....11>   select('a','b').
.....12>    by('name')
==>[a:A,b:B]
==>[a:A,b:D]
==>[a:A,b:F]

I think this contains all the logic you were looking for, but I didn't spend a lot of time optimizing it as I don't think any optimization will get around the pain of the full graph scan of V().as('b'), so either your situation involves a relatively small graph (in-memory perhaps) and this query will work or you would need to find another method all together. Perhaps you have methods to further limit "b" which might help? If something along those lines is possible, I'd probably try to better define directionality of edge traversals to avoid bothE() and instead limit to outE() or inE() which would get rid of otherV(). Hopefully you use a graph that allows for vertex centric indices which would speed up those edge lookups on "score" as well (not sure if that would help much on "decision" as it has low selectivity).

Gremlin - finding connected nodes with several boolean conditions on both nodes and edges properties

1 Answers