2
votes

I'd like to create an edge list that shows connections and connection strength. This sample graph contains 4 people and information about their attendance at workshops A and B, including the day attended and the number of hours they stayed. I'd like to form connections through the workshop node, where I would consider two people to be connected if they attended the same workshop on the same day, and the connection strength would be the minimum number of hours spent at the workshop.

Here is the sample graph:

g.addV('person').property(id, '1').property('name', 'Alice').next()
g.addV('person').property(id, '2').property('name', 'Bob').next()
g.addV('person').property(id, '3').property('name', 'Carol').next()
g.addV('person').property(id, '4').property('name', 'David').next()
g.addV('workshop').property(id, '5').property('name', 'A').next()
g.addV('workshop').property(id, '6').property('name', 'B')

g.V('1').addE('attended').to(g.V('5')).property('hours', 2).property('day', 'Monday').next()
g.V('1').addE('attended').to(g.V('6')).property('hours', 2).property('day', 'Monday').next()
g.V('2').addE('attended').to(g.V('5')).property('hours', 5).property('day', 'Monday').next()
g.V('3').addE('attended').to(g.V('6')).property('hours', 5).property('day', 'Monday').next()
g.V('4').addE('attended').to(g.V('5')).property('hours', 4).property('day', 'Tuesday').next()
g.V('4').addE('attended').to(g.V('6')).property('hours', 4).property('day', 'Monday').next()
g.V('2').addE('attended').to(g.V('6')).property('hours', 1).property('day', 'Monday')

This would be step 1, showing minimum hours on each workshop for each pair that took a workshop on the same day:

enter image description here

Note that David doesn't have any connections through workshop A because he attended it on a different day than Alice and Bob.

We can then find the total strength of the relationship by adding up hours together across workshops for each pair (now Alice and Bob have 3 total hours together, which were across workshops A and B):

enter image description here

I'm struggling with how to write this in a Neptune graph using Gremlin. I'm more familiar with Cypher, and could find this type of edge list using something like this:

match (p:Person)-[a:ATTENDED]->(w:Workshop)<-[a2:ATTENDED]-(other:Person)
where a.day = a2.day
and p.name <> other.name
unwind [a.hours, a2.hours] as hrs
with p, w, other, a, min(hrs) as hrs
return a.name, other.name, sum(hrs) as total_hours

This is as far as I've gotten with Gremlin, but I'm not sure how to finish up the summarization:

g.V().
    hasLabel('person').as('p').
    outE().as('e').
    inV().as('ws').
    inE('attended').
    where(eq('e')).by('day').as('e2').
    otherV().
    where(neq('p')).as('other').
    select('p','e','other','e2','ws').
    by(valueMap('name','hours','day'))

Would anyone be able to help?

1
Just to make sure I understand what you are looking for - do you want the sum, difference or ... for each case where the same two people appear in name and other ? - Kelvin Lawrence
I went ahead and created an answer. Please let me know if this is not quite what you were looking for. - Kelvin Lawrence

1 Answers

2
votes

Given more time I am fairly sure the query can be simplified. However, given where you have got to so far, we can extract the details for each person:

g.V().
    hasLabel('person').as('p').
    outE().as('e').
    inV().as('ws').
    inE('attended').
    where(eq('e')).by('day').as('e2').
    otherV().
    where(neq('p')).as('other').
    select('p','e','other','e2','ws').
    by(valueMap('name','hours','day').
      by(unfold())).
    project('p1','p2','shared').
      by(select('p').select('name')).
      by(select('other').select('name')).
      by(union(select('e').select('hours'),
               select('e2').select('hours')).min())     

This gives us the time each person spent together but not yet the grand total

==>[p1:Alice,p2:Bob,shared:2]
==>[p1:Alice,p2:Carol,shared:2]
==>[p1:Alice,p2:David,shared:2]
==>[p1:Alice,p2:Bob,shared:1]
==>[p1:Bob,p2:Alice,shared:2]
==>[p1:Bob,p2:Alice,shared:1]
==>[p1:Bob,p2:Carol,shared:1]
==>[p1:Bob,p2:David,shared:1]
==>[p1:Carol,p2:Alice,shared:2]
==>[p1:Carol,p2:David,shared:4]
==>[p1:Carol,p2:Bob,shared:1]
==>[p1:David,p2:Alice,shared:2]
==>[p1:David,p2:Carol,shared:4]
==>[p1:David,p2:Bob,shared:1]

All that is left is to produce the final results. One way to do this is to use a group step.

gremlin> g.V().
......1>     hasLabel('person').as('p').
......2>     outE().as('e').
......3>     inV().as('ws').
......4>     inE('attended').
......5>     where(eq('e')).by('day').as('e2').
......6>     otherV().
......7>     where(neq('p')).as('other').
......8>     select('p','e','other','e2','ws').
......9>     by(valueMap('name','hours','day').
.....10>       by(unfold())).
.....11>     project('p1','p2','shared').
.....12>       by(select('p').select('name')).
.....13>       by(select('other').select('name')).
.....14>       by(union(select('e').select('hours'),
.....15>                select('e2').select('hours')).min()).
.....16>     group().
.....17>       by(union(select('p1'),select('p2')).fold()).
.....18>       by(select('shared').sum())  

==>[[Bob,Carol]:1,[David,Alice]:2,[Carol,Alice]:2,[Carol,Bob]:1,[Alice,Bob]:3,[Carol,David]:4,[Bob,Alice]:3,
[David,Bob]:1,[Bob,David]:1,[David,Carol]:4,[Alice,Carol]:2,[Alice,David]:2]    

Adding an unfold makes the results a little easier to read. I did not try to factor out duplicates, for Bob-Alice and Alice-Bob. If you need to do that in the query an order step could be added after the group is created and then a dedup used.

gremlin> g.V().
......1>     hasLabel('person').as('p').
......2>     outE().as('e').
......3>     inV().as('ws').
......4>     inE('attended').
......5>     where(eq('e')).by('day').as('e2').
......6>     otherV().
......7>     where(neq('p')).as('other').
......8>     select('p','e','other','e2','ws').
......9>     by(valueMap('name','hours','day').
.....10>       by(unfold())).
.....11>     project('p1','p2','shared').
.....12>       by(select('p').select('name')).
.....13>       by(select('other').select('name')).
.....14>       by(union(select('e').select('hours'),
.....15>                select('e2').select('hours')).min()).
.....16>     group().
.....17>       by(union(select('p1'),select('p2')).fold()).
.....18>       by(select('shared').sum()).
.....19>     unfold()

==>[Bob, Carol]=1
==>[David, Alice]=2
==>[Carol, Alice]=2
==>[Carol, Bob]=1
==>[Alice, Bob]=3
==>[Carol, David]=4
==>[Bob, Alice]=3
==>[David, Bob]=1
==>[Bob, David]=1
==>[David, Carol]=4
==>[Alice, Carol]=2
==>[Alice, David]=2