1
votes

We have a Cosmos graph database like below. ie. A,B,C,... are nodes/vertices and edges are as shown by the arrows.

enter image description here

Each node/vertex represents a value in SQL table. Process and the requirement are as follows.

  1. User modifies node A value in SQL table
  2. Gremlin query passes A into the Graph
  3. Graph returns the following vertices in the below listed order
  4. C# app calculates the values of D,K,M,P nodes in the order and updates SQL table
  • D = A+B+C
  • K = F+E+D
  • M = J+K
  • P = L+M+N+O

I tried the following query and it has over 3000 RUs which is very costly.

g.V("A").emit().repeat(__.in('depends')).until(__.inE().count().is(0))

We need some help to optimise the query. thanks

UPDATE ===========

OK, we can rebuild the graph in a single partition to reduce the RUs but we have a scenario where multiple nodes are affected, highlighted in red in the below picture, on the way up starting from A.

Can someone help with a query to get the results in A, D, K, O, M, P order please? Logic to the query is all the child nodes should be listed before their parents

g.addV('ddn').property('pk', 'pk').property(id, 'A').property('formula', 'A').
addV('ddn').property('pk', 'pk').property(id, 'B').property('formula', 'B').
addV('ddn').property('pk', 'pk').property(id, 'C').property('formula', 'C').
addV('ddn').property('pk', 'pk').property(id, 'D').property('formula', 'A+B+C').property('requires', "'A','B','C'").
addV('ddn').property('pk', 'pk').property(id, 'E').property('formula', 'E').
addV('ddn').property('pk', 'pk').property(id, 'F').property('formula', 'E').
addV('ddn').property('pk', 'pk').property(id, 'G').property('formula', 'H+I').property('requires', "'H','I'").
addV('ddn').property('pk', 'pk').property(id, 'H').property('formula', 'H').
addV('ddn').property('pk', 'pk').property(id, 'I').property('formula', 'I').
addV('ddn').property('pk', 'pk').property(id, 'J').property('formula', 'F+G').property('requires', "'F','G'").
addV('ddn').property('pk', 'pk').property(id, 'K').property('formula', 'D+E+F').property('requires', "'D','E','F'").
addV('ddn').property('pk', 'pk').property(id, 'L').property('formula', 'L').
addV('ddn').property('pk', 'pk').property(id, 'M').property('formula', 'J+K').
addV('ddn').property('pk', 'pk').property(id, 'N').property('formula', 'N').
addV('ddn').property('pk', 'pk').property(id, 'O').property('formula', 'A+K').property('requires', "'A','K'").
addV('ddn').property('pk', 'pk').property(id, 'P').property('formula', 'L+M+N+O').property('requires', "'L','M','N','O'").
V('D').addE('needs').to(V('A')).
V('D').addE('needs').to(V('B')).
V('D').addE('needs').to(V('C')).
V('G').addE('needs').to(V('H')).
V('G').addE('needs').to(V('I')).
V('K').addE('needs').to(V('D')).
V('K').addE('needs').to(V('E')).
V('K').addE('needs').to(V('F')).
V('J').addE('needs').to(V('F')).
V('J').addE('needs').to(V('G')).
V('O').addE('needs').to(V('A')).
V('O').addE('needs').to(V('K')).
V('M').addE('needs').to(V('J')).
V('M').addE('needs').to(V('K')).
V('P').addE('needs').to(V('L')).
V('P').addE('needs').to(V('M')).
V('P').addE('needs').to(V('N')).
V('P').addE('needs').to(V('O'))

enter image description here

1
i'm assumming A, D, K,M, P have different PartitionKeys? In that case, not sure there's much you can do since this will inevitably generate a Cross-partition Query in CosmosDB, and those are always expensiveAlexDrenea
@AlexDrenea A, B, C, D...K,M, P are partition keys. ie. {pk='A', label='node'}, {pk='B', lablel='node'}.Kaf
yup, I thought so... Your Gremlin query looks good, I'm not sure there's much to do in order to optimize it, other than revisiting the data model to better suit this query pattern,AlexDrenea
@AlexDrenea any suggestions? We are thinking about creating a single hot partition using a common partition key with different labels. ie {pk='pk', label='A'}, {pk='pk', lablel='B'}. What do you think?Kaf
hot partitions are going to be really bad for writes though (time wise, not necesarily RUs). There's no harm if giving it a try, but keep in mind the 10GB partition limit. Is there any other logical way of grouping the nodes so that they you hit a bit of a middle ground? If your final operation for (A) ends up cross partititon, you could issue 2 separate calls to the DB, one for each partition and sum them up on the client ?AlexDrenea

1 Answers

2
votes

I think the answer boils down to being able to sort the vertices traversed by their path length.

gremlin> g.V("A").
......1>   emit().repeat(__.in('needs')).path().
......2>   group().
......3>     by(tail(local)).
......4>     by(count(local).fold()).
......5>   order(local).
......6>     by(select(values).tail(local)).
......7>   select(keys)
==>[v[A],v[D],v[K],v[M],v[O],v[P]]

I group() by the last element in the path() and transform each path in the group to its length with count(local). That allows me to order() the results by the longest path for each vertex.

Note that I don't think you need until(__.inE().count().is(0)) because you're just traversing to path exhaustion in either case. Also, take care with __.inE().count().is(0) as you end up counting all the edges just to detect a count of zero. Most graphs should optimize that to just until(inE()), but it's always best to be explicit in my opinion. That said, you need to be sure of your data structures when using repeat() - it takes just one edge of bad data to send your traversal into an infinity of traversing. Consider some kind of upper bound to your repeat() that makes sense for your data so that the loop will terminate at some point.

Here is an alternative which actually might be better since it doesn't bother to hold all the counts in the Map after the group():

gremlin> g.V("A").
......1>   emit().repeat(__.in('needs')).path().
......2>   group().
......3>     by(tail(local)).
......4>     by(count(local).order(local).tail(local)).
......5>   order(local).
......6>     by(values).
......7>   select(keys)
==>[v[A],v[D],v[K],v[M],v[O],v[P]]