1
votes

Coalesce doesn't work as the first step in a traversal or if a traversal leading up to the coalesce step doesn't yield at least one result. Before you dismiss the question, please hear me out.

If I have a vertex with label = 'foo' and id = 'bar' in my graph database and I'd like to add a vertex with label = 'baz' and id = 'caz', the following Gremlin query works beautifully.

g.V('bar').coalesce(__.V('caz'), __.addV('baz').property('id', 'caz'))

If; however, I get rid of the first part of the query, the query fails.

g.coalesce(__.V('caz'), __.addV('baz').property('id', 'caz'))

Similarly, if I rework the query as follows, it also fails.

g.V('caz').coalesce(__.V('caz'), __.addV('baz').property('id', 'caz'))

For coalesce to work, it must have an input set of one or more elements. I understand why such an approach makes sense when the steps within a coalesce step are has and hasLabel for example; however, it makes no sense for V and addV. I'm guessing that the server implementation of coalesce has a check/return for a null or empty input step, which cancels processing on the step.

If this is a bug or improvement request with Gremlin in general, it would be awesome to have this addressed. If it's a Cosmos DB only issue, I'll log a call with Microsoft directly.

In the interim, I'm desperately looking for a solution to the challenge of only creating an element if it doesn't exist. I'm aware of using fold/unfold with coalesce; however, that kills my traversal context making previously defined aliases (using as('xyz')) unusable. Given the complexity of the queries we're writing, we can't afford to lose the context; we also can't afford the compute of folding just to unfold when processing data at scale.

Any advice on the above is gratefully received.

Warm regards, Seb

2

2 Answers

4
votes

You can't start a traversal with any step in the Gremlin language. There are specific start steps that trigger a traversal and by "trigger" I mean that they place traversers in the pipeline for processing. There are really just a handful of start steps: V(), E() and inject(), addV() and addE().

I'm aware of using fold/unfold with coalesce; however, that kills my traversal context making previously defined aliases (using as('xyz')) unusable

You typically shouldn't rely too heavily on as() if it can be avoided. Many traversals that have heavy use of as() usually can be re-written in other forms. Since you don't have more details on that, I can't address it further.

we also can't afford the compute of folding just to unfold when processing data at scale.

I can't imagine fold() and unfold() carrying a ton of cost. In the worst case it creates a List with a single item in it and in the best case it creates an empty list. You'll probably have tons of other performance optimizations to sort out before something like that would become anything you would focus on for radical improvements.

All that said, I guess that you could do this:

gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.inject(1).coalesce(V().has('id','caz'),addV('baz').property('id','caz'))
==>v[0]
gremlin> g.inject(1).coalesce(V().has('id','caz'),addV('baz').property('id','caz'))
==>v[0]

You start the traversal with inject() and a throwaway value to just get something into the pipeline. I think that I prefer the fold() and unfold() method myself as I believe it's more readable. I also would be sure to validate that the graph I was using was actually using an index for that embedded mid-traversal V() inside the coalesce(). I would hope all graphs are smart about such optimizations but I can't say that with complete certainty. In that sense, fold() and unfold() work better as they present a more platform independent way to execute your query.

0
votes

After some digging, I realized that the issue is Gremlin language specific and not server implementation specific (as in, not a Cosmos DB issue). Accordingly, I've resorted to using two flavors of the "add if not exists" pattern.

For context, we use a Gremlin recipe provider pattern, which ensures that common conventions are maintained throughout the product for common tasks. Accordingly, when I have an element (edge or vertex) to create, I pass it to the recipe provider to return the traversal with addE/addV and property semantics generated. This issue stems from generating recipes that support the "add if not exists" pattern.

To solve the issue, I pass a boolean flag to the recipe provider that tells the provider whether to use fold/unfold semantics. That way, if the add recipe occurs at the beginning of the traversal, the app uses fold/unfold semantics; if not at the beginning, no fold/unfold. While it is very much putting lipstick on a pig as a workaround, most of the add recipes our app uses don't occur at the beginnings of traversals.

To provide an example, assuming I have three vertices using label vTest and IDs v1-id, v2-id, and v3-id, the Gremlin query generated by the Gremlin recipe provider will look like this:

g.V('v1-id')
  .has('partitionKey','v1')
  .fold()
  .coalesce(
    __.unfold(),
    __.addV('vTest')
      .property('id','v1-id')
      .property('partitionKey','v1')
  ).coalesce(
    __.V('v2-id')
      .has('partitionKey','v2'),
    __.addV('vTest')
      .property('id','v2-id')
      .property('partitionKey','v2')
  ).coalesce(
    __.V('v3-id')
      .has('partitionKey','v3'),
    __.addV('vTest')
      .property('id','v3-id')
      .property('partitionKey','v3')
  )

Because each part of the query is guaranteed to return one result, coalesce() works throughout. But, as I'm sure you'll agree, lipstick on a pig.

Unfortunately for us, all user registrations in our app will be affected by the fold() / unfold() approach because that process involves creating the first vertices. I certainly hope to see an update to Gremlin in future, either to coalesce or some other step to handle conditionals.