1
votes

I use Microsoft.Azure.Graphs library to connect to a Cosmos DB instance and query the graph database.

I'm trying to optimize my Gremlin queries in order to only select those properties that I only require. However, I don't know how to choose which properties to select from edges and vertices.

Let's say we start from this query:

gremlin> g.V().hasLabel('user').
   project('user', 'edges', 'relatedVertices')
     .by()
     .by(bothE().fold())
     .by(both().fold())

This will return something along the lines of:

{
    "user": {
        "id": "<userId>",
        "type": "vertex",
        "label": "user",
        "properties": [
            // all vertex properties
        ]
    },
    "edges": [{
        "id": "<edgeId>",
        "type": "edge",
        "label": "<edgeName>",
        "inV": <relatedVertexId>,
        "inVLabel": "<relatedVertexLabel>",
        "outV": "<relatedVertexId>",
        "outVLabel": "<relatedVertexLabel>"
        "properties": [
            // edge properties, if any
        ]
    }],
    "relatedVertices": [{
        "id": "<vertexId>",
        "type": "vertex",
        "label": "<relatedVertexLabel>",
        "properties": [
            // all related vertex properties
        ]
    }]
}

Now let's say we only take a couple of properties from the root vertex which we named "User":

gremlin> g.V().hasLabel('user').
   project('id', 'prop1', 'prop2', 'edges', 'relatedVertices')
     .by(id)
     .by('prop1')
     .by('prop2')
     .by(bothE().fold())
     .by(both().fold())

Which will make some progress for us and yield something along the lines of:

{
    "id": "<userId>",
    "prop1": "value1",
    "prop2": "value2",
    "edges": [{
        "id": "<edgeId>",
        "type": "edge",
        "label": "<edgeName>",
        "inV": <relatedVertexId>,
        "inVLabel": "<relatedVertexLabel>",
        "outV": "<relatedVertexId>",
        "outVLabel": "<relatedVertexLabel>"
        "properties": [
            // edge properties, if any
        ]
    }],
    "relatedVertices": [{
        "id": "<vertexId>",
        "type": "vertex",
        "label": "<relatedVertexLabel>",
        "properties": [
            // all related vertex properties
        ]
    }]
}

Now is it possible to do something similar to edges and related vertices? Say, something along the lines of:

gremlin> g.V().hasLabel('user').
   project('id', 'prop1', 'prop2', 'edges', 'relatedVertices')
     .by(id)
     .by('prop1')
     .by('prop2')
     .by(bothE().fold()
         .project('edgeId', 'edgeLabel', 'edgeInV', 'edgeOutV')
              .by(id)
              .by(label)
              .by(inV)
              .by(outV))
     .by(both().fold()
         .project('vertexId', 'someProp1', 'someProp2')
              .by(id)
              .by('someProp1')
              .by('someProp2'))

My aim is to get an output like this:

{
    "id": "<userId>",
    "prop1": "value1",
    "prop2": "value2",
    "edges": [{
        "edgeId": "<edgeId>",
        "edgeLabel": "<edgeName>",
        "edgeInV": <relatedVertexId>,
        "edgeOutV": "<relatedVertexId>"
    }],
    "relatedVertices": [{
        "vertexId": "<vertexId>",
        "someProp1": "someValue1",
        "someProp2": "someValue2"
    }]
}
1

1 Answers

6
votes

You were pretty close:

gremlin> g.V().hasLabel('person').
......1>   project('name','age','edges','relatedVertices').
......2>   by('name').
......3>   by('age').
......4>   by(bothE().
......5>      project('id','inV','outV').
......6>        by(id).
......7>        by(inV().id()).
......8>        by(outV().id()).
......9>      fold()).
.....10>   by(both().
.....11>      project('id','name').
.....12>        by(id).
.....13>        by('name').
.....14>      fold())
==>[name:marko,age:29,edges:[[id:9,inV:3,outV:1],[id:7,inV:2,outV:1],[id:8,inV:4,outV:1]],relatedVertices:[[id:3,name:lop],[id:2,name:vadas],[id:4,name:josh]]]
==>[name:vadas,age:27,edges:[[id:7,inV:2,outV:1]],relatedVertices:[[id:1,name:marko]]]
==>[name:josh,age:32,edges:[[id:10,inV:5,outV:4],[id:11,inV:3,outV:4],[id:8,inV:4,outV:1]],relatedVertices:[[id:5,name:ripple],[id:3,name:lop],[id:1,name:marko]]]
==>[name:peter,age:35,edges:[[id:12,inV:3,outV:6]],relatedVertices:[[id:3,name:lop]]]

Two points you should consider when writing Gremlin:

  1. The output of the previous step feeds into the input of the following step and if you don't clearly see what's coming out of a particular step, then the steps that follow may not end up being right. In your example, in the first by() you added the project() after the fold() which was basically saying "Hey, Gremlin, project that List of edges for me." But in the by() modulators for project() you treated the input to project not as a List but as individual edges which likely led to an error. In Java, that error is: "java.util.ArrayList cannot be cast to org.apache.tinkerpop.gremlin.structure.Element". An error like that is a clue that somewhere in your Gremlin you are not properly following the outputs and inputs of your steps.
  2. fold() takes all the elements in the stream of the traversal and converts them to a List. So where you had many objects, you will now have one after the fold(). To process them as a stream again, you would need to unfold() them for steps to operate on them individually. In this case, we just needed to move the fold() to the end of the statement after doing the sub-project() for each edge/vertex. But why do we need fold() at all? The answer is that the traversal passed to the by() modulator is not iterated completely by the step that it modifies (in this case project()). The step only calls next() to get the first element in the stream - this is by design. Therefore, in cases where you want the entire stream of a by() to be processed you must reduce the stream to a single object. You might do that with fold(), but other examples include sum(), count(), mean(), etc.