5
votes

I am using embedded janusgraph in my java backend my code depends on janusgraph instanciated from graph = JanusGraphFactory.open(conf)

AFAIK this connects to Cassandra and elastic search directly and run the janusgraph processor in my backend application JVM. But if I want to scale janusgraph I need to run separate janusgraph servers on a cluster and need to connect to these servers as the client from my backend.

According to remote janusgraph example on github this is accomplished using instantiating an EmptyGraph graph = EmptyGraph.instance(); which is not instance of JanusGraph but of org.apache.tinkerpop.gremlin.structure.util.empty.EmptyGraph;.

I can understand from the example above that I can only use gremlin queries by submitting them to janusgraph server, but I will not be able to use the management APIs directly unless submitting the code as a string to the server.

Finally, I can understand that it is better for scalability to run janusgraph server separately but I will lose the direct access in my code to janusgraph apis so I want to know if something I miss understand and what are the pros and cons in remote deployment approach and what I will lose against embedded approach?

Edit:

According to this answer correct it if wrong:

Pros/Cons of connecting to the remote gremlin server

Pros

  • The server has much more control and all the queries are centralized.
  • Since every one is running traversal/queries via the remote gremlin server, all are transactionally protected. The remote gremlin server runs your traversal/queries by default in a transaction.
  • Central strategy management
  • Central schema management

Cons

  • Tough to do a manual transaction management
  • You have to use groovy script as string and send it to remove (Cluster submit) for transactional execution of your code.
1
Hi, do you have any example of how to do transaction management with remote janusGraphSerhii Zadorozhnyi

1 Answers

0
votes

Whatever Pros, Cons listed above are correct, along with that I will list out my learnings:

With the gremlin server approach, as a user, the architecture will look like a web server(additional cost) which is contacting the storage system. The upscale/downscale of these gremlin servers has to be handled manually based on the load, else it will become bottleneck of the entire system.

In embedded mode, you have a storage system (say Cassandra) and another one that does interact with this via tinker pop gremlin. With this, you don't have to maintain gremlin servers, it just your program/client is interacting with the storage server.

Consider data loading via Apache Spark, once you run job with more executors the gremlin server should be capable enough to handle loads.