2
votes

I'm working on a proof of concept for using GraphDBs (specifically Titan 0.4.4 on HBase). For that I created a relatively simple graph: I have companies that buy and sell products. A sale is associated with a department of the buyer:

buyer {name} --BUYS {departmentId}--> product {name} <--SELLS-- seller {name}

I filled the graph with 1000 buyers and sellers each as well as 1 million products. Each product is sold by exactly one company and bought by randomly chosen 1 to 5 departments of 1 to 6 companies. So I roughly end up with 1 million vertices and 11.5 million edges.

Now I want to query the graph for the following: Return all products which name matches a given sub-string S bought by department D of company with name C, which I implemented like such:

Get start vertex:

start=graph.query().has("name", C).vertices()

which due to the vertex index I created returns relatively quickly (after warm-up < 1ms). Now to get the respective products I run this:

new GremlinPipeline(start).outE("BUYS").has("departmentId", D )
    .inV().has("name", new LowerCaseContainsPredicate(), P ).toList()

The response time for this against an HBase back-end (10 m3.xlarge nodes on AWS EMR) was incredibly long, with a single consumer averaging at about 1.5s per query (I picked random buyers, departments and product names and iterated 1000x), which returns around 260 product records. In order to take a possibly faulty HBase configuration out of the equation I ran it against a local BerkelyDB (4 CPUs, 8 GB VM memory, 16 GB total memory).

This sped up the queries of course, returning in ~ 160ms for a single threaded application, which is not really a race car but still acceptable. Adding more parallel requests resulted in a quick degradation of response times, 10 parallel requests came in at avg 1.2s.

So I ran a comparison with Neo4j 2.1.3 using this Cypher query:

start buyer=node:company(name=C) match buyer-[:BUYS {departmentId:D}]->(product) 
where product.name =~ "(?i).*P.*" return product

which returns much faster (4-6x depending on number of parallel requests, ~ 50ms for single thread, ~270ms for 10 in parallel). Now there is a Blueprints implementation for Neo4j so I tried that as well to find out if it's the DB engine or the way I query that makes the difference.

Turns out that running Gremlin queries against Neo4j is about as slow as running it against Titan - in fact it's even slightly slower.

All tests were run as a Java application in a single VM with embedded graph DBs to avoid any networking impact. I also noticed that the very first query always takes a lot longer, so the application ran a random query first before starting the benchmark. The code is pure Java using the GremlinePipeline Java class, the snippet above is literally the Java code it is running (except for the predicate, which is only instantiated once). Cypher queries were run by using a parameterized constant query string and passing in the respective parameter map.

This is the first time I'm using a graph DB so I'm wondering if I'm doing something fundamentally wrong with the Gremlin query or if Blueprints/Gremlin is just inherently slow.

2
Are you using binds to set your variables? If not your gremlin will need to be recompiled with every random change hence why it would be slow.Pomme.Verte
To follow up on the comment above, how are you issuing your query exactly? Are you doing all this from the Gremlin Console? Are you issuing requests to Rexster (as Titan Server)?stephen mallette
The queries are run from a Java application using the GremlinePipeline class from the gremlin-java Tinkerpop Maven artifact. It's running in a local install (no Rexter) and there are no bind variables. I'm not using Gremlin Groovy and to pre-compile the pipeline. I updated the question to give more details.Volker Kueffel

2 Answers

1
votes

If you are using Gremlin2, you can not mix Cypher and Gremlin as Gremlin uses automatic indices and Cypher uses the new "schema indices." In Neo4j2+, automatic indices are deprecated/legacy. As such, in Gremlin3, Gremlin leverages the same indices as Cypher. Thus, if you want to see performance, do your tests on two Neo4jGraphs. One that has its data populated with Gremlin and one that has its data populated with Cypher. Then you can see the actual speeds as indices are being used properly.

Regarding Titan/HBase. Be sure to turn on caching or else you will always be going to disk (which can be hot, but not as fast as Titan's cache). Please read more here: http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/

0
votes

A partial answer to the question is that I was missing vertex centric indices on Titan as Daniel stated in a comment. In addition to creating the vertex index on the name I needed to created an index for my BUYS labels:

graph.makeKey("name").dataType(String.class).indexed(Vertex.class).make();

TitanKey departmentIndex=graph.makeKey("departmentId")
                         .dataType(String.class).make();
graph.makeLabel("BUYS").sortKey(departmentIndex).make();

Neo4j and Cypher is still faster and obviously this has no effect on the Gremlin queries on Neo4j, but that might be just a issue with the Gremlin implementation for Neo4j.