I'm working on a proof of concept for using GraphDBs (specifically Titan 0.4.4 on HBase). For that I created a relatively simple graph: I have companies that buy and sell products. A sale is associated with a department of the buyer:
buyer {name} --BUYS {departmentId}--> product {name} <--SELLS-- seller {name}
I filled the graph with 1000 buyers and sellers each as well as 1 million products. Each product is sold by exactly one company and bought by randomly chosen 1 to 5 departments of 1 to 6 companies. So I roughly end up with 1 million vertices and 11.5 million edges.
Now I want to query the graph for the following: Return all products which name matches a given sub-string S bought by department D of company with name C, which I implemented like such:
Get start vertex:
start=graph.query().has("name", C).vertices()
which due to the vertex index I created returns relatively quickly (after warm-up < 1ms). Now to get the respective products I run this:
new GremlinPipeline(start).outE("BUYS").has("departmentId", D )
.inV().has("name", new LowerCaseContainsPredicate(), P ).toList()
The response time for this against an HBase back-end (10 m3.xlarge nodes on AWS EMR) was incredibly long, with a single consumer averaging at about 1.5s per query (I picked random buyers, departments and product names and iterated 1000x), which returns around 260 product records. In order to take a possibly faulty HBase configuration out of the equation I ran it against a local BerkelyDB (4 CPUs, 8 GB VM memory, 16 GB total memory).
This sped up the queries of course, returning in ~ 160ms for a single threaded application, which is not really a race car but still acceptable. Adding more parallel requests resulted in a quick degradation of response times, 10 parallel requests came in at avg 1.2s.
So I ran a comparison with Neo4j 2.1.3 using this Cypher query:
start buyer=node:company(name=C) match buyer-[:BUYS {departmentId:D}]->(product)
where product.name =~ "(?i).*P.*" return product
which returns much faster (4-6x depending on number of parallel requests, ~ 50ms for single thread, ~270ms for 10 in parallel). Now there is a Blueprints implementation for Neo4j so I tried that as well to find out if it's the DB engine or the way I query that makes the difference.
Turns out that running Gremlin queries against Neo4j is about as slow as running it against Titan - in fact it's even slightly slower.
All tests were run as a Java application in a single VM with embedded graph DBs to avoid any networking impact. I also noticed that the very first query always takes a lot longer, so the application ran a random query first before starting the benchmark. The code is pure Java using the GremlinePipeline Java class, the snippet above is literally the Java code it is running (except for the predicate, which is only instantiated once). Cypher queries were run by using a parameterized constant query string and passing in the respective parameter map.
This is the first time I'm using a graph DB so I'm wondering if I'm doing something fundamentally wrong with the Gremlin query or if Blueprints/Gremlin is just inherently slow.