4
votes

I migrated a MySQL database to Neo4j and tested a simple request. I was very surprised to see that an equivalent request was 10 to 100 times longer in neo4j than in MySql. I'm working on Neo4j 2.0.1.

In the original MySql schema I have the three following tables :

  • countries : containing a 'code', a 'continent_id' and a 'selected' boolean,
  • cities : containing a 'country_code', a 'name' and a 'status' boolean,
  • theaters : containing a 'city_id' and a 'public' boolean,

with an indexes on each property. I want to display the theater count by city for a given continent with several conditions. The request is :

SELECT count(*) as nb, c.name 
FROM `cities` c LEFT JOIN theaters t ON c.id = t.city_id 
WHERE c.country_code IN 
  (SELECT code FROM countries WHERE selected is true AND continent_id = 4)
 AND c.status=1 AND t.public = 1 
GROUP BY c.name  ORDER BY nb DESC


The database schema in Neo4j is the following :

(:Continent)-[:Include]->(:Country{selected:bool})-[:Include]->(:City{name:string, status:bool})-[:Include]->(:Theater{public:bool})

There's also an index defined on each property. The Cypher request is :

MATCH (:Continent{code: 4})-[:Include]->(:Country{selected:true})-[:Include]->(city:City{status:true})-[:Include]->(:Theater{public: true})
RETURN city.name, count(*) AS nb ORDER BY nb DESC


There are about 70.000 cities and 140.000 theaters in each database.

On continent with id 4 the MySql request took about 0.02s whereas Neo4j took 0.4s. Moreover, if I introduce in the Cypher request a variable relation length between Country and City (...(:Country{selected:true})-[:Include*..3]->(city:City{status:true})...) as I want to be able to add intermediate levels like Regions, there the request take more than 2 seconds.

I'm aware that in this particular case there is no benefit to use Neo4j instead of MySql, but I would expect to see approximatively equivalent performance between the two technologies, and I want to take advantage of Neo4j geographical hierarchy capabilities.

Am I missing something or is this a limitation of Neo4j?

Thank you for your answers.

Edit : First you'll find database dump files here. The Neo4j server configuration is the one out of box. I work in a Ruby environment, and I use the neography gem. Also I run the Neo4J server separatly since I'm not on JRuby, so it sends the cypher request through the Rest API.

The database contains 244 countries, 69000 cities and 138,000 theaters. For continent_id 4, there are 46,982 cities (37,210 have the status boolean set to true) and 74,420 theaters.

The request returned 2256 rows. At the third run, it took 338 ms. Here is the request output with the profiling infos :

profile MATCH (:Continent{code: 4})-[:Include]->(country:Country{selected:true})-[:Include*..1]->(city:City{status:true})-[:Include]->(theater:Theater{public: true}) RETURN city.name, count(*) AS nb ORDER BY nb DESC;

==> ColumnFilter(symKeys=["city.name", "  INTERNAL_AGGREGATE85ca19f3-9421-4c18-a449-1097e3deede2"], returnItemNames=["city.name", "nb"], _rows=2256, _db_hits=0)
==> Sort(descr=["SortItem(Cached(  INTERNAL_AGGREGATE85ca19f3-9421-4c18-a449-1097e3deede2 of type Integer),false)"], _rows=2256, _db_hits=0)
==>   EagerAggregation(keys=["Cached(city.name of type Any)"], aggregates=["(  INTERNAL_AGGREGATE85ca19f3-9421-4c18-a449-1097e3deede2,CountStar())"], _rows=2256, _db_hits=0)
==>     Extract(symKeys=["city", "  UNNAMED27", "  UNNAMED7", "country", "  UNNAMED113", "theater", "  UNNAMED72"], exprKeys=["city.name"], _rows=2257, _db_hits=2257)
==>       Filter(pred="(hasLabel(theater:Theater(3)) AND Property(theater,public(5)) == true)", _rows=2257, _db_hits=2257)
==>         SimplePatternMatcher(g="(city)-['  UNNAMED113']-(theater)", _rows=2257, _db_hits=4514)
==>           Filter(pred="(((hasLabel(city:City(2)) AND hasLabel(city:City(2))) AND Property(city,status(4)) == true) AND Property(city,status(4)) == true)", _rows=2257, _db_hits=74420)
==>             TraversalMatcher(start={"label": "Continent", "query": "Literal(4)", "identifiers": ["  UNNAMED7"], "property": "code", "producer": "SchemaIndex"}, trail="(  UNNAMED7)-[  UNNAMED27:Include WHERE (((hasLabel(NodeIdentifier():Country(1)) AND hasLabel(NodeIdentifier():Country(1))) AND Property(NodeIdentifier(),selected(3)) == true) AND Property(NodeIdentifier(),selected(3)) == true) AND true]->(country)-[:Include*1..1]->(city)", _rows=37210, _db_hits=37432)
2

2 Answers

5
votes

You're right, I tried it for myself and only got it down to 100ms for the query.

 MATCH (:Continent{code: 4})-[:Include]->
       (country:Country{selected:true})-[:Include]->
       (city:City{status:true})-[:Include]->
       (theater:Theater{public: true}) 
 RETURN city.name, count(*) AS nb 
 ORDER BY nb DESC;

| "Forbach"                       | 1  |
| "Stuttgart"                     | 1  |
| "Mirepoix"                      | 1  |
| "Bonnieux"                      | 1  |
| "Saint Cyprien Plage"           | 1  |
| "Crissay sur Manse"             | 1  |
+--------------------------------------+
2256 rows
**85 ms**

Please note that cypher as of 2.0.x has not yet performance optimized, that work started in Neo4j 2.1 and will continue up until 2.3. There is also more performance work planned in the kernel that will also speed things up.

I implemented the solution in Java too and got it down to 19ms. It is of course not as pretty but that's where we aiming for with cypher too:

class City {
    Node city;
    int count = 1;

    public City(Node city) {
        this.city = city;
    }

    public void inc() { count++; }

    @Override
    public String toString() {
        return String.format("City{city=%s, count=%d}", city.getProperty("name"), count);
    }
}

private List<?> queryJava3() {
    long start = System.currentTimeMillis();
    Node continent = IteratorUtil.single(db.findNodesByLabelAndProperty(CONTINENT, "code", 4));
    Map<Node,City> result = new HashMap<>();
    for (Relationship rel1 : continent.getRelationships(Direction.OUTGOING,Include)) {
        Node country = rel1.getEndNode();
        if (!(country.hasLabel(COUNTRY) && (Boolean) country.getProperty("selected", false))) continue;
        for (Relationship rel2 : country.getRelationships(Direction.OUTGOING, Include)) {
            Node city = rel2.getEndNode();
            if (!(city.hasLabel(CITY) && (Boolean) city.getProperty("status", false))) continue;
            for (Relationship rel3 : city.getRelationships(Direction.OUTGOING, Include)) {
                Node theater = rel3.getEndNode();
                if (!(theater.hasLabel(THEATER) && (Boolean) theater.getProperty("public", false))) continue;
                City city1 = result.get(city);
                if (city1==null) result.put(city,new City(city));
                else city1.inc();
            }
        }
    }
    List<City> list = new ArrayList<>(result.values());
    Collections.sort(list, new Comparator<City>() {
        @Override
        public int compare(City o1, City o2) {
            return Integer.compare(o2.count,o1.count);
        }
    });
    output("java", start, list.iterator());
    return list;
}


java time = 19ms
first = City{city=Val de Meuse, count=1} total-count 22561
1
votes

How did you measure it? Was this the first or a subsequent run?

How many cities / theaters were returned by that query?

Would you be able to run it with http://localhost:7474/webadmin/#/console/ prepend a "profile " to your query and post the resulting query plan?

It might choose the wrong index by default.

Also note that with 2.0.1 Cypher does not yet have the maximum performance. We're currently working on that. So if you want to have ultimate performance you'd have to drop down to the lower level APIs.

Is there any chance to share your database with me, to see where the performance could be.

Just having a single "INCLUDE" relationship-type might make it more expensive than need be.

Can you please also post your neo4j configuration (conf/*) and potentially your graph.db/messages.log ?