1
votes

I am currently evaluating neo4j in terms of inserting big amounts of nodes/relationships into the graph. It is not about initial inserts which could be achieved with batch inserts. It is about inserts that are processed frequently during runtime in a java application that uses neo4j in embedded mode (currently version 1.8.1 as it is shipped with spring-data-neo4j 2.2.2.RELEASE).

These inserts are usually nodes that follow the star schema. One single node (the root node of the imported dataset) has up to 1000000 (one million!) connected child nodes. The child nodes normally have relationships to other additional nodes, too. But those relationships are not covered by this test so far. The overall goal is to import that amount of data in at most five minutes!

To simulate such kind of inserts I wrote a small junit test that uses the Neo4jTemplate for creating the nodes and relationships. Each inserted leaf has a key associated for later processing:

@Test
@Transactional
@Rollback
public void generateUngroupedNode()
        {
        long numberOfLeafs = 1000000;
        Assert.assertTrue(this.template.transactionIsRunning());
        Node root = this.template.createNode(map(NAME, UNGROUPED));
        String groupingKey = null;
        for (long index = 0; index < numberOfLeafs; index++)
            {
            // Just a sample division of leafs to possible groups
            // Creates keys to be grouped by to groups containing 2 leafs each
            if (index % 2 == 0)
                {
                groupingKey = UUID.randomUUID().toString();
                }
            Node leaf = this.template.createNode(map(GROUPING_KEY, groupingKey, NAME, LEAF));
            this.template.createRelationshipBetween(root, leaf, Relationships.LEAF.name(),
                    map());
            }
        }

For this test I use the gcr cache to avoid Garbage Collector issues:

cache_type=gcr
node_cache_array_fraction=7
relationship_cache_array_fraction=5
node_cache_size=400M
relationship_cache_size=200M

Additionally I set my MAVEN_OPTS to:

export MAVEN_OPTS="-Xmx4096m -Xms2046m -XX:PermSize=256m -XX:MaxPermSize=512m -XX:+UseConcMarkSweepGC -XX:-UseGCOverheadLimit"

But anyway when running that test I always get a Java heap space error:

java.lang.OutOfMemoryError: Java heap space
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2427)
    at java.lang.Class.getMethod0(Class.java:2670)
    at java.lang.Class.getMethod(Class.java:1603)
    at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:896)
    at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:862)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:859)
    at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:423)
    at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:685)
    at org.springframework.transaction.support.TransactionTemplate.<init>(TransactionTemplate.java:67)
    at org.springframework.data.neo4j.support.Neo4jTemplate.exec(Neo4jTemplate.java:403)
    at org.springframework.data.neo4j.support.Neo4jTemplate.createRelationshipBetween(Neo4jTemplate.java:367)

I did some tests with fewer amounts of data which result into the following outcomes. 1 node connected to:

  • 50000 leafs: 3035ms
  • 100000 leafs: 4290ms
  • 200000 leafs: 10268ms
  • 400000 leafs: 20913ms
  • 800000 leafs: Java heap space

Here is a screenshot of the system monitor during those operations:

System Monitor

To get a better impression on what exactly is running and is stored in the heap I ran the JProfiler with the last test (800000 leafs). Here are some screenshots:

Heap usage:

HEAP

CPU usage:

CPU

The big question for me is: Is neo4j not designed for using that kind of huge amount of data? Or are there some other ways to achieve those kind of inserts (and later operations)? On the official neo4j website and various screencasts I found the information that neo4j is able to run with billions of nodes and relationships (e.g. http://docs.neo4j.org/chunked/stable/capabilities-capacity.html). I didn't find any functionalities like flush() and clean() methods that are available e.g. in JPA to keep the heap clean manually.

It would be great to be able to use neo4j with those amounts of data. Already with 200000 leafs stored in the graph I noticed a performance improvment of factor 10 and more compared to an embedded classic RDBMS. I don't want to give up the nice way of data modeling and querying those data like neo4j provides.

2
Don't use Spring Data Neo4j for highly performant inserts, it is not made for that, use the Neo4j core API with large enough transactions (30-50k elements per tx).Michael Hunger
template.createRelationshipBetween checks for duplicate relationships, so it is destined to be O(n) of the existing nodes. Also make sure to batch your tx.Michael Hunger
btw. what causes your heap space to explode is that you keep all transaction state, of 1M nodes + 1M rels in your heap instead of partitioning them into suitable chunks. See also the explanation about transaction sizes here: jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-seriesMichael Hunger

2 Answers

3
votes

By just using the Neo4j core API it takes between 18 and 26 seconds to create the children, without any optimizations on my MacBook Air:

Output: import of 1000000 children took 26 seconds.

public class CreateManyRelationships {

    public static final int COUNT = 1000 * 1000;
    public static final DynamicRelationshipType CHILD = DynamicRelationshipType.withName("CHILD");
    public static final File DIRECTORY = new File("target/test.db");

    public static void main(String[] args) throws IOException {
        FileUtils.deleteRecursively(DIRECTORY);
        GraphDatabaseService gdb = new GraphDatabaseFactory().newEmbeddedDatabase(DIRECTORY.getAbsolutePath());
        long time=System.currentTimeMillis();
        Transaction tx = gdb.beginTx();
        Node root = gdb.createNode();
        for (int i=1;i<= COUNT;i++) {
            Node child = gdb.createNode();
            root.createRelationshipTo(child, CHILD);
            if (i % 50000 == 0) {
                tx.success();tx.finish();
                tx = gdb.beginTx();
            }
        }
        tx.success();tx.finish();
        time = System.currentTimeMillis()-time;
        System.out.println("import of "+COUNT+" children took " + time/1000 + " seconds.");
        gdb.shutdown();
    }
}

And Spring Data Neo4j docs state, that it is not made for this type of task

1
votes

If you are connecting 800K child nodes to one node, you are effectively creating a dense node, a.k.a. Key-Value like structure. Neo4j right now is not optimized to handle these structures effectively as all connected relationships are loaded into memory upon traversal of a node. This will be addressed by Neo4j 2.1 with configurable optimizations if you only want to load parts of relationships when touching these structures.

For the time being, I would recommend either putting these structures into indexes instead and do a lookup for the connected nodes, or balancing the dense structure along one value (e.g. build a subtree with say 100 subcategories along one of the properties on the relationships, e.g. time, see http://docs.neo4j.org/chunked/snapshot/cypher-cookbook-path-tree.html for instance.

Would that help?