OK, first off, I'm using Neo4jClient for this and I've added an INDEX
to the DB using:
CREATE INDEX ON :MyClass(Id)
This is important for the way this works, as it makes inserting the data a lot quicker.
I have a class:
public class MyClass
{
public int Id {get;set;}
public string AValue {get;set;}
public ICollection<int> LinkToIds {get;set;} = new List<int>();
}
Which has an Id
which I'll be keying off, and a string
property - just because. The LinkToIds
property is a collection of Ids that this instance is linked to.
To generate my MyClass
instances I'm using this method to randomly generate them:
private static ICollection<MyClass> GenerateMyClass(int number = 50000){
var output = new List<MyClass>();
Random r = new Random((int) DateTime.Now.Ticks);
for (int i = 0; i < number; i++)
{
var mc = new MyClass { Id = i, AValue = $"Value_{i}" };
var numberOfLinks = r.Next(1, 10);
for(int j = 0; j < numberOfLinks; j++){
var link = r.Next(0, number-1);
if(!mc.LinkToIds.Contains(link) && link != mc.Id)
mc.LinkToIds.Add(link);
}
output.Add(mc);
}
return output;
}
Then I use another method to split this into smaller 'batches':
private static ICollection<ICollection<MyClass>> GetBatches(ICollection<MyClass> toBatch, int sizeOfBatch)
{
var output = new List<ICollection<MyClass>>();
if(sizeOfBatch > toBatch.Count) sizeOfBatch = toBatch.Count;
var numBatches = toBatch.Count / sizeOfBatch;
for(int i = 0; i < numBatches; i++){
output.Add(toBatch.Skip(i * sizeOfBatch).Take(sizeOfBatch).ToList());
}
return output;
}
Then to actually add into the DB:
void Main()
{
var gc = new GraphClient(new Uri("http://localhost:7474/db/data"), "neo4j", "neo");
gc.Connect();
var batches = GetBatches(GenerateMyClass(), 5000);
var now = DateTime.Now;
foreach (var batch in batches)
{
DateTime bstart = DateTime.Now;
var query = gc.Cypher
.Unwind(batch, "node")
.Merge($"(n:{nameof(MyClass)} {{Id: node.Id}})")
.Set("n = node")
.With("n, node")
.Unwind("node.LinkToIds", "linkTo")
.Merge($"(n1:{nameof(MyClass)} {{Id: linkTo}})")
.With("n, n1")
.Merge("(n)-[:LINKED_TO]->(n1)");
query.ExecuteWithoutResults();
Console.WriteLine($"Batch took: {(DateTime.Now - bstart).TotalMilliseconds} ms");
}
Console.WriteLine($"Total took: {(DateTime.Now - now).TotalMilliseconds} ms");
}
On my aging (5-6 years old now) machine it takes about 20s to put 50,000 nodes in and around about 500,000 relationships.
Let's break into that important call to Neo4j above. The key things are as you rightly suggesting UNWIND
- here I UNWIND
a batch and give each 'row' in that collection the identifier of node
. I can then access the properties (node.Id
) and use that to MERGE
a node. In the first unwind - I always SET
the newly created node (n
) to be the node
so all the properties (in this case just AValue
) are set.
So up to the first With
we have a new Node created with a MyClass
label, and all it's properties set. Now. This does include having an array of LinkToIds
which if you were a tidy person - you might want to remove. I'll leave that to yourself.
In the second UNWIND
we take advantage of the fact that the LinkToIds
property is an Array, and use that to create a 'placeholder' node that will be filled later, then we create a relationship between the n
and the n1
placeholder. NB - if we've already created a node with the same id as n1
we'll use that node, and when we get to the same Id during the first UNWIND
we'll set all the properties of the placeholder.
It's not the easiest to explain, but in the best things to look at are MERGE
and UNWIND
in the Neo4j Documentation.
Neo4j.Driver
or can you useNeo4jClient
? – Charlotte Skardon