A little context: I'm experimenting with Neo4J (as a newbie, but experienced in other database technologies) for possible use as a master data management system within our business of identity intelligence, in particular looking at building up a graph of places, identity attributes (eg: email addresses, telephone numbers, electoral roll data, etc.) with relationships between these nodes that express something meaningful, for example where an email address has been used, or where a telephone number is registered.
Desired system properties: I would like this system to have some specific properties that are valuble to us:
- Fast ingestion of information from a significant number of providers (100+), this precludes lengthy (hours) ETL processes, short ones are ok!
- On line at all times, this precludes use of the batch importer, we are most likely to use a fault tolerant cluster, sharding would be good :)
- Capacity to eventually ingest ~30G records / year (~1000/second) and retain them, creation and retention of ~100G relationships / year, right now we are ingesting ~1/10 of this load.
Where I'm stuck: I have been experimenting with a single node in Azure, 32GB RAM, 4 cores, with non-local disk, running Debian 8 and Neo4J 3.1.1. This happily ingests and relates back together the UK postal address file (PAF), around 29M records, in a few 10s of minutes using either LOAD CSV or home-brew Java and bolt. I have also ingested but not related a test set of email address data, around 20M records, and now need to build relationships based on matching postcodes, building numbers, and possibly other fields between the two data sets. This is where things get much slower when using Cypher, here's the fastest query I have been able to create thus far:
UNWIND {list} AS i
MATCH(e:DDSEMAIL) WHERE ID(e) = i WITH e
MATCH(s:SUBBNAME) USING INDEX s:SUBBNAME(SBNA)
WHERE upper(e.Building) = s.SBNA WITH e,s
MATCH(m:MAINFILE)
WHERE trim(split(e.Postcode,' ')[0]) = m.OUTC AND
trim(split(e.Postcode,' ')[1]) = m.INCO AND
right('0000'+e.HouseNo,4) = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
CREATE (e)-[r:USED_AT]->(m)
RETURN COUNT(r);
Indexes are:
ON :DDSEMAIL(HouseNo) ONLINE
ON :DDSEMAIL(Postcode) ONLINE
ON :DDSEMAIL(Building) ONLINE
ON :MAINFILE(OUTC) ONLINE
ON :MAINFILE(INCO) ONLINE
ON :MAINFILE(BNUM) ONLINE
ON :SUBBNAME(SBNA) ONLINE
Please note that the {list} parameter is being supplied through bolt from a Java client that has already enumerated all the ~20M DDSEMAIL nodes, and is batching into transactions (typically 1000 IDs at a time).
This is taking between 100-200msecs per ID, over a test run of 157000 IDs it took 7.3 hours, indicating a full execution time of ~760 hours or >1 month. The underlying machine appears CPU bound (no significant IO wait time).
Looking at the EXPLAIN for this query, there are no full scans, it's all schema index matching (once I had included the explicit index statement), so I'm not sure where to look for more speed..
(edited to add this PROFILE output):
This shows that the match to both parts of the postcode is filtering a lot of rows (56k), it may be better to re-order these fields to reduce the filter input size. (end of edit)
As a (very unfair) comparision, I pushed both sets of data from CSV files into a custom Bloom filter written in C#/.NET, which performs similar field reformatting as above then concatenates to generate textual keys, and matches these keys together. This completed convolving all 20M email records against all 29M PAF records in under 5 minutes on a single core of my laptop. It was largely IO bound.
Right now I'm considering using an external application or a user procedure to perform the record matching, and just creating relationships using Cypher, but it feels wrong to avoid a well-written query engine that should be able to do this much, much quicker than it is.
What should I be looking at to improve performance please?