1
votes

I work on a project that requires a lot of information to be played into Neo4J via py2neo rapidly, and I've hit a snag in attempting to use Python's multiprocessing library to expedite the process.

In general terms, what we're doing is twofold, where both parts are happening in unison:

Part 1:

  • Create a relationship between a pair of nodes (A->B)
  • Create another node, which has one relationship to each of the previous pair (C->A, C->B)
  • Push all 3 nodes and their relationships into Neo4j,
  • Push the ID of node C to a kafka topic

Part 2:

  • Create N "threads" via python's multiprocessing, all of which open their own kafka consumer
  • consumers read the topic, and execute code that may update relationships/nodes or create additional ones.

The issue that's arising is happening in part 2, specifically when updating or creating relationships or nodes. py2neo is throwing a py2neo.database.TransientError, which according to the documentation is:

The database cannot service the request right now, retrying later might yield a successful outcome.

I've tried adding a safety check that will re-try the transaction X times, and re-push it to the kafka topic for another thread to try after X failed attempts, but the end results are still incorrect.

I'm woefully ignorant when it comes to multithreading/multiprocessing, so I feel as though my initial research efforts have been severely off-base, and I've wasted a lot of cycles trying to remedy this -- any direction or insight is greatly appreciated.

Can provide additional info if necessary -- not sure what I might be missing.

Thanks

1

1 Answers

0
votes

I think the issue is that you try to write to nodes/relationships that are blocked by other queries. This is independent of py2neo and you should run into similar issues when doing parallel writes with another library/framework. See here: https://neo4j.com/docs/java-reference/4.0/transaction-management/locking/

The py2neo error and docs do not help much in this case.

Retrying X times might work for small workloads where you can expect the write locks to be released quickly. However, if you constantly write to the same parts of the graph you cannot predict the behaviour. Some writes might get stuck in the loop forever.

Possible Solutions

  • split up read/write tasks and have only one process writing to Neo4j
  • pre-process data in a way that parallel writes to locked nodes are avoided