arangodb: node merge history: best approach

Question

We are building a contact management application. Each contact is a node. If 2 or more contacts are discovered to be duplicates we want to provide the ability to merge them into a single node. Additionally, we want to maintain the pre-merge node states, so that we can undo the merge if required (*).

We propose to model this by creating a new node and linking the old nodes to it with a "merged_into" edge, and setting a status property to "removed".

Now we have two options:

We copy all the existing edges from the two merged nodes to the new node
We don't.

Option 2 gives a simpler data structure, however it makes all our queries much more complex. Because we have to travel back through potentially multiple levels of merged nodes to fetch all the edges

Option 1 would keep the queries the same, but will introduce a lot of extra edges.

We also are considering a 3rd option of creating a copy of the full database with all the merged nodes collapsed. i.e. just a view of the current contacts. This would need to be kept in sync with the main database.

Would appreciate any advice/suggestions on the best way to handle this.

I'd also like to suggest a new "collapse" query feature, which would enable Option 2 to work more easily.... something like this:

select out("attended_class") collapse("merged_into") from 10#12

which would collapse the specified edges until there are no further outbound "merged_into" edges, and thus retrieve all the edges attached to the previous (pre-merged) nodes

To keep things simple we won't allow the unmerge operation after any edges have been defined on the new node

Kind Regards

Swami Kevala

mchacki mchacki · Accepted Answer · 2015-08-04T15:01:25

i think this issue depends on how often you expect merges to happen. If they are seldom go with Option 1 and run a cron job that deletes leftovers from time to time.

If they are rather often you should go with Option 2 because than merging/unmerging is much faster. You still should use a cron-job that "cleans" your data and moves over edges to the merged nodes (as soon as it is clear that they will not be unmerged).

BTW: If the attributes of the Nodes are identical (except their _key) and do not have to be merged as well you could probably get away with a simple trick:

Whenever node A should be merged with node B add an edge in collection merged connecting A and B and mark B as "removed". Then modify your queries to check if there are edges from nodes in collection merged put them into the set of start vertices for your query. If you want to undo the merge, simply delete this edge. If you would add other edges to the merged node you could add it to either of the nodes. This would make it possible to unmerge nodes even if edges have been added to them.

arangodb: node merge history: best approach

1 Answers