1
votes

I'm using Neo4J to persist a XSD in graphical format. Every Node in the graph has an attribute which is a list(array) of Strings. My query will be based on this arraylist.

For example: To keep it simple, let's say every node in the graph will have list of alphabets as one of the attribute. Now my query needs to yield all the nodes containing 'C' in the arraylist.

My question is, whether I should move all the alphabets from the attribute arraylist to individual nodes attached as child node to every node. If I do that, my query will change to yield all the nodes whose child node contains 'C' as it's value.

Which of the above two approaches is more efficient. Having an attribute as arraylist or having separate child nodes containing the individual values of the arraylist.

In real scenario, that arraylist can contain thousands of entries. So, if I go ahead with second approach and create separate node for each arraylist value, the tree will bloat up in size.

But I need to know READ efficient approach out of the two.

2
Can you clarify what you mean by an "alphabet"? Is it a String of characters? Or is is just one character (from the English alphabet)? An illustration or more details will be helpful. - cybersam
Yeah, I have already seen the above link. But still I want to explore my options. - Piyush
I have used alphabets for keeping it simple for my above example. But in real use case it will be array of Strings. - Piyush
Just to summarize my question, for search/read queries, is it good to have different nodes instead of having a property as arraylist. Which approach will be efficient to fetch my result. - Piyush

2 Answers

0
votes

I'd say it depends upon the queries you plan on using.

If a lookup by an element is the primary use case (as in your example, finding all nodes containing 'C'), then separate nodes may be more efficient. The reason is that your query will not be a 'contains' type query, but the reverse, first matching to the child node 'C' (and your index or unique constraint will be used under the hood for a fast lookup), and then traversing the relationships from that node to all nodes associated with it. You get the relevant results without having to do extra filtering or property inspection.

An example of usage, assuming you have :Holder nodes, and :Letter nodes, with :Letter nodes having unique 'letter' properties, and each :Holder node has a :Contains relationship to some subset of :Letter nodes.

Your lookup query for getting all :Holder nodes contain 'C' would look like:

MATCH (:Letter{letter:'c'})<-[:Contains]-(h:Holder)
RETURN h

That's it. You match to the thing you want to find, then you find all the other nodes that contain it.

The other option, using a list within a node, especially with thousands of entries (and thousands of nodes) seems less performant to me. To my knowledge indexing does not cover elements of a collection, so you will never be able to do a fast lookup by a collection element, the db will have to inspect all elements of the collections of all nodes to find those having that element, which will only become slower as the collections grow and as the number of nodes grow.

An example of this usage, where :Holder nodes have a 'letters' collection, looks like this:

MATCH (h:Holder)
WHERE 'c' in h.letters
RETURN h

And again, this is a simple looking query, but it will be a slow one that won't be able to take advantage of indexes or other means to speed it up.

That said, the other queries you plan to make should also be brought into consideration for a final decision.

0
votes

For your use case, it should be faster to keep all the strings in the same collection in the same node, as neo4j would have to do less work.