5
votes

This is more actually more of a Lucene question, but it's in the context of a neo4j database.

I have a database that's divided into 50 or so node types (so "collections" or "tables" in other types of dbs). Each has a subset of properties that need to be indexed, some share the same name, some don't.

When searching, I always want to find nodes of a specific type, never across all nodes.

I can see three ways of organizing this:

  • One index per type, properties map naturally to index fields: index 'foo', 'id'='1234'.

  • A single global index, each field maps to a property name, to distinguish the type either include it as part of the value ('id'='foo:1234') or check the nodes once they're returned (I expect duplicates to be very rare).

  • A single index, type is part of the field name: 'foo.id'='1234'.

Once created, the database is read-only.

Are there any benefits to one of those, in terms of convenience, size/cache efficiency, or performance?

As I understand it, for the first option neo4j will create a separate physical index for each type, which seems suboptimal. For the third, I end up with most lucene docs only having a small subset of the fields, not sure if that affects anything.

3
Having a separate index for each type seems to be more convenient and also quicker, as the overall size of your index will be smaller. But I may be missing something.biziclop
@biziclop: It actually seemed like the least convenient to me, since I'd have to manage opening/closing the individual indices. My understanding is that the overall size will also be larger (see jpountz's answer).Dmitri
@Dimitri Well, obviously the overall size will be larger, the question is: are searches for all types distributed evenly in time? Or are some types searched a lot more often than other? Either way, what I'd do is to implement the solution I find the most convenient and see if it performs well. If it does, you have your winner.biziclop
I agree, I'm just trying to figure out what I find the most convenient :)Dmitri

3 Answers

1
votes

A single index will be smaller than several little indexes, because some data, such as the term dictionary, will be shared. However, since a term dictionary lookup is a O(lg(n)) operation, a lookup in a bigger term dictionary might be a little slower. (If you have 50 indexes, this would only require 6 (2^6>=50) more comparisons, it is likely you won't notice any difference.)

Another advantage of a smaller index is that the OS cache is likely to make queries run faster.

Instead of your options 2 and 3, I would index two different fields id and type and search for (id:ID AND type:TYPE) but I don't know if it is possible with neo4j.

2
votes

I came across this problem recently when I was building an ActiveRecord connection adapter for Neo4j over REST, to be used in a Rails project. Since ActiveRecord and ActiveRelation, both, have a tight coupling with SQL syntaxes, it became difficult to fit everything into NoSQL. Might not be the best solution, but here's how I solved it:

  1. Created an index named model_index which indexed nodes under two keys, type and model
  2. Index lookup with type key currently happens with just one value model. This was introduced primarily to achieve a SHOW TABLES SQL functionality which can get me a list of all models present in the graph.
  3. Index lookup with model key takes place with values corresponding to different model names in my system. This is primarily for achieving DESC <TABLENAME> functionality.
  4. With each table creation as in CREATE TABLE, a node is created with table definition attributes being stored in node properties.
  5. Created node is indexed under model_index with type:model and model:<model-name>. This enables the newly created model in the list of 'tables' and also allows one to directly reach the model node by an index lookup with model key.
  6. For each record created per model (type in your case), an outgoing edge is created labeled instances directed from model node to this new record. v[123] :=> [instances] :=> v[245] where v[123] represents model node and v[245] represents a record of v[123]'s type.
  7. Now if you want to get all instances of a specified type, you could lookup the model_index with model:<model-name> to reach a model node and then fetch all adjacent nodes over an outgoing edge labeled instances. Filtered lookups can be further achieved by applying filters and other complex traversals.

The above solution prevents model_index from clogging since it contains 2x and achieves an effective record lookup via one index lookup and single-level traversal.

Although in your case, nodes of different types are not adjacent to each other, even if you wanted to do so, you could determine the type of any arbitrary node by simply looking up it's adjacent node with an incoming edge labeled instances. Further, I'm considering the incorporate SpringDataGraph's pattern of storing a __type__ property on each instance node to avoid this adjacent node lookup.

I'm currently translating AREL to Gremlin scripts for almost everything. You could find the source code for my AR Adapter at https://github.com/yournextleap/activerecord-neo4j-adapter

Hope this helps, Cheers! :)

1
votes

spring-data-neo4j is using the first approach - it creates a different index for each type. So I guess that's a good option for the general scenario. But in your particular case it might be suboptimal, as you say. I'd run some benchmarks to measure the performance.

The other two, by the way, seem a bit artificial. You are possibly indexing completely unrelated information in the same index, which doesn't sound right.