Neo4j indexing (with Lucene) - good way to organize node "types"?

Question

This is more actually more of a Lucene question, but it's in the context of a neo4j database.

I have a database that's divided into 50 or so node types (so "collections" or "tables" in other types of dbs). Each has a subset of properties that need to be indexed, some share the same name, some don't.

When searching, I always want to find nodes of a specific type, never across all nodes.

I can see three ways of organizing this:

One index per type, properties map naturally to index fields: index 'foo', 'id'='1234'.
A single global index, each field maps to a property name, to distinguish the type either include it as part of the value ('id'='foo:1234') or check the nodes once they're returned (I expect duplicates to be very rare).
A single index, type is part of the field name: 'foo.id'='1234'.

Once created, the database is read-only.

Are there any benefits to one of those, in terms of convenience, size/cache efficiency, or performance?

As I understand it, for the first option neo4j will create a separate physical index for each type, which seems suboptimal. For the third, I end up with most lucene docs only having a small subset of the fields, not sure if that affects anything.

Having a separate index for each type seems to be more convenient and also quicker, as the overall size of your index will be smaller. But I may be missing something. — biziclop
@biziclop: It actually seemed like the least convenient to me, since I'd have to manage opening/closing the individual indices. My understanding is that the overall size will also be larger (see jpountz's answer). — Dmitri
@Dimitri Well, obviously the overall size will be larger, the question is: are searches for all types distributed evenly in time? Or are some types searched a lot more often than other? Either way, what I'd do is to implement the solution I find the most convenient and see if it performs well. If it does, you have your winner. — biziclop
I agree, I'm just trying to figure out what I find the most convenient :) — Dmitri

jpountz jpountz · Accepted Answer · 2012-03-26T21:23:27

A single index will be smaller than several little indexes, because some data, such as the term dictionary, will be shared. However, since a term dictionary lookup is a O(lg(n)) operation, a lookup in a bigger term dictionary might be a little slower. (If you have 50 indexes, this would only require 6 (2^6>=50) more comparisons, it is likely you won't notice any difference.)

Another advantage of a smaller index is that the OS cache is likely to make queries run faster.

Instead of your options 2 and 3, I would index two different fields id and type and search for (id:ID AND type:TYPE) but I don't know if it is possible with neo4j.

Neo4j indexing (with Lucene) - good way to organize node "types"?

3 Answers