2
votes

Within our application, we've been working with Lucene.Net to index large numbers of data. The fields themselves are configurable, so the name and type of the fields can change with each rebuild. Within each document, we can have multiple fields having the same name and a various number of Numeric and text fields. Since we've put a lot of work in the current development, changing to a different search engine is a no go a.t.m.

The fact is, that for the most part, it is working as a charm, but we do have one difficulty which we do not seem to get around.

Suppose we want to index document "X" containing:

Row A - Field1: 4 + Field2: a
Row B - Field1: 8 + Field2: b

The index we would make would contain 4 fields:

  • Document X:
    • Field1: 4 (Numeric)
    • Field2: a (Text)
    • Field1: 8 (Numeric)
    • Field2: b (Text)

(The row Ids are not important)

Doing a search for Field1:[3 TO 6] AND Field2:b would have a hit on this document.
However, the link between the fields represented by the row (linking 4 and 'a') is gone.

We can concatenate the values like 4_a, but that would crush our numeric search and would require the clients to know which fields are concatenated for proper results. It would also increase difficulty with our analyzer, as for each field, we can add a different analyzer (mostly for language purposes).

Also, we can create a separate document for each row with the same key and add a distinct to the search results, but that doesn't sound like the way to go, does it? It would seriously multiply the number of documents as we would create between the 20 - 100 documents for each document we would create now. I haven't tested this on performance or usability, as the current implementation doesn't allow me to try this out very easily :-)

Does anyone know how I can force a link between certain fields within Lucene.Net, but still keep a way to search for each field individually?

2
I find it hard to understand what your actually searching for from the wording of your question. Are you lookin up individual rows or aggregates of rows? Remember that Lucene is not a relationnal database and trying to make it behave like one is usually not a good idea.Jf Beaulac

2 Answers

0
votes

I personally don't see why increased number of documents would affect the performance. At least in Java version of Lucene, the bulk of the memory is used for the terms cache - which is per term and has no relationship to document count (providing term count doesn't change). Can't elaborate on usability though as this is specific to your app.

The main point is that once you group the rows into documents, you lose row relationship info. You can fix that by adding extra fields (something like rowInfoA:4_a, rowInfoB:8_b) but this seems too cumbersome and will actually require far more memory. Yes, you can select not to index but only store these auxiliary fields but I (having given information) would still prefer the 1:1 row:document mapping.

0
votes

One kludge is to add another field for linkages:

  • Document X:

    • Field1: 4 (Numeric)
    • Field2: a (Text)
    • Field1: 8 (Numeric)
    • Field2: b (Text)
    • Link: 4_a
    • Link: 8_b

Another kludge is to add a field like MyDocument:X and index each row separately, with each row containing a MyDocument field for its Document. This would let you filter by document later in your process.