0
votes

Learning how to use Lucene!

I have an index in Lucene which is configured to store term vectors.

I also have a set of documents I have already constructed custom term vectors for (for an unrelated purpose) not using Lucene.

Is there a way to insert them directly into the Lucene inverted index in lieu of the original contents of the documents?

I imagine one way to do this would be to generate bogus text using the term vector with the appropriate number of term occurrences and then to feed the bogus text as the contents of the document. This seems silly because ultimate Lucene will have to convert the bogus text back into a term vector in order to index.

1
I don't understand if the first index is already a Lucene index and why you mention only term vectors. Can you explain better the question? Thanks.Simona R.
My understanding is that Lucene maintains an inverted index mapping terms to documents (weighted by an appropriate score). One can insert documents into this index, whereupon the document contents are counted to produce a term vector (the forward index) and then inverted and inserted into the inverted index. I have a case where I have already produced an approximate term vector for a document. I simply need to insert it into the Lucene inverted index to enable finding this document.Shashir Reddy
I imagine one way to do this would be to take the term vector and generate bogus text out of the terms with the appropriate number of term occurrences. But that seems silly, because ultimately Lucene is simply converting this bogus text into a term vector anyway.Shashir Reddy
I have the term vectors for these documents already because I have a custom search engine which already has an inverted index.Shashir Reddy

1 Answers

1
votes

I'm not entirely sure what you want to do with these term vectors ultimately (score? just retrieve?) but here's one strategy I might advocate for.

Instead of focusing on faking out the text attribute of term vectors, consider looking into payloads which attach arbitrary metadata to each token. During analysis, text is converted to tokens. This includes emitting a number of attributes about each token. There's standard attributes like position, term character offsets, and the term string itself. ALL of these can be part of the uninverted term vector. Another attribute is the payload which is arbitrary metadata you can attach to a term.

You can store any token attribute uninverted as a "term vector" including payloads, which you can access at scoring time.

To do this you need to

  1. Configure your field to store term vectors, including term vectors with payload
  2. Customize analysis to emit payloads that correspond to your terms. You can read more here
  3. Use an IndexReader.getTermVector to pull back Terms. From that you can get a TermsEnum. You can then use that to get a DocsAndPositionEnum which has an accessor for the current payload
  4. If you want to use this in scoring, consider a custom query or custom score query