0
votes

I'm designing a Lucene search index that includes ranked tags for each document.

Example:

Document 1
tag: java , rank 1.2
tag: learning, rank 2.1
tag: bugs, rank 1.2
tag: architecture: rank 0.3

The tags comes from an automated classification algorithm that is also assigning a score.

How do I design the index so I can query for search for a combination of tags and return the most relevant results? Example, search for java+learning

I've initially created a FIELD for each tag and used the rank to boost the field for each document. Is this a good approach in terms of performance? What if I have 10,000 possible tags? Is it a good idea to have 10,000 FIELDS in Lucene?

Field tag = new Field(
        FIELD_TAG+tag.getId(),
        "y",
        Field.Store.NO,
        Field.Index.NOT_ANALYZED);

tag.setBoost(tag.getRank());

luceneDoc.add(tag);

If I instead add all the tags to the same field, how can I take into account the rank?

1

1 Answers

0
votes

I had this problem in my search too... Tell me if I'm wrong...

The good was if you could have one field like "Tags" contain the value "java learning bugs architecture" and you use a WhiteSpaceTokenizer:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WhitespaceTokenizerFactory

But doing this you are not able to bost each words, you are able to boost the field "Tags"...

Doing this Lucene will not give a good scoring when user searchs for "java bugs" ou "architecture in java", but will return all documents that have this words.

But you can do like you said, a lot of "Tags" and boost each one... Or you can crate a new Query Parser http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html inheritance edismax (for example) to make a field works like you want.

Is that what you want?

Ow... One more thing, adding a lot of fields will make the docs indexer slow and index bigger (probably not good to search).