1
votes

I know that SOLR can do free text search but what is the best practice for faceting on common terms inside SOLR text fields?

For example, we have a large blob of text (a description of a property) which contains useful text to facet on like 'private garage', 'private garden', 'private parking', 'underground parking', 'hardwood floors', 'two floors', ... dozens more like these.

I would like to create a view which lets users see the number of properties with each of these terms and allow the users to drill down to the relevant properties.

One obvious solution is to pre-process the data, parse the text, and create the facets for each of these key phrases with a boolean yes/no value.

I'd ideally like to automate this, so I imagine the SOLR free text search engine might allow this? e.g. Can I use the free text search engine to remove stop words and collect counts of common phrases which we can then present to the user?

If pre-processing is the only way, is there a common/best practice approach to this or any open source libraries which perform this function?

What is the best practice for counting and grouping common phrases from a text field in SOLR?

1

1 Answers

2
votes

Problem is that faceting on text fields (non-string fields) with some custom analysis chain is rather expensive. You may try using shingles, i.e. break your input into an array of overlapping bi-grams. If you are going to use solr4 make sure to have docValues=true on the text field definition. This may speed up or at least save you RAM.

The bi-gramming can be achieved using ShingleFilterFactory: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory

Beware that it is still quite computing intensive.

This may work if your data set isn't too large (subject to a separate definition) or if you can shard the data appropriately.