16
votes

I have two table fields in a MySQL table. One is VARCHAR and is a "headline" for a classified (classifieds website). The other is TEXT field which contains the "text" for the classified.

Two Questions:
How should I determine how to index these two fields? (what field-type, what classes to use etc)

Currently I have an "ad_id" as a unique identifier for each ad, example "bmw_m3_82398292".
How can I make SOLR return this identifier whenever a 'query match' is found by SOLR? (The first part of the identifier is actually the headline fields content, the second part is a random number chosen)

Thanks

1

1 Answers

29
votes

1. Schema

Your Solr schema is very much determined by your intended search behavior. In your schema.xml file, you'll see a bunch of choices like "text" and "string". They behave differently.

<fieldtype name="string" class="solr.StrField" sortMissingLast="true"     omitNorms="true"/>

The string field type is a literal string match. It would operate like == in a SQL statement.

<fieldtype name="text_ws"   class="solr.TextField"          positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldtype>

The text_ws field type does tokenization. However, a big difference in the text field is the filters for stop-words and delimiters and lower-casing. Notice how these filters are designated for both the Lucene index and the Solr query. So when searching a text field, it will adapt the query terms using these filters to help find a match.

<fieldtype name="text"      class="solr.TextField"  positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter ..... />
    <filter ..... />
    <filter ..... />
  </analyzer>
</fieldtype>

When indexing things like news stories, for example, you probably want to search for company names and headlines differently.

<field name="headline" type="text" />
<field name="coname" type="string" indexed="true" multiValued="false" omitNorms="true" />

The above example would allow you to do a search like &coname:Intel&headline:processor+specifications and retrieve matches hitting exactly Intel stories.

If you wanted to search a range

2. Result Fields

You can defined a standard set of return fields in your RequestHandler

<requestHandler name="mumble" class="solr.DisMaxRequestHandler" >
    <str name="fl">
        category,coname,headline
    </str>
</requestHandler>

You may also define the desired fields in your query string, using the fl parameter.:

/select?indent=on&version=2.2&q=coname%3AIn*&start=0&rows=10&fl=coname%2Cid&qt=standard

You can also select ranges in your query terms using the field:[x TO *] syntax. If you wanted to select certain ads by their date , you might build a query with

ad_date:[20100101 TO 20100201]

in your query terms. (There are many ways to search ranges, I'm presenting a method that uses integers instead of Date class.)