0
votes

I have an index, with a field "Affiliation", some example values are:

  • "Stanford University School of Medicine, Palo Alto, CA USA",
  • "Institute of Neurobiology, School of Medicine, Stanford University, Palo Alto, CA",
  • "School of Medicine, Harvard University, Boston MA",
  • "Brigham & Women's, Harvard University School of Medicine, Boston, MA"
  • "Harvard University, Cambridge MA"

and so on... (the bottom-line being the affiliations are written in multiple ways with no apparent consistency)

I query the index on the affiliation field using say "School of Medicine, Stanford University, Palo Alto, CA" (with QueryParser) to find all Stanford related documents, I get a lot of false +ves, presumably because of the presence of School of Medicine etc. etc. (note: I cannot use Phrase query because of variability in the way affiliation is constructed)

I have tried the following:

  1. Use a SpanNearQuery by splitting the search phrase with a whitespace (here I get no results!)

  2. Tried boosting (using ^) by splitting with the comma and boosting the last parts such as "Palo Alto CA" with a much higher boost than the initial phrases. Here I still get lots of false +ves.

Any suggestions on how to approach this? If SpanNearQuery the way to go, Any ideas on why I get 0 results?

2

2 Answers

1
votes

Are you using OR search instead of AND?

You can set default operator to AND with QueryParser.setDefaultOperator(). Setting default operator to AND should eliminate all the false positives. But, you might risk false negatives in case your indexed values is "Stanford University School of Medicine, Palo Alto, CA ", and you are searching for "Stanford University School of Medicine, Palo Alto, CA USA", (note the extra term USA in query.)

If your queries are not going to have more terms than the indexed value, this should resolve your problem.

0
votes

Here is how I did it:

  1. Added the common terms such as "University", "School", "Medicine", "Institute" etc. to stopwords list.

  2. Used a booleanquery for each of the terms and setMinimumNumberShouldMatch() to 75% of the query string length.

Finally, loop through the hits collector and use a string comparison algorithm like Jaro-Winkler, Levenstein etc. for a second-level filter. (this is slow but ensures precision).

Hope this helps.