I need your help on index design for a real scenario. It might be a long question, let me try explain it as concise as possible.
We are building a search platform based on Elasticsearch to provide site search experience for our customers, the document in index could be something like this:
{ "Path":"http://www.foo.com/doc/abc/1", "Title":"Title 1", "Description":"The description of doc 1", ... }
{ "Path":"http://www.foo.com/doc/abc/2", "Title":"Title 2", "Description":"The description of doc 2", ... }
{ "Path":"http://www.foo.com/doc/abc/3", "Title":"Title 3", "Description":"The description of doc 3", ... }
...
For each query, the returned hit documents are by default sorted by relevance, but our customer also wants to boost some specific documents for some keywords,
They give us the following like boosting configuration XML:
<boost>
<Keywords value="keyword1">
<Path rank="10000">http://www.foo.com/doc/abc/1</Path>
</Keywords>
<Keywords value="keyword2">
<Path rank="10000">http://www.foo.com/doc/abc/2</Path>
<Path rank="9900">http://www.foo.com/doc/abc/1</Path>
</Keywords>
<Keywords value="keyword3">
<Path rank="10000">http://www.foo.com/doc/abc/3</Path>
<Path rank="9900">http://www.foo.com/doc/abc/2</Path>
<Path rank="9800">http://www.foo.com/doc/abc/1</Path>
</Keywords>
</boost>
That mean, if user search “keyword1", the top 1 hit document should be the document whose Path field value is "www.foo.com/doc/abc/1", regardless the relevance score of that document. Similarly, if search "keyword3", the top 3 hit documents should be the documents whose Path values are "www.foo.com/doc/abc/3", "www.foo.com/doc/abc/2" and "www.foo.com/doc/abc/1" respectively.
To satisfy this special requirement, my design is, firstly invert the original boosting XML to following format:
<boost>
<Path value="http://www.foo.com/doc/abc/1">
<keywords>
<keyword value="keyword1" rank="10000" />
<keyword value="keyword2" rank="9900" />
<keyword value="keyword3" rank="9800" />
</keywords>
</Path>
<Path value="http://www.foo.com/doc/abc/2">
<keywords>
<keyword value="keyword2" rank="10000" />
<keyword value="keyword3" rank=9900" />
</keywords>
</Path>
<Path value="http://www.foo.com/doc/abc/3">
<keywords>
<keyword value="keyword3" rank="10000" />
</keywords>
</Path>
</boost>
Then add a nested field "Boost", which contains a array of keyword/rank fields, to the Elasticsearch document as following example:
{
"Boost": [
{ "keyword":"keyword1", "rank": 10000},
{ "keyword":"keyword2", "rank": 9900},
{ "keyword":"keyword3", "rank": 9800}
]
"Path":"http://www.foo.com/doc/abc/1",
"Title":"Title 1",
"Description":"The description of doc 1",
...
}
{
"Boost": [
{ "keyword":"keyword2", "rank": 10000},
{ "keyword":"keyword3", "rank": 9900}
]
"Path":"http://www.foo.com/doc/abc/2",
"Title":"Title 2",
"Description":"The description of doc 2",
...
}
{
"Boost": [
{ "keyword":"keyword3", "rank": 10000}
]
"Path":"http://www.foo.com/doc/abc/3",
"Title":"Title 3",
"Description":"The description of doc 3",
...
}
Then in query time, use nested query to get the rank value of each matched document for a given search keyword, and then use the score script to adjust the relevance score by this rank value.
Since the rank value from boosting XML is much larger than normal relevance score ( generally less than 5), the adjusted score of the documents which configured in boosting XML for given keyword should be top scores.
Do you think it is a good design on Elasticsearch? Any suggestions to better approaches?
Thanks in advance!