1
votes

My requirement is this:

If I pass multiple words for search as a list, ES will return documents with subset of word matches along with words matched So I can understand which document matched which subset.

Suppose I need to search for words such as Football, Cricket, Tennis, Golf etc. in three documents

I am going to store these files in corresponding documents. Mappings for "mydocuments" index looks like this:

{
  "mydocuments" : {
    "mappings" : {
      "docs" : {
        "properties" : {
          "file_content" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

First Document

{ _id: 1, file_content: "I love tennis and cricket"}

Second document:

{ _id: 2, file_content: "tennis and football are very popular"}

Third document:

{ _id: 3, file_content: "football and cricket are originated in england"}

I should be able to search a single file/or multiple files for Football, Tennis, cricket, golf and it should return something like this

Something like this

    "hits":{
        "total" : 3,
        "hits" : [
            {
                "_index" : "twitter",
                "_type" : "tweet",
                "_id" : "1",
                "_source" : {
                    "file_content" : ["football","cricket"],
                    "postDate" : "2009-11-15T14:12:12",

                }
                },
                {
                    "_index" : "twitter",
                    "_type" : "tweet",
                    "_id" : "2",
                    "_source" : {
                        "file_content" : ["football","tennis"],
                        "postDate" : "2009-11-15T14:12:12",

                    }
                }
            ]

Or in case of multiple file searches an array of above search results

Any idea how can we do this using Elasticsearch?

If this really can not be done using elasticsearch I am ready to evaluate any other options (Native lucene, Solr)

EDIT

My bad probably I did not provide enough details. @Andrew what I meant by file is the text content of a file stored as a String Field (Full Text) in a document in ES. Assume One file corresponds to one document with text content string in a field called "file_content".

1
I think you need to think about your own application and see what ES can give you and what you can do to arrange the results in the way you want in your own application. {football: yes, cricket: no, tennis : yes , golf no} is about your application and ES cannot give you something like this. ES gives you JSON and this JSON has a certain, well determined structure. Please read the documentation about ES first, and then come up with a meaningful question about ES.Andrei Stefan
@AndreiStefan: I do find the question interesting and regard it as a generic ES question. (Disclaimer: I’m a Solr user.) The text from the files are tokenized, and I’d say the question boils down to: If I give a number of words to ES, how can I find out which of these words (or tokens generated from the words) are in which document? I don’t think the original poster needs exactly the given JSON structure, but merely data that can be derived from the JSON structure returned by ES.BlueM
@BlueM, first of all he talks about "files". In ES we talk about documents. How he/she went from "file" to "document" - no mention about this. Secondly, there is no mapping, you assume "tokenized". Ok, tokenized how? Does he already have a mapping, if so where is it? Can the poster clarify the JSON statement I made? Nothing so far. Thirdly, fyi SO has some guidelines on how to ask a question. These being said, I do expect the poster to show what he tried, what he has so far and what doesn't work. I gave him/her an important advice about reading docs.Andrei Stefan
@AndreiStefan: Sure – there are a lot of details missing in the question, and for simplicity’s sake I’ve made certain assumptions about the setup. But still: I’d be interested in the answer :-)BlueM
@BlueM Ok. Assuming so many things qualifies as a "Too broad" type of question for SO. I will wait for updates to the post. Until then I will not attempt an answer.Andrei Stefan

1 Answers

1
votes

The closest thing you can get to what you want is highlighting, meaning emphasizing the searched terms in the documents.

Sample query:

{
  "query": {
    "match": {
      "file_content": "football tennis cricket golf"
    }
  },
  "highlight": {
    "fields": {"file_content":{}}
  }
}

Result:


       "hits": {
          "total": 3,
          "max_score": 0.027847305,
          "hits": [
             {
                "_index": "test_highlight",
                "_type": "docs",
                "_id": "1",
                "_score": 0.027847305,
                "_source": {
                   "file_content": "I love tennis and cricket"
                },
                "highlight": {
                   "file_content": [
                      "I love <em>tennis</em> and <em>cricket</em>"
                   ]
                }
             },
             {
                "_index": "test_highlight",
                "_type": "docs",
                "_id": "2",
                "_score": 0.023869118,
                "_source": {
                   "file_content": "tennis and football are very popular"
                },
                "highlight": {
                   "file_content": [
                      "<em>tennis</em> and <em>football</em> are very popular"
                   ]
                }
             },
             {
                "_index": "test_highlight",
                "_type": "docs",
                "_id": "3",
                "_score": 0.023869118,
                "_source": {
                   "file_content": "football and cricket are originated in england"
                },
                "highlight": {
                   "file_content": [
                      "<em>football</em> and <em>cricket</em> are originated in england"
                   ]
                }
             }
          ]
       }

As you can see the terms that were found are highlighted (elements surrounded by <em> tags) under a special highlight section.