12
votes

Scenario

I have HTML documents, let's say: emails. I want to store these on elastic search and search the plaintext of HTML emails.

Problem

Elasticsearch would index all the HTML tags and attributes, too. I don't want that. I want to search for span if it is a plain text, not a html element. For example <span>span</span> could be a hit, but not <span>some other content</span>.

Question

Would you recommend, to store a HTML stripped field and a HTML field in a document? Or should I store the HTML document on S3 and rather leave a stripped HTML version in elastic search? Does it even make sense

I honestly don't know what happens if elastic search is indexing a HTML document, but I could imagine that it will also index divs and spans and all the attributes. These are things I totally don't search for. So: any suggestion on solving the problem here would be great!

What am I doing now?

Right now before I store a document in ES, I check if the index exists for the document type. If not, I create a collection with a given mapping. The mapping looks like this

{
    "analysis": {
        "analyzer": {
            "htmlStripAnalyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": "standard",
                "char_filter": [
                    "html_strip"
                ]
            }
        }
    },
    "mappings": {
        "${type}": {
            "dynamic_templates": [
                {
                    "_metadata": {
                        "path_match": "_metadata.*",
                        "mapping": {
                            "type": "keyword"
                        }
                    }
                }
            ],
            "properties": {
                "_tags": {
                    "type": "nested",
                    "dynamic": true
                }
            }
        }
    }
}

Warning: Ignore the existing mappings. It has nothing to do with my intentions. They are just there.

I am replacing ${type} with the document type, let's say emails. What would it look like to tell ES to not index the HTML stuff?

2
You probably want to define a static mapping to index only the relevant fields regarding your HTML documents, and probably a way to retrieve your document (the path, or an id allowing you to compute the said path). Here is a little more on mapping from Elasticsearch: elastic.co/guide/en/elasticsearch/reference/current/…Adonis
@asettouf I edited the question. You may see what I'm doing now.AmazingTurtle
I see, basically dynamic should be set to false (or strict), and you will have to provide explicitly what interests you to index in the document. That means that you will have to parse the HTML, and then construct a JSON query (yourself or through the ES API) to feed ES only the relevant part. I can write a small example if you are interestedAdonis
I'd love to see an example - because I still don't get what you mean. Just to mention, I'm a bloody beginner with ES.AmazingTurtle
I don't get it. What's the connection between the mapping you posted and your desired outcome? I don't see the analyzer used anywhere, instead I see a dynamic template. What's your intent?Andrei Stefan

2 Answers

19
votes

A complete test case:

DELETE /test
PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "htmlStripAnalyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"],
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "html": {
          "type": "text",
          "analyzer": "htmlStripAnalyzer"
        }
      }
    }
  }
}

POST /test/test/1
{
  "html": "<td><tr>span<td></tr>"
}
POST /test/test/2
{
  "html": "<span>whatever</span>"
}
POST /test/test/3
{
  "html": "<html> <body> <h1 style=\"font-family: Arial\">Test</h1> <span>More test</span> </body> </html>"
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "span"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "body"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "more"
    }
  }
}

Update for Elasticsearch >=7 (removal of types)

DELETE /test
PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "htmlStripAnalyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"],
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "html": {
        "type": "text",
        "analyzer": "htmlStripAnalyzer"
      }
    }
  }
}

POST /test/_doc/1
{
  "html": "<td><tr>span<td></tr>"
}
POST /test/_doc/2
{
  "html": "<span>whatever</span>"
}
POST /test/_doc/3
{
  "html": "<html> <body> <h1 style=\"font-family: Arial\">Test</h1> <span>More test</span> </body> </html>"
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "span"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "body"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "more"
    }
  }
}
1
votes

By default Elasticsearch will dynamically add new fields if it finds any during the indexing process (see this):

When Elasticsearch encounters a previously unknown field in a document, it uses dynamic mapping to determine the datatype for the field and automatically adds the new field to the type mapping.

To disable this behavior (see the doc for more details), the simplest is to put dynamic to false (prevents the automatic creation) or to strict (throws an exception and does not create a new document). In that case, you would need to explicitly write the mapping for the tags you wish to keep inside your _tags section, and pre parse the HTML document to only feed the tags you are interested in to Elasticsearch.

So let's say you have the following HTML document:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>A simple example</title>
</head>
<body>
  <div>
    <p><span class="ref">A sentence I want to reference from this HTML document</span></p>
    <p><span class="">Something less important</span></p>
</body>
</html>

The first thing you want to have is a static mapping inside Elasticsearch, I would do the following (assuming the ref is a string):

PUT html
{

"mappings": {
  "test":{
    "dynamic": "strict",
    "properties": {
      "ref":{
        "type": "string"
      }
    }
  }
}

Now if you try adding a document this way, it will succeed:

PUT html/test/1
{
  "ref": "A sentence I want to reference from this HTML document"
}

But this won't succeed:

PUT html/test/2
{
  "ref": "A sentence I want to reference from this HTML document",
  "some_field": "Some field"
}

Now the only thing remaining is to parse the HTML to retrieve the "ref" field, and create the above query (use whatever language you like, Java, Python...)

Edit: Actually to store the HTML without indexing it, in your mapping you simply need to set index to no (see here):

"_tags": {
          "type": "nested",
          "dynamic": true,
          "index": "no"
         }