Storing HTML Documents in Elasticsearch

Question

Scenario

I have HTML documents, let's say: emails. I want to store these on elastic search and search the plaintext of HTML emails.

Problem

Elasticsearch would index all the HTML tags and attributes, too. I don't want that. I want to search for span if it is a plain text, not a html element. For example <span>span</span> could be a hit, but not <span>some other content</span>.

Question

Would you recommend, to store a HTML stripped field and a HTML field in a document? Or should I store the HTML document on S3 and rather leave a stripped HTML version in elastic search? Does it even make sense

I honestly don't know what happens if elastic search is indexing a HTML document, but I could imagine that it will also index divs and spans and all the attributes. These are things I totally don't search for. So: any suggestion on solving the problem here would be great!

What am I doing now?

Right now before I store a document in ES, I check if the index exists for the document type. If not, I create a collection with a given mapping. The mapping looks like this

{
    "analysis": {
        "analyzer": {
            "htmlStripAnalyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": "standard",
                "char_filter": [
                    "html_strip"
                ]
            }
        }
    },
    "mappings": {
        "${type}": {
            "dynamic_templates": [
                {
                    "_metadata": {
                        "path_match": "_metadata.*",
                        "mapping": {
                            "type": "keyword"
                        }
                    }
                }
            ],
            "properties": {
                "_tags": {
                    "type": "nested",
                    "dynamic": true
                }
            }
        }
    }
}

Warning: Ignore the existing mappings. It has nothing to do with my intentions. They are just there.

I am replacing ${type} with the document type, let's say emails. What would it look like to tell ES to not index the HTML stuff?

You probably want to define a static mapping to index only the relevant fields regarding your HTML documents, and probably a way to retrieve your document (the path, or an id allowing you to compute the said path). Here is a little more on mapping from Elasticsearch: elastic.co/guide/en/elasticsearch/reference/current/… — Adonis
@asettouf I edited the question. You may see what I'm doing now. — AmazingTurtle
I see, basically dynamic should be set to false (or strict), and you will have to provide explicitly what interests you to index in the document. That means that you will have to parse the HTML, and then construct a JSON query (yourself or through the ES API) to feed ES only the relevant part. I can write a small example if you are interested — Adonis
I'd love to see an example - because I still don't get what you mean. Just to mention, I'm a bloody beginner with ES. — AmazingTurtle
I don't get it. What's the connection between the mapping you posted and your desired outcome? I don't see the analyzer used anywhere, instead I see a dynamic template. What's your intent? — Andrei Stefan

Andrei Stefan Andrei Stefan · Accepted Answer · 2017-04-10T13:30:28

A complete test case:

DELETE /test
PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "htmlStripAnalyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"],
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "html": {
          "type": "text",
          "analyzer": "htmlStripAnalyzer"
        }
      }
    }
  }
}

POST /test/test/1
{
  "html": "<td><tr>span<td></tr>"
}
POST /test/test/2
{
  "html": "<span>whatever</span>"
}
POST /test/test/3
{
  "html": "<html> <body> <h1 style=\"font-family: Arial\">Test</h1> <span>More test</span> </body> </html>"
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "span"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "body"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "more"
    }
  }
}

Update for Elasticsearch >=7 (removal of types)

DELETE /test
PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "htmlStripAnalyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"],
          "char_filter": [
            "html_strip"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "html": {
        "type": "text",
        "analyzer": "htmlStripAnalyzer"
      }
    }
  }
}

POST /test/_doc/1
{
  "html": "<td><tr>span<td></tr>"
}
POST /test/_doc/2
{
  "html": "<span>whatever</span>"
}
POST /test/_doc/3
{
  "html": "<html> <body> <h1 style=\"font-family: Arial\">Test</h1> <span>More test</span> </body> </html>"
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "span"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "body"
    }
  }
}

POST /test/_search
{
  "query": {
    "match": {
      "html": "more"
    }
  }
}