Scenario
I have HTML documents, let's say: emails. I want to store these on elastic search and search the plaintext of HTML emails.
Problem
Elasticsearch would index all the HTML tags and attributes, too. I don't want that. I want to search for span
if it is a plain text, not a html element. For example <span>span</span>
could be a hit, but not <span>some other content</span>
.
Question
Would you recommend, to store a HTML stripped field and a HTML field in a document? Or should I store the HTML document on S3 and rather leave a stripped HTML version in elastic search? Does it even make sense
I honestly don't know what happens if elastic search is indexing a HTML document, but I could imagine that it will also index divs and spans and all the attributes. These are things I totally don't search for. So: any suggestion on solving the problem here would be great!
What am I doing now?
Right now before I store a document in ES, I check if the index exists for the document type. If not, I create a collection with a given mapping. The mapping looks like this
{
"analysis": {
"analyzer": {
"htmlStripAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": "standard",
"char_filter": [
"html_strip"
]
}
}
},
"mappings": {
"${type}": {
"dynamic_templates": [
{
"_metadata": {
"path_match": "_metadata.*",
"mapping": {
"type": "keyword"
}
}
}
],
"properties": {
"_tags": {
"type": "nested",
"dynamic": true
}
}
}
}
}
Warning: Ignore the existing mappings. It has nothing to do with my intentions. They are just there.
I am replacing ${type} with the document type, let's say emails
.
What would it look like to tell ES to not index the HTML stuff?
dynamic
should be set to false (or strict), and you will have to provide explicitly what interests you to index in the document. That means that you will have to parse the HTML, and then construct a JSON query (yourself or through the ES API) to feed ES only the relevant part. I can write a small example if you are interested – Adonis