4
votes

I have been trying to figure out the best way to use actual regex patterns within an Elasticsearch 5.4 query. After searching about the standard analyzer and tokenizing each string field, I started using the not analyzed field placed in my mappings (standard .raw property). I have tried two variants of the same query, neither has been successful.

Query String filter:

GET /test-*/_search
{
"query": {
  "bool": {
    "must": [
      {
          "query_string":{
            "query": "URL.raw:/^(http|https)\\:\/\/.+(wp-content|wp-admin)/"
          }  
      }
    ]
  }
},
"sort": {
  "@timestamp": {
    "order": "desc"
  }
 }
}

REGEXP FILTER:

GET /test-*/_search
{
 "query": {
  "bool": {
    "must": [
      {
        "regexp": {
          "URL.raw":{
            "value": "/^(http|https)\\:\/\/.+(wp-content|wp-admin)/"
          }
        }
      }
    ]
  }
 },
 "sort": {
  "@timestamp": {
    "order": "desc"
  }
 }
}

Both seem to yield no results or parse exceptions

{
  "error": {
    "root_cause": [
      {
        "type": "parse_exception",
        "reason": "parse_exception: Encountered \" \"^\" \"^ \"\" at line 1, column 8.\nWas expecting one of:\n    <BAREOPER> ...\n    \"(\" ...\n    \"*\" ...\n    <QUOTED> ...\n    <TERM> ...\n    <PREFIXTERM> ...\n    <WILDTERM> ...\n    <REGEXPTERM> ...\n    \"[\" ...\n    \"{\" ...\n    <NUMBER> ...\n    "
      },

Does lucene require special escaping or blacklisted chars? Any help or pointers would be much appreciated. Thanks!

1
Lucene regexps are anchored by default and ^ / $ are not special there. You do not need / regex delimiters and you do not need to escape /. Try the regexp_filter with "https?://.*wp-(content|admin).*" - Wiktor Stribiżew

1 Answers

3
votes

Lucene regexps are anchored by default and ^ / $ are not special there.

You do not need / regex delimiters and you do not need to escape / due to that.

Use the following pattern:

"value": "https?://.*wp-(content|admin).*"

Note that I modified the groups a bit to make the pattern more linear and efficient.

Details:

  • https?:// - string starts with https:// or http://
  • .* - then has any 0+ chars
  • wp- - a wp- substring
  • (content|admin) - either content or admin substring
  • .* - then has any 0+ chars.