I'm using Elasticsearch 6.8, and trying to write a query in python notebook. Here is a mapping used for the index i'm working with:
{ "mapping": { "news": { "properties": { "dateCreated": { "type": "date", "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis" }, "itemId": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "market": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "timeWindow": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } } }
I'm trying to search for exact string like "[2020-08-16 10:00:00.0,2020-08-16 11:00:00.0]" in "timeWindow" field (which is a "text" type, not a "date" field), and also select by market="en-us" (market is a "text" field too). This string has spaces,colons,commas, a lot of whitecharacters, and I don't know how to make a right query.
At the moment I have this query:
res = es.search(index='my_index',
doc_type='news',
body={
'size': size,
'query':{
"bool":{
"must":[{
"simple_query_string": {
"query": "[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]",
"default_operator": "and",
"minimum_should_match":"100%"
}
},
{"match":{"market":"en-us"}}
]
}
}
})
The problem is that is doesn't match my "simple_query_string" for timeWindow string exactly (I understand that this string gets tokenized, splitted into parts like "2020","08","17","00","01", etc, and each token is analyzed separately), and I'm getting different values for timeWindow that I want to exclude, like
['[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]'
'[2020-08-17 00:05:00.0,2020-08-17 01:05:00.0]'
...
'[2020-08-17 00:50:00.0,2020-08-17 01:50:00.0]'
'[2020-08-17 00:55:00.0,2020-08-17 01:55:00.0]'
'[2020-08-17 01:00:00.0,2020-08-17 02:00:00.0]']
Is there a way to do what I want?
UPD (and answer): My current query uses "term" and "timeWindow.keyword", this combination allows me to do exact search for string with spaces and other whitecharacters:
res = es.search(index='msn_click_events', doc_type='news', body={
'size': size,
'query':{
"bool":{
"must":[{
"term": {
"timeWindow.keyword": tw
}
},
{"match":{"market":"en-us"}}
]
}
}
})
And this query selects only right timewindows values (string):
['[2020-08-17 00:00:00.0,2020-08-17 01:00:00.0]'
'[2020-08-17 01:00:00.0,2020-08-17 02:00:00.0]'
'[2020-08-17 02:00:00.0,2020-08-17 03:00:00.0]'
...
'[2020-08-17 22:00:00.0,2020-08-17 23:00:00.0]'
'[2020-08-17 23:00:00.0,2020-08-18 00:00:00.0]']