2
votes

I'd like to perform a DocumentDB query that looks like SELECT * FROM c where c.teams IN (@teamsList) AND CONTAINS(c.text, "some string")

The issue is the query above is computationally intensive and nearly exceeds our S3 collection limit (this query took 2400RU's and our data set is growing quickly, we will hit the scanning limit for contains soon).

I'm aware that Azure Search is a more efficient way or search indexable fields. My question is how do I efficiently merge the results of Azure Search with other query fields, in my example, restricting it by team list. We are interested in exposing a "query builder" (Similar example available here) where CONTAINS is a permitted operand on any field.

1
If a given query times out for exceeding your RU budget, it'll return with a continuation token that you can use to go back again for more data. Does that help? While Azure Search would be more efficient on your CONTAINS clause, I'm not sure if would be any better/worse on the IN clause. You might want to benchmark both for your dataset.Larry Maccherone
Regarding the first DocumentDB query, do you know how many documents match the query filter in the query? The IN clause will be filtered via the index. If you could post the activity ID, we can investigate why it takes 2400 RUs. Also if you need to return only a page of results for your query builder, you might be able to reduce the RUs significantly by limiting the number of results shown by default.Aravind Krishna R.
The query SELECT TOP 100 * FROM c where CONTAINS(c.details.title, "test") returns 13 results and consumes 2409.98 RUs in the Azure Portal. The activity id: 6d966ad4-a196-4255-a10b-334a753a89e0Valchris
@LarryMaccherone that might work, but the customer experience would be severely degraded as the response time would be very high correct?Valchris
Something else is going on. Only 13 results for 2400+ RUs is crazy. Let Aravind investigate why. Only then, should you consider what I'm proposing. If you only need the TOP n, then using TOP n is fine, but if you actually need them all but you just want to display the first page, then set maxItemCount to your page size plus some buffer. Then cache the entire collection client side, only displaying a page at a time.Larry Maccherone

1 Answers

0
votes

If you want to use DocumentDB for CONTAINS word searches and avoid scans (and not use Azure Search), you can do the following:

  1. You tokenize text into an array of words. You can do with with an off-the-shelf tokenizer like Lucene.NET. Let's say text is "This is a question"
  2. Store the words as an array like text_tokens. The content of text_words is ["this", "is", "question"] (canonicalized to lower case, and removed stop-words)
  3. Query the values in text_tokens using ARRAY_CONTAINS(c.text_tokens, "word"). This will use the index.