Counting string occurrences with ArangoDB AQL

Question

To count the number of objects containing a specific attribute value I can do something like:

FOR t IN thing
  COLLECT other = t.name = "Other" WITH COUNT INTO otherCount
  FILTER other != false
  RETURN otherCount

But how can I count three other occurrences within the same query, without resulting to subqueries running through the same dataset multiple times?

I've tried something like:

FOR t IN thing
  COLLECT 
    other = t.name = "Other",
    some = t.name = "Some",
    thing = t.name = "Thing"
  WITH COUNT INTO count
  RETURN {
   other, some, thing,
   count
  }

But I can't make sense of the results: I must be approaching this in the wrong way?

Do you actually want to count how often certain phrases occur in a larger string value of a single attribute? Your query doesn't look like it would do anything like it, I'm actually surprised that it is not invalid (i.e. raising a syntax error because of x = y = z) — CodeManX
Yes, I'd like to count how often different phrases occur as substrings of a single attribute. If exact values would suffice, then it would be enough to do somethinge like: FOR t IN thing COLLECT name = t.name WITH COUNT INTO count RETURN { name, count }. — Bjorn Thor Jonsson

CodeManX CodeManX · Accepted Answer · 2020-01-16T00:48:58

Split and count

You could split the string by the phrase and subtract 1 from the count. This works for any substring, which on the other hand means it does not respect word boundaries.

LET things = [
    {name: "Here are SomeSome and Some Other Things, brOther!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]

FOR t IN things
  LET Some = LENGTH(SPLIT(t.name, "Some"))-1
  LET Other = LENGTH(SPLIT(t.name, "Other"))-1
  LET Thing = LENGTH(SPLIT(t.name, "Thing"))-1
  RETURN {
   Some, Other, Thing
}

Result:

[
  {
    "Some": 3,
    "Other": 2,
    "Thing": 1
  },
  {
    "Some": 0,
    "Other": 0,
    "Thing": 0
  },
  {
    "Some": 0,
    "Other": 1,
    "Thing": 0
  }
]

You can use SPLIT(LOWER(t.name), LOWER("...")) to make it case-insensitive.

COLLECT words

The TOKENS() function can be utilized to split the input into word arrays, which can then be grouped and counted. Note that I changed the input slightly. An input "SomeSome" will not be counted because "somesome" != "some" (this variant is word and not substring based).

LET things = [
    {name: "Here are SOME some and Some Other Things. More Other!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]
LET whitelist = TOKENS("Some Other Things", "text_en")

FOR t IN things
  LET whitelisted = (FOR w IN TOKENS(t.name, "text_en") FILTER w IN whitelist RETURN w)
  LET counts = MERGE(FOR w IN whitelisted
    COLLECT word = w WITH COUNT INTO count
    RETURN { [word]: count }
  )
  RETURN {
    name: t.name,
    some: counts.some || 0,
    other: counts.other || 0,
    things: counts.things ||0
  }

Result:

[
  {
    "name": "Here are SOME some and Some Other Things. More Other!",
    "some": 3,
    "other": 2,
    "things": 0
  },
  {
    "name": "There are no such substrings in here.",
    "some": 0,
    "other": 0,
    "things": 0
  },
  {
    "name": "some-Other-here-though!",
    "some": 1,
    "other": 1,
    "things": 0
  }
]

This does use a subquery for the COLLECT, otherwise it would count the total number of occurrences for the entire input.

The whitelist step is not strictly necessary, you could also let it count all words. For larger input strings it might save some memory to not do this for words you are not interested in anyway.

You might want to create a separate Analyzer with stemming disabled for the language if you want to match the words precisely. You can also turn off normalization ("accent": true, "case": "none"). An alternative would be to use REGEX_SPLIT() for typical whitespace and punctuation characters for a simpler tokenization, but that depends on your use case.

Other solutions

I don't think that it's possible to count each input object independently with COLLECT without subquery, unless you want a total count.

Splitting is a bit of a hack, but you could substitute SPLIT() by REGEX_SPLIT() and wrap the search phrases in \b to only match if word boundaries are on both sides. Then it should only match words (more or less):

LET things = [
    {name: "Here are SomeSome and Some Other Things, brOther!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]

FOR t IN things
  LET Some = LENGTH(REGEX_SPLIT(t.name, "\\bSome\\b"))-1
  LET Other = LENGTH(REGEX_SPLIT(t.name, "\\bOther\\b"))-1
  LET Thing = LENGTH(REGEX_SPLIT(t.name, "\\bThings\\b"))-1
  RETURN {
   Some, Other, Thing
}

Result:

[
  {
    "Some": 1,
    "Other": 1,
    "Thing": 1
  },
  {
    "Some": 0,
    "Other": 0,
    "Thing": 0
  },
  {
    "Some": 0,
    "Other": 1,
    "Thing": 0
  }
]

A more elegant solution would be to utilize ArangoSearch for word counting, but it doesn't have a feature to let you retrieve how often a word occurs. It might keep track of that already internally (Analyzer feature "frequency"), but it's definitely not exposed at this point in time.

Counting string occurrences with ArangoDB AQL

1 Answers