Split and count
You could split the string by the phrase and subtract 1 from the count. This works for any substring, which on the other hand means it does not respect word boundaries.
LET things = [
{name: "Here are SomeSome and Some Other Things, brOther!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
FOR t IN things
LET Some = LENGTH(SPLIT(t.name, "Some"))-1
LET Other = LENGTH(SPLIT(t.name, "Other"))-1
LET Thing = LENGTH(SPLIT(t.name, "Thing"))-1
RETURN {
Some, Other, Thing
}
Result:
[
{
"Some": 3,
"Other": 2,
"Thing": 1
},
{
"Some": 0,
"Other": 0,
"Thing": 0
},
{
"Some": 0,
"Other": 1,
"Thing": 0
}
]
You can use SPLIT(LOWER(t.name), LOWER("...")) to make it case-insensitive.
COLLECT words
The TOKENS() function can be utilized to split the input into word arrays, which can then be grouped and counted. Note that I changed the input slightly. An input "SomeSome" will not be counted because "somesome" != "some" (this variant is word and not substring based).
LET things = [
{name: "Here are SOME some and Some Other Things. More Other!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
LET whitelist = TOKENS("Some Other Things", "text_en")
FOR t IN things
LET whitelisted = (FOR w IN TOKENS(t.name, "text_en") FILTER w IN whitelist RETURN w)
LET counts = MERGE(FOR w IN whitelisted
COLLECT word = w WITH COUNT INTO count
RETURN { [word]: count }
)
RETURN {
name: t.name,
some: counts.some || 0,
other: counts.other || 0,
things: counts.things ||0
}
Result:
[
{
"name": "Here are SOME some and Some Other Things. More Other!",
"some": 3,
"other": 2,
"things": 0
},
{
"name": "There are no such substrings in here.",
"some": 0,
"other": 0,
"things": 0
},
{
"name": "some-Other-here-though!",
"some": 1,
"other": 1,
"things": 0
}
]
This does use a subquery for the COLLECT, otherwise it would count the total number of occurrences for the entire input.
The whitelist step is not strictly necessary, you could also let it count all words. For larger input strings it might save some memory to not do this for words you are not interested in anyway.
You might want to create a separate Analyzer with stemming disabled for the language if you want to match the words precisely. You can also turn off normalization ("accent": true, "case": "none"). An alternative would be to use REGEX_SPLIT() for typical whitespace and punctuation characters for a simpler tokenization, but that depends on your use case.
Other solutions
I don't think that it's possible to count each input object independently with COLLECT without subquery, unless you want a total count.
Splitting is a bit of a hack, but you could substitute SPLIT() by REGEX_SPLIT() and wrap the search phrases in \b to only match if word boundaries are on both sides. Then it should only match words (more or less):
LET things = [
{name: "Here are SomeSome and Some Other Things, brOther!"},
{name: "There are no such substrings in here."},
{name: "some-Other-here-though!"}
]
FOR t IN things
LET Some = LENGTH(REGEX_SPLIT(t.name, "\\bSome\\b"))-1
LET Other = LENGTH(REGEX_SPLIT(t.name, "\\bOther\\b"))-1
LET Thing = LENGTH(REGEX_SPLIT(t.name, "\\bThings\\b"))-1
RETURN {
Some, Other, Thing
}
Result:
[
{
"Some": 1,
"Other": 1,
"Thing": 1
},
{
"Some": 0,
"Other": 0,
"Thing": 0
},
{
"Some": 0,
"Other": 1,
"Thing": 0
}
]
A more elegant solution would be to utilize ArangoSearch for word counting, but it doesn't have a feature to let you retrieve how often a word occurs. It might keep track of that already internally (Analyzer feature "frequency"), but it's definitely not exposed at this point in time.
x = y = z) - CodeManXFOR t IN thing COLLECT name = t.name WITH COUNT INTO count RETURN { name, count }. - Bjorn Thor Jonsson