You're asking for word counts "for any given document URI". But you are assuming that the solution involves indexes or lexicons, and that's not necessarily a good assumption. If you want something document-specific from a document-oriented database, it's often best to work on the document directly.
So let's focus on an efficient word-count solution for a single document, and go from there. OK?
Here's how we could get word counts for a single element, including any children. This could be the root of your document: doc($uri)/*.
declare function local:word-count($root as element())
as map:map
{
let $m := map:map()
let $_ := cts:tokenize(
$root//text())[. instance of cts:word]
! map:put($m, ., 1 + (map:get($m, .), 0)[1])
return $m
};
This produces a map, which I find more flexible than flat text. Each key is a word, and the value is the count. The variable $doc already contains your sample XML.
let $m := local:word-count($doc)
for $k in map:keys($m)
return text { $k, map:get($m, $k) }
inside 1
This 2
is 2
paragraph 1
highlighted 1
EXAMPLE 1
header 1
are 1
word 1
words 1
these 1
tag 1
And 1
a 2
Note that the order of the map keys is indeterminate. Add an order by clause if you like.
let $m := local:word-count($doc)
for $k in map:keys($m)
let $v := map:get($m, $k)
order by $v descending
return text { $k, $v }
If you want to query the entire database, Geert's solution using cts:words might look pretty good. It uses a lexicon for the word list, and some index lookups for word matching. But it will end up walking the XML for every matching document for every word-lexicon word: O(nm). To do that properly the code will have to do work similar to what local:word-count does, but for one word at a time. Many words will match the same documents: 'the' might be in A and B, and 'then' might also be in A and B. Despite using lexicons and indexes, usually this approach will be slower than simply applying local:word-count to the whole database.
If you want to query the entire database and are willing to change the XML, you could wrap every word in a word element (or whatever element name you prefer). Then create an element range index of type string on word. Now you can use cts:values and cts:frequency to pull the answer directly from the range index. This will be O(n) with a much lower cost than the cts:words approach, and probably faster than local:word-count, because won't visit any documents at all. But the resulting XML is pretty clumsy.
Let's go back and apply local:word-count to the whole database. Start by tweaking the code so that the caller supplies the map. That way we can build up a single map that has word counts for the whole database, and we only look at each document once.
declare function local:word-count(
$m as map:map,
$root as element())
as map:map
{
let $_ := cts:tokenize(
$root//text())[. instance of cts:word]
! map:put($m, ., 1 + (map:get($m, .), 0)[1])
return $m
};
let $m := map:map()
let $_ := local:word-count($m, collection()/*)
for $k in map:keys($m)
let $v := map:get($m, $k)
order by $v descending
return text { $k, $v }
On my laptop this processed 151 documents in less than 100-ms. There were about 8100 words and 925 distinct words. Getting the same results from cts:words and cts:search took just under 1-sec. So local:word-count is more efficient, and probably efficient enough for this job.
Now that you can build a word-count map efficiently, what if you could save it? In essence, you'd build our own "index" of word counts. This is easy, because maps have an XML serialization.
(: Construct a map. :)
map:map()
(: The document constructor creates a document-node with XML inside. :)
! document { . }
(: Construct a map from the XML root element. :)
! map:map(*)
So you could call local:word-count on each new XML document as it's inserted or updated. Then store the word-count map in the document's properties. Do this using a CPF pipeline, or using your own code via RecordLoader, or in a REST upload endpoint, etc.
When you want word counts for a single document, that's just a call to xdmp:document-properties or xdmp:document-get-properties, then call the map:map constructor on the right XML. If you want word counts for multiple documents, you can easily write XQuery to merge those maps into a single result.