2
votes

I have 50 Millions document in my marklogic database. I'd like to analyze the content in order to know which are the main categories of document.

Each of my document are in a specific folder (ie : "/books/") and with a specific collection ("/type/books").

I'd like to generate a CSV with two columns : name_of_the_collection;count_distinct_value

Example :

Collection;count
books;437438
cars;46565
cats;457373

And the same with the directory :

directory;count
/animals/cats/;437438
/animals/dogs;46565
/animals/cow;457373

I tried to list all distinct categories/collection and count the number of documents but I was not able to combine the two.

Could you please help me ?

Thanks, Romain.

2

2 Answers

7
votes

Given a name of a collection xdmp:estimate(cts:search(doc(), cts:collection-query($collection)) will give you a count of the number of documents in that collection. Similarly with cts:directory-query($directory) for a directory.

If you have the collection lexicon enabled you can get all the collection counts directly: cts:collections()!text{.||";"||cts:count(.)}

For directories it is a little trickier, but if you have the URI collection enabled you can get the directories with a bit of work as well:

declare function local:basepath( 
  $uri as xs:string
) as xs:string
{
   if ( fn:contains( $uri, "/" ) )
   then 
      let $path := fn:replace( $uri, "^(.*)/([^/]*)$", "$1" )
      return if ($path = "") then "/" else $path
   else ""
};

let $map := map:map()
let $_ :=
  for $uri in cts:uris()
  let $dir := local:basepath($uri)
  return
    if (empty(map:get($map, $dir)))
    then map:put($map, $dir, 1)
    else map:put($map, $dir, map:get($map,$dir)+1)
for $key in map:keys($map)
return ($key||";"||map:get($map,$key))
1
votes

Here's an example in XQuery

for $coll in cts:collections()
  let $count := fn:count(cts:uris("",(),cts:collection-query($coll)))
  order by $count descending
  return fn:concat($coll,';',$count)