3
votes

Hi MarkLoggers out there,

I have again a question for you! I have a collection of documents containing postalcode information. 400.000 docs. The docs are ordered one zip code per doc, each doc contains 400 features , ordered in categories and variabeles like so:

<postcode id="9728" xmlns="http://www.nvsp.nl/p4">
<meta-data>
<!--
Generated by DIKW for NetwerkVSP ST!P
-->
<version>0.3</version>
<dateCreated>2014-06-28+02:00</dateCreated>
</meta-data>
<category name="Oplages">
<variable name="Oplage" updated="2014-08-12+02:00">
  <segment name="Bruto">1234</segment>
  <segment name="Stickers">234</segment>
  <segment name="Netto">1000</segment>
  <segment name="Aktief">J</segment>
</variable>
</category>
<category name="Automotive">
<variable name="Leaseauto">
<segment name="Leaseauto">2.68822210725987</segment>
</variable>
<variable name="Autotype">
<segment name="De Oudere Stadsrijder">4.61734781858941</segment>
<segment name="De Dure Tweedehandsrijder">6.02534919813761</segment>
<segment name="De Autoloze">41.187790998448</segment>
<segment name="De Leasende Veelrijder">0.608035868253147</segment>
<segment name="De Modale Middenklasser">13.1996896016555</segment>
<segment name="De Vermogende Autoliefhebber">4.45283669598206</segment>
<segment name="De Vermogende Kilometervreter">2.07690981203656</segment>
<segment name="De Doelmatige Budgetrijder">17.2048629073978</segment>
<segment name="De Doorsnee Nieuw Kopende Automob">10.1595102603897</segment>
</variable>
...
400 more cat/var/segment element
...
</postcode>

I need to find a subset of docs based on the id attribute in postcode element and return only specific elements.

Elements to return are in cat Oplages var Oplage and I need segments Bruto and Netto

Now we have a rest api extension that does that but not fast enough.

Example query:

xquery version "1.0-ml";
declare namespace html = "http://www.w3.org/1999/xhtml";
declare namespace p4ns       = "http://www.nvsp.nl/p4";
declare namespace wijkns     = "http://www.nvsp.nl/wijk";

let $segment := "Bruto"

let $zoeker0 := cts:search(fn:doc(), cts:element-attribute-range-query(xs:QName("p4ns:postcode"), xs:QName("id"), "=", ("2311","2312","2313"))) 
let $zoeker1 := cts:search(/p4ns:postcode, cts:element-attribute-range-query(xs:QName("p4ns:postcode"), xs:QName("id"), "=", ("2311","2312","2313"))) 
let $zoeker2 := cts:search(/p4ns:postcode, cts:element-attribute-value-query(xs:QName("p4ns:postcode"), xs:QName("id"), ("2311","2312","2313"))) 

let $inhoud1 := $zoeker0//p4ns:segment[@name=$segment]
let $inhoud2 := $zoeker1//p4ns:segment[@name=$segment]/text()

let $r1 := cts:search(/p4ns:postcode, cts:element-attribute-range-query(xs:QName("p4ns:segment"), xs:QName("name"), "=", $segment))

return $inhoud2

Now if I profile this test query the slow part is looking up the "Bruto" segment in de docs returned by the cts:search. I know I should avoid looking up elements in docs via xpath but I do not know how to combine the two bits hitting only indexes...

Profiler outcome:

.main:13:44 1446    27  7127    30  7938    @name = "Bruto"
.main:12:44 1446    27  6956    30  7793    @name = "Bruto"
.main:17:11 1   9.3     2431    9.4     2458    cts:search(fn:collection()/p4ns:postcode, cts:element-attribute-range-query(xs:QName("p4ns:segment"), fn:QName("", "name"), "=", $segment))
.main:10:16 1   7.2     1874    7.2     1885    cts:search(fn:collection()/p4ns:postcode, cts:element-attribute-value-query(xs:QName("p4ns:postcode"), fn:QName("", "id"), ("2311", "2312", "2313")))

Query result:

1234
4567
3456

NOW my question(s):

1) What does "@name = "Bruto"" mean and why is it slow?

2) Ideally I would combine the search of docs with looking up the segment element via xpath into one combination but if I put $zoeker into a cts:search it is unsearchable... What is the best approach to get my result back in one go?

thx in advance!

hugo

1

1 Answers

6
votes

I see two basic problems: too many trips to the database, and those trips bring back too much data that you don't really want. The goal is to minimize the number of database lookups, and make each lookup as precise as possible.

In this case the main way you are performing database lookups is cts:search. There are several of those: probably too many, and sometimes the results are never used. I think some of those are leftover experiments. When you profile it's important to profile clean code.

Next, most of the profiler time is in that @name=$segment XPath predicate. That's repeated too, and for no good reason. Get rid of the repetition and it will go faster.

However the other reason @name=$segment shows up is because MarkLogic indexes documents, not nodes. It indexes the names and values of nodes, but each index entry points to a document - or more specifically a fragment, but let's not go there. So when you have one document with tens or hundreds of index entries for segment/@name values, all those index entries point to the document root. When you ask for only the segments that match a particular name, the index lookup matches the entire document. So the evaluation has to walk each document tree. That can be expensive in CPU cycles, and that's what the profiler is showing you.

There's no cure for that without restructuring the document, or perhaps doing something clever with co-occurrences. However we can clean up your query, and convert it to a single XPath expression using full paths. Let's see if this is fast enough for your use-case.

declare namespace p4ns="http://www.nvsp.nl/p4" ;

(: These might be external parameters. :)
let $segment := "Bruto"
let $ids := ("2311","2312","2313")
return collection()/p4ns:postcode[
  @id = $ids]/p4ns:category/p4ns:variable/p4ns:segment[
  @name = $segment]/string()

If I insert your sample XML and change its id to 2313, that returns the single value 1234. Profiling it shows 33 expressions in less than 1-ms, with 66% of the time in the database lookup via XPath. However it still has to look at all the segment/@name values: in this case 14 of them, taking 10% of the time.

Note that I didn't use cts:search nor any of your range indexes. MarkLogic automatically indexes node value for XPath value-equality lookups. You only need range indexes for special operations: for example facets, sorting, and inequality lookups.

You could do a little better with this:

(collection()/p4ns:postcode[
  @id = $ids]/p4ns:category/p4ns:variable/p4ns:segment[
  @name = $segment])[1]/string()

Now we're telling the evaluator that there's only one match expected. So it'll stop after it finds Bruto, and that's early in the document. In this case it is the first one, but on average (...)[1] should cut the number of expressions in half. Other tree-pruning techniques should also help: for example maybe you can add the category and variable names to your inputs, and express them as XPath predicates.

This might be a good time for you to back up and look at the big picture. What is it you're trying to accomplish with this query? There may be a much more efficient way to reach your goal.

If this is your most common use case, then ideally you would restructure your documents so that every id-segment lookup becomes a computable doc($uri) call. I'm not sure that's a good idea in this particular case, but I don't have complete knowledge of your application.

Another approach is to use in-memory value indexes and https://docs.marklogic.com/cts:value-co-occurrences to avoid looking at the XML at all. However that's a complicated approach and I'm not going to explore it here.