3
votes

Imagine if I have an xml document stored in Marklogic in the following format:

<document>
   <id>DocumentID</id>
   <questions>
       <question_item>
            <question>question1</question>
            <answer>answer1</answer>
       </question_item>
       <question_item>
            <important>high</important>
            <question>question2</question>
            <answer>answer2</answer2>
       <question_item>
</document>

Basically, each document has a number of questions, only some of them have an element. I want to return all of the "important" questions in a flat format with metadata taken from the document it is pulled from (e.g., id).

The following xquery seems to work, and is reasonably fast:

for $x in cts:search(/document,
  cts:element-query(xs:QName("important"),cts:and-query((
))
), "unfiltered" , 0.0)
return for $y in $x/questions/question_item

  return 
    if ($y/important) then 
    fn:concat($x/id,'|',
        $y/question,'|',
        $y/answer,
        $y/important
    )
    else ()

This seems to work and is reasonably fast. However, I usually find that the for loops are not the fastest way to work in xquery. The solution does seem to be a relatively cumbersome approach. Is there a better way to return just the "important" nodes initially, but then still have access to the main document elements?

2

2 Answers

3
votes

Personally, I find conditional logic more cumbersome than for loops, but I think you can remove one of each for a simpler query. Instead of looping over the first sequence of documents, you can simply assign them to a variable, which will allow you to reference them. Then in your loop, use a predicate to constrain question_item to those with important elements, eliminating the need for the conditional:

let $documents := cts:search(/document,
  cts:element-query(xs:QName("important"), cts:and-query(())
  ), "unfiltered" , 0.0)
for $y in $documents/questions/question_item[important]
return fn:concat($x/id,'|',
  $y/question,'|',
  $y/answer,
  $y/important)
3
votes

As in the code sample, the optimal approach is to match documents first based on the indexes and then to extract values from the matched documents. A FLWOR expression with non-redundent XPaths is an efficient way to extract values from a document.

One possible improvement would be to take a more fine-grained approach in modelling the documents: that is, to put each question item in a separate document. That way, the search will retrieve only the question items that are important.

That change would become important if the documents are large. For maximum performance, you could then put range indexes on the question, answer, and important elements and get one tuple for each question item directly from the indexes.

If the specific list of question items is usually retrieved and updated together, however, that would argue against splitting out each question as a separate document.

Hoping that helps,