2
votes

I'm a newcomer to XML databases and in particular, I am trying to learn how Marklogic works. My apologies if these questions are too naive or obvious.

What I'd like to do is to implement MongoDB style document references in Marklogic since I think the pattern would apply very well on the latter, being itself a document oriented database.

This is what the MongoDB documentation has to say about manual and DBRef style document references:

http://docs.mongodb.org/manual/reference/database-references/

MongoDB recommends the use of manual document references.

Now, the most direct approach I can see is to define this piece of information as, say, a part of a schema definition; starting with the definition of an objectid, a book and a publisher:

<xs:simpleType name="objectId">
  <xs:restriction base="xs:string">
    <xs:length value="24"/>
    <xs:whiteSpace value="collapse"/>
  </xs:restriction>
</xs:simpleType>

<xs:element name="Publisher">
  <xs:complexType>
    <xs:attribute name="id" type="fbc:objectId" use="required"/>
    <xs:attribute name="name" type="xs:string" use="required"/>
    <xs:attribute name="location" type="xs:string" use="required"/>
  </xs:complexType>
</xs:element>

<xs:element name="Book">
  <xs:complexType>
    <xs:attribute name="Title" type="xs:string"/>
    <xs:attribute name="publisherId" type="fbc:objectId" use="required"/>
  </xs:complexType>
</xs:element>

So three questions:

  1. Would this suffice to model the document reference between a book and its publisher? Is there a better approach for Schema based XML documents?

  2. Would this approach introduce difficulties when doing XQueries inside Marklogic (or any other XML database such as existDB, Senda or Basex?

  3. Marklogic states that it can use "Modular documents" which hold some type of special document references using XPointer and XInclude:

    http://docs.marklogic.com/guide/app-dev/mod-docs

Are there any advantages in using that approach instead of manual document references? Are there any working Java API examples this feature?

I apologize in advance if these are too many questions but I believe they're all related to the overall question stated here. Thanks.

Update:

I think I will then resort to do some data de-normalization wherever appropriate and use plain old document URI attributes to reference other documents where needed. Not the best approach I guess but I think it may be good enough down the road. I'll keep updating with my findings. Thanks!

3

3 Answers

2
votes

As David and WST have pointed out, MarkLogic emphasizes denormalization over joins. Storing data structure trees or structured textual content makes it possible to retrieve documents with high performance at scale.

That said, MarkLogic does support joins. You can use XInclude to aggregate or just use an element or attribute whose value is the document URI for a related document. (The linking approach is comparable to linking in HTML.) Such links can be resolved by XQuery on the server or resolved on the client by retrieving the related documents with a single query.

1
votes

I think the simplest approach would be to do away with the ID-based association and store the publisher name and location directly in each book document. Otherwise you will still have to perform a join-like operation, which is more expensive.

MarkLogic performs best when all the data you need is already in the document. This usually means duplicating data. This strategy should work fine in other XQuery databases, but I can't say exactly how optimal it would be compared to MarkLogic.

MarkLogic has a very extensive whitepaper explaining its indexing system and many other details. It's an excellent resource for understanding how to design optimal queries and data. This link maintains a copy of the latest version:

http://developer.marklogic.com/inside-marklogic

1
votes

XIncluded documents aka "Modular Documents" in MarkLogic http://docs.marklogic.com/guide/app-dev/mod-docs

can be stored after expand (if you use the CPF framework it will actually store both the components and the final expanded document) or expanded on read.

If you expand on read, the crucial difference is that the search functions search on document (or fragment) basis. Searches across a Modular document will not show up as a match on the master document but rather on the included document. I would guess that generally this is not what most search based applications want. But if your app is not so heavily search based, or you can take this into account you could take advantage of this.

I would suggest in general (i.e. without other compelling rationale) to de-normalize your data so that it all fits into one atomic document.

-David