2
votes

Background : I have like 500k documents loaded in the Marklogic production database. We have a processing pipeline using CPF, for document ingestion. I store the MIME-Type of the documents.

I need to have facet based on the FileType. I cannot use the MIME-Type for faceting since, regular user does not understand the MIME-Type and they are more familiar with the extensions (Excel, Word, Jpeg etc...).. My issue is multiple MIME-Types can map to once single file type, for eg:- Excel has like 10 or more MIME-Types and I want to map all of them to Excel.

Following are the 2 possibilities I came up with as to how to implement. I was wondering if there are any other good ways to achieve this..

  1. I have a "transform" method, which basically takes the faceting produced based on the MIME-Type and group all the various MIME-Types corresponding to a File Type. I have a configuration file (XML) which has all the mappings of the MIME-Type to the File Type. This way when new MIME-Types get added to a File-Type all I can do is edit this configuration file..
    The Disadvantage of this method is that, when doing the search query, I need to expand the search string by having a Custom constraint, so the File Type gets transformed to each individual MIME-Types.
  2. Second option, is during the ingestion process, I add the File-Type and this will solve my faceting and search issue.
    The disadvantage with this option is that I need to add FileType to all the existing 500k documents and before I do I either needs to disable the CPF or add some kind of logic where when the CPF gets triggered on these 500k documents, I tell not to take any action. Since it is production database, I have no liberty to disable the CPF for new document ingestion.

I like the second option, but is there a better way to do this than changing my CPF code or disabling the trigger database for some time ??

And also which one (1 or 2) is a better way to do this ? And I want to know if there are better options than these..

3

3 Answers

3
votes

You accurately list the two options. Do it at query time, or do it at index time. And you accurately list the pros and cons. My advice: If it's fast enough for you to do it at query time, and if the maintenance of the more advanced code is acceptable, then do that. Else bake the knowledge into the documents.

1
votes

With just 500k documents I think query time will be fast enough. You could potentially leverage this grouping-constraint for this purpose: https://github.com/grtjn/ml-constraints#grouping-constraint

You can install and deploy that with MLPM. After that you could use something like this inside your search options:

<constraint name="Attachment-Type">
  <custom>
    <parse apply="parse-structured" ns="http://marklogic.com/grouping-constraint" at="/ext/mlpm_modules/ml-constraints/grouping-constraint.xqy"/>
    <start-facet apply="start" ns="http://marklogic.com/grouping-constraint" at="/ext/mlpm_modules/ml-constraints/grouping-constraint.xqy"/>
    <finish-facet apply="finish" ns="http://marklogic.com/grouping-constraint" at="/ext/mlpm_modules/ml-constraints/grouping-constraint.xqy"/>
    <facet-option>limit=5</facet-option>
    <facet-option>frequency-order</facet-option>
    <facet-option>descending</facet-option>
    <facet-option>any</facet-option>
  </custom>
  <annotation>
    <range type="xs:string" facet="true" collation="http://marklogic.com/collation//S1">
      <element ns="http://my-namespace.com" name="mime-type"/>
    </range>
    <config>
      <group label="Audio">
        <match pattern="audio/*"/>
      </group>
      <group label="Video">
        <match pattern="video/*"/>
        <match pattern="application/vnd.rn-realmedia"/>
      </group>
      <group label="Documents">
        <match pattern="application/msword"/>
        <match pattern="application/vnd.wordperfect"/>
        <match pattern="application/x-wordstar"/>
        <match pattern="application/pdf"/>
        <match pattern="application/postscript"/>
        <match pattern="application/rtf"/>
        <match pattern="application/x-xywrite"/>
        <match pattern="application/x-mass11"/>
      </group>
      <group label="Spreadsheets">
        <match pattern="application/vnd.ms-excel"/>
      </group>
      <group label="Presentations">
        <match pattern="application/vnd.ms-powerpoint"/>
      </group>
      <show-remainder label="Other"/>
    </config>
  </annotation>
</constraint>

However, performance degrades with the number of groups and patterns you provide, as well with the number of total documents. Once total documents grows beyond several millions, this might take a relatively significant time of search resolution. In that case you are better off calculating groups upfront..

HTH!

0
votes

In case you worry about performance and prefer to calculate upfront, consider leveraging CPF, instead of fighting it. Make sure your CPF pipelines can detect cases where no work has to be done (to prevent things to get added twice and such), and let it just fill in missing parts. Once you are sure CPF is configured that way, and you have added logic to add the file type, just touch the files by re-inserting them with something along the lines of xdmp:document-insert($uri, doc($uri), xdmp:document-get-permissions($uri), ....)..

HTH!