FileType facet in Marklogic

Question

Background : I have like 500k documents loaded in the Marklogic production database. We have a processing pipeline using CPF, for document ingestion. I store the MIME-Type of the documents.

I need to have facet based on the FileType. I cannot use the MIME-Type for faceting since, regular user does not understand the MIME-Type and they are more familiar with the extensions (Excel, Word, Jpeg etc...).. My issue is multiple MIME-Types can map to once single file type, for eg:- Excel has like 10 or more MIME-Types and I want to map all of them to Excel.

Following are the 2 possibilities I came up with as to how to implement. I was wondering if there are any other good ways to achieve this..

I have a "transform" method, which basically takes the faceting produced based on the MIME-Type and group all the various MIME-Types corresponding to a File Type. I have a configuration file (XML) which has all the mappings of the MIME-Type to the File Type. This way when new MIME-Types get added to a File-Type all I can do is edit this configuration file..
The Disadvantage of this method is that, when doing the search query, I need to expand the search string by having a Custom constraint, so the File Type gets transformed to each individual MIME-Types.
Second option, is during the ingestion process, I add the File-Type and this will solve my faceting and search issue.
The disadvantage with this option is that I need to add FileType to all the existing 500k documents and before I do I either needs to disable the CPF or add some kind of logic where when the CPF gets triggered on these 500k documents, I tell not to take any action. Since it is production database, I have no liberty to disable the CPF for new document ingestion.

I like the second option, but is there a better way to do this than changing my CPF code or disabling the trigger database for some time ??

And also which one (1 or 2) is a better way to do this ? And I want to know if there are better options than these..

hunterhacker hunterhacker · Accepted Answer · 2016-07-17T18:11:18

You accurately list the two options. Do it at query time, or do it at index time. And you accurately list the pros and cons. My advice: If it's fast enough for you to do it at query time, and if the maintenance of the more advanced code is acceptable, then do that. Else bake the knowledge into the documents.

FileType facet in Marklogic

3 Answers