Background : I have like 500k documents loaded in the Marklogic production database. We have a processing pipeline using CPF, for document ingestion. I store the MIME-Type of the documents.
I need to have facet based on the FileType. I cannot use the MIME-Type for faceting since, regular user does not understand the MIME-Type and they are more familiar with the extensions (Excel, Word, Jpeg etc...).. My issue is multiple MIME-Types can map to once single file type, for eg:- Excel has like 10 or more MIME-Types and I want to map all of them to Excel.
Following are the 2 possibilities I came up with as to how to implement. I was wondering if there are any other good ways to achieve this..
- I have a "transform" method, which basically takes the faceting produced based on the MIME-Type and group all the various MIME-Types corresponding to a File Type. I have a configuration file (XML) which has all the mappings of the MIME-Type to the File Type. This way when new MIME-Types get added to a File-Type all I can do is edit this configuration file..
The Disadvantage of this method is that, when doing the search query, I need to expand the search string by having a Custom constraint, so the File Type gets transformed to each individual MIME-Types. - Second option, is during the ingestion process, I add the File-Type and this will solve my faceting and search issue.
The disadvantage with this option is that I need to add FileType to all the existing 500k documents and before I do I either needs to disable the CPF or add some kind of logic where when the CPF gets triggered on these 500k documents, I tell not to take any action. Since it is production database, I have no liberty to disable the CPF for new document ingestion.
I like the second option, but is there a better way to do this than changing my CPF code or disabling the trigger database for some time ??
And also which one (1 or 2) is a better way to do this ? And I want to know if there are better options than these..