Document processing in Liferay portal

Question

I've been using Liferay a lot for past 2 years, but I have never needed any extensive document management.

Now I have a portlet where users upload documents (MS office OLE2 documents, ODS documents, PDF etc.) and I have to persist them with all metadata available.

I know how would I do that without using Liferay, I'd probably use Apache solr with Apache Tika (UpdateRichDocuments and ExtractingRequestHandler) or Apache Jackrabbit that are using Apache Tika under the hood (org.apache.jackrabbit.extractor.*).

The problem is, that If I look at the trunk of Liferay, there are some key classes :

Hooks (JCRHook, FileSystemHook, CMISHook, s3Hook) that are employed from within DLLocalServiceImpl kinda directly

Another alternative is using DLAppLocalServiceImpl that is employing DLRepositoryLocalServiceImpl and the files are persisted into repository also via Hooks, but a lot of additional stuff is done in there.

There is not jackrabbit-text-extractors library in Liferay, so I suppose If I wanted metadata to be extracted from PDF, DOCs, ODS documents, I would have very hard times... because the DL service layer doesn't accept additional properties
1. I think I'd have to avoid using DL services and JCR hook and access Jackrabbit directly... But I would loose the compatibility and possibility migrate my repository etc.

Could please anybody collaborate on this one please ? Thank you

David O'Meara David O'Meara · Accepted Answer · 2011-02-28T03:55:13

SOLR for indexing, Jackrabbit for document storage. Managing Liferay Document Library in code is fairly easy, just look at the DL*LocalServiceUtil classes, namely DLFolderLocalServiceUtil and DLFileLocalServiceUtil. By default Liferay just creates a matching folder/file structure on the hard drive (with names changed) so you'd only need to write code or use Jackrabbit if you wanted more than this since Liferay allows up/download and viewing out of the box via the control panel and various portlets.

I haven't used JackRabbit with Liferay but once configured everything should be managed under the covers and you shouldn't need to worry about it on the front end.

When you say "with all metadata available" I'm not sure what is retained, but aside from renaming the file so that it can be tracked there shouldn't be any other changes. It should be quick and easy to test by uploading a file of each type and checking the entries in the LIFERAY/data/document_library directory and subdirectories. Again this would be different if Jackrabbit is used.

Document processing in Liferay portal

4 Answers