Extract text from large files

Question

I need to extract text from large files(max limit 50MB) Files may be doc,ppt,xls,txt or pdf format. So far I've used Apache POI 'http://poi.apache.org/'

For Microsoft Office documents and PDFBox to extract text from PDF. However the extraction process gets slow as files get large specially with following files. Results I've achieved so far:

1.PPTX - 45MB - 3 Minutes apx

2.PDF - 62MB - 2 Minutes apx

3.Docx - 32MB - 15 Seconds apx

4.XLS - 17MB - 10 Seconds apx

5.XLSX - 7MB - 20 Seconds apx

I need the process to be fast. Which APIs can I use to achieve this, and what best practices can help me enhance my application's performance?

As PDF is a format merely drawing letter groups at custom positions in a page, all these letter groups have to be found, sorted, and glued together before you get your text. This can take some time... depending on the PDF library used, though, there certainly are faster and slower solutions... — mkl
I'm looking for certainly something faster than 2mins for 62MB file. — Umar Iqbal

Joop Eggen Joop Eggen · Accepted Answer · 2014-02-26T11:43:46

pptx, docx and xlsl are zips with XML files inside (content.xml and sharedStrings.xml or so). If you do not need text in context, thus a DOM (model of the entire document), you might process these XMLs yourself, and sequentially parse all.

For PDF you might try itext, sequentially reading the pdf. In fact there are sample text extractors for several pdf libraries.

Extracting text from XML means reading the XML text sequentially and only paying attention to text parts between > and <.

The hard part is xlsx, where cell values are shared: an indirect reference. I would rather use a JDBC query, but that also takes time. There are several options: ODBC-JDBC bridge, and there are proper drivers.

Programming indeed costs time, and should be done on small sample documents.

Extract text from large files

2 Answers