I need to extract text from large files(max limit 50MB) Files may be doc,ppt,xls,txt or pdf format. So far I've used Apache POI 'http://poi.apache.org/'
For Microsoft Office documents and PDFBox to extract text from PDF. However the extraction process gets slow as files get large specially with following files. Results I've achieved so far:
1.PPTX - 45MB - 3 Minutes apx
2.PDF - 62MB - 2 Minutes apx
3.Docx - 32MB - 15 Seconds apx
4.XLS - 17MB - 10 Seconds apx
5.XLSX - 7MB - 20 Seconds apx
I need the process to be fast. Which APIs can I use to achieve this, and what best practices can help me enhance my application's performance?