0
votes

I have a list of .pdf, ppt, pptx, xls,xlsx, doc and .docx FilesList<File> and now want to look for email-addresses in this files. My Problem is how to extract the plan Text smart from those files. Currently I am using Apache POI and I have a single methode for every type of File is there a shorter, more elegant posibility doing this? Maybe there is also a posibility to also process .odt, .odp, .ods Files? How to get te plan text from .pdf, ppt, pptx, xls,xlsx, doc and .docx Files into a String?

2
Did you try Apache Tika?Gagravarr

2 Answers

1
votes

If the Apache library can convert the file to text, then you can do a regex search in the resultant text. If you can use some other Java library, then you may be able to search directly in the original document or at least convert them to plain text first.

The company I am working has a few libraries for two of these formats. With Gnostice XtremeDocumentStudio (for Java) library, you can convert PDF and DOCX files to plain text.

DocumentConverter dc = new DocumentConverter();
dc.convertToFile("sample.pdf", "sample-pdf.txt");
dc.convertToFile("sample.docx", "sample-docx.txt");

With Gnostice PDFOne (for Java) library, you can directly perform the search in the PDF using a regex (another regex, one created for email addresses, link given above). This library works only with PDF files.

PdfDocument doc = new PdfDocument();
doc.load("Input_Docs\\input_doc.pdf");

// Obtain all website addresses in page 2
ArrayList lstSearchResults =
   (ArrayList) doc.search("http://{1}",  // regular expression
                          2, // page number
                          PdfSearchMode.REGEX,
                          PdfSearchOptions.NONE);
-1
votes

Did you try JOffice? It supports OpenOffice document formats (.odt, .ods) as well as Microsoft Office document formats.