1
votes
java - How to extract text from .doc document using apache poi? - Stack Overflow
Asked
Viewed 2k times
1

I used some code snippets below for extracting text from .doc file

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
Range range = document.getRange();
        int len = range.numParagraphs();
        StringBuilder builder = new StringBuilder();

        for (int i = 0; i < len; i++) {
            builder.append(range.getParagraph(i).text());
        }

and

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
WordExtractor wordExtractor = new WordExtractor(document);
        String[] paragraphs = wordExtractor.getParagraphText();
        StringBuilder builder = new StringBuilder();
        for (String p : paragraphs) {
            builder.append(p);
        }

However, both of them always output some strange characters. ex:

PAGEREF_Toc351848910\h10HYPERLINK\l _Toc351848911

CITATIONPla\l1033[HYPERLINK\l"Pla"13]. So, I want to know where are they from and how to remove them when extracting text from .doc file

Thanks in advance

3
  • 1
    The strange text you show are a table of contents entry a TOC reference and a citation. Sorry, I don't know how to remove them.
    – grahamj42
    Mar 23 2013 at 20:45
  • 1
    Have you tried using WordExtractor#stripFields(String) to remove them?
    – Gagravarr
    Mar 24 2013 at 21:09
  • It works. Thanks alot
    – thoitbk
    Mar 28 2013 at 17:55
0

I hope this may give you some insight.

    private static void ConvertDoctoPdf(String src, String outputPdf) throws Exception {

        try {
            Document pdfdoc = new Document();

            HWPFDocument doc = new HWPFDocument(new FileInputStream(src));

            //create wordextractor object to wrap the extracted word from HWPFDocument object.
            WordExtractor we = new WordExtractor(doc);

            OutputStream outputFile = new FileOutputStream(new File(desc));

            //create a pdf writer object to write text to mypdf.pdf file
            PdfWriter.getInstance(pdfdoc, outputFile);

            pdfdoc.open();

            Paragraph para = new Paragraph();

            //Collecting all paragraphs
            String[] paragraphs = we.getParagraphText();

            for (int i = 0; i < paragraphs.length; i++) {
                //add the paragraph to the document
                para.add(paragraphs[i]);
                //para.add(new Chunk(Chunk.NEWLINE));
                }
            //print all paragraph together
            System.out.println(para);    
            //Add all paragraph together to pdfdoc document.
            pdfdoc.add(para);

            pdfdoc.close();
            we.close();
            }  catch (Exception e) {
            e.printStackTrace();

        }
    }
2
  • This appears to be creating a PDF document - how is that in any way solving the Original Problem?
    – Gagravarr
    Feb 16 2017 at 11:56
  • System.out.println(para); It prints extracted paragraph.
    – Om Prakash
    Feb 17 2017 at 4:30

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.

 
1

1 Answers

0
votes

I hope this may give you some insight.

    private static void ConvertDoctoPdf(String src, String outputPdf) throws Exception {

        try {
            Document pdfdoc = new Document();

            HWPFDocument doc = new HWPFDocument(new FileInputStream(src));

            //create wordextractor object to wrap the extracted word from HWPFDocument object.
            WordExtractor we = new WordExtractor(doc);

            OutputStream outputFile = new FileOutputStream(new File(desc));

            //create a pdf writer object to write text to mypdf.pdf file
            PdfWriter.getInstance(pdfdoc, outputFile);

            pdfdoc.open();

            Paragraph para = new Paragraph();

            //Collecting all paragraphs
            String[] paragraphs = we.getParagraphText();

            for (int i = 0; i < paragraphs.length; i++) {
                //add the paragraph to the document
                para.add(paragraphs[i]);
                //para.add(new Chunk(Chunk.NEWLINE));
                }
            //print all paragraph together
            System.out.println(para);    
            //Add all paragraph together to pdfdoc document.
            pdfdoc.add(para);

            pdfdoc.close();
            we.close();
            }  catch (Exception e) {
            e.printStackTrace();

        }
    }