How to extract text from .doc document using apache poi?

votes

java - How to extract text from .doc document using apache poi? - Stack Overflow

How to extract text from .doc document using apache poi?

Ask Question

Asked 8 years, 10 months ago

Active 4 years, 11 months ago

Viewed 2k times

I used some code snippets below for extracting text from .doc file

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
Range range = document.getRange();
        int len = range.numParagraphs();
        StringBuilder builder = new StringBuilder();

        for (int i = 0; i < len; i++) {
            builder.append(range.getParagraph(i).text());
        }

and

HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
WordExtractor wordExtractor = new WordExtractor(document);
        String[] paragraphs = wordExtractor.getParagraphText();
        StringBuilder builder = new StringBuilder();
        for (String p : paragraphs) {
            builder.append(p);
        }

However, both of them always output some strange characters. ex:

PAGEREF_Toc351848910\h10HYPERLINK\l _Toc351848911

CITATIONPla\l1033[HYPERLINK\l"Pla"13]. So, I want to know where are they from and how to remove them when extracting text from .doc file

Thanks in advance

asked Mar 23 2013 at 17:57

thoitbk

2692 gold badges8 silver badges21 bronze badges

1

The strange text you show are a table of contents entry a TOC reference and a citation. Sorry, I don't know how to remove them.
– grahamj42
Mar 23 2013 at 20:45
1

Have you tried using WordExtractor#stripFields(String) to remove them?
– Gagravarr
Mar 24 2013 at 21:09
It works. Thanks alot
– thoitbk
Mar 28 2013 at 17:55

Add a comment |

1 Answer 1

Active Oldest Score

I hope this may give you some insight.

    private static void ConvertDoctoPdf(String src, String outputPdf) throws Exception {

        try {
            Document pdfdoc = new Document();

            HWPFDocument doc = new HWPFDocument(new FileInputStream(src));

            //create wordextractor object to wrap the extracted word from HWPFDocument object.
            WordExtractor we = new WordExtractor(doc);

            OutputStream outputFile = new FileOutputStream(new File(desc));

            //create a pdf writer object to write text to mypdf.pdf file
            PdfWriter.getInstance(pdfdoc, outputFile);

            pdfdoc.open();

            Paragraph para = new Paragraph();

            //Collecting all paragraphs
            String[] paragraphs = we.getParagraphText();

            for (int i = 0; i < paragraphs.length; i++) {
                //add the paragraph to the document
                para.add(paragraphs[i]);
                //para.add(new Chunk(Chunk.NEWLINE));
                }
            //print all paragraph together
            System.out.println(para);    
            //Add all paragraph together to pdfdoc document.
            pdfdoc.add(para);

            pdfdoc.close();
            we.close();
            }  catch (Exception e) {
            e.printStackTrace();

        }
    }

answered Feb 16 2017 at 10:31

Om Prakash

2,3833 gold badges25 silver badges46 bronze badges

This appears to be creating a PDF document - how is that in any way solving the Original Problem?
– Gagravarr
Feb 16 2017 at 11:56
System.out.println(para); It prints extracted paragraph.
– Om Prakash
Feb 17 2017 at 4:30

Add a comment |

Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged java ms-word apache-poi doc or ask your own question.

Stack Overflow works best with JavaScript enabled

javams-wordapache-poidoc

1 Answers

votes

I hope this may give you some insight.

    private static void ConvertDoctoPdf(String src, String outputPdf) throws Exception {

        try {
            Document pdfdoc = new Document();

            HWPFDocument doc = new HWPFDocument(new FileInputStream(src));

            //create wordextractor object to wrap the extracted word from HWPFDocument object.
            WordExtractor we = new WordExtractor(doc);

            OutputStream outputFile = new FileOutputStream(new File(desc));

            //create a pdf writer object to write text to mypdf.pdf file
            PdfWriter.getInstance(pdfdoc, outputFile);

            pdfdoc.open();

            Paragraph para = new Paragraph();

            //Collecting all paragraphs
            String[] paragraphs = we.getParagraphText();

            for (int i = 0; i < paragraphs.length; i++) {
                //add the paragraph to the document
                para.add(paragraphs[i]);
                //para.add(new Chunk(Chunk.NEWLINE));
                }
            //print all paragraph together
            System.out.println(para);    
            //Add all paragraph together to pdfdoc document.
            pdfdoc.add(para);

            pdfdoc.close();
            we.close();
            }  catch (Exception e) {
            e.printStackTrace();

        }
    }

How to extract text from .doc document using apache poi?

current community

your communities

more stack exchange communities

How to extract text from .doc document using apache poi?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged java ms-word apache-poi doc or ask your own question.

Hot Network Questions

1 Answers

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged java ms-word apache-poi doc or ask your own question.

Related

1 Answers