How do I use Apache POI to read a .DOC file in Java to separate images from text?

Question

I need to read a Word .doc file from Java that has text and images. I need to recognize the images & text and separate them into 2 files.

I've recently heard about "Apache POI." How I can use Apache POI to read Word .doc files?

Unknown Unknown · Accepted Answer · 2009-02-28T06:07:22

The examples and sample code on apache's site are pretty good. I recommend you start there.

http://poi.apache.org/hwpf/quick-guide.html

To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.

Here for an example of extracting an image. Here for the latest revision as of this writing.

And of course, the Javadocs

Note that, according to the POI site,

HWPF is still in early development.

How do I use Apache POI to read a .DOC file in Java to separate images from text?

2 Answers