24
votes

I need to convert a Word document into HTML file(s) in Java. The function will take input an word document and the output will be html file(s) based on the number of pages the word document has i.e. if the word document has 3 pages then there will be 3 html files generated having the required page break.

I searched for open source/non-commercial APIs which can convert doc to html but for no result. Anybody who have done this type of job before please help.

Thanks

11
Here are some starting points for you. Good luck. On Microsoft's website, you can find documentation for the .doc format, and on the ECMA website, the .docx format. Microsoft has a category for Java on their OpenXML developer blog, including a post specifically about converting OpenXML to XHTML in Java.lewinski
theserverside.com/news/thread.tss?thread_id=41942#216880 -- this has worked quite well for me earlieranjanb
here's something used by someone who's been doing this for a while -- jroller.com/rickard/entry/word_to_html_in_javaanjanb

11 Answers

3
votes

We use tm-extractors (http://mvnrepository.com/artifact/org.textmining/tm-extractors), and fall back to the commercial Aspose (http://www.aspose.com/). Both have native Java APIs.

7
votes

I recommend the JODConverter, It leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today.

JODConverter has a lot of documents, scripts, and tutorials to help you out.

4
votes

I've used the following approach successfully in production systems where the new MS Word XML format isn't available:

Spawn a process that does something similar to:

http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html

You'd probably want to start openoffice up once at startup of your program, and call the python script as many times during your program that you need to (with some sort of checking to ensure the ooffice process is always there).

The other option is to spawn the following sort of command every time you need to do the conversion:

ooffice -headless "macro://<path to ooffice vb macro to convert, with parameter pointing to file>"

I've used the macro approach multiple times and it works well (sorry, I don't have the macro code available).

While there are mechanisms for doing it via MS Word, they're not easy from Java, and do require other support programs to drive MS Word via OLE.

I've used abiword before too, which works well for many documents, but does get confused with more complex documents (ooffice seems to handle everything I've thrown at it). Abiword has a slightly easier command line interface for conversion than ooffice.

2
votes

If its a docx, you could use docx4j (ASL v2). This uses XSLT to create the HTML.

However, it will give you a single HTML for the whole document.

If you wanted an HTML per page, you could do something with the lastRenderedPageBreak tag that Word puts into the docx (assuming you used Word to create it).

2
votes

It is easier to do this in the new MS word docx as the format is in XML. You can use an XSL to transform the Word doc in XML format to an HTML format.

If however your Word doc is in an old version, you can use POI library http://poi.apache.org/ and then access that and generate a Java object and from that point on you can easily convert it to an HTML format using an HTML java library

http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html

1
votes

I see this thread turns up in external links and has the occasional post so I thought I'd post an update (hope no one minds). OpenOffice continues to evolve and release 3.2 improves the word import export filters again. OpenOffice and Java can run on many platforms so Java systems can make use of the OpenOffice UNO API directly to import/manipulate/export documents in many formats (including word and pdf) or use a library like JODReports or Docmosis to facilitate. Both have free/open options.

1
votes

I tried this way and its work with me from this site http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

This only work with docx to convert it into html included images inside that word document.

    // 1) Load DOCX into XWPFDocument
    InputStream doc = new FileInputStream(new File("c:/document.docx"));
    XWPFDocument document = new XWPFDocument(doc);

   // 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
            XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;

            // 3) Extract image
            String root = "target";
            File imageFolder = new File( root + "/images/" + doc );
            options.setExtractor( new FileImageExtractor( imageFolder ) );
            // 4) URI resolver
            options.URIResolver( new FileURIResolver( imageFolder ) );


            OutputStream out = new FileOutputStream(new File("c:/document.html"));
            XHTMLConverter.getInstance().convert(document, out, options);

I hope this solve your issue

0
votes

You'd have to find the MS word doc specification ( since it is basically a binary dump of whatever is in word at that point in time ), and slowly go through it element by element converting ms word "objects/states" to the html equiv. you might be able to find a script to do it for u since this really isn't fun work and i'd advise against it ( converting file formats or even reading from commercial files on your own is always hard and often incomplete ). PS: just google doc2html

0
votes

If you are targeting word 2007 files using the ooxml format then this article might help. And there is the Ooxml4j project which is implementing ooxml for Java library.

If you are targeting the binary files though...thats another problem.

0
votes
import officetools.OfficeFile; // package available at www.dancrintea.ro/doc-to-pdf/
...
FileInputStream fis = new FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.html"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"html");

All possible conversions:

doc --> pdf, html, txt, rtf

xls --> pdf, html, csv

ppt --> pdf, swf

html --> pdf

0
votes

you can use micrsoft office online

first, on server side request https://view.officeapps.live.com/op/view.aspx?src='your doc file online url'

then use jsoup parse the result html

when access from mobile the html will have a frame wrapped.