0
votes

Apologies in advance if this question is not appropriate for this website.

I have written some documents in Microsoft Word which I need to also display on a website as HTML. To do this I need to enter the content of these documents into a database with HTML tags. So for example this is what I need to put in the database:

<h1>Document Title</h1>
<p>This is the introduction paragraph for the document</p>
<ol>
<li>This is a summary point</li>
</ol>

My problem is that saving the Microsoft Word as a HTML page adds so much extra markup (mainly presentational with inline CSS) that its hard for me to strip it out to its basic HTML structure like in my example above.

So how does one keep offline and online content in sync? I wanted to avoid making two versions of the same document (one in Word and one in HTML) because keeping them in sync would be difficult.

Can MS Word be setup to save as HTML without any presentational formatting? Or is there a different piece of software I should be using?

4
What are your preferred programming languages?JasonPlutext

4 Answers

1
votes

If the number of documents are limited and you can use a manual procedure to convert them, maybe some free online services like word2cleanhtml.com, www.textfixer.com or document.online-convert.com help you.

But if you want to automate the process, you have to know that the docx format is actually a zip file which contains all elements of your documents (images, tables, texts, etc). These items are categorized under sub-folders and most of them are in XML format. So you can use techniques like what explained here to extract desired content from a docx file.

There is also some known commercial and open source libraries which let you manipulate or extract contents of docx files. APIs like Apache POI or OpenOffice are examples of open source projects and Aspose Word for Java is a commercial product which is one of the best API's available in this field.

1
votes

From experience I would recommend sticking with the Word save-to-html approach. The difficulty of removing mso tags is more surmountable than will be the newly introduced issues of any other alternative solutions to your problem.

There are lots of javascript rich-text editors FCKEditor and TinyMCE that do stripping of word tags - I would recommend looking into these, are these plugins open-source?

1
votes

Thank you for the replies. I tried the various online convertors but they never converted lists properly. Numbered lists were put into <p> elements which was wrong. In the end I found out how to do it with much ease....

Copy and paste the entire Word document into Adobe Dreamweaver. Then go into code view and you will see that Dreamweaver has beautifully applied the correct, clean, HTML markup!

0
votes

If you use ColdFusion you can use the DocExtactor http://docxextractor.riaforge.org/

You have access to all the source, so it can be modified to get the HTML formatting you need

Disclaimer: I wrote it