4
votes

Scenario: I have about 14000 word documents that need to be converted from "Microsoft Word 97 - 2003 Document" to "Microsoft Word Document". In other words upgraded to 2010 format (.docx).

Question: Is there an easy way to do this using API's or something?

Note: I've only been able to find a microsoft program that converts the documents to .docx but they still open in compatability mode. It would be nice if they could just be converted to the new format. Same functionality you get when you open an old document and it gives you the option to convert it.

Edit: Just found http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word._document.convert.aspx looking into how to use it.

EDIT2: This is my current function for converting the documents

Private Sub btnConvert_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnConvert.Click
    FolderBrowserDialog1.ShowDialog()
    Dim mainThread As Thread
    If Not String.IsNullOrEmpty(FolderBrowserDialog1.SelectedPath) Then
        lstFiles.Clear()

        DirSearch(FolderBrowserDialog1.SelectedPath)
        ThreadPool.SetMaxThreads(1, 1)
        lstFiles.RemoveAll(Function(y) y.Contains(".docx"))
        TextBox1.Text += "Conversion started at " & DateTime.Now().ToString & Environment.NewLine
        For Each x In lstFiles
            ThreadPool.QueueUserWorkItem(New WaitCallback(AddressOf ConvertDoc), x)
        Next

    End If
End Sub
Private Sub ConvertDoc(ByVal path As String)
    Dim word As New Microsoft.Office.Interop.Word.Application
    Dim doc As Microsoft.Office.Interop.Word.Document
    word.Visible = False

    Try
        Debug.Print(path)
        doc = word.Documents.Open(path, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing, Type.Missing)
        doc.Convert()

    Catch ex As Exception
        ''do nothing
    Finally
        doc.Close()
        word.Quit()
    End Try

End Sub`

It lets me select a path then find all doc files within the subfolders. That code isn't important, all the files for conversion are in lstFiles. Only problem at the moment is that it takes a really long time to convert even just 10 documents. Should I be using one word application per document instead of reusing it? Any suggestions?

Also it opens word after about 2 or 3 conversions and starts flashing but keeps converting.

EDIT3: Tweaked to code above a little bit and it runs cleaner. Takes 1min10sec to convert 8 files though. Considering I have 14000 I need to convert this method will take a reasonably long time.

EDIT4: Changed the code up again. Uses a threadpool now. Seems to run a bit faster. Still need to run on a better computer to convert all the documents. Or do them slowly by folder. Can anyone think of any other way to optimize this?

4
I wondered about using threads, but when I ran your first version of code, I saw it was using up 100% of both my cores with just a single thread, so I didn't think parallelization would help the issue as much as a faster computer. What kind of computer are you using?Sam Skuce
Windows XP x86, Intel Pentium Dual CPU @ 2.00GHZ, 3.25 GB of RAM. Work computers...Gage
Other than mine being x64 Windows 7, we're pretty comparable. I wonder if the x86 version is just that much slower than x64, or if we're using a different version of the office library. I'm using "Microsoft Office 12.0 Object Library" version 2.4.0.0 and "Microsoft Word 12.0 Object Library" version 8.4.0.0. Also, what's the average size of the Word documents you're converting? I think the largest in my sample set was 1 MB or so.Sam Skuce
Another thought - the Office library will actually load a copy of winword.exe in the background to do it's work. Maybe Windows 7 is just better at process startup and/or interprocess communication. Do you have any Windows 7 computer you can run it on?Sam Skuce
@Sam Skuce, The documents range from 60kb to 120kb. So not big at all. I won't have a windows 7 machine i can run it on for another couple years (Next refresh). I use the threading because it limits the number of conversions that will happen concurrently.Gage

4 Answers

2
votes

I ran your code locally, with just some minor edits for improved tracing and timing, and it "only" took 13.73 seconds to do 12 files. That would take care of your 14,000 in about 4 hours. I'm running Visual Studio 2010 on Windows 7 x64 with a dual core processor. Perhaps you can just use a faster computer?

Here's my full code, this is just a form with a single button, Button1, and a FolderBrowserDialog, FolderBrowserDialog1:

Imports System.IO

Public Class Form1

Dim lstFiles As List(Of String) = New List(Of String)

Private Sub DirSearch(path As String)


    Dim thingies = From file In Directory.GetFiles(path) Where file.EndsWith(".doc") Select file

    lstFiles.AddRange(thingies)

    For Each subdir As String In Directory.GetDirectories(path)
        DirSearch(subdir)
    Next
End Sub

Private Sub Button1_Click(sender As System.Object, e As System.EventArgs) Handles Button1.Click
    FolderBrowserDialog1.ShowDialog()

    If Not String.IsNullOrEmpty(FolderBrowserDialog1.SelectedPath) Then
        lstFiles.Clear()

        DirSearch(FolderBrowserDialog1.SelectedPath)
        Dim word As New Microsoft.Office.Interop.Word.Application
        Dim doc As Microsoft.Office.Interop.Word.Document
        lstFiles.RemoveAll(Function(y) y.Contains(".docx"))
        Dim startTime As DateTime = DateTime.Now
        Debug.Print("Timer started at " & DateTime.Now().ToString & Environment.NewLine)
        For Each x In lstFiles
            word.Visible = False
            Debug.Print(x + Environment.NewLine)
            doc = word.Documents.Open(x)
            doc.Convert()
            doc.Close()
        Next
        word.Quit()
        Dim endTime As DateTime = DateTime.Now
        Debug.Print("Took " & endTime.Subtract(startTime).TotalSeconds & " to process " & lstFiles.Count & " documents" & Environment.NewLine)
    End If

End Sub
End Class
2
votes

Use word automation and open it and save it with the WdSaveFormat enumeration for wdFormatDocumentDefault which should be docx

http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.wdsaveformat%28v=office.14%29.aspx

or try your hand at the Convert method you mentioned. Either way 100% possible and should be fairly easy.

Edit: if the converter Daniel posted works, thats far easier and he deserves all the credit : )

1
votes

You can use the free Office File Converter.

Here explains the settings:

http://technet.microsoft.com/en-us/library/cc179019.aspx

There is a file list setting.

1
votes

try this:

using Microsoft.Office.Interop
Microsoft.Office.Interop.Word.ApplicationClass word = new ApplicationClass();
object nullvalue = Type.Missing;
object filee = filename;
object file2 = String.Format("{0}{1}", filepath, "convertedfile.doc");
Microsoft.Office.Interop.Word.Document doc = word.Documents.Open(ref filee, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue);
        doc.SaveAs(ref file2, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue, ref nullvalue);