3
votes

I want a user to be able to upload a word document and my program then parses the document into separate word documents. The problem is that the splitting will need to be manual as all the word documents are not formatted the same way. My initial thought is before the user uploads, the user tags the sections with a beginning and end tag (of some sort maybe a comment) that my program can then parse and split the document into separate documents. (This also needs to work for .doc and .docx so a common solution is desirable)

Ex. Input:

Doc1

Chapter 1

Blah Blah Blah

Chapter 2

Blah blah

/end Doc1

Ex. Output:

Doc1

Chapter 1

Blah Blah Blah

/end Doc1

Doc 2

Chapter 2

Blah blah

/end Doc2

Any ideas? I have been struggling with this for awhile

5
My employer requires quite a few .doc and .docx modifications. The problem is that a solution for both is very hard. In docx the best bet is to unzip the document like a zip file and then you can play with the XML files inside. You might be able to use styles or custom styles to know when to split the document or have the user add some custom tag. Unfortunately a graceful solution is very difficult.Tim C
right, I definitely dont think an automated solution is feasible. If a user can tag the volumes and then I split based on what they tag prior to upload.Holograham

5 Answers

4
votes

What you want to do is non-trivial! I have done my fair share of document manipulation, that said if you are working with a DOCX these days it is not too bad due to the supporting libraries, see:

http://openxmldeveloper.org/

Older version get more difficult, you would need to source a library for that, or as suggested use macros.

Is the "program" a web site? If so make sure you do not use COM interop!

0
votes

I'd say your best bet is to investigate the VSTO or VBA macros to accomplish this. Both will give you full access to the object model in whatever version the document is.

0
votes

Something that may help is HTML Transit. It's incredibly old software and incredibly expensive, and from an initial search, it may not be supported anymore. But, it did have the ability to take one Word document, and split it up into smaller pieces (of course, it converted it to HTML as well). Something to look into, maybe. Google "HTML Transit" for more research and free demo.

0
votes

I've had great success with Aspose.Words for document manipulation and generation.

0
votes

VBA Macro to split files into sub documents

Sub UpdateDocuments()

    Application.ScreenUpdating = False
    Dim strFolder As String, strFile As String, wdDoc As Document
    strFolder = GetFolder
    If strFolder = "" Then Exit Sub
    strFile = Dir(strFolder & "\*.doc", vbNormal)
    While strFile <> ""
        Set wdDoc = Documents.Open(FileName:=strFolder & "\" & strFile,      AddToRecentFiles:=False, Visible:=False)
        With wdDoc
            'Call your other macro or insert its code here
            'BreakOnSection
            wdDoc.Activate

        ActiveDocument.ActiveWindow.View.Type = wdOutlineView
            Selection.WholeStory
        Selection.Copy
            ActiveDocument.Subdocuments.AddFromRange Range:=Selection.Range
            ActiveDocument.SaveAs "C:\Data\Split\" & ActiveDocument.Name

            ActiveDocument.Close SaveChanges:=True
    End With
    strFile = Dir()
    Wend
    Set wdDoc = Nothing
    Application.ScreenUpdating = True
End Sub

Function GetFolder() As String
    Dim oFolder As Object
    GetFolder = ""
    Set oFolder = CreateObject("Shell.Application").BrowseForFolder(0,     

"Choose a folder", 0)
    If (Not oFolder Is Nothing) Then GetFolder = oFolder.Items.Item.Path
    Set oFolder = Nothing
End Function