2
votes

I am trying to delete all embedded object from Word and PowerPoint files using openxml SDK. I am new to Open XML and not sure whether I am doing this correctly. Below is the code I have. My intention is to remove any objects embedded and to delete images embedded. Both codes when executed are giving errors.

Code that I tried to delete all embedded items in the document.

using (var wdDoc = WordprocessingDocument.Open(wordFilePath, true))
{
    var docPart = wdDoc.MainDocumentPart;
    var document = docPart.Document;
    var embeddedObjectsCount = docPart.EmbeddedObjectParts.Count();
    while (embeddedObjectsCount > 0)
    {
        docPart.DeletePart(docPart.EmbeddedObjectParts.FirstOrDefault());
        embeddedObjectsCount = docPart.EmbeddedObjectParts.Count();
    }
}

Code that I tried to delete all images in the document. (This works partially if I don't have any objects embedded in the document.)

using (var wdDoc = WordprocessingDocument.Open(wordFilePath, true))
{
    var docPart = wdDoc.MainDocumentPart;
    var document = docPart.Document;
    var imageObjectsCount = docPart.ImageParts.Count();
    while (imageObjectsCount > 0)
    {
        docPart.DeletePart(docPart.ImageParts.FirstOrDefault());
        imageObjectsCount = docPart.ImageParts.Count();
    }
}

When I run the above code the file I use is getting corrupted. I would like to know how to remove all embedded objects from Word without corrupting the file.

I haven't done anything on PowerPoint yet, but I hope it would be similar to Word document.

2
My code is partially the same, are you closing the document after executing ?EasyE
I haven't yet understood the concept of Open XML completely. The reference codes available on MSDN does not show the closing part. Could you please explain? I think the using will take care of closing the file.Kannan Suresh
I am assuming that you are opening an existing word doc since you are trying to delete already embedded object correct ?EasyE
Yes. Passing the file path in using.Kannan Suresh
That block of code you are running should not corrupt a file, I am guessing that the using block is not actually closing your file and it is becoming corrupt, a good place to start is identifying what version of ooxml you are using. MSDN is sometimes outdated and misleading, try looking here openxmldeveloper.org this place has easily digestible articles that can lead you in the right direction. Also make sure you do a compatibility check of your ooxml version with the version of office you are using.EasyE

2 Answers

1
votes

I managed to find a solution for my problem. I had to dive in to the concepts of Open XML SDK to get this. However, I am not so sure on whether this is the optimal solution.

Goal

  1. Remove all embedded objects in PowerPoint and Word.

  2. Remove all images in PowerPoint and Word.

For Word

//using Ovml = DocumentFormat.OpenXml.Vml.Office;
//Determine whether there are any Embedded Objects in the document
using (var wdDoc = WordprocessingDocument.Open(wordFilePath, true))
{
    var docPart = wdDoc.MainDocumentPart;
    var docHasEmbeddedOleObjects = document.Body.Descendants<Ovml.OleObject>().Any();
    if (docHasEmbeddedOleObjects)
    {
        foreach (var oleObj in document.Body.Descendants<Ovml.OleObject>())
        {
            oleObj.Remove(); //Remove each ole object in the document. This will remove the object from view in word.
        }
        //Delete the embedded objects. This will remove the actual attached files from the document.
        docPart.DeleteParts(docPart.EmbeddedObjectParts);
        //Delete all picture in the document
        docPart.DeleteParts(docPart.ImageParts);
    }
}

For PowerPoint

using (var ppt = PresentationDocument.Open(powerPointFilePath, true))
{
    foreach (var slide in slides)
    {
        //Remove Ole Objects
        var oleObjectCount = slide.Slide.Descendants<OleObject>().Count();
        while (oleObjectCount > 0)
        {
            var oleObj = slide.Slide.Descendants<OleObject>().FirstOrDefault();
            var oleObjGraphicFrame = oleObj?.Ancestors<GraphicFrame>().FirstOrDefault();
            if (oleObjGraphicFrame != null)
            {
                oleObjGraphicFrame.RemoveAllChildren();
                oleObjGraphicFrame.Remove();
            }
            oleObjectCount = slide.Slide.Descendants<OleObject>().Count();
        }
        //Delete embedded objects
        slide.DeleteParts(slide.EmbeddedObjectParts);
        //Delete all pictures
        slide.DeleteParts(slide.ImageParts);
    }
}
0
votes

In my experience, the fastest way to "corrupt" an OpenXML document is to have a bad relation pointer. The fastest way to get a handle of what's behind those cryptic error messages is to go straight to the raw OpenXML markup.

To get an idea of what is happening:

  1. Make a copy of your file before running your code, call this A.docx
  2. Run your code and make a copy of your result, call this B.docx
  3. Rename A.docx and B.docx to A.zip and B.zip

Investigate the source file

First, inside of A.zip, open the file called [Content_Types].xml. Take note of the parts that you would like to remove. Think of this file as a declaration to the word processor of the types of files that it will encounter in the sub-directories.

Parts such as the document content (word/document.xml) or the footnotes part (word/footnotes.xml) have their own relations parts named as [part path here].rels.

For example, document.xml.rels will hold relation information for things like charts, hyperlinks, and images in document.xml; likewise, footnotes.xml.rels holds information on things like hyperlinks in footnotes.xml.

Investigate the result file

Now open B.zip and compare the [Content_Types].xml files. Do you see a part there that you intended to delete? Is there a part missing that you did not intend to delete?

Inside of the word sub-directory in B.zip, do you see any embedded files that are not listed in the [Content_Types].xml file?

If you take a look at the raw markup, and the error doesn't jump out at you, feel free to comment with what some more details about your file structure and we can troubleshoot from there.