4
votes

My goal is to extract embedded documents from a OneNote notebook programmatically. The embedded documents are likely to be Office documents, PDFs, and other arbitrary files. I do not have any difficulty getting a Base64 string for inline images, but I do have a problem getting a Base64 string for other file types.

I am using VS 2008 C#, OneNote 2007, Windows XP SP3.

I am using a sample .ONE file, which consists of a small amount of text, a PDF file, and one inline image. I am able to identify the ID of the containing page and the ID of PDF. I have hard-coded the IDs into the following example.

        // ID of the Application
        string strID;
        Microsoft.Office.Interop.OneNote.Application onApplication = new Microsoft.Office.Interop.OneNote.Application();
        onApplication.OpenHierarchy(@"D:\Projects\OneNote\test.one",
            System.String.Empty, out strID, Microsoft.Office.Interop.OneNote.CreateFileType.cftSection);


        string strXML1;
        onApplication.GetPageContent("{460ABC12-855F-09E4-3724-85E8DE17BD57}{1}{B0}", out strXML1, PageInfo.piAll);

        // Get page reference
        string strXML2;
        onApplication.GetPageContent("{4AA5B6DF-1C90-0B3D-3FFD-687B0AF4A632}{1}{B0}", out strXML2, PageInfo.piAll);

        //Get Hyperlink to embedded object
        string strHyperlink;
        onApplication.GetHyperlinkToObject("{4AA5B6DF-1C90-0B3D-3FFD-687B0AF4A632}{1}{B0}", "{23A17F23-F743-0C9B-082A-BC6BD5D9CA6E}{13}{B0}", out strHyperlink);

        //Condition to ensure that the ObjectID is good.
        if ((strHyperlink != null) && (strHyperlink != ""))
        {
            //Get Base64 string.
            string strBase64;
            onApplication.GetBinaryPageContent("{4AA5B6DF-1C90-0B3D-3FFD-687B0AF4A632}{1}{B0}", "{23A17F23-F743-0C9B-082A-BC6BD5D9CA6E}{13}{B0}", out strBase64);
        }

The application returns a good hyperlink whether I reference the PDF or the inline image. The application returns a good Base64 string for the inline image. However, the application returns error 0x8004200f The binary object does not exist. for the PDF. The same is true if I try a version containing an embedded Word document.

How can I get a Base64 string for the PDF? I am open to using http://onom.codeplex.com/, but I have not found a solution there.

By the way, I am aware that IDs may not be the same from from one OneNote session to another. In my tests, I make sure the IDs are correct manually viewing the XML in debug mode.

Here is a snippet of the XML written to strXML2.

The inline image

<![CDATA[Attachment_Test_01]]>
</one:T>
</one:OE>
</one:Title>
<one:Image format=\"jpg\" originalPageNumber=\"0\" lastModifiedTime=\"2013-06-10T18:39:46.000Z\" objectID=\"{1A32E30F-091E-4F03-8147-D00D0D16C6FD}{20}{B0}\">
<one:Position x=\"90.0\" y=\"104.400001525879\" z=\"3\"/>
<one:Size width=\"767.9999389648437\" height=\"576.0\"/>
<one:Data>/9j/4AAQSkZJRgABAQAAAQABAAD//gA7Q1JFQVRPUjogZ2QtanBlZyB2MS4wICh1c2luZyBJ (SNIP)

The embedded PDF

<![CDATA[4\r\n‘4]]>
</one:OCRText>
<one:OCRToken startPos=\"0\" region=\"0\" line=\"0\" x=\"564.631591796875\" y=\"250.1052703857422\" width=\"6.063148498535156\" height=\"5.30526351928711\"/>
<one:OCRToken startPos=\"3\" region=\"1\" line=\"1\" x=\"684.3789672851562\" y=\"462.3157653808594\" width=\"5.305229187011718\" height=\"6.821067810058594\"/>
</one:OCRData>
</one:Image>
<one:InsertedFile pathCache=\"C:\\TEST\\D62228.pdf\" pathSource=\"C:\\C++_Neural_Networks_And_Fuzzy_Logic.pdf\" preferredName=\"C++_Neural_Networks_And_Fuzzy_Logic.pdf\" lastModifiedTime=\"2013-06-10T18:39:43.000Z\" objectID=\"{23A17F23-F743-0C9B-082A-BC6BD5D9CA6E}{13}{B0}\">

Thank you.

1

1 Answers

2
votes

GetBinaryPageContent API can be used only to retrieve image and ink data. For embedded files the pathCache attribute points to the file stored in OneNote cache folder. You can simply read that file.