Extracting text from Lotus Notes XML rich text element

Question

I am to migrate the contents of a Lotus Notes database to SharePoint. The entire database is exported to XML files (this requirement cannot be changed) and I have to parse these XML files and insert the data into SharePoint.

Whats tripping me up is the elements that contain rich text. The XML elements contain an XML representation of the exact rich text format used in the field in Lotus Notes using DXL as described in http://publib.boulder.ibm.com/infocenter/domhelp/v8r0/index.jsp?topic=%2Fcom.ibm.designer.domino.main.doc%2FH_PARAGRAPH_DEFINITIONS_ELEMENT_XML.html

I don't need to keep the actual formatting of the text (unless this is equally easy as retrieving the plain text), but if I simply extract the value of the XML element containing the rich text (using LinqToXML) I get the plain text without linebreaks which is not acceptable. Additionally, embedded images are displayed in the retrieved text as base64 encoded strings (they are embedded in the XML as such).

Can anyone provide me with guidance to how to extract the text from the XML element either as proper RTF format that can be inserted into an RTF file or as a plain text that includes the correct line breaks and don't contain the embedded images?

zfr zfr · Accepted Answer · 2013-11-20T01:24:35

Obviously the XML you deal with is DXL. A more elegant method would be to convert it to HTML with an XSL transformation. A required XSLT stylesheet you may find supplied with PD4ML tool. From HTML format a document can be converted to PDF, RTF or an image with PD4ML (or probably to another format using another tool)

Extracting text from Lotus Notes XML rich text element

3 Answers