10
votes

I'd like to take some RTF input and clean it to remove all RTF formatting except \ul \b \i to paste it into Word with minor format information.

The command used to paste into Word will be something like: oWord.ActiveDocument.ActiveWindow.Selection.PasteAndFormat(0) (with some RTF text already in the Clipboard)

{\rtf1\ansi\deff0{\fonttbl{\f0\fnil\fcharset0 Courier New;}}
{\colortbl ;\red255\green255\blue140;}
\viewkind4\uc1\pard\highlight1\lang3084\f0\fs18 The company is a global leader in responsible tourism and was \ul the first major hotel chain in North America\ulnone  to embrace environmental stewardship within its daily operations\highlight0\par

Do you have any idea on how I can clean up the RTF safely with some regular expressions or something? I am using VB.NET to do the processing but any .NET language sample will do.

4

4 Answers

6
votes

I would use a hidden RichTextBox, set the Rtf member, then retrieve the Text member to sanitize the RTF in a well-supported way. Then I would use manually inject the desired formatting afterwards.

5
votes

I'd do something like the following:

Dim unformatedtext As String

someRTFtext = Replace(someRTFtext, "\ul", "[ul]")
someRTFtext = Replace(someRTFtext, "\b", "[b]")
someRTFtext = Replace(someRTFtext, "\i", "[i]")

Dim RTFConvert As RichTextBox = New RichTextBox
RTFConvert.Rtf = someRTFtext
unformatedtext = RTFConvert.Text

unformatedtext = Replace(unformatedtext, "[ul]", "\ul")
unformatedtext = Replace(unformatedtext, "[b]", "\b")
unformatedtext = Replace(unformatedtext, "[i]", "\i")

Clipboard.SetText(unformatedtext)

oWord.ActiveDocument.ActiveWindow.Selection.PasteAndFormat(0)
2
votes

You can strip out the tags with regular expressions. Just make sure that your expressions will not filter tags that were actually text. If the text had "\b" in the body of text, it would appear as \b in the RTF stream. In other words, you would match on "\b" but not "\b".

You could probably take a short cut and filter out the header RTF tags. Look for the first occurrence of "\viewkind4" in the input. Then read ahead to the first space character. You would remove all of the characters from the start of the text up to and including that space character. That would strip out the RTF header information (fonts, colors, etc).

1
votes

Regex it, it wont parse absolutely everything correctly (tables for example) but does the job in most cases.

string unformatted = Regex.Replace(rtfString, @"\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?", "");

Magic =)