1
votes

I am stuck a bit in decoding. I got a base64-encoded .rtf file.

A little part of this looks like this: Bek\u252\''fcld\u337\''3f

Which represents: Beküldő

But my output data after decoding is: Bekuld?

If I manually replace the characters it works.

StringReplace(Result, 'U337\''3F', '''F5', [rfReplaceAll, rfIgnoreCase]);

Does anyone know a general solution for this? Some conversation or something?

1
Hi - welcome to Delphi on Stack Overflow. Please be specific with your question - The fragment you have shown is not the Base64 encoded version of what you say it represents. If you are not clear with the problem it's difficult to know how best to help you. If your input is properly Base64 encoded then you should be able to produce the binary equivalent of the input. If you want to treat it as a string you need to be sure your decoding is allowing for that. UTF8 and UTF16LE are not the same binary inputs for example.Rob Lambden
RTF is not base64 encoded. The best way to handle this is to use an actual RTF parser, a (headless) RichEdit control, etc to decode the RTF for you, and then your can extract the desired text from the output.Remy Lebeau
@RemyLebeau He didn't say rtf was base-64 encoded. He wrote he had "got a base-64 encoded .rtf file".Arnaud Bouchez
@RobLambden IIRC the RTF tries to be 7-bit ASCII only, and escape most of the non ASCII characters, like é instead of é. There may be 8-bit WinAnsi content in practice, depending on the RTF writer. But never UTF-16 for sure. RTF is a very complex beast. It defines \ansi, i.e. WinAnsi / codepage 1252 by default.Arnaud Bouchez

1 Answers

2
votes

For instance, \u242 means Unicode character #242.

So you could search for \u in the RTF content (ignoring any \\ escaped sequence), then retrieve the following number, and use it as a character.

But RTF is a very complex beast.

Check what the RTF 1.5 specifications says about encoding:

\uN This keyword represents a single Unicode character which has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number. This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered.

Perhaps the easiest is to use a hidden RichEdit for decoding, under Windows/VCL.