0
votes

I am using MS Word to edit text that I convert to structured HTML using VBA.

The text is written out using document.saveas2 with encoding:=msoEncodingUTF8.

Today I found that the Trademark Symbol [Edit: inserted using the Insert Symbol capability; Insert Tab, Symbols group, Symbol button] was appearing in the text files as "(tm)".

Having discovered that encoding:=65001 should also produce UTF8, I tried it - and in one case it seemed to work, but the result was not reproducible.

I also learned that being older than Unicode, Word might use a private code page for certain characters, so I also entered the unicode code directly followed by alt-X; the TM symbol appeared correctly but still failed to be written to the text file.

Whilst I have been able to work around the problem by replacing TM with the HTML "& trade ;" (extra spaces to prevent it getting rendered as the symbol!), I am concerned about the potential for other encoding failures.

Can anyone shed any light on the cause(s) of this issue or offer an effective resolution/mitigation?

System config: Word 2010; Windows 7 64 bit.

2
@Deduplicator Explanation of the edit would be helpful. And for whoever it may have been, a down-vote without comment is at best discourteous; it is also not at all helpful. Please explain yourself or undo the vote.Julian Moore
When the summary is that obvious, I normally don't write it, but ok, here it is: "Corrected tags, removed fluff".Deduplicator
Regarding discourtesy on the part of the downvoter: If he voted according to the posts value (as he sees it), that's commendable. And if he didn't see enough use in trying to help you improve your post, that's acceptable though regrettable, not discourteousy involved. Just because he helps everyone by evaluating posts does not mean he is in any way obligated to help even more, or help you personally (go on Meta Stack Overflow if you want to debate that, but please first research the topic). (As an aside, accusing him of discourtesy is very discourteous.)Deduplicator
I'm not sure if this is the problem, especially since you don't explain how special symbols are inserted in the text, but... Many of the symbols Word inserts are not Unicode, but "normal" characgter codes formatted with a different font, such as WingDings. You need to make sure the symbols giving you problems are truly unicode characters for that symbol.Cindy Meister
@CindyMeister Thanks for considering the question. I believe my comment about double-checking by using the Word Unicode insertion facility addresses your point. I will also edit the question to clarify the insertion method used.Julian Moore

2 Answers

1
votes

I recorded a macro to save some some Chinese text that is clearly unsupported in the default code page on my system, which was Windows-1252. I saved in .txt format and it asked for the encoding, which I selected UTF-8. Here is the result:

ActiveDocument.SaveAs2 FileName:="The.txt", FileFormat:=wdFormatText, _
    LockComments:=False, Password:="", AddToRecentFiles:=True, WritePassword _
    :="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts:=False, _
    SaveNativePictureFormat:=False, SaveFormsData:=False, SaveAsAOCELetter:= _
    False, Encoding:=65001, InsertLineBreaks:=False, AllowSubstitutions:= _
    False, LineEnding:=wdCRLF, CompatibilityMode:=0

It did save the file correctly in UTF-8. I edited the macro down to the following minimal code and it still worked.

ActiveDocument.SaveAs2 FileName:="test.txt", FileFormat:=wdFormatText, _
    AllowSubstitutions:=False, Encoding:=65001
0
votes

I wanted to save in filtered html utf8, so I tried:

doc.SaveAs2 FileName:="file1.htm", FileFormat:=wdFormatFilteredHTML, AllowSubstitutions:=False, Encoding = msoEncodingUTF8

I found that although with FileFormat:=wdFormatText the file did save in utf8, doing the same with wdFormatFilteredHTML did not. What did work was

doc.WebOptions.Encoding = msoEncodingUTF8
doc.SaveAs2 FileName:="file1.htm", FileFormat:=wdFormatFilteredHTML, AllowSubstitutions:=False