3
votes

I'm trying to extract the text from a large number of old Word files and put the text in a database. I'm doing this by parsing the text into sections, creating a text file for each section, and then doing a bulk insert.

Whoever wrote these documents originally used some strange characters. In Word the look similar to the normal extended ASCII characters. But when I start looking at the decimal or hex codes, they are not any known character.

strange characters and their ASCII equivalents

On the left is the character originally in the document and on the right is the appropriate ASCII character entered from the keyboard.

When I copy and paste them into vim, it looks like this:

vim's interpretation of the characters

Using vim to look at the decimal and hex codes, the file looks like this:

Original      True ASCII
Dec    Hex    Dec    Hex
61617  f0b1   177    00b1
61666  f0e2   174    00ae
 8220  201c    34    22
 8221  201d    34    22

I'm using some of the lines from code found here:

NextChar = ActiveDocument.Characters(idx)
Dim nBytes As Long
Dim abBuffer() As Byte

nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(NextChar), -1, vbNull, 0&, 0&, 0&)
ReDim abBuffer(4)
nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(NextChar), -1, ByVal VarPtr(abBuffer(0)), nBytes - 1, 0&, 0&)

Using this method, VBA returns the following decimal codes (some are multiple bytes, expressed using commas):

Original        True ASCII
Dec             Dec
40              194, 177
40              194, 174
226, 128, 156   34
226, 128, 157   34

I've also tried Asc and AscW. These work fine on the quote marks somehow, returning only the final byte. But since ActiveDocument.Characters returns a parenthesis for the other two, it just processes it like a parenthesis.

I have several questions regarding these various outputs as well as how to properly handle these characters.

  • Why does ActiveDocument.Characters return a parenthesis when reading the plus/minus and the registered trademark symbol?
  • Why is 194 placed in front when using those characters?
  • Most ASCII tables I've seen follow the encoding seen here. But those indicate that 177 and 174 should be dots and double arrows, respectively. Which does not match up to Word or vim. But this table seems to agree with Word and vim. Are there multiple ASCII encodings? I thought it was a standard.
  • What is the proper way to read in these special, multiple byte characters so that I can identify them and replace them with their ASCII equivalents?

Edit:

Just learned about using AltX in Word to change each character into it's unicode number. This works fine on the original quotation marks but it does nothing when I try it on the original plus/minus and trademark symbol. Not sure where these characters come from.


Edit 2:

I tried saving into a text file. The plus/minus and trademark symbol will not properly convert in Western European (Windows) encoding. UTF-8 is better but also has problems. Unicode will convert everything but it converts the problem characters to whatever the preceding character is.

Western European (Windows)

Unicode


Edit 3:

Link to test file


Edit 4:

I used the Open XML Productivity Tool and looked at the XML directly and possible code to create these weird symbols. This is what I've found:

<w:r w:rsidRPr="00EE7521">
 <w:rPr>
   <w:sz w:val="16" />
 </w:rPr>
 <w:sym w:font="Symbol" w:char="F0B1" />

And:

RunProperties runProperties1 = new RunProperties();
FontSize fontSize2 = new FontSize(){ Val = "16" };

runProperties1.Append(fontSize2);
SymbolChar symbolChar1 = new SymbolChar(){ Font = "Symbol", Char = "F0B1" };

Is there a way to detect and properly decode SymbolChar typed characters in VBA? Or any other language at this point.

1
The first two characters have hex codes corresponding to Unicode "Private Use Areas," which are undefined encodings left intentionally open so that applications to define them for their own purposes. E.G. unicode.org/cldr/utility/character.jsp?a=F0B1. Meanwhile, the quotes correspond to real Unicode characters. E.G. unicode.org/cldr/utility/character.jsp?a=201CMikegrann
As for identifying what these characters are defined to be, and converting them into Unicode equivalents... I don't know. But at least this explains why two of them work, and two of them don't.Mikegrann
@Mikegrann Okay so do you know of a function that can tell me if a character read is in the Private Use Area? Or a function that can read all of the bytes up to and including the Private Use Area? Because right now it seems as if ActiveDocument.Characters is seeing the character as unknown, but instead of throwing an error, it incorrectly returns 0x40.embedded.kyle
@Mikegrann No problem. Thanks for pointing me in the right direction.embedded.kyle
Looking into it myself, I put those characters into a test sheet and tried reading them in as byte arrays, which gave me the correct bytes for those characters (e.g. f0b1 becomes bytes 240, 177). Are you sure your VBA is correctly fetching the byte values of those characters? Because if you can get the correct byte values, you can check against the Private Use Areas manually (en.wikipedia.org/wiki/Private_Use_Areas)Mikegrann

1 Answers

3
votes

I just tested this simple macro and it successfully replaced instances of your first two symbols (f0b1, f0e2) with their ASCII equivalents on my test document. It just clears the first byte when detecting that the character is from the PUA (Private Use Area).

Private Sub Strip_PUA()
    For idx = 1 To ActiveDocument.Characters.Count
        Dim bArr() As Byte
        bArr = ActiveDocument.Characters(idx)

        If bArr(1) >= &HE0 And bArr(1) <= &HF8 Then
            bArr(1) = 0
            ActiveDocument.Characters(idx) = bArr
        End If
    Next
End Sub

You might have to tweak the bArr to something other than 1 based on endianness of the system and bytecount of the characters. This also relies on the fact that the characters you've encountered happen to have the correct ASCII byte already. That might not always be the case - you'll have to investigate to be sure.


EDIT: Reproduced from this Google Groups discussion.

Sub SymbolsUnprotect()
  Dim SelFont, SelCharNum

  Selection.Collapse (wdCollapseStart)
  Selection.Find.ClearFormatting
  With Selection.Find
    .Text = "[" & ChrW(61472) & "-" & ChrW(61695) & "]"
    .Replacement.Text = ""
    .Forward = True
    .Wrap = wdFindContinue
    .Format = False
    .MatchCase = False
    .MatchWholeWord = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
  End With
  While Selection.Find.Execute
    With Dialogs(wdDialogInsertSymbol)
      SelFont = .Font
      SelCharNum = .CharNum
    End With

    Selection.Font.Name = SelFont
    Selection.TypeText Text:=ChrW(SelCharNum)

    ' replace the last 2 lines with the following to
    ' protect symbols from decorative fonts:
    ' Selection.InsertSymbol _
    '   Font:=SelFont, _
    '   CharacterNumber:=SelCharNum, _
    '   Unicode:=True

  Wend
End Sub

This will "unprotect" symbols, which will show up as a "(" (decimal 40) character when protected - which is the default for symbols inserted through Word's Insert > Symbol Dialog. It will allow you to properly read the bytes of these characters as f0**, but won't be able to tell you exactly what those bytes correspond to for fonts like Symbol, which define their own mappings in the PUA. Look up these mappings for conversion into Unicode (linked to the specific character block containing the mappings of ± (Symbol 177) to (Unicode 177) and of ® (Symbol 226) to (Unicode 174)).