I'm trying to extract the text from a large number of old Word files and put the text in a database. I'm doing this by parsing the text into sections, creating a text file for each section, and then doing a bulk insert.
Whoever wrote these documents originally used some strange characters. In Word the look similar to the normal extended ASCII characters. But when I start looking at the decimal or hex codes, they are not any known character.
On the left is the character originally in the document and on the right is the appropriate ASCII character entered from the keyboard.
When I copy and paste them into vim, it looks like this:
Using vim to look at the decimal and hex codes, the file looks like this:
Original True ASCII
Dec Hex Dec Hex
61617 f0b1 177 00b1
61666 f0e2 174 00ae
8220 201c 34 22
8221 201d 34 22
I'm using some of the lines from code found here:
NextChar = ActiveDocument.Characters(idx)
Dim nBytes As Long
Dim abBuffer() As Byte
nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(NextChar), -1, vbNull, 0&, 0&, 0&)
ReDim abBuffer(4)
nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(NextChar), -1, ByVal VarPtr(abBuffer(0)), nBytes - 1, 0&, 0&)
Using this method, VBA returns the following decimal codes (some are multiple bytes, expressed using commas):
Original True ASCII
Dec Dec
40 194, 177
40 194, 174
226, 128, 156 34
226, 128, 157 34
I've also tried Asc
and AscW
. These work fine on the quote marks somehow, returning only the final byte. But since ActiveDocument.Characters
returns a parenthesis for the other two, it just processes it like a parenthesis.
I have several questions regarding these various outputs as well as how to properly handle these characters.
- Why does
ActiveDocument.Characters
return a parenthesis when reading the plus/minus and the registered trademark symbol? - Why is
194
placed in front when using those characters? - Most ASCII tables I've seen follow the encoding seen here. But those indicate that
177
and174
should be dots and double arrows, respectively. Which does not match up to Word or vim. But this table seems to agree with Word and vim. Are there multiple ASCII encodings? I thought it was a standard. - What is the proper way to read in these special, multiple byte characters so that I can identify them and replace them with their ASCII equivalents?
Edit:
Just learned about using AltX in Word to change each character into it's unicode number. This works fine on the original quotation marks but it does nothing when I try it on the original plus/minus and trademark symbol. Not sure where these characters come from.
Edit 2:
I tried saving into a text file. The plus/minus and trademark symbol will not properly convert in Western European (Windows) encoding. UTF-8 is better but also has problems. Unicode will convert everything but it converts the problem characters to whatever the preceding character is.
Edit 3:
Edit 4:
I used the Open XML Productivity Tool and looked at the XML directly and possible code to create these weird symbols. This is what I've found:
<w:r w:rsidRPr="00EE7521">
<w:rPr>
<w:sz w:val="16" />
</w:rPr>
<w:sym w:font="Symbol" w:char="F0B1" />
And:
RunProperties runProperties1 = new RunProperties();
FontSize fontSize2 = new FontSize(){ Val = "16" };
runProperties1.Append(fontSize2);
SymbolChar symbolChar1 = new SymbolChar(){ Font = "Symbol", Char = "F0B1" };
Is there a way to detect and properly decode SymbolChar
typed characters in VBA? Or any other language at this point.
ActiveDocument.Characters
is seeing the character as unknown, but instead of throwing an error, it incorrectly returns0x40
. – embedded.kyle