1
votes

I've a word document containing different fonts and different languages. One instance would be a text in english and a corresponding translation in ancient greek. For the ancient greek part a TrueType Font was used (https://fonts2u.com/greek-regular.font). Now this approach is highly unsuitable for sharing those files and I'd like to convert the ancient greek part into corresponding unicode characters.

I tried the python package python-docx to import the file. Although successfull at importing and viewing the file content, I couldn't find a way to select only the ancient greek characters and convert them to their corresponding unicode characters.

I was thinking about using the TrueType Font character map and find and replace those characters with the corresponding unicode characters. However viewing the contents I was unable to select only ancient greek characters.

Q: Is there a way using VBA, python or exporting the files in different encodings to "translate" the ancient greek characters to their corresponding unicode characters?

2

2 Answers

1
votes

wow, sounds awkward, compounded brokenness!

given that the font is using its own non-standard definition of character encoding, you might be easier off using an XML parser to work with the file directly. I'd do this mostly because that's the way that occurs to me to be easiest to select the relevant mis-encoded parts of the text.

something like:

  1. open the file (ElementTree in Python). note that a DOCX file is really a ZIP file containing a file called word/document.xml and other associated images/misc as appropriate
  2. use an xpath selector to get all instances of text that use the greek font
  3. use your remapping code to move from the broken greek encoding to use real Unicode characters
  4. save the file

you'd want to switch to a font that uses greek characters in their standard unicode code points, you could either do this in the raw XML, or maybe just reopen the file in Word and set a font everywhere

1
votes

using the python-docx package I'm searching and selecting the characters based on their font name

import docx
doc = docx.Document('greek_text.docx')
doc.paragraphs[3].runs[10].font.name

returns 'Greek' for example

for run in doc.paragraphs[3].runs:
    if run.font.name == "Greek":
        for char in run.text:
            print (char +" "+ str(hex(ord(char))))

g 0x67

u 0x75

n 0x6e

ยป 0xbb

returns the unicode character and corresponding hex value. That leaves the mapping of these values to the correct unicode values for their greek characters.