7
votes

I'm using tesseract to OCR text from a screen-scraper application. The only font used is the Segoe UI 8 CLEARTYPE QUALITY (see image below). At this moment tesseract is doing a poor job, mixing Z and 2, 0 and o and so on.

I've tried to scale up the text image (no improvements). Looking at eng.traineddata I can see that tesseract is not trained with Segoe UI 8 CLEARTYPE QUALITY.

Question: How can I train tesseract with a new font and specify that only that font should be used?

enter image description here

1
You got a solution to this?Pranav
@Pranav, no but I have started a bounty. Did you find a solution? Please share if you did :-)Vingtoft
@Vingtoft . Yes. I used learning to teach the OCR engine for different types of fonts. It works perfect for me. I have converted the images into box files and then used a box file to train.Pranav

1 Answers

2
votes

Please provide an example of your effort. My goal is to help you reach your goal, not to do the work for you.

This is quite a common problem and lots of people have solved this, some more efficiently than others. You can use the tools that they have created.

An example

There are multiple others, some of them do just typefaces and are optimized for that. It might be something that is more impactful for you. For example:

There are other examples, but most of them use image magic and other tools to improve the initial input data quality so that the OCR tool does its best. Personally, I wrote efficient c# GDI transformations to manipulate the input data before I run Tesseract on it.