Configuring Tesseract OCR to read words of same font size

Question

I am using Tesseract 3.05.01 for Windows to extract text from an image containing few lines. The lines are surrounded by a rounded rectangle. [Image attached for reference].

Tesseract detects the rounded rectangle as "C" at the beginning and ">" at the end of the line.

This is what Tesseract returns:

The Richter scale is used for measuring the
magnitude of which natural phenomenon?

C Earthquake >
C Hurricane >
C Tsunami

I tried including ">" in blacklist, but the blacklisted symbol gets replaced by something similar. So I think if there is an option to extract only characters of similar size, then it would avoid the shapes.

Is there any way to detect only lines of similar font size/height? or Suggest me any method to overcome this problem.

man zet man zet · Accepted Answer · 2018-07-23T13:22:58

You can maybe use a whitelist instead of a blacklist that includes all letters you want to have! In tesseract.js for example this ist:

tessedit_char_whitelist: "abcdefghijklmnop ...."

Configuring Tesseract OCR to read words of same font size

1 Answers