tesseract fails at simple number detection

Question

I want to perform OCR on images like this one:

6x6 matrix with numerical values

It is a table with numerical data with colons as decimal separators. It is not noisy, contrast is good, black text on white background. As an additional preprocessing step, in order to get around issues with the frame borders, I cut out every cell, binarize it, pad it with a white border (to prevent edge issues) and pass only that single cell image to tesseract. I also looked at the individual cell images to make sure the cutting process works as expected and does not produce artifacts. These are two examples of the input images for tesseract:

Single cell from above table. Content: 1,7

Single cell from above table. Content: 57

Unfortunately, tesseract is unable to parse these consistently. I have found no configuration where all 36 values are recognized correctly.

There exist a couple similar questions here on SO and the usual answer is a suggestion for a specific combination of the --oem and --psm parameters. So I wrote a python script with pytesseract that loops over all combinations of --oem from 0 to 3 and all values of --psm from 0 to 13 as well als lang=eng and lang=deu. I ignored the combinations that throw errors.

Example 1: With --psm 13 --oem 3 the above "1,7" image is misidentified as "4,7", but the "57" image is correctly recognized as "57".

Example 2: With --psm 6 --oem 3 the above "1,7" image is correctly recognized as "1,7", but the "57" image is misidentified as "o/".

Any suggestions what else might be helpful in improving the output quality of tesseract here?

My tesseract version:

tesseract v4.0.0.20190314
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
 Found AVX2
 Found AVX
 Found SSE

Ahx Ahx · Accepted Answer · 2021-01-24T11:49:26

Solution

1. Divide the image into the 5-different row
1. Apply division-normalization to each row
1. Set psm to 6 (Assume a single uniform block of text.)
1. Read

From the original image, we can see there are 5 different rows.

Each iteration, we will take a row, apply normalization and read.

We need to understand how to set image indexes carefully.

import cv2
from pytesseract import image_to_string

img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]

start_index = 0
end_index = int(h/5)

Question Why do we declare start and end indexes?

We want to read a single row in each iteration. Lets see in an example:

The current image height and width are 645 and 1597 pixels.

Divide the images based on indexes:

start-index	end-index
0	129
129	258 (129 + 129)
258	387 (258 + 129)
387	516 (387 + 129)

Lets see whether the images are readable?

start-index	end-index	image
0	129
129	258
258	387
387	516

Nope, they are not suitable for reading, maybe a little adjustment might help us. Like:

start-index	end-index	image
0	129 - 20
109	218
218	327
327	436
436	545
545	654

Now they are suitable for reading.

When we apply the division-normalization to each row:

start-index	end-index	image
0	109
109	218
218	327
327	436
436	545
545	654

Now when we read:

1,7 | 57 | 71 | 59 | .70 | 65

| 57 | 1,5 | 71 | 59 | 70 | 65

| 71 | 59 | 1,3 | 57 | 70 | 60

| 71 | 59 | 56 | 1,3 | 70 | 60

| 72 | 66 | 71 | 59 | 1,2 | 56

| 72 | 66 | 71 | 59 | 56 | 4,3

Code:

import cv2
from pytesseract import image_to_string

img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
# print(img.shape[:2])
start_index = 0
end_index = int(h/5) - 20

for i in range(0, 6):
    # print("{}->{}".format(start_index, end_index))
    gry_crp = gry[start_index:end_index, 0:w]
    blr = cv2.GaussianBlur(gry_crp, (145, 145), 0)
    div = cv2.divide(gry_crp, blr, scale=192)
    txt = image_to_string(div, config="--psm 6")
    print(txt)
    start_index = end_index
    end_index = start_index + int(h/5) - 20

tesseract fails at simple number detection

1 Answers