0
votes

I want to perform OCR on images like this one:

6x6 matrix with numerical values

It is a table with numerical data with colons as decimal separators. It is not noisy, contrast is good, black text on white background. As an additional preprocessing step, in order to get around issues with the frame borders, I cut out every cell, binarize it, pad it with a white border (to prevent edge issues) and pass only that single cell image to tesseract. I also looked at the individual cell images to make sure the cutting process works as expected and does not produce artifacts. These are two examples of the input images for tesseract:

Single cell from above table. Content: 1,7

Single cell from above table. Content: 57

Unfortunately, tesseract is unable to parse these consistently. I have found no configuration where all 36 values are recognized correctly.

There exist a couple similar questions here on SO and the usual answer is a suggestion for a specific combination of the --oem and --psm parameters. So I wrote a python script with pytesseract that loops over all combinations of --oem from 0 to 3 and all values of --psm from 0 to 13 as well als lang=eng and lang=deu. I ignored the combinations that throw errors.

Example 1: With --psm 13 --oem 3 the above "1,7" image is misidentified as "4,7", but the "57" image is correctly recognized as "57".

Example 2: With --psm 6 --oem 3 the above "1,7" image is correctly recognized as "1,7", but the "57" image is misidentified as "o/".

Any suggestions what else might be helpful in improving the output quality of tesseract here?

My tesseract version:

tesseract v4.0.0.20190314
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
 Found AVX2
 Found AVX
 Found SSE
1

1 Answers

1
votes

Solution


From the original image, we can see there are 5 different rows.

Each iteration, we will take a row, apply normalization and read.

We need to understand how to set image indexes carefully.

import cv2
from pytesseract import image_to_string

img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]

start_index = 0
end_index = int(h/5)

Question Why do we declare start and end indexes?

We want to read a single row in each iteration. Lets see in an example:

The current image height and width are 645 and 1597 pixels.

Divide the images based on indexes:

start-index end-index
0 129
129 258 (129 + 129)
258 387 (258 + 129)
387 516 (387 + 129)

Lets see whether the images are readable?

start-index end-index image
0 129 enter image description here
129 258 enter image description here
258 387 enter image description here
387 516 enter image description here

Nope, they are not suitable for reading, maybe a little adjustment might help us. Like:

start-index end-index image
0 129 - 20 enter image description here
109 218 enter image description here
218 327 enter image description here
327 436 enter image description here
436 545 enter image description here
545 654 enter image description here

Now they are suitable for reading.


When we apply the division-normalization to each row:

start-index end-index image
0 109 enter image description here
109 218 enter image description here
218 327 enter image description here
327 436 enter image description here
436 545 enter image description here
545 654 enter image description here

Now when we read:

1,7 | 57 | 71 | 59 | .70 | 65

| 57 | 1,5 | 71 | 59 | 70 | 65

| 71 | 59 | 1,3 | 57 | 70 | 60

| 71 | 59 | 56 | 1,3 | 70 | 60

| 72 | 66 | 71 | 59 | 1,2 | 56

| 72 | 66 | 71 | 59 | 56 | 4,3

Code:

import cv2
from pytesseract import image_to_string

img = cv2.imread("0VXIY.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
# print(img.shape[:2])
start_index = 0
end_index = int(h/5) - 20

for i in range(0, 6):
    # print("{}->{}".format(start_index, end_index))
    gry_crp = gry[start_index:end_index, 0:w]
    blr = cv2.GaussianBlur(gry_crp, (145, 145), 0)
    div = cv2.divide(gry_crp, blr, scale=192)
    txt = image_to_string(div, config="--psm 6")
    print(txt)
    start_index = end_index
    end_index = start_index + int(h/5) - 20