7
votes

I tried to force tesseract to use only my words list when perform OCR. First, i copy bazaar file to /usr/share/tesseract-ocr/5/tessdata/configs/. This is my bazaar file:

load_system_dawg F
load_freq_dawg F
user_words_suffix user-words

Then, i created eng.user-words in /usr/share/tesseract-ocr/5/tessdata. This is my user-words file:

Items
VAT
included
CASH

then i perform ocr for this image by command: tesseract -l eng --oem 2 test_small.jpg stdout bazaar.

test_img

this is my result:

2 Item(s) (VAT includsd) 36,000
casH 40,000
CHANGE 4. 000

As you can see, includsd is not in my user-words file, and it should be 'included'. Besides, i got same result even without using bazaaz config in command. It looks like that my bazaar and eng.user-words config doesn't have any effect in OCR output. So how can use bazaar and user-words config, in order to get desired result ?

2

2 Answers

0
votes

All you need to do was up-sampling the image.

If you up-sample two - times

enter image description here

Now read:

2 Item(s) (VAT included) 36,000
CASH 40,000
CHANGE 4,000

Code:

import cv2
import pytesseract

# Load the image
img = cv2.imread("4nGXo.jpg")

# Convert to the gray-scale
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Up-sample
gry = cv2.resize(gry, (0, 0), fx=2, fy=2)

# OCR
print(pytesseract.image_to_string(gry))

# Display
cv2.imshow("", gry)
cv2.waitKey(0)
-1
votes

user_words_suffix does not seem to work for --oem 2. A workaround is to use user_words_file which contains the path to your user-words file.