I tried to force tesseract to use only my words list when perform OCR.
First, i copy bazaar file to /usr/share/tesseract-ocr/5/tessdata/configs/
. This is my bazaar file:
load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
Then, i created eng.user-words
in /usr/share/tesseract-ocr/5/tessdata
. This is my user-words file:
Items
VAT
included
CASH
then i perform ocr for this image by command: tesseract -l eng --oem 2 test_small.jpg stdout bazaar
.
this is my result:
2 Item(s) (VAT includsd) 36,000
casH 40,000
CHANGE 4. 000
As you can see, includsd
is not in my user-words file, and it should be 'included'. Besides, i got same result even without using bazaaz
config in command. It looks like that my bazaar
and eng.user-words
config doesn't have any effect in OCR output. So how can use bazaar
and user-words
config, in order to get desired result ?