Issue to train tesseract-OCR 4 - Empy shape table

Question

I am trying to train Tesseract 4 with particular pictures (to read multimeters with 7 segments),

please note that I am aware of the allready trained data from Arthur Augusto at https://github.com/arturaugusto/display_ocr but I need to train Tesseract over my own data.

In order to train tess, I followed differents tutorials (as https://robipritrznik.medium.com/recognizing-vehicle-license-plates-on-images-using-tesseract-4-ocr-with-custom-trained-models-4ba9861595e7 or https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/)

but i allways get problem when running the shapeclustering command with my own data

(With example data as https://github.com/tesseract-ocr/tesseract/issues/1174#issuecomment-338448972, every things is working fine)

Indeed when I try to do the shapeclusturing command it have this output screenshot Then my shape_table is empty and the trainig could'nt be efficient...

With example data it's working fine and the shape_table is well filled

I am guessing that I have issue with box file generation, here is my process to create box file :

I use the

tesseract imageFileName.tif imageFileName  batch.nochop makebox

command to generate box file and then i edit it with JtessboxEditor.

So I can't see where I'am wrong with my .box/.tif data couple.

Have a good day & thanks for helping me \n Adrien

Here is my full batch script for training after having generated and edited box files.

set name=sev7.exp0
set shortName=sev7

echo Run Tesseract for Training.. 
tesseract.exe %name%.tif %name% nobatch box.train 
 
echo Compute the Character Set.. 
unicharset_extractor.exe %name%.box 

shapeclustering -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
mftraining -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
echo Clustering.. 
cntraining.exe %name%.tr
echo Rename Files.. 
rename normproto %shortName%.normproto 
rename inttemp %shortName%.inttemp 
rename pffmtable %shortName%.pffmtable 
rename shapetable %shortName%.shapetable
echo Create Tessdata.. 
combine_tessdata.exe %shortName%.
echo. & pause

Adrien p Adrien p · Accepted Answer · 2020-12-09T19:55:34

Ok so finally I achieved to train tesseract.

The solution is to add a --psm parameter when using the command

tesseract.exe %name%.tif %name% nobatch box.train

as

tesseract.exe %name%.%typeFile% %name%  --psm %psm% nobatch box.train

note that all the psm value are :

REM pagesegmode values are:

REM   0 = Orientation and script detection (OSD) only.
REM   1 = Automatic page segmentation with OSD.
REM   2 = Automatic page segmentation, but no OSD, or OCR
REM   3 = Fully automatic page segmentation, but no OSD. (Default)
REM   4 = Assume a single column of text of variable sizes.
REM   5 = Assume a single uniform block of vertically aligned text.
REM   6 = Assume a single uniform block of text.
REM   7 = Treat the image as a single text line.
REM   8 = Treat the image as a single word.
REM   9 = Treat the image as a single word in a circle.
REM   10 = Treat the image as a single character.
REM   11 = Sparse text. Find as much text as possible in no particular order.
REM   12    Sparse text with OSD.
REM   13    Raw line. Treat the image as a single text line bypassing hacks that are Tesseract-specific.

founded on https://github.com/tesseract-ocr/tesseract/issues/434

Issue to train tesseract-OCR 4 - Empy shape table

1 Answers