Training tesseract 4 with images instead of font

Question

I have some questions about making tiff/box files for tesseract 4. In TrainingTesseract 4.00 document written:

Making Box Files As with base Tesseract, there is a choice between rendering synthetic training data from fonts, or labeling some pre-existing images (like ancient manuscripts for example).

But it did not explain how to train with pre-existing images.

I want to train for the Persian language in tesseract 4 (lstm). I have some images from ancient manuscripts and want to train with images and texts instead of font. So I can’t use text2image command. I know that the old format box files will not work for LSTM training.

How can I make tif/box for tessearct 4 lstm then label them and how to change tesseract commands?
Should I use other tools for generating box files (Given that Persian language is right to left )?
Should I use fine tuning or train from Scratch?

Hello. I suggest you ask these four questions separately and providing some code. — prgrm

Raniem Raniem · Accepted Answer · 2018-08-23T12:58:39

I was struggling just like you, until I found this github repository: https://github.com/OCR-D/ocrd-train

It will make your life super easy. All you need to do is to put your images in tif format and your text should have the same image name with extension .gt.txt. It will take care of all the rest for you. (you might need to update the Makefile according to your local machine)

Whether to train from scratch or fine-tune depends on your own language, data and the problem you are trying to solve. For me the fine tunining is what I need cause I am happy with the current performance but need to add upon it.

All the useful details you might need can be found in this answer

Training tesseract 4 with images instead of font

2 Answers