tesseract 4 Why isn't my training data compiling

Question

I am trying to train Tesseract 4 to recognise some electronic circuit diagram symbols such as a resistor, capacitor etc from images but there seems to be no straight forward guide into training tesseract and the official documentation seems to focus more on fonts instead of image data.

The reply on this post seems to be the most helpful thing I've found so far but when following the steps I get an error:

What I've done so far:

Successfully compiled tesseract 4.1.1 and the training tools on ubuntu 16
Successfully cloned the tesstrain repo
Generated 4 tif images of components titled image0.tiff - image.3.tiff
Generated 4 plain text files with the same name titeld image0.gt.txt - image3.gt.txt
Each text file has the name of the component in it, eg resistor, capacitor etc.
Moved these files into the appropiate location (tesstrain/data)

Note: I know I need way more data than this, this is simply just a test to get everything working and sucessfully make a .traineddata file.

When I run the command "make training MODEL_NAME=testModel_1" I get the following in my console:

@CKVM1:~/Downloads/tesstrain$ make training MODEL_NAME=testModel_1
find: ‘data/testModel_1-ground-truth’: No such file or directory
find: ‘data/testModel_1-ground-truth’: No such file or directory
Error: missing ground truth for training
Makefile:175: recipe for target 'data/testModel_1/list.train' failed
make: *** [data/testModel_1/list.train] Error 1

I believe the issue is that, in the post I linked the instructions say to the "START_MODEL" paramater which as far as I understand uses whichever language you set it as as a starting point to improve training time but since I'm using custom symbols and not actual letters I don't see how that would benefit me. It seems the issue is however, that it expects a (more general?) ground truth file to already be present before the training starts which I am unsure how to go about solving

Any ideas on how to resolve this?

Eric Ihli Eric Ihli · Accepted Answer · 2021-04-23T11:56:10

Make sure that your training data is in ´tesstrain/data/testModel_1-ground-truth´.

You can look at what ´make training´ is doing at https://github.com/tesseract-ocr/tesstrain/blob/0d972f86f4aaf88fde77e3445ff607e68866c882/Makefile#L200

You'll see that it's looking for something in the ´GROUND_TRUTH_DIR´.

$(ALL_GT): $(shell find $(GROUND_TRUTH_DIR) -name '*.gt.txt')
    @mkdir -p $(OUTPUT_DIR)
    find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs paste -s > "$@"

GROUND_TRUTH_DIR is, by default, ´GROUND_TRUTH_DIR := $(OUTPUT_DIR)-ground-truth´

And if we keep following back the path of environment variables...

# Name of the model to be built. Default: $(MODEL_NAME)
MODEL_NAME = foo

# Data directory for output files, proto model, start model, etc. Default: $(DATA_DIR)
DATA_DIR = data

# Output directory for generated files. Default: $(OUTPUT_DIR)
OUTPUT_DIR = $(DATA_DIR)/$(MODEL_NAME)

Given the output of your error message, it doesn't look like any of your environment variables have been changed from their defaults, which is good. Everything should work. It looks like the training program is complaining simply that you don't have a folder at ´tesstrain-data-testModel_1-ground-truth´, which is what is required.

tesseract 4 Why isn't my training data compiling

1 Answers