20
votes

Is there any way to get Tesseract to match only user-specified words or patterns? The manual claims it is possible, yet I cannot find a single documented instance on the internet of somebody getting this working.

Here are many examples of people asking for help because it does not work, and none have a proven resolution.

stackoverflow.com/questions/33429143/tesseract-user-pattern-is-not-applied

stackoverflow.com/questions/31874393/tesseract-ocr-force-pattern

stackoverflow.com/questions/26856349/provide-pattern-for-tesseract

stackoverflow.com/questions/22432194/tesseract-ocr-only-detect-user-words

stackoverflow.com/questions/17209919/tesseract-user-patterns

groups.google.com/forum/#!topic/tesseract-ocr/S9CIK3jOMWw

groups.google.com/forum/#!topic/tesseract-ocr/5vFqVcJmHnM

So can we conclude that this feature simply does not work? Is there an official statement to this effect?

1
A lot of the linked Tesseract documents appear to have moved. Here is a link to a manual on github.Evan
Year later, still appears to be the case.Slight
The link to the manual is deadAdelin
Repo admins say that user-patterns broke somewhere around v3.02. LSTM v4.0 probably has broken user-patterns as well as char-whitelisting github.com/tesseract-ocr/tesseract/issues/960NightFury13

1 Answers

6
votes

There is now an example on the Tesseract doc site at https://tesseract-ocr.github.io/tessdoc/APIExample-user_patterns.html [Thanks @Ravi for the new link]

That test example does work for me in the oem=1 / LSTM mode of Tesseract 4.x.

I can't, however, get it to work for any other examples, or in any other modes.

I have seen no official statement and at the time of writing it does indeed seem that the feature is non-functional.