Does anyone know how Tesseract - OCR postprocessing / spellchecking works?

Question

I was using tesseract-ocr (pytesseract) for spanish and it achieves very high accuracy when you set the language to spanish and of course, the text is in spanish. If you do not set language to spanish this does not perform that good. So, I'm assuming that tesseract is using many postprocessing models for spellchecking and improving the performance, I was wondering if anybody knows some of those models (ie edit distance, noisy channel modeling) that tesseract is applying. Thanks in advance!

Pytesseract is open source and on GitHub. Had you checked that, you would have read that it's a wrapper around Google Tesseract which is also open source and on GitHub. — Jongware
I already read their wiki on github and did not find what I was looking for. Thank you anyways! — Tomas -

user898678 user898678 · Accepted Answer · 2020-01-22T08:28:08

Your assumption is wrong: If you do not specify language, tesseract uses English model as default for OCR. That is why you got wrong result for Spanish input text. There is no spellchecking post processing.

Does anyone know how Tesseract - OCR postprocessing / spellchecking works?

1 Answers