Good day,
I'm trying to configure SOLR
to use Tesseract OCR
engine for text extraction from images, but did not have success yet.
SOLR extracting fine text from structured text documents (.xls, .pdf, doc, etc), but it does not want to call Tesseract module for text recognition.
I'm using
- SOLR v.7.4.0
- Tesseract version 4.1.1
- TIKA 1.18 version (build-in in SOLR, no standalone version)
Tesseract is installed in to the following directory:
/usr/share/tesseract/4/tessdata/
echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
tesseract -v
tesseract 4.1.1-rc2-20-g01fb
leptonica-1.76.0
libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0
Command tesseract test.jpg test.txt
produces accurate txt file with OCRed content from test.jpg.
solrconfig.xml, TesseractOCRConfig.properties, ParseContent.xml files were modified to point to Tesseract installation.
Has anybody done such configuration ?