0
votes

Good day, I'm trying to configure SOLR to use Tesseract OCR engine for text extraction from images, but did not have success yet.

SOLR extracting fine text from structured text documents (.xls, .pdf, doc, etc), but it does not want to call Tesseract module for text recognition.

I'm using

  • SOLR v.7.4.0
  • Tesseract version 4.1.1
  • TIKA 1.18 version (build-in in SOLR, no standalone version)

Tesseract is installed in to the following directory:

/usr/share/tesseract/4/tessdata/
echo $TESSDATA_PREFIX - > /usr/share/tesseract/4/tessdata/
tesseract -v
tesseract 4.1.1-rc2-20-g01fb
leptonica-1.76.0
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0

Command tesseract test.jpg test.txt produces accurate txt file with OCRed content from test.jpg.

solrconfig.xml, TesseractOCRConfig.properties, ParseContent.xml files were modified to point to Tesseract installation.

Has anybody done such configuration ?

1

1 Answers

1
votes

Good day, We solved the situation. Here is what was used and changed: In our installation we used Tesseract version 3.05, Tika version 1.17, SOLR version 7.4. We actually, had TIKA version 1.17, not 18. 1. Changed from HOCR to TXT >>> in file parseContext.xml 2. Had to start SOLR as a root user. Version 4.1.1 is not compatible with TIKA 1.17 , so we will upgrade SOLR to version 7.7, TIKA version 1.19 and will try to install Tesseract 4.1.1enter image description here