1
votes

Environment & Setup

Operating System: Suse Linux Enterprise Server 12 SP 1

$ uname –a
Linux 3.12.62-60.64.8-default #1 SMP Tue Oct 18 12:21:38 UTC 2016 (42e0a66) x86_64 x86_64 x86_64 GNU/Linux

Since this environment is managed, I can not update any system libraries like glibc etc. So the newest and only official supported version for "Suse 12 SP1 x86_64" of teaaseract I found is 3.02.

Installed Packages:

libgif4-4.1.6-34.1.1.x86_64.rpm
liblept3-1.69-16.1.x86_64.rpm
libtesseract3-3.02.02-3.2.1.x86_64.rpm
libwebp4-0.3.1-34.1.x86_64.rpm
tesseract-3.02.02-59.1.x86_64.rpm

tesseract version

$ tesseract –v
tesseract 3.02.02
    leptonica-1.69
        libgif 4.1.6 : libjpeg 8d : libpng 1.5.22 : libtiff 4.0.6 : zlib 1.2.8

Release details

$ zypper info tesseract
Information for package tesseract:
----------------------------------
Repository: @System
Name: tesseract
Version: 3.02.02-59.1
Arch: x86_64
Vendor: obs://build.opensuse.org/home:koprok
Support Level: unknown
Installed: Yes
Status: up-to-date
Installed Size: 3.8 MiB
Summary: Open Source OCR Engine
Description: […]

Traindata & Languages

Traindata has been manually downloaded from: https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.eng.tar.gz/download https://sourceforge.net/projects/tesseract-ocr-alt/files/tesseract-ocr-3.02.deu.tar.gz/download

And files have been extracted to /usr/share/tessdata/

$ ls -la /usr/share/tessdata/
drwxr-xr-x 1 root root      230 Dec 31 16:37 configs/
-rw-r--r-- 1 root root  2438081 Dec 30 15:31 deu.traineddata
-rw-r--r-- 1 root root   171918 Dec 30 20:16 eng.cube.bigrams
-rw-r--r-- 1 root root       38 Dec 30 20:16 eng.cube.fold
-rw-r--r-- 1 root root      181 Dec 30 20:16 eng.cube.lm
-rw-r--r-- 1 root root   857304 Dec 30 20:16 eng.cube.nn
-rw-r--r-- 1 root root      254 Dec 30 20:16 eng.cube.params
-rw-r--r-- 1 root root 13020078 Dec 30 20:16 eng.cube.size
-rw-r--r-- 1 root root  2444187 Dec 30 20:16 eng.cube.word-freq
-rw-r--r-- 1 root root      996 Dec 30 20:16 eng.tesseract_cube.nn
-rw-r--r-- 1 root root 21876572 Dec 30 20:16 eng.traineddata
drwxr-xr-x 1 root root       88 Dec 31 16:37 tessconfigs/

tesseract detects 'deu' and 'eng' as available languages

$ tesseract --list-langs
List of available languages (2):
deu
eng

Application & Problem

The software application is build based on Spring Boot framework. The code executing the tesseract command looks sth. like:

Runtime.getRuntime().exec(new String[] { 
 "tesseract", 
 "--tessdata-dir", "/usr/share/tessdata", 
 "-l", lang.getISO3Language(), 
 inputTiff.toAbsolutePath().toString(), extractedcntPath });

The appication logfile says

2016-12-30 20:30:02,320 [https-jsse-nio-8443-exec-7] WARN  PDFContentExtractor - read_params_file: parameter not found: II*

Executing tesseract with tessdata dir fails

$ tesseract --tessdata-dir /usr/share/tessdata -l deu inputPdf6632237754781472255.tiff out4
read_params_file: parameter not found: II*

When executing tesseract with no tessdata dir works well

$ tesseract -l deu input.tiff out5
Tesseract Open Source OCR Engine v3.02.02 with Leptonica

Questions & Ideas

  • Why does tesseract work well and detect the available languages without the --tessdata-dir parameter set?
  • Why does teasseract crash during initialization when using the --tessdata-dir parameter set?
  • Is there any difference between running tesseract with/without the --tessdata-dir parameter set?

What can I do to fix this problem?

  • Install a newer version of tesseract?
  • Compile a version from sources?
  • Use other traindata/tessdata?
  • Run tesseract without the --tessdata-dir param?

If anybody can help me getting this issue solved in the upcomming week, it would not only make me happy, but rather our whole team.

Thank you very much in advance!
-Rüdiger

1

1 Answers

1
votes

That command switch is not available until 3.04 version. Executing tesseract command will reveal what command options are supported for the current version.

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage