1
votes

i need to integrate the tesseract-ocr which converts scanned image as in pdf to text.

there is tesseractOCRParser already available.

bu there is no invoke method given.

when am trying to build tika with tesseract-ocr referral path am getting this following error

Results :

Failed tests: testNoConfig(org.apache.tika.parser.ocr.TesseractOCRConfigTest): Invalid default tesseractPath value expected:<[]> but was:<[/home/serendio/tesseract-ocr/]>

Tests run: 569, Failures: 1, Errors: 0, Skipped: 7

can anyone help me out ???

or any other-way to resolve this problem??

1
Do you have Tesseract installed? And how are you trying to call / use Tika? - Gagravarr
Yea . i have tesseract in my machine . by referring tesseract path from my machine am trying to build the tika .jar for my system. The problem is tika source not builds with tesseract source. - Kovalan R
Why are you trying to build Tika from source? To get started, you're much better off just downloading pre-build binaries, at least until you're used to it all - Gagravarr

1 Answers

3
votes

I think this can help : https://wiki.apache.org/tika/TikaOCR I followed this guide and I was able to easily extract the content! I simply installed Tesseract and then Tika.

Using Tika 1.9 I was easily able to : - extract the content directly calling a local Tika server - extract the content in a custom application ( you can use the tika-example project) with no effort .

No modification was needed. Everything working out of the box.