OCR Tesseract configuration

Question

I am working with Tesseract to extract vocabulary lists out of images.

The lists consist out of 2 different languages. Unfortunately there is only whitespace between lang1 and lang2 (maybe 3 or 4 blank characters).

Is there a way to define, which string to take to separate the two from each other.

The list could look like the following:

house, building Haus, Gebäude tree Baum ...

Also I have problems to get a linebreak after each word-pair.

Thanks!

Edit: I run this command

tesseract bilder/screenshot1.png output/screenshot1 -l swe+deu

to extract all entries from this picture

As you can see, there is no clear separator between the values. As output I get this

nej nein

jaha aha

Vad talar du för språk? Welche Sprachen sprichst du?
vad för welche, was für

tala (talar, talade, talat) sprechen

språk (-et, —, -en) Sprache

japanska japanisch

engelska englisch

Du då? Und du?

då da, dann, damals, als

bara nur

lite ein bisschen

verb (-et, —, en) Verb

position (—en, -er, -erna) Stellung, Position
OBS (= observera) NB, Achtung!

fråga (-n, -or, -orna) Frage

which is quiet good. But I don't know how to seperate the string of each line in two strings because of the missing usable separator.

Please share what you have tried so far and what programming language you are using. Also sharing the image might help. — hcham1

cortex42 cortex42 · Accepted Answer · 2016-10-21T12:02:18

You could use the Tesseract API and try to separate the words by calling the method WordFontAttributes of the class ResultIterator to determine if one word is bold or not. This GitHub gist shows how the method is used.

OCR Tesseract configuration

1 Answers