I am working with Tesseract to extract vocabulary lists out of images.
The lists consist out of 2 different languages. Unfortunately there is only whitespace between lang1 and lang2 (maybe 3 or 4 blank characters).
Is there a way to define, which string to take to separate the two from each other.
The list could look like the following:
house, building Haus, Gebäude tree Baum ...
Also I have problems to get a linebreak after each word-pair.
Thanks!
Edit: I run this command
tesseract bilder/screenshot1.png output/screenshot1 -l swe+deu
to extract all entries from this picture
As you can see, there is no clear separator between the values. As output I get this
nej nein
jaha aha
Vad talar du för språk? Welche Sprachen sprichst du?
vad för welche, was für
tala (talar, talade, talat) sprechen
språk (-et, —, -en) Sprache
japanska japanisch
engelska englisch
Du då? Und du?
då da, dann, damals, als
bara nur
lite ein bisschen
verb (-et, —, en) Verb
position (—en, -er, -erna) Stellung, Position
OBS (= observera) NB, Achtung!
fråga (-n, -or, -orna) Frage
which is quiet good. But I don't know how to seperate the string of each line in two strings because of the missing usable separator.