1
votes

I'm using OCR on historical newspapers that contain 6 columns per page. At present I use FineReader and define text blocks for each column. I'd like to use Tesseract. Tesseract gets the columns mostly right, but every few lines it reads into adjacent columns. I wonder if there's a way to set its parameters so that it will look quite rigidly for six columns.

Following suggestions on other questions, I've tried playing with --psm and hocr without great success.

Working with a jpg I've posted on github, and converting it into a text-embedded pdf using this code tesseract 1906-07-02-p4.jpg out -l eng+fra --psm 1 pdf I get this result:

enter image description here

Clearly the engine is making a bloc containing the indented lines, and another containing the flush lines.

Confirming this is the text output of the flush lines:


Grocery, Bar and Coffea shop of the trpops
stationed at the Citadel, Cairo.

to received tender for this service by 10 a.m.,
on Saturday, the 14th Jaly, 1906.

application in person to the Commandant,
Citadel, between the hours of 10 a.m. and
12 noon, daily.
—_—_——

Is there a way to constrain tesseract to certain column boundaries? (Obviously I could do this by cutting up the images but I'd like to avoid that work.)

1
have you tried with a different mode of PSM? i think you should try with --psm 6 - maulik kansara
--psm 6 is worse--it reads single lines across all six columns. :( - Will Hanley
oh. if you have a fixed page design, then you can scan for each column using UZN files with coordinates. - maulik kansara
For anyone who might be interested: you can make a big improvement in column recognition if you use a paint/photo editing program to draw straight black lines between every column on the source image. - Will Hanley

1 Answers

-1
votes

you can user

psm 4 oem 1

or psm 4 oem 3 to get better text and accuracy