2
votes

I am trying to extract tables from old books using tesseract in R. Here is an example: Image

The quality of the image is quite poor and the recognition rate was quite bad at first. However, I managed to increase it with gimp: Rescaling, grey scale, auto threshold for colours, Gaussian blur and/or sharpen filters. I also gave a shot to Fred's imageMagick scripts - textcleaner - and used imageMagick to successfully remove the black lines. This is what I'm doing in R:

library(tesseract)
library(magick)

img <- image_read('img.png')
img_data <- ocr(img, engine = tesseract('eng', options = list(tessedit_char_whitelist = '0123456789.-',
                                                            tessedit_pageseg_mode = 'auto',
                                                            textord_tabfind_find_tables = '1',
                                                            textord_tablefind_recognize_tables = '1')))
cat(img_data)

Given that I only want to deal with digits, I set tessedit_char_whitelist and, while I get better results, they are still not reliable. What would you do in this case? Any other thoughts to improve accuracy before I try to train tesseract? I've never done it - let alone with digits only. Any idea/experience on how to do it? I've checked this out: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 but I'm still a bit baffled.

1

1 Answers

4
votes

I worked on a project that used Tesseract to read data fields off of video frames and create an indexed spreadsheet from them. What I found to work well was to crop each text field (using ffmpeg) out each image, process (with ImageMagick, using similar techniques you mentioned), OCR, and then I had Python (something similar could be done in R) create a spreadsheet from the OCR results. The benefit of this method is that Tesseract only has to deal with small, single line text images, which in my case seemed to improve results (with the -psm 7 option). The downside is it's quite processing intensive. Perhaps creating an image for each line of the page would help.

I did find that training Tesseract for a new font/language helped my results immensely. It can be tedious and time consuming, but it significantly improved my results, sometimes going from 0% correct to 100% correct. This site helped me understand the process. I just followed their steps and it worked, sure enough. From my experience in creating training images, it helped a lot to crop out single characters, with about at least a dozen of each character to create a good training sample. And try to have a similar number of samples for each character; it seemed like if you did many many more of one character Tesseract would give that character as a result (incorrectly) more often.