0
votes

I am using Tesseract OCR c++ library in QT to get a text from a png image using this code

const char* lang = "eng";
QString filename = "D:/image.png";

tesseract::TessBaseAPI tess;
tess.Init(NULL, lang, tesseract::OEM_DEFAULT);
tess.SetPageSegMode(tesseract::PSM_AUTO);

FILE* fin = fopen(filename.toStdString().c_str(), "rb");
if (fin == NULL)
{
    std::cout << "Cannot open " << filename.toStdString().c_str() << std::endl;
    return;
}
fclose(fin);

STRING text;
if (tess.ProcessPages(filename.toStdString().c_str(), NULL, 0, &text))
{
    ui->plainTextEdit->setPlainText(QString::fromUtf8(text.string()));
 //show result in plainttext qt gui

}

put the data not accurate enough for the data in the table and it gives me strange characters and when I use an online OCR website to convert my image to text (the same image) it does it with 100% accurate so what makes it gives me this wrong text is this a problem with the library? or my code? or if there is a better free library I can use to be more accurate?

I got the image from pdf I use ghost script to get the image with a good quality so the OCR library should get me the correct data

2
The not accurate of the OCR does not depend on Qt, that depends on the class that makes the calculation, so I see the Qt tag irrelevant. โ€“ eyllanesc
Are you attempting any preprocessing before processing your pages? If you look at their forums several of the users mentions that you should try it with black and white images(black font on white background) your text has a lot of well fuzz around it, you should attempt to preprocess that out, The online OCR most likely has some stuff in place to automatically edit the images and remove those. โ€“ AresCaelum
yes I use the Ghostscript to get the image from the pdf file with this properties -dFirstPage=1 -dLastPage=1 -dBATCH -dNOPAUSE -sDEVICE=pnggray -r300 -dUseCropBox โ€“ user7179690

2 Answers

0
votes

I am not experienced with cpp, but I think your problem relates to the below line with a great probability:

tess.Init(NULL, lang, tesseract::OEM_DEFAULT);

It must show the tessdata folder. instead of NULL you may write the folder name, for example "C:/tessdata/". Again, I am not experienced with cpp, that's why you may decide slash "/" or backslash "\". This folder should contain the language file(s).

0
votes

As Eddge mentioned in his comment you should apply some image preprocessing stuff there are bunch of scripts for imagemagick. Ans of course OpenCV will vastly help in this stuff as well.

The next point could be PSM mode which by default should satisfy your needs to extract whole page information.

Also the result of the online OCR is not 100% as you mentioned.

There is "1 S Days" instead of "15 Days"
There is "Mail: finance(a)" instead of "E Mail: finance@"
There is "TiA THE GREEN HOL1 5" instead of "T/A THE GREEN HOU 5"

etc.

Which Tesseract version are you using? I highly recommend to use 3.05. (4.0 shows much better results but it is not officially released yet).

Also the following link could help you with your results: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

P.S. I hope you are eligible to share publicly such financial documentations;)