not accurate tesseract OCR data from a png image in QT c++

Question

I am using Tesseract OCR c++ library in QT to get a text from a png image using this code

const char* lang = "eng";
QString filename = "D:/image.png";

tesseract::TessBaseAPI tess;
tess.Init(NULL, lang, tesseract::OEM_DEFAULT);
tess.SetPageSegMode(tesseract::PSM_AUTO);

FILE* fin = fopen(filename.toStdString().c_str(), "rb");
if (fin == NULL)
{
    std::cout << "Cannot open " << filename.toStdString().c_str() << std::endl;
    return;
}
fclose(fin);

STRING text;
if (tess.ProcessPages(filename.toStdString().c_str(), NULL, 0, &text))
{
    ui->plainTextEdit->setPlainText(QString::fromUtf8(text.string()));
 //show result in plainttext qt gui

}

put the data not accurate enough for the data in the table and it gives me strange characters and when I use an online OCR website to convert my image to text (the same image) it does it with 100% accurate so what makes it gives me this wrong text is this a problem with the library? or my code? or if there is a better free library I can use to be more accurate?

I got the image from pdf I use ghost script to get the image with a good quality so the OCR library should get me the correct data

The not accurate of the OCR does not depend on Qt, that depends on the class that makes the calculation, so I see the Qt tag irrelevant. — eyllanesc
Are you attempting any preprocessing before processing your pages? If you look at their forums several of the users mentions that you should try it with black and white images(black font on white background) your text has a lot of well fuzz around it, you should attempt to preprocess that out, The online OCR most likely has some stuff in place to automatically edit the images and remove those. — AresCaelum
yes I use the Ghostscript to get the image from the pdf file with this properties -dFirstPage=1 -dLastPage=1 -dBATCH -dNOPAUSE -sDEVICE=pnggray -r300 -dUseCropBox — user7179690

nes nes · Accepted Answer · 2017-07-28T12:14:13

I am not experienced with cpp, but I think your problem relates to the below line with a great probability:

tess.Init(NULL, lang, tesseract::OEM_DEFAULT);

It must show the tessdata folder. instead of NULL you may write the folder name, for example "C:/tessdata/". Again, I am not experienced with cpp, that's why you may decide slash "/" or backslash "\". This folder should contain the language file(s).

not accurate tesseract OCR data from a png image in QT c++

2 Answers