Currently i'm having a huge intrest in image processing and optical character recognition. After some basic recognition and some filters I decided to start on something more diffucult.
I'm trying to read the value out of these captchas: http://img851.imageshack.us/img851/9579/57859946.png
I have written some filters for pre-processing: - Replace Color (to White) Remove blue lines remove the lines that go trough the text (two) - Threshold image (255)
Wich outputs an images like this; http://img232.imageshack.us/img232/2325/00i3q45j1zt.png
As you can see there are holes in some letters. I first thought maybe it's better to leave the lines trough the letters but that made it worse. I'm using the tesseract OCR engine and I trained it using the Elephant font (The font the captcha uses). I also tried using other OCR engines like GOCR but it makes everything worse. With tesseract I now have a recognition of 20%. I'm coding in C# (.NET 4.0).
The captcha is generated by a software package named PHPCaptcha.
Now my question is: Is there any algorithm or tick to fill up the holes in the letters? And is there any other way to get a better recognition?
I'm excited to hear from you guys :)
Greetings,