2
votes

Currently i'm having a huge intrest in image processing and optical character recognition. After some basic recognition and some filters I decided to start on something more diffucult.

I'm trying to read the value out of these captchas: http://img851.imageshack.us/img851/9579/57859946.png

I have written some filters for pre-processing: - Replace Color (to White) Remove blue lines remove the lines that go trough the text (two) - Threshold image (255)

Wich outputs an images like this; http://img232.imageshack.us/img232/2325/00i3q45j1zt.png

As you can see there are holes in some letters. I first thought maybe it's better to leave the lines trough the letters but that made it worse. I'm using the tesseract OCR engine and I trained it using the Elephant font (The font the captcha uses). I also tried using other OCR engines like GOCR but it makes everything worse. With tesseract I now have a recognition of 20%. I'm coding in C# (.NET 4.0).

The captcha is generated by a software package named PHPCaptcha.

Now my question is: Is there any algorithm or tick to fill up the holes in the letters? And is there any other way to get a better recognition?

I'm excited to hear from you guys :)

Greetings,

1

1 Answers

2
votes

Part 0 - Preface


i) Before hand, you may want read to my OCR-related answer here, which may give you some tricks for using tesseract

ii) I assume you could just turn everything into black and white (in your case, processing in colors doesn't give you an edge)


Part 1 - Preprocessing


To fill 'the-holes' after you've removed the blue lines. You can always dilate or perform 'dilate-then-erode' operations. Here, dilation means you enlarge every pixel in 8-directions(making a bigger pixel). Once you've dilated the pixels, see if you can get them to be recognized or see if the characters are 'over-filled' (dilated too much). If the chars cannot be recognized or the characters are dilated too much, you can then apply a erosion operation. Of course there are advanced synthesis algorithms, but i think you are better off to start with a simpler image processing operation first.


Part 2 - OCR/Tesseract


With Tesseract, if you are feeding the whole image into Tesseract, it would perform line analysis and so on and so forth. Since characters in captcha dont behave like normal text, doing line analysis or recognizing them in a group may somewhat deteoriate the recognition rate. So my suggestion is to recognize by character-by-character first.