2
votes

I want to turn scanned images into black and white images, the goal is to reduce the file size before the images are transferred over the internet for OCR.

The normal binarisation/ black and white images created by scanners/ general image editing software produces undesirable results.

Lots of random black pixels are left behind which are really just noise from binarisation, this causes the OCR to try and recognise characters where there are none, or insert full stops, colons etc after characters.

What can I use in OpenCV to binarise an image, keep lines, characters & dark areas solid, and, reduce pixel noise in white areas?

I've toyed with cvThreshold and cvAdaptiveThreshold but results have not been great yet.

As an example, check out this original image and the desired result.

2
Your example appears to be trinary, I see at least one shade of gray in addition to the black and white. - Mark Ransom
@MarkRansom When I went back and looked at the images in IrfanView, I thought you were right and I must have saved the B&W image wrong. However, when looking at the images in Gimp, the pixels are just B&W. What are you using to view the image? In my case I trust gimp over IrfanView. - Michael
I was looking at it in Chrome. Today in Firefox it looks OK, don't know what happened. - Mark Ransom

2 Answers

2
votes

You can try this however you still need to adjust some parameters.

#define ALPHA_SCALE 2
#define THRESHOLD_VAL 40
#define MAX_VAL_FOR_THRESHOLD 250
#define PIXEL_MISMATCH_COUNT 10 //9, 7
Mat current_frame_t2;        

     IplImage *img = cvLoadImage("Original.tiff", CV_LOAD_IMAGE_UNCHANGED );
     cvNamedWindow("My_Win", CV_WINDOW_AUTOSIZE);
    // namedWindow("My_Win", 1);
     cvShowImage("My_Win", img);
      cvWaitKey(10);
     Mat current_frame_t1(img);
     cvtColor(current_frame_t1, current_frame_t2, CV_RGB2GRAY);
    current_frame_t1.release();
    imshow("My_Win", current_frame_t2);
     cvWaitKey(10);
     equalizeHist(current_frame_t2, current_frame_t1);
    current_frame_t2.release();
    convertScaleAbs(current_frame_t1, current_frame_t2,ALPHA_SCALE);

    threshold(current_frame_t2, current_frame_t1, THRESHOLD_VAL, MAX_VAL_FOR_THRESHOLD, CV_THRESH_BINARY);
    medianBlur(current_frame_t1,current_frame_t2,1); 
    imshow("My_Win", current_frame_t2);
    imwrite("outimg.tiff", current_frame_t2),
    cvWaitKey(0);
1
votes

You can use a connected-components labeling algorithm and delete the components that doesn't fill a reasonable ammount of pixels in the image.

One very simple way of implementing it in OpenCV is using contours:

1. Do the preliminary bizariztion of the OCR, that will give you a very noise output. 
2. Find all contours on that noise image.
3. For each found contour:
  3.1. Fill the contour with a color different of the two options in the binarized image.
  3.2. Count the ammount of pixels filled with that color.
  3.3. If the ammount of pixels are smaller than a given treshold, fill the contour with the void collor of the binary image.

For reference:cv::findContours and cv::drawContours.

It's possible to optimize the loop classifying more than one contour on 3.1. and doing the pixels count in a single pass for all those colors in 3.2. . I didn't answared with the optimized version because it's possible that you have more than 253 different groups (255 colors - 2 default colors of the binary image) and it's not so straight forward to take that in account.