Pattern Extraction Layer for Documents in OpenCV

Question

First we generate the binary image of the give image by thresholding it at 80% of its intensity and inverting the resulting image. In the binary image white pixels represent the characters, , graphics and lines etc. The first step in pattern extraction is to locate rectangular regions called ‘rect’. A rect is a rectangular region of loosely connected white pixels 1, that encloses a certain logical part of the document. We considered simple 8-neighborhood connectivity and performed connected component (contour) analysis of the binary image leading to the segmentation of the textual components. For next part of algorithm we use the minimum bounding rectangle of contours. These rectangles were then sorted top-to-bottom and left-to-right order, using 2D point information of leftmost-topmost corner. Smaller connected patterns were discarded based on the assumption that they may have originated due to noise dependent on image acquisition system and does not in any way contribute to the final layout. Also punctuation marks were neglected using smaller size criterion e.g. comma, full-stop etc. At this level we also segregate the fonts based on the height of the bounding rect using avgh (average height) as threshold. Two thresholds are used to classify fonts into three categories - small, normal and large.

equation http://a1.sphotos.ak.fbcdn.net/hphotos-ak-snc7/401374_144585198985889_100003032296249_180106_343820769_n.jpg

can you help me translate this theory into opencv source code or give me any related link for this, im currently working with document image analyzing for my thesis ....

Abid Rahman K Abid Rahman K · Accepted Answer · 2012-04-22T18:14:20

I know this is a late reply. But i think future comers can get help from it.

Below is the answer i think i understood from above passage (All codes are in OpenCV-Python v 2.4-beta):

I take this as input image. It is a simple image for sake of understanding.

input image

First we generate the binary image of the give image by thresholding it at 80% of its intensity and inverting the resulting image.

import cv2
import numpy as np

img = cv2.imread('doc4.png')
gray = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
ret,thresh = cv2.threshold(gray,0.8*gray.max(),255,1)
contours, hier = cv2.findContours(thresh,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)

Thresholded image :

threshold image

We considered simple 8-neighborhood connectivity and performed connected component (contour) analysis of the binary image leading to the segmentation of the textual components.

It is simply contour finding in OpenCV, also called connected-component labelling.It selects all white blobs(components) in the image.

contours, hier = cv2.findContours(thresh,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)

Contours :

contours

For next part of algorithm we use the minimum bounding rectangle of contours.

Now we find bounding rectangles around each detected contours. Then remove contours with small areas to remove commas etc. See the statement:

Smaller connected patterns were discarded based on the assumption that they may have originated due to noise dependent on image acquisition system and does not in any way contribute to the final layout. Also punctuation marks were neglected using smaller size criterion e.g. comma, full-stop etc.

We also find the average height, avgh.

height = 0
num = 0
letters = []
ht = []

for (i,cnt) in enumerate(contours):
    (x,y,w,h) = cv2.boundingRect(cnt)
    if w*h<200:
        cv2.drawContours(thresh2,[cnt],0,(0,0,0),-1)
    else:
        cv2.rectangle(thresh2,(x,y),(x+w,y+h),(0,255,0),1)
        height = height + h
        num = num + 1
        letters.append(cnt)
        ht.append(h)

avgh = height/num

So after this all commas etc are removed, and green rectangles drawn around selected ones:

bounding rect

At this level we also segregate the fonts based on the height of the bounding rect using avgh (average height) as threshold. Two thresholds are used to classify fonts into three categories - small, normal and large (as per given equations in passage).

average height, avgh, obtained here is 40. So one letter is small if height is less than 26.66 (ie 40x2/3), normal if 26.66large if height>60. But in the given image, all heights fall between (28,58), so all are normal. So you can't see the difference.

So i just made a small modification to easily visualize it : small if height<30 , normal if 3050.

for (cnt,h) in zip(letters,ht):
    print h
    if h<=30:
        cv2.drawContours(thresh2,[cnt],0,(255,0,0),-1)
    elif 30 < h <= 50:
        cv2.drawContours(thresh2,[cnt],0,(0,255,0),-1)
    else:
        cv2.drawContours(thresh2,[cnt],0,(0,0,255),-1)
cv2.imshow('img',thresh2)
cv2.waitKey(0)
cv2.destroyAllWindows()

Now you get the result with letters categorized to small,normal,large:

result

These rectangles were then sorted top-to-bottom and left-to-right order, using 2D point information of leftmost-topmost corner.

This part i have omitted. It is just sorting of all bounding rects wrt their leftmost-topmost corner.

Pattern Extraction Layer for Documents in OpenCV

1 Answers