1
votes

Requiring the mighty help of stack overflow. I actually work on an app that has to analyze via OCR (I'm using tesseract) documents and extract all the text I can get out of it. Here is an example of the type of image:

Image including text to extract

Here is what I do on preprocessing to get rid of all the lines. In the future I would also probably have to analyze each "rectangle" separatly (feeding a zone defined by given lines to tesseract) so I guess there's simpler methods than this but i wouldn't have the "lines" coordinates.

package formRecog;

import java.io.File;
import java.util.ArrayList;
import java.util.List;

import org.opencv.core.Core;
import org.opencv.core.Mat;
import org.opencv.core.Point;
import org.opencv.core.Scalar;
import org.opencv.core.Size;
import org.opencv.imgcodecs.Imgcodecs;
import org.opencv.imgproc.Imgproc;
import static org.opencv.core.Core.bitwise_not;
import org.opencv.core.MatOfPoint;


public class testMat {

    public static void main(String[] args) {

        System.loadLibrary(Core.NATIVE_LIBRARY_NAME);

        Mat source  = Imgcodecs.imread("./image.png",Imgcodecs.CV_LOAD_IMAGE_ANYCOLOR);
        Mat destination  = new Mat(source.rows(), source.cols(), source.type());
        Imgproc.cvtColor(source, destination, Imgproc.COLOR_RGB2GRAY);  
        Imgcodecs.imwrite("gray.jpg", destination);

        Imgproc.GaussianBlur(destination, destination, new Size(3, 3), 0, 0, Core.BORDER_DEFAULT);  

        Imgproc.Canny(destination, destination, 30, 90);
        Imgcodecs.imwrite("postcanny.jpg", destination);

        Mat houghlines = new Mat(); 
        Imgproc.HoughLinesP(destination, houghlines, 1, Math.PI / 180,  250, 185,5);

        //DESSINER LES LIGNES
        Mat result = new Mat(source.rows(), source.cols(), source.type());
        for (int i = 0; i < houghlines.rows(); i++) {
            double[] val = houghlines.get(i, 0);
            Imgproc.line(destination, new Point(val[0], val[1]), new Point(val[2], val[3]), new Scalar(0, 0, 255), 5);
            Imgproc.line(result, new Point(val[0], val[1]), new Point(val[2], val[3]), new Scalar(0, 0, 255),5);
        }

        Imgcodecs.imwrite("lines.jpg", result);

        Mat contourImg = new Mat(source.rows(), source.cols(), source.type());
        List<MatOfPoint> contours = new ArrayList<MatOfPoint>();
        Mat hierarchy = new Mat();
        //Point offset = new Point();

        Imgproc.findContours(destination, contours, hierarchy, Imgproc.RETR_LIST, Imgproc.CHAIN_APPROX_NONE );
        Imgproc.drawContours(contourImg, contours, -1, new Scalar(255, 0, 0),-1);

        Imgcodecs.imwrite("contour.jpg", contourImg);

        bitwise_not(destination,destination);


        Imgcodecs.imwrite("final.jpg", destination);

    }
}

Here is the final image

Final image after processing

Problem is, tesseract doesnt read anything on this :

11m ËEZË@ÜDS@ 7 C@mpû@ 515 îf@5@??ûäû ©©m@@@ @@ vësw??a? PF©@MÜGS @"@X@Ü©ÜÎÊQÜ©IÏÙ 1111 175515

Is the first "line" I get.

I think it is because the letters arent "filled" anymore and tesseract cannot read them, because tesseract actually gave me pretty good results precedently but the lines deleting method wasnt good. I'd like to fill the letters with black but

Imgproc.drawContours(contourImg, contours, -1, new Scalar(255, 0, 0),-1);

doesnt do anything, although I'm pretty sure findContours worked fine cause if I imwrite the result of it I get the very same image as before.

I searched similar problemslike cv2.drawContours will not draw filled contour and Contour shows dots rather than a curve when retrieving it from the list, but shows the curve otherwise but didn't find anything I could use (maybe didn't get it).

Just so you know, I started programming courses like in september so I'm pretty new to the thing (forgive me if there's some gruesome things written here), but I don't have a choice on the subject I'm working on :)

I hope I made myself clear enough and my english isn't too bad.

My thanks.

EDIT: Thanks to Rick.M It's getting better, using CHAIN_APPROX_SIMPLE in findcontours and iterating via ldx in drawcontours did the trick. New final

Is there a way to improve this result ? I'm guessing tesseract won't eat this aswell ? thanks

Uploading postcanny image : Image after canny

1
Have you tried to draw the contours using contourIdx instead of -1?Rick M.
Do you mean by iterating contoursldx to draw each contour separatly ? I just tried for (int ldx = 0; ldx < contours.size(); ++ldx) Imgproc.drawContours(contourImg, contours, ldx, new Scalar(255, 0, 0),-1); With no luck, but maybe I didnt get what you mean..DSt
Yes I meant this. What does contours.size() say?Rick M.
System.out.println(contours.size()); renders : 5369DSt
and contourImg.type()?Rick M.

1 Answers

1
votes

The reason why drawContours wasn't working as required was that the flag: CHAIN_APPROX_NONE stores absolutely all contour points. Hence, using CHAIN_APPROX_SIMPLE which compresses horizontal, vertical, and diagonal segments and leaves only their end points gives you finished contours. In this case you could also use, Imgproc.drawContours(contourImg, contours, -1, new Scalar(255, 0, 0),-1); without the loop and should work fine.

Now, for the discussion in comments, the Canny image looks nice, but as you can see after zooming, the letters which aren't detected by findContours are not completely connected. I would suggest using erosion with a small kernel (you have to play with the parameters) to get better results.