Requiring the mighty help of stack overflow. I actually work on an app that has to analyze via OCR (I'm using tesseract) documents and extract all the text I can get out of it. Here is an example of the type of image:
Image including text to extract
Here is what I do on preprocessing to get rid of all the lines. In the future I would also probably have to analyze each "rectangle" separatly (feeding a zone defined by given lines to tesseract) so I guess there's simpler methods than this but i wouldn't have the "lines" coordinates.
package formRecog;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import org.opencv.core.Core;
import org.opencv.core.Mat;
import org.opencv.core.Point;
import org.opencv.core.Scalar;
import org.opencv.core.Size;
import org.opencv.imgcodecs.Imgcodecs;
import org.opencv.imgproc.Imgproc;
import static org.opencv.core.Core.bitwise_not;
import org.opencv.core.MatOfPoint;
public class testMat {
public static void main(String[] args) {
System.loadLibrary(Core.NATIVE_LIBRARY_NAME);
Mat source = Imgcodecs.imread("./image.png",Imgcodecs.CV_LOAD_IMAGE_ANYCOLOR);
Mat destination = new Mat(source.rows(), source.cols(), source.type());
Imgproc.cvtColor(source, destination, Imgproc.COLOR_RGB2GRAY);
Imgcodecs.imwrite("gray.jpg", destination);
Imgproc.GaussianBlur(destination, destination, new Size(3, 3), 0, 0, Core.BORDER_DEFAULT);
Imgproc.Canny(destination, destination, 30, 90);
Imgcodecs.imwrite("postcanny.jpg", destination);
Mat houghlines = new Mat();
Imgproc.HoughLinesP(destination, houghlines, 1, Math.PI / 180, 250, 185,5);
//DESSINER LES LIGNES
Mat result = new Mat(source.rows(), source.cols(), source.type());
for (int i = 0; i < houghlines.rows(); i++) {
double[] val = houghlines.get(i, 0);
Imgproc.line(destination, new Point(val[0], val[1]), new Point(val[2], val[3]), new Scalar(0, 0, 255), 5);
Imgproc.line(result, new Point(val[0], val[1]), new Point(val[2], val[3]), new Scalar(0, 0, 255),5);
}
Imgcodecs.imwrite("lines.jpg", result);
Mat contourImg = new Mat(source.rows(), source.cols(), source.type());
List<MatOfPoint> contours = new ArrayList<MatOfPoint>();
Mat hierarchy = new Mat();
//Point offset = new Point();
Imgproc.findContours(destination, contours, hierarchy, Imgproc.RETR_LIST, Imgproc.CHAIN_APPROX_NONE );
Imgproc.drawContours(contourImg, contours, -1, new Scalar(255, 0, 0),-1);
Imgcodecs.imwrite("contour.jpg", contourImg);
bitwise_not(destination,destination);
Imgcodecs.imwrite("final.jpg", destination);
}
}
Here is the final image
Problem is, tesseract doesnt read anything on this :
11m ËEZË@ÜDS@ 7 C@mpû@ 515 îf@5@??ûäû ©©m@@@ @@ vësw??a? PF©@MÜGS @"@X@Ü©ÜÎÊQÜ©IÏÙ 1111 175515
Is the first "line" I get.
I think it is because the letters arent "filled" anymore and tesseract cannot read them, because tesseract actually gave me pretty good results precedently but the lines deleting method wasnt good. I'd like to fill the letters with black but
Imgproc.drawContours(contourImg, contours, -1, new Scalar(255, 0, 0),-1);
doesnt do anything, although I'm pretty sure findContours worked fine cause if I imwrite the result of it I get the very same image as before.
I searched similar problemslike cv2.drawContours will not draw filled contour and Contour shows dots rather than a curve when retrieving it from the list, but shows the curve otherwise but didn't find anything I could use (maybe didn't get it).
Just so you know, I started programming courses like in september so I'm pretty new to the thing (forgive me if there's some gruesome things written here), but I don't have a choice on the subject I'm working on :)
I hope I made myself clear enough and my english isn't too bad.
My thanks.
EDIT: Thanks to Rick.M It's getting better, using CHAIN_APPROX_SIMPLE in findcontours and iterating via ldx in drawcontours did the trick. New final
Is there a way to improve this result ? I'm guessing tesseract won't eat this aswell ? thanks
Uploading postcanny image : Image after canny