0
votes

I am learning Tesseract OCR and reading this article that is based on this article. From first article:

First step is Adaptive Thresholding, which converts the image into binary images. Next step is connected component analysis which is used to extract character outlines. This method is very useful because it does the OCR of image with white text and black background. Tesseract was probably first to provide this kind of processing. Then after, the outlines are converted into Blobs. Blobs are organized into text lines, and the lines and regions are analyzed for some fixed area or equivalent text size.

Could anyone explain what is Blob?

2

2 Answers

1
votes

From https://tesseract-ocr.repairfaq.org/tess_glossary.html :

Blob

Isolated, small region of the scanned image. It's delineated by the outline. Tesseract 'juggles' the blobs to see if they can be split further into something that improved the confidence of recognition. Sometimes, blobs are 'combined' if that gives a better result. See pithsync.cpp, for example.

1
votes

Generally a blob (also called a Connected Component) is a connected piece (i.e. not broken) in a binary image. In other words, it's a solid element in a binary image. Blob finders are a key step in any system that aims extracting/measuring data from digital images.