0
votes

I am using tesseract 4.0 to recognize english words,but fail only on this image ,without any words been recognized,

any one can give a tip,thanks

    r=pytesseract.image_to_string('6.jpg', lang='eng')
    print(r)

Fail image

update:

I try to OCR with online website

https://www.newocr.com/

and it works,but why?

how can I use tesseract to recognize it?

1

1 Answers

0
votes

The problem is pytesseract is not rotation-invariant. Therefore, you need to do additional pre-processing. source

  • My first idea is to rotate the image with a small angle

  • img = imutils.rotate_bound(cv2.imread("YD90o.png"), 4)
    
  • Result:

    • enter image description here
  • Now if I apply an adaptive-threshold

    • enter image description here
  • To read with pytesseract you need to set additional configuration:

    • pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
      
    • PSM (page-segmentation-mode) 6 is Assume a single uniform block of text. source

  • Result:

    • You want to get the last sentence of the image.

    • txt = pytesseract.image_to_string(thr, lang="eng", config="--psm 6")
      txt = txt.replace('\f', '').split('\n')
      print(txt[len(txt)-2])
      
    • Result:

    • Continue Setub ie Gene
      

The website might use deep-learning method to detect the words in the image. But when I use newocr.com the result is:

oy Eee a
setuP me -
continve ae