1
votes

I have a multiple page .tif file, I am trying to extract text from it using Tesseract OCR but I am getting this error

TypeError: Unsupported image object

Code

from PIL import Image
import pytesseract

img = Image.open('Group 1/1_CHE_MDC_1.tif')
text = pytesseract.image_to_string(img.seek(0))  # OCR on 1st Page
text = ' '.join(text.split())
print(text)

ERROR

enter image description here

Any idea why its happening

2

2 Answers

2
votes

Image.seek does not have a return value so you're essentially running:

pytesseract.image_to_string(None)

Instead do:

img.seek(0)
text = pytesseract.image_to_string(img)
1
votes

I had a same question and i have tried below code and it worked for me :-

import glob
import pytesseract import os

os.chdir("Set your Tesseract-OCR .exe file path")

b = ''
for i in glob.glob('Fullpath of your image directory/*.tif'):  <-- you can give *.jpg extension in case of jpg image
    if  glob.glob('*.tif'):
        b = b +  (pytesseract.image_to_string(i))
print(b)

Happy learning !