0
votes

(Here is the noise removed image that I am trying to extract text) I am trying to detect text part of an image(jpg file) using Tesseract-OCR and OpenCV in Python. The text part of the imageis Turkish, therefore I am using 'Turkish trained data (tur)' which is in Tesseract-OCR file. I have applied dilation and erosion to remove the noise before using tesseract.

The problem is, eventhough some of the characters in particular areas can be detected, the detection is mostly unsuccesful and fails to detect Turkish characters. Do you know any method or do you have any suggestion to get more success. Here are my codes below :

import pytesseract
from PIL import Image
import cv2

img= cv2.imread('C:\Users\gulsa\Desktop\Tesseract-OCR\alm98_2.jpg')
img = Image.open('alm98_2.jpg')
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-
OCR/tesseract'

tex = pytesseract.image_to_string(Image.open('alm98_2.jpg'),lang='tur')
print(tex)

Thank you in advance!

1
Have you tried things listed in "Improve quality" section in tesseract FAQ? (github.com/tesseract-ocr/tesseract/wiki/ImproveQuality)Dmitrii Z.
It depends on the image. Is the text handwritten? Is there noise in the image? Is the lighting good?zindarod
I have applied binarization, dilation and erosion to remove noise, but the result is the same. The text is not hand written, it is printed and legible, with white clear puntos on black background.Gülşah Ayhan
Can you post your image after you applied all your preprocessing (binarization, dilation etc)? Also by detection do you mean that tesseract doesn't recognize character as turkish (but recognizes as some other char) or it doesn't see anything at all?Dmitrii Z.
I have attached my image. It can see the characters but for most characters it does not detect them correctly. It has full of mistakes.Gülşah Ayhan

1 Answers

1
votes

Here's what i get after using tesseract on your image

HerTürdenErutikyıdeplç'nTıkla!Sımsıkainlemereoyo AnındaCebirıdenIde!Iziemeklçin18YaşındanBüyükoin'ak Zorunludur.HerkamgoridenyüzleroevideoHighDefTvde!High DefTv,abonelik"servistir.Pakelhaîlaliktümvergilerdahilolamk ayda64TLyebtaIedimedig'süreoeherz—ıyyenileneoekîir.Servis ücreti,aboneoldugınuzoperaîöfündüzenleyecegifaîuralar karaliylaveyaönödemelihatlardanTL/Krmikîaridüsülerekîahsil edilecektir.Ipîaliğn:|PTALya24329z-ıgörder.Iptaledilendönem içinücretiadasiyapiin'azXeteriibakiyenizyokayükleme

So far it doesn't seem like a very bad result. Not saying its very good one, but nothing to do with Turkish letters. You can get much better results if you will be able to detect and separate letters which are too close to each other at the moment.

enter image description here

For example for this image i get perfect results (notice better font, more space between chars)

Her Türden Erotik Video Için Tıkla!Sımsicak Binlerce Videoyu

If you're getting a lot of noisy letters which are definitely not in the Turkish alphabet (like fl or î symbols) - you can make a blacklist.

Another option is to iterate through tesseract results character to character and correct them if you can use any heuristic for that.

Edit: TBH when i try to read the text on your image I cannot separate words from the sentence, maybe it is specific of font you're using, but it definitely looks too harsh for both human and machine.

Edit2: Added example image with more space between chars