2
votes

I am working on OCR using python with pytesseract. So exactly what i am trying to do is to read the text on image, extract the text and store the extracted text in a txt or csv file using file handling. I want multiple files to be read, store the text and perform a check if the image's text im going to get read and store is already exist in a txt file.
Here is my code that is working without any error. The last lines are what i was trying to do but doesn't seem to work. Can anyone help me out regarding this? Thanks in advance.

import cv2
import pytesseract,csv,re,os
from PIL import Image
from ast import literal_eval

img = pytesseract.image_to_string(Image.open("test1.png"), lang="eng")
print(img)

with open('C:\\Users\\Hasan\\Videos\\Captures\\saved.csv', "w") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(img)

string = open('C:\\Users\\Hasan\\Videos\\Captures\\saved.csv').read()
new_str = re.sub('[^a-zA-z0-9\n\.]', ' ', string)
open('C:\\Users\\Hasan\\Videos\\Captures\\saved.csv', "w").write(new_str)

# f = open("saved.csv", "r")
# read = f.readline()
# print("\n" + f.read())

with open('C:\\Users\\Hasan\\Videos\\Captures\\saved.csv') as sv:
    for line in sv:
        if img in line:
            print("Data already exists")
        else:
            print("file saved successfully")
1
This is happening because when you are writing new_str it will create a new row whenever it encounters new line character. So while iterating it in the last step you get only the first line of the text in line, whereas img contains entire extracted text.Satheesh K

1 Answers

3
votes

Replace '\n' while writing to the CSV file and strip '\n' from img while comparing.

import cv2
import pytesseract,csv,re,os
from PIL import Image
from ast import literal_eval
img_path = "example_01.png"
out_csv_path = "saved.csv"
img = pytesseract.image_to_string(Image.open(img_path), lang="eng")
print(img)

with open(out_csv_path, "w") as outfile:
    writer = csv.writer(outfile)
    writer.writerow(img)

string = open(out_csv_path).read()
new_str = re.sub('[^a-zA-z0-9\. ]', '', string)
open(out_csv_path, "w").write(new_str)

# f = open("saved.csv", "r")
# read = f.readline()
# print("\n" + f.read())

with open(out_csv_path,newline='') as sv:
    img = re.sub('[^a-zA-z0-9\. ]', '', img)
    for line in sv:
        print("Line text is: {}\nExtracted Text is: {}".format(line,img))
        if img in line:
            print("Data already exists")
        else:
            print("file saved successfully")

Sample output:

Noisyimage
to test
Tesseract OCR
Line text is: Noisyimageto testTesseract OCR
Extracted Text is: Noisyimageto testTesseract OCR
Data already exists