1
votes

I have the code to extract/convert text from scanned pdf files/normal pdf files by using Tesseract OCR. But I want to make my code to convert a pdf folder rather than a single pdf file, then the extract text files will be store in a folder that I want.

See my code below:

filePath = '/Users/CodingStark/scanned/scanned-file.pdf'
pages = convert_from_path(filePath, 500)


image_counter = 1
  
# Iterate through all the pages stored above 
for page in pages: 
  
    filename = "page_"+str(image_counter)+".jpg"
          
    page.save(filename, 'JPEG') 
  
    image_counter = image_counter + 1
    

filelimit = image_counter-1
  
# Creating a text file to write the output 
outfile = "scanned-file.txt"
  

f = open(outfile, "a") 
  
# Iterate from 1 to total number of pages 
for i in range(1, filelimit + 1): 

    filename = "page_"+str(i)+".jpg"
          
    # Recognize the text as string in image using pytesserct 
    text = str(((pytesseract.image_to_string(Image.open(filename))))) 

    text = text.replace('-\n', '')     
  

    f.write(text) 
#Close the file after writing all the text. 
f.close() 

I want to automate my code so it will convert all my pdf files in the scanned folder and those extract text files will be in a folder that I want. Also, are there any ways to delete all the jpg files after the code? Since it takes a lot of memory spaces. Thank you so much!!

1
you need to write a shell script in bash or similar to do this. Or you need to write a program in Python or Go. I had used Go to do this with Tesseract OCR in a project. JPGs doesn't take 'memory spaces', they consumes storage space. You can remove then when the task finish.gorlok
@gorlok Thanks, I will give it a try!CodingStark

1 Answers

2
votes

here is the loop to read from a path,

import glob,os
import os, subprocess

pdf_dir = "dir"
os.chdir(pdf_dir)
for pdf_file in glob.glob(os.path.join(pdf_dir, "*.PDF")):
      //// put here what you want to do for each pdf file