0
votes

I have 100 scanned PDF files and I need to convert them into text files.

I have first converted them into png files (see script below), now I need help to convert these 100 png files to 100 text files.

library(pdftools)
library("tesseract")

#location
dest <- "P:\\TEST\\images to text"

#making loop for all files
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

#Convert files to png
sapply(myfiles, function(x)
  pdf_convert(x, format = "png", pages = NULL, 
              filenames = NULL, dpi = 600, opw = "", upw = "", verbose = TRUE))

#read files
cat(text)

I expect to have a text file for each png file:

From: file1.png, file2.png, file3.png...

To: file1.txt, file2.txt, file3.txt...

But the actual result is one text file containing all png files text.

1
There are several issues with your code. Your list.files pattern doesn’t list PDF files but all files with the string 'pdf' in its name. The comment above that line of code is completely wrong, it doesn’t at all explain what the line is doing. You also didn’t show us the crucial bit — namely, how you are actually trying to OCR the files, and what fails. Instead, the code you’ve shown is, effectively, irrelevant to your question.Konrad Rudolph

1 Answers

2
votes

I guess you left out the bit with teh png -> text bit, but I assume you used library(tesseract).

You could do the following in your code:

library(tesseract)
eng <- tesseract("eng")
sapply(myfiles, function(x) {
  png_file <- gsub("\\.pdf", ".png", x)
  txt_file <- gsub("\\.pdf", ".txt", x)
  pdf_convert(x, format = "png", pages = 1, 
              filenames = png_file, dpi = 600, verbose = TRUE)

  text <- ocr(png_file, engine = eng)
  cat(text, file = txt_file)
  ## just return the text string for convenience
  ## we are anyways more interested in the side effects
  text
})