Extract Text from Multipage Attachment PDF Using Google Apps Script

Question

I have a Gmail attachment PDF with multiple scanned pages. When I use Google Apps Script to save the blob from the attachment to a Drive file, open the PDF manually from Google Drive, then select Open With Google Docs, all of the text from the PDF is displayed as a Google Doc. However, when I save the blob as a Google Doc with OCR, only the text from the image on the first page is saved to a Doc, accessed either manually or by code.

The code to get the blob and process it is:

function getAttachments(desiredLabel, processedLabel, emailQuery){
    // Find emails
    var threads = GmailApp.search(emailQuery);
    if(threads.length > 0){
        // Iterate through the emails
        for(var i in threads){
            var mesgs = threads[i].getMessages();
            for(var j in mesgs){
                var processingMesg = mesgs[j];
                var attachments = processingMesg.getAttachments();
                var processedAttachments = 0;
                // Iterate through attachments
                for(var k in attachments){
                    var attachment = attachments[k];
                    var attachmentName = attachment.getName();
                    var attachmentType = attachment.getContentType();
                    // Process PDFs
                    if (attachmentType.includes('pdf')) {
                        processedAttachments += 1;
                        var pdfBlob = attachment.copyBlob();
                        var filename = attachmentName + " " + processedAttachments;
                        processPDF(pdfBlob, filename);
                    }
                }
            }
        }
    }
}


function processPDF(pdfBlob, filename){
  // Saves the blob as a PDF.
  // All pages are displayed if I click on it from Google Drive after running this script.
  let pdfFile = DriveApp.createFile(pdfBlob);
  pdfFile.setName(filename);
  // Saves the blob as an OCRed Doc.
  let resources = {
    title: filename,
    mimeType: "application/pdf"
  };
  let options = {
    ocr: true,
    ocrLanguage: "en"
  };
  let file = Drive.Files.insert(resources, pdfBlob, options);
  let fileID = file.getId();
  // Open the file to get the text.
  // Only the text of the image on the first page is available in the Doc.
  let doc = DocumentApp.openById(fileID);
  let docText = doc.getBody().getText();
}

If I try to use Google Docs to read the PDF without OCR directly, I get Exception: Invalid argument, for example:

DocumentApp.openById(pdfFile.getId());

How do I get the text from all of the pages of the PDF?

ziganotschka ziganotschka · Accepted Answer · 2020-07-03T09:05:32

DocumentApp.openById is a method that can only be used for Google Docs documents
pdfFile can only be "opened" with the DriveApp - DriveApp.getFileById(pdfFile.getId());
Opening a file with DriveApp allows you to use the following methods on the file
When it comes to OCR conversion, your code works for me correctly to convert all pages of a PDF document to Google Docs, so you error source is likely come from the attachment itself / the way you retrieve the blob
Mind that OCR conversion is not good at preserving formatting, so a two page PDF might be collapsed into a one-page Docs - depneding on the formatting of the PDF

Extract Text from Multipage Attachment PDF Using Google Apps Script

1 Answers