Tesseract ocr PDF as input

Question

I am building an OCR project and I am using a .Net wrapper for Tesseract. The samples that the wrapper have don't show how to deal with a PDF as input. Using a PDF as input how do I produce a searchable PDF using c#?

I have use ghostscript library to change Pdf to image then feed Tesseract with it and it's working great getting the text but i doesn't save the original shape of Pdf i only get text

how can i get text from Pdf with saving the shape of original Pdf

this is a page from pdf i don't want only text i want the text to be in the shapes like the original pdf and sorry for poor English

You'd need a library to turn a PDF into an Image. And then use that same library to create the searchable PDF. — juharr
which library is the best for this job and could you provide me with a sample to how to do this .. and i want to save the shape of the original pdf and add under it the text layer @juharr — acrab
Removed unnecessary information, linked the outside link in-line and fixed grammar. This question requires 'what you've tried' (in terms of actual code) or it risks being downvoted into oblivion or closed. — Nathaniel Ford

Kostas Charitidis Kostas Charitidis · Accepted Answer · 2019-10-11T10:52:47

Just for documentation reasons, here is an example of OCR using tesseract and pdf2image to extract text from an image pdf.

import pdf2image
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract


def pdf_to_img(pdf_file):
    return pdf2image.convert_from_path(pdf_file)


def ocr_core(file):
    text = pytesseract.image_to_string(file)
    return text


def print_pages(pdf_file):
    images = pdf_to_img(pdf_file)
    for pg, img in enumerate(images):
        print(ocr_core(img))


print_pages('sample.pdf')

Tesseract ocr PDF as input

4 Answers