19
votes

I am building an OCR project and I am using a .Net wrapper for Tesseract. The samples that the wrapper have don't show how to deal with a PDF as input. Using a PDF as input how do I produce a searchable PDF using c#?

  • I have use ghostscript library to change Pdf to image then feed Tesseract with it and it's working great getting the text but i doesn't save the original shape of Pdf i only get text

how can i get text from Pdf with saving the shape of original Pdf

enter image description here

this is a page from pdf i don't want only text i want the text to be in the shapes like the original pdf and sorry for poor English

4
You'd need a library to turn a PDF into an Image. And then use that same library to create the searchable PDF.juharr
which library is the best for this job and could you provide me with a sample to how to do this .. and i want to save the shape of the original pdf and add under it the text layer @juharracrab
Removed unnecessary information, linked the outside link in-line and fixed grammar. This question requires 'what you've tried' (in terms of actual code) or it risks being downvoted into oblivion or closed.Nathaniel Ford

4 Answers

13
votes

Just for documentation reasons, here is an example of OCR using tesseract and pdf2image to extract text from an image pdf.

import pdf2image
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract


def pdf_to_img(pdf_file):
    return pdf2image.convert_from_path(pdf_file)


def ocr_core(file):
    text = pytesseract.image_to_string(file)
    return text


def print_pages(pdf_file):
    images = pdf_to_img(pdf_file)
    for pg, img in enumerate(images):
        print(ocr_core(img))


print_pages('sample.pdf')
6
votes

Tesseract supports the creation of sandwich since version 3.0. But 3.02 or 3.03 are recommended for this feature. Pdfsandwich is a script which does more or less what you want.

There is the online service www.sandwichpdf.com which does use tesseract for creating searchable PDFs. You might want to run a few tests before you start implementing your solution with tesseract. The results are ok, but there are commercial products which deliver better results. Disclosure: I am the creator of www.sandwichpdf.com.

1
votes

Use pdf2png.com, then upload your pdf, then it will make all png files of each page as <pdf_name>-<page_number>.png in .zip file,

Then, you can write simple python code as

#/usr/bin/python3
#coding:utf-8
import os
pdf_name = 'pdf_name'
language = 'language of tesseract'
for x in range(int('number of pdf_pages')):
    cmd = f'tesseract {pdf_mame}-{x}.png {x} -l {language}'
    os.system(cmd)

And then, read all the files such as from 1.txt to all the way up, and append to a single, file, it is as easy as that.

1
votes

There is a handy tool OCRmyPDF that will add a text layer to a scanned PDF making it searchable - which essentially automates the steps mentioned in previous answers.