1
votes

From a pdf file, I am successfully generating 1 png image for each page in the pdf.

The problem is that no matter what setting I use, for some pages GhostScript messes up the font spacing such that in some pngs, one word looks like it is 2 or 3 words.

This is a problem as I am using these files in evernote which messes up expected search results. So a search for "Providers" returns nothing because in the png file, it appears as "Pro vid e rs" (or 'Users' appears as "Use rs").

Dropbox link to a screenshot showing the original text of the source pdf on the left and generated png on the right --> http://dl.dropbox.com/u/13267240/ScreenClip.png

I am new to Ghostscript and am at a loss as to why this is happening.

Here is the command line I am using (in Python):

cmd = "gswin%sc " % (SYS_PROCESSOR_ARCH) + "-q -dNOPAUSE -dBATCH -dPDFFitPage=true -sDEVICE=png16m -r%s " % (PNG_RES) + "-sOutputFile=" + '"%s\%s-pg-%%d.%s" "%s"' % (outputdir, outputFileNamePrefix, suffix, pdfSourceFile)

OR evaluated at runtime:

gswin64c -q -dNOPAUSE -dBATCH -dPDFFitPage=true -sDEVICE=png16m -r300X300 -sOutputFile="C:\EPTK-TMP\02-01-Introduction-pg-%d.png" "C:\EPTK-TMP\02-01-Introduction.pdf"

1
How are you searching the text in the PNGs in Evernote? Is there some sort of optical character recognition happening? Is the aim just to have the PDF text in Evernote?Brian L
Yes, ever does great OCR on images. In fact to the point of producing equal search results to the original (the pdf doc). Where it stands apart is that, unlike a pdf search that searches only a pdf's text, I am able to reliably search for characters appearing on any image embedded in the original pdf document (in the png img).user1956808

1 Answers

3
votes

The font in your PDF sample is a sans-serif font (without the little ornamental endings of lines etc...), the font in your PNG sample is a serif font (with the little ornamental...).

So GhostScript is for some reason not using the correct font while doing the PDF to PNG conversion. This might have several reasons:

1) The fonts might not be embedded in the PDF file, so GhostScript has to figure out something else.

2) The fonts might also not be available on your system, so GhostScript simply replaces them with some default. This changes how the letters look and probably also the width of the letters, which gives you the resulting spacing issues.

So the question is whether you are generating the PDF in the first place. If so you might need to do it better so that GhostScript can use the embedded font. If you are not generating the PDF you could try to figure out what fonts are used in these PDF files you have and make sure they are available to GhostScript on your system.

I'm not that well known with GhostScript, but perhaps the fonts are already on your system and it's just a matter of GhostScript not finding them. In that case look whether there is a command line argument to point it to the correct font folder(s) on your system.