We print pdf books generated through a html to pdf application.
There is a header and footer on each page, and we place content exactly using production, and translation restrictions (and layout variations) for different languages to ensure that the fixed content for each page fits.
So for example, although our content is dynamic, a paragraph is expected to take approximatley the same amount of space for the same place in the book. We sometimes change style and layout attributes for translations but the same rules about like sizes apply.
We have a header and footer on each page, and the entire book is rendered as one long html page using css line breaking to force each header onto to a new page. So to reflect we control fixed content height per page server side.
This works well, and we are very happy with the advantages that HTML affords us in presentation (designers rather than programmers can design pages etc), we are also heavily invested in this tech, we are in too deep to change direction now, so we are not able to change our technology, we are using html 2 pdf and we need to make this work as best as possible. That is not to say we could not mix tech. but...
The problem is thus, we now have some variable sized content, that we have no former control over, to us it is text, so we have control over its formatting, but not it's quantity. We also have headings which are different sizes.
We need a way to calculate page breaks, leaving as little white space as possible, and I would love to know how anyone else is dealing with this. I know this will not be an exact science, but I still need the best approach possible.
We have total control over the rendering/layout engine it is always ie8 compatible, so different browsers need not be considered.
These are my thoughts, would love to hear yours:
- This is our current method, assign a number of lines per page (variable by font-size and font to allow for different locales) each block of content will be calculated into n lines cost and this figure used to calculate pages breaks.
Pro simple
Con inaccurate, none of our fonts are monospaced, needs configuring for every locale.
- Render each consecutive page of free flow content into a webpage in a div of the exact page width (fixed div) let it flow to whatever vertical height it requires, using a html 2 bmp solution capture an image and use the height of the rendered image (edge detected and cropped if required) to calculate the required number of pages.
Pro Could be accurate, not too expensive if free flow content is kept contiguous.
Con Incomplete solution, once I know the required number of pages, how do I know where to break the html? Measuring each page using this method and edge detecting would be very expensive.
- On a font by font basis, knowing in advance the font sizes, padding and margins of text and headings, calculate width and line breaks and height, chracter by character using width data extracted from the font file.
Pro Once all the data had been extracted, and margins had been added for differences in HTML rendering this could likely be fairly accurate.
Con Highly intricate and sensitive to style sheet changes.
- Could we use a WebBrowserControl to somehow measure the content?
Love to hear your thoughts and suggestions.
EDIT....
Our pdf converter is Winnovative, which runs within a .net Windows service, our html feed however is generated in PHP.