extract PDF text by columns

Question

My question is:

How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?

Background: I work on a project about text analyses (especially scientific texts). These texts sometimes are published in muliple column layouts with each column given a separate page number. To order the extracted text by the layouted pagenumbers it would be useful to extract the text by columns.

I use pdfBox and tried / searched for several things:

I tried the getThreadBeads() method of the PDPage class -> result: list with 0 size
I tried graping the text with the getCharactersByArticle() method -> text not divided in columns
(I tried this with pdf files of published texts as well as with self created .doc based files, each have a multiple column layout)

The thing is that pdfBox seems to divide the text by columns automatically: If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns. But if I set setSortByPosition() on false the stripper is doing this division.

For that I had a look to the pdfBox source code: The crucial method is the writePage() method of PDFTextStripper. Here spaces (which are not given in most pdfs) and line breaks are calculated obviously. But I couldn't find how the Stripper is calculating the column breaks.

So the questions again:

How is PDFTextStripper calculating column breaks?
Are there methods in the pdfBox API to catch this / to extract the text by columns?
Is this possible with other pdf-api?

thanks in advance

mkl mkl · Accepted Answer · 2014-10-07T11:07:15

If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns. But if I set setSortByPosition() on false the stripper is doing this division.

[...] How is PDFTextStripper calculating column breaks?

It isn't.

By setting SortByPosition to false you tell PDFBox to not try to sort the text pieces from the page content stream but to instead accept them in the order they appear.

In your document the text pieces seem to be drawn in the reading order, i.e. column by column. This is not true for all documents, and to cope with other documents PDFBox offers the option of sorting the text pieces left-to-right, top-to-bottom.

Activating that option (setting SortByPosition to true) in your document returns the text without respect to the columns.

Are there methods in the pdfBox API to catch this / to extract the text by columns?

PDFBox does not analyze the page content to recognize columns. If you do the analysis, though, it allows you to extract text column by column if you provide the column rectangles as reguions.

extract PDF text by columns

2 Answers