My question is:
How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?
Background: I work on a project about text analyses (especially scientific texts). These texts sometimes are published in muliple column layouts with each column given a separate page number. To order the extracted text by the layouted pagenumbers it would be useful to extract the text by columns.
I use pdfBox and tried / searched for several things:
- I tried the
getThreadBeads()method of thePDPageclass -> result: list with 0 size - I tried graping the text with the
getCharactersByArticle()method -> text not divided in columns
(I tried this with pdf files of published texts as well as with self created .doc based files, each have a multiple column layout)
The thing is that pdfBox seems to divide the text by columns automatically:
If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns.
But if I set setSortByPosition() on false the stripper is doing this division.
For that I had a look to the pdfBox source code:
The crucial method is the writePage() method of PDFTextStripper.
Here spaces (which are not given in most pdfs) and line breaks are calculated obviously.
But I couldn't find how the Stripper is calculating the column breaks.
So the questions again:
- How is PDFTextStripper calculating column breaks?
- Are there methods in the pdfBox API to catch this / to extract the text by columns?
- Is this possible with other pdf-api?
thanks in advance