3
votes

I am using PDFbox to get the font size from PDFs.

I have extended PDFTextStripper and overridden the writeString function which gives me access to TextPosition object.

It works fine half the time. But the other times it returns font size as '-1'. Why is that? This affects the rest of my algorithm.

I have tried functions getHeight, getHeightDir and getFontSize. I get the same results with all these.

Here is the writeString function:

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    for (TextPosition text : textPositions) {
        getChar(text);
        writeString(string);
    }
}

The getChar function processes the information.

How do I fix this? Thanks in advance.

EDIT: I'm using PDFBox 2.0.2. My application requires me to convert any given file to a pdf and then process it using PDFBox. This -1 problem happens to all Spreadsheet files. I use Apache POI 3.15 to convert the document to PDF. It works fine for doc, docx, ppt, pptx, odt, odp

1
Please share your pdf and mention what version you are usingTilman Hausherr
I'm using PDFBox 2.0.2. My application requires me to convert any given file to a pdf and then process it using PDFBox. This -1 problem happens to all Spreadsheet files. I use Apache POI 3.15 to convert the document to PDF. It works fine for doc, docx, ppt, pptx, odt, odp.Sid Prasad
Current version is 2.0.6. I can only have a look at the problem if you share a PDF. If you don't, try the PrintTextLocations example and see what getFontSize() returns.Tilman Hausherr
Please do as Tilman asks you. Only telling vaguely how one could create the PDFs, is not sensible. Please simply supply a sample PDF.mkl
@SidPrasad Please be precise in asking your questions, and you cant just leave unanswered comments, Please Respond otherwise I will have to close this question.ayush gupta

1 Answers

0
votes

As you have not shared a sample document, from your question, here are my inferences.

Assuming PDFBox works fine, if getFontSize returns -1, then font size has not been set at the source side, i.e., while generating the PDF. If from your observation, the characters for which getFontSize returns -1 has all the same size, this could be thought of as a default size.

If this does not help, for an actual solution, you may provide any sample pdf, as mentioned in the comments by others.