C# iTextSharp PDFReader Reads From Beginning of PDF Always

Question

I am using iTextSharp PDFReader to read a pdf file that has 18 pages but every time I increment the page number, it starts from the beginning of the pdf instead of reading just that particular page. If I set "x" to the pdfReader.NumberOfPages value, it only reads the last page. I would like to read each page individually and add the data to my list of string s. I am also going through a folder, reading each pdf file, but I am testing with just one at first.

List<string> s = new List<string>();
while (z < filePaths.Count())
{
    PdfReader pdfReader = new PdfReader(filePaths[z]); 
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    for (int x = 1; x <= pdfReader.NumberOfPages; x++)
    {
        string currentText = "";                                
        currentText = PdfTextExtractor.GetTextFromPage(pdfReader, x, strategy);                        
        s.Add(currentText);
    }
    z++;
    pdfReader.Close();
}

does it always read the first page only, except for the last page, or does it read everything from first to xth page each? the underlying workhorse method ProcessContent<E>(int pageNumber, E renderListener) clearly should do what you intend... which version of ITextSharp do you use? — Cee McSharpface
using 5.5.10.0, it always starts at the first page and reads until the xth page — AWooster
just to make sure... do you expect s to contain all pages of all files, one page worth of text per list item, when the outer loop is finished? — Cee McSharpface
Yes, I am wanting to read the pdf page by page and insert each page of text as a list item. — AWooster

blagae blagae · Accepted Answer · 2016-12-08T11:34:12

All previous answers are pretty close, i.e. you were correctly blaming it on some kind of state issue.

The only part that was missing is that it is the strategy variable that remembers its state. After calling GetTextFromPage, your strategy object does not flush its existing contents.

So the trick is to instantiate your strategy inside the loop:

for (int x = 1; x <= pdfReader.NumberOfPages; x++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = "";                                
    currentText = PdfTextExtractor.GetTextFromPage(pdfReader, x, strategy);                        
    s.Add(currentText);
}

C# iTextSharp PDFReader Reads From Beginning of PDF Always

3 Answers