2
votes

I am using iTextSharp PDFReader to read a pdf file that has 18 pages but every time I increment the page number, it starts from the beginning of the pdf instead of reading just that particular page. If I set "x" to the pdfReader.NumberOfPages value, it only reads the last page. I would like to read each page individually and add the data to my list of string s. I am also going through a folder, reading each pdf file, but I am testing with just one at first.

List<string> s = new List<string>();
while (z < filePaths.Count())
{
    PdfReader pdfReader = new PdfReader(filePaths[z]); 
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    for (int x = 1; x <= pdfReader.NumberOfPages; x++)
    {
        string currentText = "";                                
        currentText = PdfTextExtractor.GetTextFromPage(pdfReader, x, strategy);                        
        s.Add(currentText);
    }
    z++;
    pdfReader.Close();
}
3
does it always read the first page only, except for the last page, or does it read everything from first to xth page each? the underlying workhorse method ProcessContent<E>(int pageNumber, E renderListener) clearly should do what you intend... which version of ITextSharp do you use? - Cee McSharpface
using 5.5.10.0, it always starts at the first page and reads until the xth page - AWooster
just to make sure... do you expect s to contain all pages of all files, one page worth of text per list item, when the outer loop is finished? - Cee McSharpface
Yes, I am wanting to read the pdf page by page and insert each page of text as a list item. - AWooster

3 Answers

5
votes

All previous answers are pretty close, i.e. you were correctly blaming it on some kind of state issue.

The only part that was missing is that it is the strategy variable that remembers its state. After calling GetTextFromPage, your strategy object does not flush its existing contents.

So the trick is to instantiate your strategy inside the loop:

for (int x = 1; x <= pdfReader.NumberOfPages; x++)
{
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = "";                                
    currentText = PdfTextExtractor.GetTextFromPage(pdfReader, x, strategy);                        
    s.Add(currentText);
}
1
votes

Got it to work by removing the strategy from this line PdfTextExtractor.GetTextFromPage(pdfReader, x, strategy)

static void Main(string[] args)
        {
            List<string> filePaths = new List<string>();
            filePaths.Add("C:\\temp\\pe\\ACN-ONFBG-010-R-EN-ONT (1364).pdf");
            filePaths.Add("C:\\temp\\pe\\ACN-ONFBG-010-R-UN-NOR (1364).pdf");
            filePaths.Add("C:\\temp\\pe\\ACN-ONFBG-010-R-UN-SOU (1364).pdf");
            List<string> results = doit(filePaths);
            string stall = "stall";
        }


        private static List<string> doit(List<string> filePaths)
        {
            List<string> s = new List<string>();
            int z = 0;
            while (z < filePaths.Count())
            {
                PdfReader pdfReader = new PdfReader(filePaths[z]);
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                for (int x = 1; x <= pdfReader.NumberOfPages; x++)
                {
                    string currentText = "";
                    currentText = PdfTextExtractor.GetTextFromPage(pdfReader, x);
                    s.Add(currentText);
                }
                z++;
                pdfReader.Close();
            }
            return s;
        }
0
votes

I suspect a reader state issue. Try opening the PdfReader once before the loop to get the page count. Store the page count in a variable. Use that variable as the upper bound for the loop. Then in the loop, instantiate a new PdfReader for every page, dispose it after each iteration.

EDIT: It turned out that the text extraction strategy is the culprit. It retains state somehow. Always instantiate a new SimpleTextExtractionStrategy before calling GetTextFromPage, or omit the strategy parameter - then a new instance of the default implementation of ITextExtractionStrategy will be created internally.