I need to extract text from a pdf file using itext7 or itextsharp and put html tag for bold around all the words using bold font

Question

I am using iText7 and I want to extract all the texts from a pdf and put html tag for bold ( ... ) around all the words that uses bold fonts and save it in text file. Any pointers? I am able to independently extract text and also extract all the bold words but not able to co-relate the two. Here is the code snippet I am using for extracting the text:

PdfDocument MyDocument = new PdfDocument(new PdfReader("C:\\MyTest.pdf"));
string MyText = PdfTextExtractor.GetTextFromPage(MyDocument.GetPage(1), new 
SimpleTextExtractionStrategy());

Here is the code I am using for extracting all the words using the bold font:

MyRectangle = new Rectangle(0, 0, 50, 100);
CustomFontFilter fontFilter = new CustomFontFilter(MyRectangle);
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy = 
listener.AttachEventListener(new LocationTextExtractionStrategy(), fontFilter);
PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
parser.ProcessPageContent(MyDocument.GetPage(1));
String MyBoldTextList = extractionStrategy.GetResultantText();
//------
class CustomFontFilter : TextRegionEventFilter
{
    public CustomFontFilter(iText.Kernel.Geom.Rectangle filterRect) : base(filterRect){ }
    override public bool Accept(IEventData data, EventType type)
    {
        if (type == EventType.RENDER_TEXT){
            TextRenderInfo renderInfo = (TextRenderInfo)data;
            PdfFont font = renderInfo.GetFont();
            if (font!=null)
                return font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold");
        }
        return false;
    }
}

The problem is that the pdf in question here is a multi-column document. SimpleTextExtractionStrategy brings the text in perfect order but if I use the LocationStrategy, it messes up texts by jumping from one column to next column in each line. I am not able to find any way to get the list of bold words using SimpleTextExtractionStrategy. In LocationStrategy, the list that I get is not in the right order so I am unable to co-relate it.

mkl mkl · Accepted Answer · 2020-11-23T16:11:06

So to summarize:

You want to extract all the text from a pdf and put the html tag for bold (...) around all the text that uses bold fonts.
Your PDFs allow normal text extraction (without those  tags) using the SimpleTextExtractionStrategy. The LocationTextExtractionStrategy on the other hand cannot be used as it messes up the order of the multi-column text.
Bold text in your PDFs can properly be recognized by your CustomFontFilter, i.e. by the
```
font.GetFontProgram().GetFontNames().GetFontName().Contains("Bold")
```
condition.

Thus, one way to implement your task would be to extend the SimpleTextExtractionStrategy to check every chunk received using the CustomFontFilter condition and insert  tags where required.

For example like this:

public class BoldTaggingSimpleTextExtractionStrategy : SimpleTextExtractionStrategy
{
    FieldInfo textField = typeof(TextRenderInfo).GetField("text", BindingFlags.NonPublic | BindingFlags.Instance);
    bool currentlyBold = false;

    public override void EventOccurred(IEventData data, EventType type)
    {
        if (type.Equals(EventType.RENDER_TEXT))
        {
            TextRenderInfo renderInfo = (TextRenderInfo)data;
            string fontName = renderInfo.GetFont()?.GetFontProgram()?.GetFontNames()?.GetFontName();
            if (fontName != null && fontName.Contains("Bold"))
            {
                if (!currentlyBold)
                {
                    textField.SetValue(renderInfo, "<b>" + renderInfo.GetText());
                    currentlyBold = true;
                }
            }
            else if (currentlyBold)
            {
                AppendTextChunk("</b>");
                currentlyBold = false;
            }
        }
        base.EventOccurred(data, type);
    }
}

As you see I used reflection here. I did so because (A) TextRenderInfo does not allow public setting of the text and (B) AppendTextChunk must not be used before the first chunk is processed by base.EventOccurred - there the size of a StringBuilder containing the collected text chunks is used to check whether the chunk currently processed is the first one or not; if something is in that builder before at least one chunk has been processed, one gets a NullReferenceException. There are other work-arounds for that but reflection here means but one more line of code.

I need to extract text from a pdf file using itext7 or itextsharp and put html tag for bold around all the words using bold font

1 Answers