How to change output format of text extracted from apache tika?

Question

I used apache tika to extract text from pdf using the code:

`

Parser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);

`

the output is as follows:

`

<p>Level 1 

Level 2 

Level 3 

 Level 4 

 Level 5 

 Level 6 

  Level 7 

  Level 8 

  Level 9 

 Level 10 

 Level 11 

Level 12 

Level 13    </p>

`

Is there any way I can get the output by configuring the pdf parser so that in the output each level# in enclosed within individual paragraph tag? For example:

<p>Level 1</p>
<p>Level 2</p>

Each level in the pdf can actually represent a sentence or paragraph.

Lil Bro Lil Bro · Accepted Answer · 2017-11-01T08:09:50

Try something like this:

// Get string data
String data = handler.toString();
// Remove tags or other things (depends on your needs)
data = data.replace("<p>","");
data = data.replace("</p>","");
// Now it looks like: String data ="Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 ... ";
String newdata = "";
// Split string in all places contain [number]+[blank space]
for (String s: data.split("(?<=[0-9])(?=" ")")) {
    // append with desired strings
    s =  "<p>"+s+"</p>";
    // and store modified data
    newdata += s;
}

So, if you need, you can put "\n" additionally after tag "</p>". Also you can output all s strings one by one if you need.

I hope this was helpful. Good luck.

How to change output format of text extracted from apache tika?

1 Answers