I used apache tika to extract text from pdf using the code:
`
Parser parser = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(file);
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
`
the output is as follows:
`
<p>Level 1
Level 2
Level 3
Level 4
Level 5
Level 6
Level 7
Level 8
Level 9
Level 10
Level 11
Level 12
Level 13 </p>
`
Is there any way I can get the output by configuring the pdf parser so that in the output each level# in enclosed within individual paragraph tag? For example:
<p>Level 1</p>
<p>Level 2</p>
Each level in the pdf can actually represent a sentence or paragraph.