4
votes

We're building a PDF search machine with Solr and Lucene where users can search for text in PDFs. The database only contains PDFs.

In the search results page ("/browse") we want to append the PDF file with #page=X where X is the page the text was found on. (Adobe Acrobat automatically scrolls to a certain page if specified with an anchor tag.)

For example, if I search for foobar and there's a pdf document where foobar is on page 5, the link should be http://pdfserver/pdfs/pdf.pdf#page=5 (note the anchor at the end).

  1. Is this possible?
  2. How would we get this page number?
2
i don't think i understand what you're actually trying to achieve. Do you want to index pdf files and any search that you make to return the page number of the matched text or is it something else?omu_negru
Exactly that. So if I search for "foobar" and there's a pdf document where "foobar" is on page 5, the link should be pdfserver/pdfs/pdf.pdf#page=5Simon Fredsted
Did you ever find a solution to this? Seems like a basic requirement when indexing a load of PDF files.MrTelly
@MrTelly, I used the #search solution and URL-encoding the search term.Simon Fredsted

2 Answers

1
votes

One easy-to-implement solution I found was to use the #search parameter that Adobe Reader supports when embedded in IE.

For example:

http://pdfserver/pdfs/pdf.pdf#search=foobar

Adobe Reader then jumps to the page.

One would need to URL-encode the search terms, of course.

0
votes

Apache tika can transform PDF files into structured data for you to feed into the solr server .

My approach to your problem would be to index each pdf per page, with extra fields linking to the chapter, text title (or absolute path, or both) and page number.Using this data you can then open the relevant document at the relevant page.

Read more about tika here : http://tika.apache.org/