SOLR/LUCENE Experts, please help me design a simple keyword search from PDF index?

Question

I dabbled with solr but couldn't figure out a way to tailor it to my reuqirement.

What I have :

A bunch of PDF files. A set of keywords.

What I am trying to achieve :

Index the PDF files (solrcell - done) Search for a keyword (works ok) Tailor the output to spit out the names of the PDF files, an excerpt where the keyword occurred (No clue/idea how to)

Tried manipulating ResponseHandler/Schema.xml/Solrconfig.xml to no avail.

Lucene/solr experts, do you think what I am trying to achieve is possible?

I put my existing code on github @ https://github.com/ThinkCode/solr_search (which is mostly solr's default example with minor modifications to the fields (all the content is stored in one content field).

Notable changes in schema.xml being :

Schema.xml :

<solrQueryParser defaultOperator="AND"/>

   <field name="id" type="string" indexed="true" stored="true" required="true" />

   <field name="content" type="text_general" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

   <dynamicField name="*" type="string"    indexed="true"  stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

<solrQueryParser defaultOperator="AND"/>

<copyField source="*" dest="content"/>

Current Output :

(query) http://localhost:8983/solr/select/?q=Java+Servlet&version=2.2&start=0&rows=10&indent=on

<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">13</int><lst name="params"><str name="indent">on</str><str name="start">0</str><str name="q">Java Servlet</str><str name="version">2.2</str><str name="rows">10</str></lst></lst>

<result name="response" numFound="1" start="0"><doc><arr name="content_type"><str>application/pdf</str></arr><str name="id">tutorial.pdf</str><str name="subject">Solr</str><arr name="title"><str>Solr tutorial</str></arr></doc></result></response>

What I am looking for is 'extracted fragment (line) where the keyword was found'.

In the query provided, I search for 'Java Servlet' and it returned the document. I am interested in the context 'Solr can run in any Java Servlet Container of your choice' to be returned in the output xml.

Yes, it's possible. Can you post what you have so far, or where concretely you're having trouble? — Mauricio Scheffer
I put the code on github @ github.com/ThinkCode/solr_search and the schema file is at github.com/ThinkCode/solr_search/blob/master/apachesolr330/… — ThinkCode
I don't mean to be rude, but you'll have to be much more specific than this... otherwise it's a "plz send me the codez / do my job for free" kind of question, which is not welcome on stackoverflow. — Mauricio Scheffer
I updated the question with a sample. I am not looking for someone who can do the job for me! I am looking for hints/leads which will help me research in the right direction. Its been less than a week since I stumbled upon solr. Thanks! — ThinkCode

Mauricio Scheffer Mauricio Scheffer · Accepted Answer · 2011-08-03T00:07:20

To get snippets of text around the matched keywords, see http://wiki.apache.org/solr/HighlightingParameters

To get the filename of the indexed PDF as part of the response, simply add a field with that information (it should be a string field, non-indexed, stored). Of course, you have to populate this new field at index-time.

SOLR/LUCENE Experts, please help me design a simple keyword search from PDF index?

2 Answers