4
votes

I am using Solr to index DOC, DOCX and PDF files. I had enabled stored for the text and I checked it out. Here's the result from a sample DOC file:

, a mobile user interface (UI) software development company, based in Cambridge, UK. After integrating the company, Qualcomm re-branded their interface markup language and its accompanying integrated development environment (IDE) as HYPERLINK "http://en.wikipedia.org/w/index.php?title=UiOne&action=edit&redlink=1" *\o "UiOne (page does not exist)" uiOne** . In March 2009, Qualcomm informed their Cambridge engineering staff, mostly from the division working on HYPERLINK "http://en.wikipedia.org

The Doc contains material from Wikipdia. I captured a full output on http://pastebin.com/8FL9eHJv

So Solr CEl/Tika inserts its own formatting, and the results of the formatting show up in the search output. How can I fix the problem so that the search results (text snippets) will not contain the formatting?

Googling around tells me that TIKA has several output formats, so is that the approach? Or is there a plugin that can filter the text before rendering the results?

Relevant details: My configuration is close to stock: My upload command is a python variation of

curl "http://localhost:8983/solr/update/extract?literal.id=doc-qualcomm&commit=true" -F "[email protected]"

My schema.xml http://pastebin.com/VLz2uuDQ

My SolrConfig.xml http://pastebin.com/X2J2jj64

1
Can you post your SOLR config for the bit that talks to Tika? As you've spotted, Tika supports outputting as Plain Text, HTML and XHTML, so things may well depend on how you've chosen to configure SOLR to talk to TikaGagravarr
I edited my question to include those. But my configuration is close to stock, I just modified a few details in the schema.xmlJesvin Jose
What version of SOLR are you using? And what version of Tika does that include?Gagravarr

1 Answers

0
votes

Are you asking about the extra hyperlink items in the search results. If yes, try updating the extract request handle in your solrconfig.xml to

<str name="captureAttr">false</str><str name="fmap.a">ignored_</str>