I am using Solr to index DOC, DOCX and PDF files. I had enabled stored for the text and I checked it out. Here's the result from a sample DOC file:
, a mobile user interface (UI) software development company, based in Cambridge, UK. After integrating the company, Qualcomm re-branded their interface markup language and its accompanying integrated development environment (IDE) as HYPERLINK "http://en.wikipedia.org/w/index.php?title=UiOne&action=edit&redlink=1" *\o "UiOne (page does not exist)" uiOne** . In March 2009, Qualcomm informed their Cambridge engineering staff, mostly from the division working on HYPERLINK "http://en.wikipedia.org
The Doc contains material from Wikipdia. I captured a full output on http://pastebin.com/8FL9eHJv
So Solr CEl/Tika inserts its own formatting, and the results of the formatting show up in the search output. How can I fix the problem so that the search results (text snippets) will not contain the formatting?
Googling around tells me that TIKA has several output formats, so is that the approach? Or is there a plugin that can filter the text before rendering the results?
Relevant details: My configuration is close to stock: My upload command is a python variation of
curl "http://localhost:8983/solr/update/extract?literal.id=doc-qualcomm&commit=true" -F "[email protected]"
My schema.xml http://pastebin.com/VLz2uuDQ
My SolrConfig.xml http://pastebin.com/X2J2jj64