I have to put many different docs (xls,pptx,txt,csv,pdf) into a solr core. All of the documents are unstructured and unrelated. I would like to do something like:
{
'filename':'doc1',
'content': entire doc
}
In this case filename is not a tag inside the actual document but is assigned by the user and content, also not in actual document, would map to the entire indexed document.
I plan on doing the processing through a python script, and while there are tools to extract text from rich text documents, I would rather just pass them to solr and have solr ignore their internal tags (in pdfs for example) and map the whole doc to the contents tag of the above schema.
In summation how do I create a schema with two fields not found in the target documents, and index the entire document and map it to one of the fields (text_en)? I'm somewhat new to solr, so my vocabulary might be a little cloudy, so please ask for clarification if you're not sure what I'm trying to achieve.
text_enby default is a fieldType and not a field. - Binoy Dalal