0
votes

I have to put many different docs (xls,pptx,txt,csv,pdf) into a solr core. All of the documents are unstructured and unrelated. I would like to do something like:

{
    'filename':'doc1',
    'content': entire doc
}

In this case filename is not a tag inside the actual document but is assigned by the user and content, also not in actual document, would map to the entire indexed document.

I plan on doing the processing through a python script, and while there are tools to extract text from rich text documents, I would rather just pass them to solr and have solr ignore their internal tags (in pdfs for example) and map the whole doc to the contents tag of the above schema.

In summation how do I create a schema with two fields not found in the target documents, and index the entire document and map it to one of the fields (text_en)? I'm somewhat new to solr, so my vocabulary might be a little cloudy, so please ask for clarification if you're not sure what I'm trying to achieve.

1
An alternative is to index the docs as is into solr and create a catch-all copyfield into which you'll copy the contents of all the other fields. Also, on a separate note, text_en by default is a fieldType and not a field. - Binoy Dalal
Like <copyField source="*" dest="text" />? - O.D.P
I don't think a wildcard will work here. You will have to add all the fields separately. - Binoy Dalal
Turns out the wild card works. I thing text field does the same. - O.D.P

1 Answers

0
votes

The Tika module in Solr has configuration options that should be able to do what you want:

literal.<fieldname>: Populates a field with the name supplied with the specified value for each document. The data can be multivalued if the field is multivalued.

So if you add a parameter named literal.filename=doc1, doc1 will be added in filename for the document.

There's also an example to ignore all fields that are not present in the schema:

uprefix: Prefixes all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: uprefix=ignored_ would effectively ignore all unknown fields generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>

And to move a field into a different field name:

fmap.<source_field>: Maps (moves) one field name to another. The source_field must be a field in incoming documents, and the value is the Solr field to map to. Example: fmap.content=text causes the data in the content field generated by Tika to be moved to the Solr's text field

And a final hint - to fully customize the handling, but still use the available Solr server and Tika for processing documents (and then knead the data a bit more before indexing):

extractOnly: Default is false. If true, returns the extracted content from Tika without indexing the document. (See also extractFormat)