0
votes

We just upgraded from Solr 6.3 to 7.5. With no changes to the schema or config, we are getting a 400 error on just about every pdf file that we try to index. These are files that Solr 6.3 had no problems indexing. All other types of complex file are indexed as before, it's just the pdf files causing the problem.

Clue #1: Out of ~1900 pdf files, only 2 were successfully processed. Most of our pdfs have a subject and a title, but these 2 did not.

Clue #2: In the console log we see failure messages like this: RequestHandlerBaseorg.apache.solr.common.SolrException: undefined field: "pdf_docinfo_title"

I can't find a field with that name in the schema. A google search on pdf_docinfo_title didn't turn up anything useful.

1

1 Answers

0
votes

Since you don't have a field with that name, and no catch-all definition, Solr barfs when Tika hands it back a document with the field pdf_docinfo_title set.

As Tika is upgraded between Solr versions if possible, this field was not included by the older version of Tika bundled with 6.3, while the version bundled with 7.5 provides it properly. It represents the document title for the pdf file.

You can also use the fmap parameter to map fields from Tika to a different field in your schema:

fmap.<source_field>

Maps (moves) one field name to another. The source_field must be a field in incoming documents, and the value is the Solr field to map to. Example: fmap.content=text causes the data in the content field generated by Tika to be moved to the Solr’s text field.

You can also use the parameter uprefix to get the Tika module to prefix all unknown fields with a common prefix:

uprefix

Prefixes all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: uprefix=ignored_ would effectively ignore all unknown fields generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>