3
votes

I have a directory of pdf files: document.01.pdf, document.02.pdf, and so on. I am running Solr 6.6.2. I have run

solr create -c documents

to create a core called documents. I want to upload the pdf files to Solr and have it index the text that they contain, not just their metadata.

I understand that it's Tikka's job to do the extracting. I understand that it's the job of the solr.extraction.ExtractingRequestHandler to call Tikka. My solarconfig.xml (which is just the default created by solr create) contains the following section:

<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">_text_</str>
    </lst>
  </requestHandler>

If I run

post -c documents path-to-pdf-directory

I end up with entries in the index that contain metadata about the PDF files and an id that's the full path to the file, but not the file content. What I want is these metadata fields plus an additional field called something like text or content to contain the text in the PDFs.

Following examples like the one here, I also tried commands like

curl 'http://localhost:8983/solr/documents/update/extract?literal.id=doc1&commit=true' -F "[email protected]"

but this does the same thing.

I've been searching all over for documentation on how to do this, but everything I find makes it sound like I'm doing everything right.

How do I do this? This seems like such basic functionality that the fact it isn't obvious makes me think I'm misunderstanding something fundamental.

2

2 Answers

2
votes

you are asking Solr to put all text in a field named _text (with a trailing underscore too, I can't make it show here) with this:

<str name="fmap.content">_text_</str>

If you don't see a field like this after indexing, check that you have such a field defined in schema.xml (with the right indexed/stored attributes). You don't necessarily need to have it defined in schema.xml, it can work via dynamicFields too, but for a quick verification just define it.

2
votes

I changed the value of fmap.content for the ExtractingRequestHandler to text_en because text_en is listed as a field type in my managed schema and the text in my documents is in English.

<requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.meta">ignored_</str>
      <str name="fmap.content">text_en</str>
    </lst>
  </requestHandler>

Now when I run post the contents of my document are indexed as a text_en field along with all the other metadata.