I have a directory of pdf files: document.01.pdf, document.02.pdf, and so on. I am running Solr 6.6.2. I have run
solr create -c documents
to create a core called documents
. I want to upload the pdf files to Solr and have it index the text that they contain, not just their metadata.
I understand that it's Tikka's job to do the extracting. I understand that it's the job of the solr.extraction.ExtractingRequestHandler
to call Tikka. My solarconfig.xml
(which is just the default created by solr create
) contains the following section:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
If I run
post -c documents path-to-pdf-directory
I end up with entries in the index that contain metadata about the PDF files and an id
that's the full path to the file, but not the file content. What I want is these metadata fields plus an additional field called something like text
or content
to contain the text in the PDFs.
Following examples like the one here, I also tried commands like
curl 'http://localhost:8983/solr/documents/update/extract?literal.id=doc1&commit=true' -F "[email protected]"
but this does the same thing.
I've been searching all over for documentation on how to do this, but everything I find makes it sound like I'm doing everything right.
How do I do this? This seems like such basic functionality that the fact it isn't obvious makes me think I'm misunderstanding something fundamental.