I'm using Solr's php extension for interacting with Apache Solr. I'm indexing data from the database. I wanted to index contents of external files (like PDFs, PPTX) as well.
The logic for indexing is:
Suppose the schema.xml
has the following fields defined:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="created" type="tlong" indexed="true" stored="true" />
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="filepath" type="text_general" indexed="false" stored="true"/>
<field name="filecontent" type="text_general" indexed="false" stored="true"/>
A single database entry may/may not have a file stored.
Hence, following is my code for indexing:
$post = stdclass object having the database content
$doc = new SolrInputDocument();
$doc->addField('id', $post->id);
$doc->addField('name', $post->name);
....
....
$res = $client->addDocument($doc);
$client->commit();
Next, I want to add the contents of the PDF file in the same solr document as above.
This is the curl
code:
$ch = curl_init('
http://localhost:8010/solr/update/extract?');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);
But, I guess I'm missing something. I read the documentation, but I cannot figure out a way of retrieving the contents of the file and then adding it to the existing solr document in the field: filecontent
EDIT #1:
If I try to set literal.id=xyz
in the curl request, it creates a new solr document having id=xyz
. I don't want a new solr document created. I want the contents of the pdf to be indexed and stored as a field in the previously created solr document.
$doc = new SolrInputDocument();//Solr document is created
$doc->addField('id', 98765);//The solr document created above is assigned an id=`98765`
....
....
$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);
I want the above solr document (id = 98765
) to have a field in which the contents of the pdf get indexed & stored.
But the cURL request (as above) creates another new document (with id = 1
). I don't want that.