Retrieving document ID in apache solr clustering results

Question

So I'm trying to cluster the solr search results using the Lingo clustering algorithm that comes with solr 6. It does the job but I need it to retrieve the document IDs (the IDs are called P_ID here) with the clustering results. I've been working on this and haven't had any luck, any help is greatly appreciated.
Here is the solrconfig.xml file

  <lib dir="${solr.install.dir:../../..}/contrib/clustering/lib/" regex=".*\.jar" />

  <lib dir="${solr.install.dir:../../..}/dist/" regex="solr-clustering-\d.*\.jar" />

  <requestHandler name="/clustering"
              startup="lazy"
              enable="${solr.clustering.enabled:true}"
              class="solr.SearchHandler">
<lst name="defaults">
    <bool name="clustering">true</bool>
    <bool name="clustering.results">true</bool>
    <bool name="carrot.produceSummary">true</bool>
    <!-- Logical field to physical field mapping. -->
    <str name="carrot.url">P_ID</str>
    <str name="carrot.title">input</str> 
    <str name="carrot.snippet">input</str>

    <!-- Configure any other request handler parameters. We will cluster the
         top 100 search results so bump up the 'rows' parameter. -->
    <str name="defType">edismax</str>
      <str name="qf">
        input^1.4
       </str>
      <str name="q.alt">*:*</str>
    <str name="rows">100</str>
    <str name="fl">*</str>
  </lst>
<arr name="last-components">
    <str>clustering</str>
  </arr>
  </requestHandler>

And here's the results that I get:

{

  "responseHeader":{
    "status":0,
    "QTime":24},
  "response":{"numFound":16,"start":0,"docs":[
      {
        "date":"2016-09-18 13:50:07.0",
        "input":"Text",
        "type":"q",
        "U_ID":2,
        "P_ID":1,
        "_version_":1548945773383647232},
      {
        "date":"2016-09-18 13:53:09.0",
        "input":"Text 2",
        "type":"q",
        "U_ID":10,
        "P_ID":2,
        "_version_":1548945773385744384},
      {
        "date":"2016-09-18 14:20:29.0",
        "input":"Text 3",
        "type":"q",
        "U_ID":12,
        "P_ID":3,
        "_version_":1548945773385744385},
      {
        "date":"2016-09-18 13:50:07.0",
        "input":"Text 4",
        "type":"q",
        "U_ID":3,
        "P_ID":4,
        "_version_":1548945773385744386},
      ]
  },
  "clusters":[{
      "labels":["label 1"],
      "score":6.723284893605449,
      "docs":["text ",
        "Text 2",
        "Text 4"
        ]},
    {
      "labels":["lable 2"],
      "score":10.22078770519469,
      "docs":["text 3",
        "Text 2"
        ]},
    {
      "labels":["label 3"],
      "score":8.32470981979922,
      "docs":["text 1",
        "text 3"
        ]},
    ]}

As you can see, under the "clusters" section it gives me the clusters and the documents but it doesn't give me the document IDs, I even tried changing the fl parameter to P_ID (document Id) but it didn't work. It can even show the P_ID values in the response section but not in the clusters section.

What is the uniqueKey setting of your schema? If I understand the docs correctly, the clustering component will use that field as the field returned in the docs array. — MatsLindh
The unique key is the P_ID (P_ID is document ID here) and this is already specified in the scheme, but I don't understand why rather than returning the P_IDs, the clustering returns the documents. — vmontazeri

vmontazeri vmontazeri · Accepted Answer · 2016-11-02T18:35:58

Ok So I came up with a solution that satisfies what I needed. All I needed to do was to put the document ID in the <uniqueKey></uniqueKey> element in the schema.

Retrieving document ID in apache solr clustering results

1 Answers