7
votes

I have a problem with Solr 5.3.1 . My Schema is rather simple. I have one uniqueKey which is the "id" as string. indexed, stored and required, non-multivalued.

I add documents first with a "content_type:document_unfinished" and then overwrite the same document, with the same id but another content_type:document. The document is then twice in the index. Again, the only uniqueKey is "id", as string. The id is coming originally from a mysql-index primary int.

Also looks like this happens not only once:

http://lucene.472066.n3.nabble.com/uniqueKey-not-enforced-td4015086.html

http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-td4129651.html

In my case not all the documents in the index are duplicated, just some. I was assuming - initially - that they are getting overwritten on commit when the same uniqueKey exists in the index. Which doesn't seem to work like I expected it. I do not want to simply update some fields in the document, I want to completely replace it, with all the children.

Some stats: around 350k documents in the index. Mostly with childDocuments. The Documents are distinguished by a "content_type" field. I used SolrJ to import them in that way:

HttpSolrServer server = new HttpSolrServer(url);
server.add(a Collection<SolrInputDocument>);
server.commit();

I am always adding a whole document with all the children again. Its nothing overly fancy. I end up with duplicated documents for the same uniqueKey. There are no side injections. I run only Solr with the integrated Jetty. I do not open the lucene index in java "manually".

What I did then was to delete+insert again. That seemed to work for a while, but then started under some conditions give this error message:

Parent query yields document which is not matched by parents filter

The document where that happens seems to be completely random, just one thing seems to emerge: its a childDocument where it happens. I do not run anything special, basically downloaded the solr package from the website and run it with bin/solr start

Anyone any ideas?

EDIT 1

I think I found the problem, which seems to be a bug? To reproduce the issue:

I downloaded Solr 5.3.1 to a Debian in a virtualBox and started it with bin/solr start. Added a new core with the basic config set. Nothing changed at the basic config set, just copied it over and added the core.

This leads to two documents with the same id in the index:

    SolrClient solrClient = new HttpSolrClient("http://192.168.56.102:8983/solr/test1");
    SolrInputDocument inputDocument = new SolrInputDocument();
    inputDocument.setField("id", "1");
    inputDocument.setField("content_type_s", "doc_unfinished");
    solrClient.add(inputDocument);
    solrClient.commit();
    solrClient.close();

    solrClient = new HttpSolrClient("http://192.168.56.102:8983/solr/test1");
    inputDocument = new SolrInputDocument();
    inputDocument.setField("id", "1");
    inputDocument.setField("content_type_s", "doc");
    SolrInputDocument childDocument = new SolrInputDocument();
    childDocument.setField("id","1-1");
    childDocument.setField("content_type_s", "subdoc");
    inputDocument.addChildDocument(childDocument);
    solrClient.add(inputDocument);
    solrClient.commit();
    solrClient.close();

Searching with:

http://192.168.56.102:8983/solr/test1/select?q=%3A&wt=json&indent=true

leads to the following output:

{

  "responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "q": "*:*",
      "indent": "true",
      "wt": "json",
      "_": "1450078098465"
    }
  },
  "response": {
    "numFound": 3,
    "start": 0,
    "docs": [
      {
        "id": "1",
        "content_type_s": "doc_unfinished",
        "_version_": 1520517084715417600
      },
      {
        "id": "1-1",
        "content_type_s": "subdoc"
      },
      {
        "id": "1",
        "content_type_s": "doc",
        "_version_": 1520517084838101000
      }
    ]
  }
}

What am I doing wrong?

1
I am currently facing a situation which seems to be very similar to yours: I also use SolrJ, I also use childDocuments, and I also just recently detected, that on a plain update of a document, afterwards the document exists twice with the same unique-key in the index. I also figured that I could try to explicitly delete the document via solrClient.deleteById(id), which seems to solve the problem - however, since you state that this is not a real fix, i still worry.SebastianRiemer
I've written a small Java test application which generates the issue as described by you. It can be found here: github.com/sebastianriemer/SolrDuplicateTest I would be interested to know whether you get the same result as me. I also wrote to the solr-user mailing list and will post the answers back here.SebastianRiemer
I think I was reading somewhere that Solr is treating documents with child documents different than without. I cannot recall where I was reading it, but I started adding a subdocuments right away, from the first document on. Overwriting works then for me since - but I still consider it as a "bug" from Solr not treating unique keys ... as ... well... unique. I am adding a few 10k documents to the index every day and it works.tom_w
By the way, I'd suggest writing an answer to your own question with your solution and accept it. As far as I know this is considered good practice and helps others having the same problem.SebastianRiemer

1 Answers

3
votes

Thanks for your feedback! I write this as answer since it is too long otherwise. I actually got the same response from the mailing list:

Mikhail Khludnev Hello Sebastian,

Mixing standalone docs and blocks doesn't work. There are a plenty of issues open.

On Wed, Mar 9, 2016 at 3:02 PM, Sebastian Riemer wrote:

Hi,

to actually describe my problem in short, instead of just linking to the test applicaton, using SolrJ I do the following:

1) Create a new document as a parent and commit

    SolrInputDocument parentDoc = new SolrInputDocument();
    parentDoc.addField("id", "parent_1");
    parentDoc.addField("name_s", "Sarah Connor");
    parentDoc.addField("blockJoinId", "1");
    solrClient.add(parentDoc);
    solrClient.commit();

2) Create a new document with the same unique-id as in 1) with a child document appended

    SolrInputDocument parentDocUpdateing = new SolrInputDocument();
    parentDocUpdateing.addField("id", "parent_1");
    parentDocUpdateing.addField("name_s", "Sarah Connor");
    parentDocUpdateing.addField("blockJoinId", "1");

    SolrInputDocument childDoc = new SolrInputDocument();
    childDoc.addField("id", "child_1");
    childDoc.addField("name_s", "John Connor");
    childDoc.addField("blockJoinId", "1");

    parentDocUpdateing.addChildDocument(childDoc);
    solrClient.add(parentDocUpdateing);
    solrClient.commit();

3) Results in 2 Documents with id="parent_1" in solr index

Is this normal behaviour? I thought the existing document should be updated instead of generating a new document with same id.

For a full working test application please see orginal message.

Best regards, Sebastian

I think it is a known issue, and there exist several tickets which kind of relate to this, but I am glad that there is a way to deal with it (adding child docs right from the beginning) (https://issues.apache.org/jira/browse/SOLR-6096, https://issues.apache.org/jira/browse/SOLR-5211, https://issues.apache.org/jira/browse/SOLR-7606)