5
votes

I've had this week an issue with a Solr index: http://lucene.472066.n3.nabble.com/corrupted-index-in-slave-td4054769.html,

Today, that error started to happen constantly for almost every request, and I created a JIRA issue becaue I thought it was a bug https://issues.apache.org/jira/browse/SOLR-4707

As you can read, at the end it was due to a fail in the Solr master-slave replication, and now I don't know if we should think about migrating to SolrCloud, since Solr master-slave replications seems not to fit to our requirements:

  • index size: ~20 million documents, ~9GB
  • ~1200 updates/min
  • ~10000 queries/min (distributed over 2 slaves) MoreLikeThis, RealTimeGet, TermVectorComponent, SearchHandler

I would thank you if anyone could help me to answer these questions:

  • Would it be advisable to migrate to SolrCloud? Would it have impact on the replication performance?
  • In that case, what would have better performance? to maintain a copy of the index in every server, or to use shard servers?
  • How many shards and replicas would you advice for ensuring high availability?

Kind Regards,

Victor

1
If you could wait a bit, Solr 5 will come out within the next year and it has a whole slew of positive changes that further support SolrCloud. IMO 4.x support for SolrCloud requires a lot of further maintenance so if you can wait, I would just wait. Also deciding how to shard sucks.Xinzz
I solved the problem thanks to this article searchhub.org/2013/08/23/… after reading it, I could understand that the soft commit time was missconfigured according to our requirements (index-heavy, query- heavy), we had too many soft commits but we didn't need the data to be available in real time. Therefore, as the article suggests, I tried to set the soft commit interval quite long, but the hard commit to a small value, in my case 15 seconds.vruizext
Also, optimizing the indexing process by sending "bulk" updates messages containing several items rather than sending one request for every item being indexed, and choosing a better strategy for caching results of queries, helped to reduce the load in the solr servers and improved the overall quality of the service providedvruizext

1 Answers

3
votes

Well, answer to all your questions depends on what exactly you want from solrcloud.

  • Yes,it would be advisable to move over to solrcloud as it provides High availability,scalability and Near real time search plus automated hot replication.But these features comes at the cost of slightly performance degradation (You want notice even in well configured cluster).
  • I would suggest you should use shared configuration to allow solr to maintain index data for you (I am sure you will bring smile to TechOps people if you do so). This will reduce human errors and resource requirement as well.
  • Answer to your last question entirely depends on your cloud deployment.You should try with 2 shard 2 replica configuration and then create test deployment to ensure that it serves your needs.If not, try with different combinations of shard and replica counts until u get what u want(I know its pain !).

At last don't forget to estimate your future growth(How much data you will add to your cluster in next couple of years), and keeping in mind you should decide shards and replicas