2
votes

I want to establish a SolrCloud clsuter for over 10 millions of news articles. After reading this article: Shards and Indexing Data in SolrCloud, I have a plan as follows:

  1. Add prefix ED2001! to document ID where ED means some newspaper source and 2001 is the year part in published date of news article, i.e. I want to put all news articles of specific news paper source published in specific year to a shard.
  2. Create collection with router.name set to compositeID.
  3. Add documents?
  4. Query Collection?

Practically, I got some questions:

  1. How to add doucments based on this plan? Do I have to specify special parameters when updating the collection/core?
  2. Is this called "custom sharding"? If not, what is "custom sharding"?
  3. Is auto sharding a better choice for my case since there's a shard-splitting feature for auto sharding when the shard is too big?
  4. Can I query without _router_ parameter?

EDIT @ 2015/9/2:

  1. This is how I think SolrCloud will do: "The amount of news articles of specific newspaper source of specific year tends to be around a fix number, e.g. Every year ED has around 80,000 articles, so each shard's size won't increase dramatically. For the next year's news articles of ED, I only have to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard for me (which contains all ED2016 articles), and later the Leader will spread the replica of this new shard to other nodes (per replica per node other than leader?)". Am I right? If yes, it seems no need for shard-splitting.
2

2 Answers

5
votes

Answer-1: If have the schema (structure) of the document then you can provide the same in schema.xml configuration or you can use Solr's schema-less mode for indexing the document. The schema-less mode will automatically identify the fields in your document and index them. The configuration of schema-less mode is little different then schema based configuration mode in solr. Afterwards, you need to send the documents to solr for indexing using curl or solrj java api. Essentially, solr provides rest end points for all the different operations. You can write the client in any language which suits you better.

Answer-2: What you have mentioned in your plan, use of compositeId, is called custom sharding. Because you are deciding to which shard a particular document should go.

Answer-3: I would suggest to go with auto-sharding feature if are not certain how much data you need to index at present and in future. As the index size increases you can split the shards and scale the solr horizontally.

Answer-4: I went through the solr documentation, did not find anywhere mentioning _route_ as mandatory parameter. But in some situations, this may improve query performance because it overcomes network latency when querying all the shards.

Answer-5: The meaning of auto-sharding is routing the document to a shards, based on the hash range assigned while creating the shards. It does not create the new shards automatically, just by specifying a new prefix for compositeId. So once the index grows large enough in size, you might need to split it. Check here for more.

3
votes

This is actually a guide to answer my own question:

I kinda understand some concepts:

  1. "custom sharding" IS NOT "custom hashing".
  2. Solr averagely splits hash values as default hashing behavior.
  3. compositeId router applies "custom hashing" cause it changes default hashing behavior by prefixing shard_key/num-of-bits.
  4. Implicit router applies "custom sharding" since we need to manually specify which shards our docs will be sent to.
  5. compositeId router still is auto sharding since it's Solr who see the shard_key prefix and route the docs to specific shards.
  6. compositeId router needs to specify numShards parameter (possbily because Solr needs to distribute various hash value space ranges for each of the shard).

So obviously my strategy doesn't work since I need to always add in new year's news articles to Solr and there's no way I can predict how many shards in advance. So to speak, Implicit router seems a possible choice for me (We create shard we need and add docs to shard we intend to).