15
votes

Solr 1.4 Enterprise Search Server recommends doing large updates on a copy of the core, and then swapping it in for the main core. I am following these steps:

  1. Create prep core: http://localhost:8983/solr/admin/cores?action=CREATE&name=prep&instanceDir=main
  2. Perform index update, then commit/optimize on prep core.
  3. Swap main and prep core: http://localhost:8983/solr/admin/cores?action=SWAP&core=main&other=prep
  4. Unload prep core: http://localhost:8983/solr/admin/cores?action=UNLOAD&core=prep

The problem I am having is, the core created in step 1 doesn't have any data in it. If I am going to do a full index of everything and the kitchen sink, that would be fine, but if I just want to update a (large) subset of the documents - that's obviously not going to work.

(I could merge the cores, but part of what I'm trying to do is get rid of any deleted documents without trying to make a list of them.)

Is there some flag to the CREATE action that I'm missing? The Solr Wiki page for CoreAdmin is a little sparse on details.

Possible Solution: Replication

Someone on solr-user suggested using replication. To use it in this scenario would (to my understanding) require the following steps:

  1. Create a new PREP core based off the config of the MAIN core
  2. Change the config of the MAIN core to be a master
  3. Change the config of the PREP core to be a slave
  4. Cause/wait for a sync?
  5. change the config of the PREP core to no longer be a slave
  6. Perform index update, then commit/optimize on PREP core.
  7. Swap PREP and MAIN cores

A simpler replication-based setup would be to configure a permanent PREP core that is always the master. The MAIN core (on as many servers as needed) could then be a slave of the PREP core. Indexing could happen on the PREP core as fast or as slow as necessary.

Possible Solution: Permanent PREP core and double-updating

Another idea I came up with was this (also involving a permanent PREP core):

  1. Perform index update, then commit/optimize on PREP core.
  2. Swap PREP and MAIN cores.
  3. Re-perform index update, then commit/optimize on what is now the PREP core. It now has the same data as the MAIN core (in theory) and will be around, ready for the next index operation.
1
I think that procedure is intended for reindexing everything. What are you using to index? DIH or a custom process?Mauricio Scheffer
have you tried just updating the documents on the same core? does it really perform so bad?Mauricio Scheffer
Well, have you tried? You might be complicating things unnecessarily...Mauricio Scheffer
We tried and performance was within tolerance. We don't have "a lot" of documents - on the order of 100k. Thanks for the advice. I was just surprised that the book recommended something that was so hard to implement.stannius

1 Answers

3
votes

I created this idea of clone operation that does a filesystem copy of the indexes and config data, and then CREATEs a new one. There are some locking issues, and you have to have filesystem access to the indexes, but it did work. This does give you a nice copy that you can muck around with the config files.

The more I think about it, you could CREATE a new core and then do this:

Force a fetchindex on slave from master command : http://slave_host:port/solr/replication?command=fetchindex It is possible to pass on extra attribute 'masterUrl' or other attributes like 'compression' (or any other parameter which is specified in the tag) to do a one time replication from a master. This obviates the need for hardcoding the master in the slave.

And populate the new one from the production one, then apply your updates, and then swap back!