7
votes

I have two Jackrabbit instances containing the same content. Rebuilding the Lucene index is slow, 30+ hours, and the down-time needed in the cluster is risky. Is it possible to instead just re-index one Jackrabbit then copy the Lucene index from that instance to the other?

Naively copying the Lucene index files beneath the workspace directory doesn't work. The issue appears to be that the content is indexed by document number which maps to a UUID which maps to the JCR path for the indexed node, but these UUIDs are not stable for a given path between Jackrabbit instances. (Both are actually Day CQ publisher instances populated by replication from a CQ author instance.)

I've managed to find the UUID-to-path mapping in the repository under /jcr:system/jcr:versionStorage/ but I can't see an easy way to copy this between repositories along with the Lucene index. And then I can't find the UUID->document ID mapping anywhere in the files - is this part of the Lucene index too?

Thanks for any help. I'm leaning towards just re-indexing the second instance separately and accepting the downtime but any ideas to reduce risk or the elapsed time of reindexing the cluster appreciated!


In the end we're going the re-index-them-both route: we've managed to repurpose a test instance as an extra live instance that we can drop into the farm temporarily whilst we take the other two out in turn to re-index. However I'd still be interested in hearing better ways to do this!

1
Please take a look at this post - though maybe you've already seen it. stackoverflow.com/questions/670182/…Mike Perrenoud
Thanks. No, I don't think any of those are relevant for me: it's the embedded search engine so I can't switch to Solr and the other answers discus copying the index files which isn't enough for me. I need to somehow combine either the node path data with the index and copy that, then rebuild the path -> UUID -> document number mapping at the other, or somehow transform the copied index to use the document numbers on the target system on the source system.Rup

1 Answers

2
votes

That seems like a scary idea, honestly. I'm not sure there is any way to guarantee that you've got the same underlying data, even with identical content and hardware configuration.

If your performance numbers look like ours, the time to copy the entire repository is less than the time it takes to reindex. Have you considered just reindexing one repository, doing a backup/copy, and then configuring the backup/copy to be your second instance?