2
votes

I'm trying to understand the difference between the datacenter replication implemented in Cassandra and in Couchbase. It looks like in Cassandra, if I have two datacenters(DC), all my data is replicated in both. Whilst in Couchbase, two DCs hold different data and could be manually configured the replication of a subset of data between different DCs. Is it correct?

How a client can know where the data is located in Couchbase? If I query DC1 for a data holds in DC2, what happens?

In Couchbase, how the whole system is aware of where the data is replicated?

Thank you in advance!

3

3 Answers

4
votes

Couchbase Cross Datacenter Replication (XDCR) replicates all the data from a source bucket to a destination bucket (continuously).

If you have bucket A in the New York datacenter and bucket B in the San Francisco datacenter, and you configure XDCR from bucket A to bucket B, all of the data in bucket A replicates to bucket B. You cannot configure any additional filter. However, this replication is only one direction. So, if you are also writing data directly into bucket B you will not have all data in both datacenters. If you want to have all the data in both datacenter, you would also configure XDCR from bucket B to bucket A. This is referred to as bi-directional replication in the manual. In this 2 cluster configuration, it would give you all the data in all the datacenters.

Couchbase Client SDKs are configured to talk to a single cluster. This means the client must know which cluster to connect to, if you have different data stored in your New York cluster from your San Francisco cluster, your application must have the logic to know where to look for the data.

For high-availability use cases, typically bi-directional replication is set up between the regions, and applications are designed to prefer a cluster. An application deployed closer to New York might prefer the New York cluster. As long as there are no problems it reads and writes to that cluster. If there is some problem, say the New York datacenter is down, the application could continue operation by switching to the San Francisco datacenter. But again, all of this logic would be in your application.

The "smart cluster map" mentioned by Robin is used to find data within a single cluster. Its important to understand this will not locate data stored in different regions.

2
votes

Please note, the more recent versions of Couchbase (4.0 and higher) XDCR do allow filtering. A simple regex on the key names allows a selected subset of data can be replicated between data centers. See more at http://developer.couchbase.com/documentation/server/4.0/xdcr/xdcr-filtering-setup.html

1
votes

In Couchbase, Cross Data Replication works Bucket to Bucket. Couchbase allows two types of Replication - Bi-Directional and Uni-Directional. If you use Bi-Directional, both datasets are going to be the same. If you use Uni-Directional, you could in theory have 1 dataset larger than the other, but generally not.

The Client SDKs know where data is located in Couchbase because it uses a smart Cluster Map. This Cluster Map keeps track of where data is located at all times, therefore sending requests to the correct node.