I recently "inherited" a Carbon/Graphite setup from a colleague which I have to redesign. The current setup is:
- Datacenter 1 (DC1): 2 servers (server-DC1-1 and server-DC1-2) with 1 carbon-relay and 4 carbon caches
- Datacenter 2 (DC2): 2 servers (server-DC2-1 and server-DC2-2) with 1 carbon-relay and 4 carbon caches
All 4 carbon-relays are configured with a REPLICATION_FACTOR of 2, consistent hashing and all carbon-caches ( 2(DCs) * 2(Servers) * 4(Caches) ). This had the effect that some metrics exist only on 1 server (they probably were hashed to a different cache on the same server). With over 1 million metrics this problem affects about 8% of all metrics.
What I would like to do is a multi-tiered setup with redundancy, so that I mirror all metrics across the datacenters and inside the datacenter I use consistent hashing to distribute the metrics evenly across 2 servers.
For this I need help with the configuration (mainly) of the relays. Here is a picture of what I have in mind:
The clients would send their data to the tier1relays in their respective Datacenters ("loadbalancing" would occur on client side, so that for example all clients with an even number in the hostname would send to tier1relay-DC1-1 and clients with an odd number would send to tier1relay-DC1-2).
The tier2relays use consistent hashing to distribute the data in the datacenter evenly across the 2 servers. For example the "pseudo" configuration for tier2relay-DC1-1 would look like this:
- RELAY_METHOD = consistent-hashing
- DESTINATIONS = server-DC1-1:cache-DC1-1-a, server-DC1-1:cache-DC1-1-b, (...), server-DC1-2:cache-DC1-2-d
What I would like to know: how do I tell tier1relay-DC1-1 and tier1relay-DC1-2 that they should send all metrics to the tier2relays in DC1 and DC2 (replicate the metrics across the DCs) and do some kind of "loadbalancing" between tier2relay-DC1-1 and tier2relay-DC1-2.
On another note: I also would like to know what happens inside the carbon-relay if I use consistent hashing, but one or more of the destinations are unreachable (server down) - do the metrics get hashed again (against the reachable caches) or will they simply be dropped for the time? (Or to ask the same question from a different angle: when a relay receives a metric does it do the hashing of the metric based on the list of all configured destinations or based on the currently available destinations?)