0
votes

Below are my assumptions/queries. Please address if there is something wrong in my understanding

By Reading the documentation I understood that

  1. Zookeeper writes go to the Leader, and they are replicated to follower. A read request can be served from the follower(slave) itself. And hence read can be stale.
  2. Why can't we use zookeeper as a cache system?
  3. As the write request is always made/redirected to Leader, it means node creation is consistent. When two clients sending a write request for same node name, one of them will ALWAYS get an error(NodeExistsException).
  4. If above is true, then can we use zookeeper to keep track of duplicate requests by creating a znode with the requestId.
  5. For generating a sequence number in a distributed system, we can use the sequential node creation.
1
... node creation is consistent ... What definition of "consistent" are you using?inquisitive
Currently we were trying to block duplicate requests using mysql unique key, but I guess mysql is not built for this purpose. Instead of mysql I want to .block it using zookeeper. For every request id I will create a znode, In case of duplicate requests I will get error "NodeExistsException". I was asking for this type of consistency.munish
I think mysql will be better suited for this. How do you check if a request is duplicate? What aspect of the request do you chdck? Id? Some value? Can you give example of a duplicate request?inquisitive
Suppose there's an api which does the refund against a requestId(param) after doing some validation. If user clicked twice unknowingly and the hit goes to the two different servers(distributed/cluster) then there are chances that the refund can be done twice. It can be done by unique key of mysql, but I think mysql is not built for that. Including mysql means dev+devops+DBA Including zookeeper only includes dev+devops. DBA is out of picture. I am still checking what are the BEST options to do such kind of checking in distributed environment.munish

1 Answers

1
votes

Based on what information is available in the question and the comments, it appears that the basic question is: In a stateless multi server architecture, how best to prevent data duplication, here the data is "has this refund been processed?"

This qualifies as "primarily opinion based". There are multiple ways to do this and no one way is the best. You can do it with MySQL and you can do it with Zookeeper.

Now comes pure opinion and speculation:

To process a refund, there must be some database somewhere? Why not just check against it? The duplicate-request scenario that you are preparing against seems like a rare occurrence - this wont be happening hundred times per sec. If so, then this scenario does not warrant high performance implementation. Just a database lookup should be fine.

Your workload seems to be 1:1 ratio of read:write. Every time a refund is processed, you check whether it is already processed or not and if not processed then process it and make an entry for it. Now Zookeeper itself says it works best for something like 10:1 ratio of read:write. While there is no such metric available for MySQL, it does not need to make certain* guarantees that zookeeper makes for write activities. Hence i hope, it should be better for pure write intensive loads. (* Guarantees like sequentiality, broadcast, consensus etc)

Just a nitpick, but your data is a linear list of hundreds (thousands? millions?) of transaction ids. This is exactly what MySQL (or any database) and its Primary Key is built for. Zookeeper is made for more complex/powerful hierarchical data. That you do not need.