https://sre.google/sre-book/managing-critical-state/

Groups of processes may want to reliably agree on questions such as:

  • Which process is the leader of a group of processes?
  • What is the set of processes in a group?
  • Has a message been successfully committed to a distributed queue?
  • Does a process hold a lease or not?
  • What is a value in a datastore for a given key?

The distributed consensus problem deals with reaching agreement among a group of processes connected by an unreliable communications network.

For instance, several processes in a distributed system may need to be able to form a consistent view of a critical piece of configuration, whether or not a distributed lock is held, or if a message on a queue has been processed.

for leader election, critical shared state, or distributed locking

coordination failure

a problem that appears to be caused by a full partition may instead be the result of:

  • A very slow network
  • Some, but not all, messages being dropped
  • Throttle occurring in one direction, but not the other direction

split brain

  • pair of file servers, leader and follower
  • monitor each other via hertbeats
    • cant contact → STONITH to shut partner node down and take mastership of files
  • STONITH commnds may not be delivered leading to corruption and unavailability bc both nodes r expected to be active or dead
  • The problem here is that the system is trying to solve a leader election problem using simple timeouts. Leader election is a reformulation of the distributed asynchronous consensus problem, which cannot be solved correctly by using heartbeats.

failover

The problem of determining a consistent view of group membership across a group of processes is another instance of the distributed consensus problem.

many distributed systems problems turn out to be different versions of distributed consensus, including master election, group membership, all kinds of distributed locking and leasing, reliable distributed queuing and messaging, and maintenance of any kind of critical shared state that must be viewed consistently across a group of processes.