System Design: Replication

As application traffic increases, there may be the need to scale the architecture of the system to handle the increased load. In doing so, there are two main approaches that may be taken: 

  1. Vertical scaling or scaling up: In vertical scaling, the entire system is moved to a more powerful "machine". This implies the entire system my be utilising an architecture where CPU, RAM and Storage exist on one machine(shared-memory architecture) or where multiple machines each with its own CPU and RAM  share a single Storage(shared-disk architecture). 
  2. Horizontal scaling, scaling out or shared-nothing architecture where multiple machines(nodes) with its own CPU, RAM and Storage are networked together by software to be able to handle the incoming load either by breaking the database into chunks stored on each device(partitioning) or creating copies of the database onto multiple devices(replication). 

Replication

This basically means a copy of the same data is kept on multiple devices. This is done to guarantee high availability, reduce latency, ensure disconnected operations and scalability(high read throughput). However this raises the question, how do we ensure data written to the database is recorded by every "machine"(replica). The work around for this includes leaders and followers, multileader replication and leaderless replication. 

Leaders and Followers

In this system, one replica is designated as the leader/master and all writes are made to this replica. The other replicas are made read-only followers/slaves. This ensures that the data is kept valid and can only be modified by the "master" replica only. 

Synchronous vs Asynchronous Replication

In performing replication, this can be done synchronously, i.e., when a user makes a write request to the leader replica, the leader replica, sends the same write request to the follower replicas, confirms the write with the replicas before sending the response to the user. Asynchronous replication on the other hand, the the leader sends the write request to the follower but does not wait for any response from the follower before sending the response to the user. Synchronous replication has strong durability but introduces high latency, asynchronous replication has lower latency but risks data loss on leader failure. Most systems nowadays tends to have a semi-synchronous replication where the leader has a synchronous replication with one follower and that follower has asynchronous replication with the remaining followers, ensuring data durability as well as low latency. 

Setting Up New Followers

From a high level overview, setting a new follower usually follows these steps: 
  1. Take a consistent snapshot of the leader's database at a specific point in time
  2. Copy the snapshot to the new follower node.
  3. The follower connects to the leader and requests all data changes that have occurred since the snapshot was taken.
  4. The follower processes the backlog of data changes until it has caught up to the leader
That is how generally, new followers are created, however, how do we handle failure in replication? 

Handling Node Outages.

In node outages, if a follower fails, catching up usually means, the follower checks its system logs, finds the last transaction before the failure, connect with the leader and request all data changes since the last transaction. When a leader fails on the other hand, this typically means a follower has to be promoted to a leader(failover). This involves, determining that the leader has failed, choose the replica with the most up-to-date data and promote it to the leader and finally a reconfiguration of the system to use the new leader. This can introduce challenges such as data loss, split brain and timeouts.

Implementation of Replication Logs 

Databases use different methods to implement replication logs, including: 
  • Statement-Based Replication: The leader logs and sends every write statement (e.g., INSERT, UPDATE, DELETE) to its followers. This can be problematic if statements are nondeterministic or rely on side effects.
  • Write-Ahead Log (WAL) Shipping: The leader sends its low-level, byte-by-byte log of changes to disk blocks. Followers build an exact copy of the leader's data structures. This method (used by PostgreSQL and Oracle) tightly couples replication to the storage engine, making it difficult to run different database versions or storage engines on leaders and followers.
  • Logical (Row-Based) Replication: Changes are logged at a higher abstraction level, describing row modifications (e.g., primary key, new column values). This approach (used by MySQL's binlog) is more backward-compatible and easier for external applications to parse, which is useful for change data capture.
  • Trigger-Based Replication: Custom application code (triggers) is registered to automatically log data changes into a separate table, which an external process can then use to replicate changes.

Problems with Replication Lag 

Asynchronous replication can lead to replication lag, where the delay between a write on the leader and its reflection on a follower can range from fractions of a second to minutes or hours. This lag can cause inconsistencies that impact applications:
  • Read-After-Write Consistency (or Read-Your-Writes Consistency): A user may write data but then read from a stale follower and not see their own update. Solutions include always reading the user's own data from the leader, or routing reads based on the timestamp of the last write.
  • Monotonic Reads: A user might see newer data from one replica, then older data from a more lagging replica, making time appear to go backward. Ensuring that a user's reads always go to the same replica can prevent this.
  • Consistent Prefix Reads: If writes have a causal order, a reader should see them applied in that order. This can be a problem in partitioned databases where partitions operate independently. Explicitly tracking causal dependencies or ensuring related writes go to the same partition can help. Designing systems that need stronger guarantees than eventual consistency requires careful consideration, often involving transactions or consensus mechanisms.

Multi-Leader Replication

In multi-leader replication, multiple nodes can accept writes from clients, and these leaders then replicate changes to each other and to followers.
  • Use Cases: It's often used for multi-datacenter operation to reduce write latency for users in different geographical locations and improve fault tolerance against datacenter outages. It can also allow clients to continue working offline.
  • Handling Write Conflicts: The main challenge is resolving conflicts that occur when the same data is concurrently modified on different leaders. Solutions include:
    • Convergent Conflict Resolution: Ensuring all replicas eventually arrive at the same final value. This can involve picking the "last write wins" (LWW) based on timestamps, giving each replica a unique ID, or merging values (e.g., concatenating lists).
    • Custom Conflict Resolution Logic: Application-specific code can be used to resolve conflicts either when they are detected during replication (on write) or when conflicting versions are read (on read).
  • Replication Topologies: Different communication paths are possible, such as all-to-all, circular, or star topologies. In all-to-all topologies, writes may arrive out of order at different replicas, leading to inconsistencies if not properly handled (e.g., using version vectors).

Leaderless Replication

This approach abandons the concept of a single leader, allowing any replica to accept writes directly from clients. It was inspired by Amazon Dynamo and is used by databases like Riak, Cassandra, and Voldemort.
  • Writing to the Database When a Node Is Down: Clients send write requests to several replicas in parallel. A write is considered successful if a configurable number of replicas (w) acknowledge it, even if some nodes are unavailable.
  • Reading: Clients send read requests to several replicas in parallel. If different responses are received, version numbers are used to determine the most up-to-date value.
  • Read Repair and Anti-Entropy: Mechanisms to ensure all data is eventually copied to every replica:
    • Read Repair: A client detects stale values during a read and writes the newer value back to the outdated replica.
    • Anti-Entropy: A background process copies missing data from one replica to another.
  • Quorums for Reading and Writing: If there are n replicas, w writes, and r reads, the condition w + r > n ensures that a read will typically return the most recent value, as the read and write sets of nodes must overlap. n, w, r are configurable parameters.
  • Limitations of Quorum Consistency: Even with w + r > n, edge cases (like sloppy quorums, concurrent writes, or timing issues) can lead to stale values being returned. Leaderless systems generally provide eventual consistency and often do not guarantee read-your-writes, monotonic reads, or consistent prefix reads.
  • Sloppy Quorums and Hinted Handoff: Sloppy quorums allow writes to succeed even if the designated "home" nodes are unavailable, by writing to other reachable nodes. Hinted handoff delivers these writes to the correct "home" nodes once they become available.
  • Detecting Concurrent Writes: Conflicts are inherent. Strategies include:
    • Last Write Wins (LWW): Uses timestamps to impose an arbitrary order on concurrent writes, which can lead to data loss due to clock skew.
    • Happens-Before Relationship and Concurrency: Operations are concurrent if neither causally precedes the other.
    • Merging Concurrently Written Values (Siblings): Clients are responsible for merging conflicting versions (siblings) in application code, for example, by taking the union of items in a shopping cart.
    • Version Vectors: A mechanism to track causal dependencies and identify which values to overwrite and which to keep as siblings when there are multiple replicas.

Reference

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Book by Martin Kleppmann

This blog post is a summary of my personal notes and understanding from reading "Designing Data-Intensive Applications" by Martin Kleppmann. All credit for the original ideas belongs to the author. 

Comments

Popular Posts