Mubashir Usman’s Blog

Notes on System Design from First Principles

2026-04-01T13:09:00+00:00

System Design from First Principles

Part 1,2 Physics of data

Data fetch time depends on distance it has to travel, more frequently the data needs to be accessed the close it should be to CPU.

Little’s law - L = λW

L = average number of requests in the system
λ = rate of requests coming in
W = average time it takes in the system, so load on a system is proportional to the time taken to process a request.

Average latency does not give true picture of delays so we use percentiles, such as 99^th^ means the 1 percent of requests that faced highest delay.

For reliability, instead of using uptime it can be defined in terms of successful number of responses with respect to total responses. Uptime can be misleading as for a distributed service, our service will remain available in some parts of the world at all times.

System should be run at less capacity than it can handle, for example 70% CPU usage, the rest should be left for unexpected load spike to protect the system from creating a cascade failure. In fact its a tradeoff between resource wastage and avoiding exponential increase in latency in the face of high traffic.

Amdahl’s law speed of doing a task is limited by the serial fraction of the task(the one that can not be parallelized).

     1
------------
    (1-s)
s + -----
      n

Here s is the fraction of task to be done serially, and n is the number of processors. E.g if code is 95% parallelizable and 5% is serial and n goes to so large that fractional part becomes zero, then (1/0.05) = 20, even if n is infinity, we can not achieve more speed than 20 times.

Four golden signals

Latency, time to service a request, we should track P50, P90, P99
Traffic, the demand of the system or number of requests coming in
Errors, ratio of failed requests to the total requests
Saturation, how full is the system, such as database connection pool, cpu usage etc

Part 3 Communication

Data access latency

This table shows cpu access time for different storage medias.

Storage	Scaled Latency	Actual Latency
L1 Cache	0.5 seconds	1-4 nano sec
RAM	2 minutes	100+ nano sec
SSD	2 days	25-100 micro sec
HDD	5 months	5-10 mili sec

A network call is expensive due to: Latency due to network calls

(1) it needs to do a dns query (2) needs to do TCP 3-way handshake, (3) TLS handshake (4) calculate encryption keys (5) invisible timeouts

Serialization Latency

Serialization is expensive: Latency due to serialization Constructing json objects to be sent on the network from Java objects (serialization) is CPU intensive (requires to do string manipulation). Instead of JSON, we can use other techniques to serialize data.

JSON	Protobuf OR Flat buf
{“id”: 5, “status”: active}	bytes
CPU needs to do json parsing	CPU copies bytes so no parsing is needed
Not good for high performance	Good for high performance

flat buffers > protobuffers > json

Network Latency

Speed of light in fiber optic is a hard limit on data transfer speed, a total distance from London to San Francisco is 8500 KM, and one trip from London to San Francisco would take 85ms, so with HTTP/1.1 and TLS/1.2 we can calculate total round trip time for one http request sent, TCP Handshake[85ms] + TLS trip[85ms] + HTTP[85ms] = 225ms

one RTT 10ms per 1000KM

QUIC

TCP should be used for rare and long lived connections. QUIC (built on UDP) combines TCP and TLS handshake, so network latency from LONDON to San Francisco becomes: TCP Handshake and TLS trip[85ms] + HTTP[85ms] = 170ms. This is 1-RTT instead of 2-RTT in TCP. For returning users this will reduce to 85ms or 0-RTT, using CDN the edge server can keep an open TCP connection with root server and this latency can be further reduced.

HTTP2

HTTP2 sends all requests using single TCP connection (multiplexing, connection pooling), and if combined with ptotobuf then serialization and deserialization cost can be saved.

Apache Arrow defines standard memory layout, achieves zero copy deserialization when data on network cable, on disk and in the ram is identical.

Good rules of thumb for API design: Batching send one request with many little things, Data locality if two services communicate too much, consider making them one, Coarse grained API is not too chatty

Part 4 Anatomy of a Request

URL in browser memory -> syscall connect() -> context switch to TCP/IP stack ->
MTU limit on packet is 1500 Bytes and TCP header comes in
IP address needs to be searched and DNS comes in, Geo-DNS gives IP according to location of user
Ethernet header
After leaving ISP, BGP comes in and it decides on which path to choose for the given destination
Anycast help BGP by announcing same IP from different locations
Edge server comes in, TLS is terminated, since TCP handshake is expensive so edge server uses existing warm TCP connection to root server
Firewall for DPI, packet inspection is expensive so keep it as close to edge as possible
Load balancer, layer 7
API gateway, does user exist and is JWT token valid, rate limiting, protocol translation REST to gRPC

Part 5 Persistence

Fundamental challenge

Persistence is important — you cannot afford to lose data. But disk is slow. The goal: data that survives power outages and systems that feel fast.

Latency in Human Scale

To make latency intuitive, imagine scaling nanoseconds to human time:

Storage	Scaled Latency
L1 Cache	0.5 seconds
RAM	2 minutes
SSD	2 days
HDD	5 months

We want persistence and speed. These two goals conflict — and the rest of this lecture is about how to reconcile them.

Databases & The OS Lie

Buffered I/O

When a process calls write(), the OS does not immediately write to disk. Data goes to the page cache (in RAM), and write() returns immediately. This is the “OS lie” — your data isn’t on disk yet.

Process          Page Cache (RAM)         Disk
   │                    │                   │
   │──── write() ──────►│                   │
   │◄─── returns ───────│                   │
   │                    │── (eventually) ──►│

fsync() — Safe but Slow

fsync() blocks until writes actually reach disk and are confirmed. It is the antidote to the OS lie, but expensive — the process cannot proceed until the disk acknowledges.

Trade-off: Buffered I/O (fast, unsafe) vs. fsync() (safe, slow). The choice depends on how much data loss your use-case can tolerate.

Write-Ahead Log (WAL)

Instead of writing directly to tables (random I/O), databases append every write to the end of a log file first — the Write-Ahead Log.

DB Write ──► Append to WAL ──► ACK returned to client
                  │
                  │ (later, when idle)
                  ▼
            Update index/tables

WAL converts random writes into sequential writes, which are dramatically faster on both HDDs and SSDs.

Sequential writes > random writes. On HDDs, the read/write head needs to physically move — appending keeps the head still. SSDs benefit from sequential patterns too.

SSD Internals — Write Amplification

A common misconception: SSD is not fast RAM. You cannot overwrite or delete a single byte on an SSD. You can only erase in large chunks (~2 MB blocks).

The Read-Modify-Erase-Write Dance

To change 1 byte on an SSD:

Read 2 MB chunk → RAM
Modify 1 byte in RAM
Erase the entire 2 MB block on SSD
Write the full 2 MB back to SSD

Goal: update 1 byte → actually moved 2 MB. This is write amplification.

Flash Translation Layer (FTL)

The FTL is a small orchestrator embedded in every SSD. It manages physical erase blocks, tracks logical-to-physical block mappings, and handles wear leveling — making the drive appear as a simple byte-addressable device and hiding all the complexity above.

B-Tree — Shallow and Fat

Without an index, every query is a full table scan. The B-Tree solves this: optimized for reads, keeping itself shallow by making each node very wide (many children).

Exponential Growth (fanout = 500)

Layer 1 (Root):  1 node
Layer 2:         500 nodes
Layer 3:         250,000 nodes   → 125 million pages
Layer 4:         125 million nodes → 62.5 billion items

62.5 billion rows indexed with only 4 disk reads (~30–40 ms per lookup).

LSM Tree — Log-Structured Merge Tree

The B-Tree is optimized for reads. The LSM Tree makes the opposite bet: optimize for writes. Used by Cassandra, RocksDB, and other NoSQL engines.

The key rule: only ever do sequential writes. Never do random writes on disk.

Write Path

Write arrives
     │
     ▼
MemTable (sorted list in RAM)  ──also writes──►  WAL on disk (durable)
     │
     │ (when MemTable is full → flush)
     ▼
SSTable on disk (immutable, sorted, sequential write)

Bloom Filter

Each SSTable has an associated Bloom filter — a probabilistic data structure that answers: “Is this key in this SSTable?”

“Definitely not” → skip the SSTable entirely (no disk read needed)
“Maybe yes” → go check

If the Bloom filter says no, the SSTable is not even touched. This dramatically reduces unnecessary disk reads.

The RUM Conjecture

A fundamental trade-off in data structure design — you can optimise for any two of the three, but never all three simultaneously:

              R (Read)
             /        \
            /          \
  B-Tree ──/            \── B-Tree
  (read+mem)            (read+mem)
          /              \
         /    Pick Two    \
        /      Only        \
U (Update) ────────────── M (Memory)
    LSM Tree (update+read)

Optimise for	Use	Example workload
R + M	B-Tree	Banking, relational DBs (read-heavy)
U + R	LSM Tree	Logs, event streams, Cassandra (write-heavy)

Rule of thumb: If your application is read-heavy, use a B-Tree. If it will be write-heavy, use an LSM Tree.

The Invisible Enemy — Bit Rot

Even at rest, data can silently corrupt. Cosmic rays, voltage fluctuations, and magnetic interference can flip bits without the OS noticing. Do not trust hardware.

Solution: Checksums (e.g. SHA-256). Compute a hash on write; recompute and verify on read. ZFS does this automatically for every block.

Disk Failure Rates

Out of 10,000 disks, approximately 1 fails per month. At scale, disk failures are expected daily events — not exceptional ones. Distributed systems must treat failure as the norm.

2003 — Google File System (GFS)

GFS splits data into chunks, storing 3 copies across 3 machines on 3 different racks. If one rack goes down, data survives on the other two.

But this introduces a consistency problem: all three replicas might have slightly different data at any moment. Consensus protocols like Raft solve this.

The Big Trade-Off Spectrum

Every persistence decision sits on a spectrum between maximum speed and maximum durability:

   |--------------------------------|---------------------------------|
High risk / High speed                                 Low risk / Low speed

  RAM /                              fsync()            Distributed replication S3 /
Buffered I/O                           SSD                 Multi-region

Practical Decision Framework

Is it a like on a TikTok post?

It’s okay if people see the count update with a 2–3 second delay.
Use an LSM tree with buffered I/O. Optimise for throughput.

Is it a bank transfer?

Data loss is unacceptable.
Use fsync() and wait for all replicas to reply before returning success. Optimise for durability.

Distributed Systems For Fun And Profit

2026-03-14T00:00:00+00:00

Notes from Distributed Systems for Fun and Profit

When I started learning about distributed systems I have been taking notes mostly on my notebook, but it takes longer to reference it and go back. So I decided to keep my notes here when I decided to read this book. In big systems things like fault tolerance, leader election, failure detection, coordination, consistency, availability come up very often and what strategy we choose to deal with them is dependent on the kind of system we are after.

Chapter 1: Basics

Distributed systems is a way of doing a task with many computers instead of doing using one. To do this we have a constraint, that is to do using comodity hardware isntead of relying on most expensive hardware. This is because of a fundamental reason that as the nodes grow the performance difference between high-end hardware and comodity hardware decreases. So everything happens when scale grows, and the goal is to gain the scalability. Scalability can be defined as the ability to grow the system with the amount of work. This can be in terms of data size, computers to administrators ratio, decrease in latency with growing nodes etc.

Scaleable systems have two properties: 1. Performance (latency) 2. Availability (fault-tolerance)

Performance is achieving shorter response time OR high throughput OR low utilization of resources.
In these three latency is interesting as it has little to do with financial limitations and more with physical limitations. Speed of light and hardware components at which they can work is a hard limit. E.g the time between the write initiated and confirmation response received.
Availability is proportion of the system functioning properly. If a user can not load the page the system is not available.
Availability can be measured in terms in uptime, like a 90% availability means more than a month downtime per year. And 99.9% < 9 hours, 99.99% < less than an hour. Availability can be affected by many other factors than just the uptime of a service, like hard disks catching fire, a star falling from a sky or company going bankrupt. The best we can do is to design for fault tolerance. Fault tolerance = the ability to behave well when a fault occurs.

The hinderance between the above good things can be: increased number of nodes increases probability of failure of one node (fault-tolerance), increased number of nodes may result in more communcation (thus reduced performance, latency). Our system design options are beyond these physical constraints. Both performance and availability are defined by external guarantees such as SLAs.

Making appropriate abstractions for complex systems make them more manageable and understandable. In this regard models can help concretely define what the properties of our system will be. Good abstractions remove the irrelevant details. Some models are: failure modes (crash/byzantine), system model (synchronous/asynchronous), consistency (weak/strong). A system with weaker guarantees can be more performant/be available and also hard to reason about at the same time.
Some failures types such as network latencies and network partitions means that a system has to make hard choices between whether its worth it to stay available and provide lose guarantees or reject the requests and play safe.

Design techniques: Partition and replicate

Partition - means to divide data on multiple nodes and each partition is a subset of data. This help to improve performance by limiting the amount of data to work with in a partition. And increase availability by allowing partitions to fail independently.
Replication: Copying or reproducing something is something that we can use to fight latency. It improves performance by making the additional bandwidth and compute available. And it improves availability by increasing the copies of data by increasing the number of nodes which should fail before system become unavailable.
Replicate to reduce the risk of single point of failure. Replicate data to a local cache to reduce latency or on multiple machines to increase throughput. Downsides of replication is that this data needs to be in sync, this means we need to make sure replication follow some consistency model. Stronger consistency allows you to program the system as if the underlying system was not replicated. But other weaker consistency models expose some underlying details of the system and can offer lower latency and high availability.

Chapter 2: Up and Down the Level of Abstraction

The fundamental tension is between how we want the system to behave, i.e as a single unit and how the system actually is, distributed. So we create abstractions, we assume that two nodes are equal even when they are not, this makes things easier and manageable. Impossible results tell us that within our assumptions some things are impossible. In a distributed system, programs run concurrently on independent nodes, there is an unreliable network between them, and they have no shared memory or shared clock. This means that the knowledge in a particular node is local, any information about global state is normally out of date, clocks are not synchronized, nodes can fail and recover from a failure independently. A robust system would be that makes little or no assumtions. And we can also make a system with strong assumptions, e.g nodes do not fail a big assumption and system will not need to handle node failure, though this unrealistic assumption.

Nodes can fail by crashing or in any arbitrary way other than crashing (Byzantine fashion). We only consider crash failure because we can’t account other infinite number of failures and then design our algorithm. For example a hacker could hack into the machine but we don’t take such failures into account.
Communication links can be assumed to be unreliable and subject to message losts/delays. A network partition occurs when a network fails between nodes but nodes continue to be operational. These are enough assumptions without going into details of individual network links or counting distance between nodes (in a local network).
Timing assumptions are essential as the nodes have their own clocks and are at some distance from each other. Synchronous system where there exist upper bound on message transmission delays, and asynchronous where processes execute independently without any upper bound. Synchronous means that two processes have the same experience and messages sent will be received within a maximum delay, and processes execute in a lock step. Asynchronous assumes that we can’t rely on timing, and assumtions about execution speeds, maximum message delays can help rule out failure scenarios as if they never happened. Real world systems can run occasionally processes within upper bounds but there are certainly times when there are delays and message loss.

The consensus problem

Some computers are in consensus if they agree on some value. Stages of consensus 1. Agreement: every correct process must agree on a value 2. Integrity: every correct process decides on one value 3. Termination: all processes eventually reach a decision 4. Validity: if all processes reach on a certain value then they all decide that value. Consensus problem helps to solve more advanced problems such atomic commit and atomic broadcast.

FLP Impossibility result

Assumes asynchronous model, it states that we can not have a consensus in an asynchronous system where a process can fail by crashing, even if the messages are never lost. This means that there can not be a consensus if a process remain undecided for an arbitrary amount of time (by delaying message delivery).

The CAP theorem

Consistency in CAP means that all computers have the same copy of data or else the system refuses to answer. Availability means that system keeps giving answers even in the face of node failures. Partition tolerance means that system continues to operate even in the face of network division. Only two can be satisfied simultaneusly. Picking CA: strict quorum protocols such as two phase commit. It can’t tolerate any node failure. Strong conistency, it can’t differentiate between network partition and node failure. common in traditional relational databases using two phase commit. Picking CP: includes majority quorum protocols where minority partition is unavailable like Paxos. It can tolerate n node failures from 2n+1 nodes, means as long as n+1 stay up. It makes the minority partition to NOT accept writes and only majority partition can accept. Picking AP: protocols that involves conflict resolution like DynamoDB In the face of partition, CAP theorem reduces to choose between Consistency and Availability. Four conclusions from CAP theorem:

early system designs did not incorporate network partition in their design (mostly CA) but in today’s times we can not ignore partitions as systems are spreaded in different geographic regions.
there is a fundamental tension between strong consistency and high availability when network partition occur. Strong consistency guarantee require us to give up availability during network partition. Because if we do not give up availability and two nodes can’t communicate then we have a divergence. We can overcome this in two ways: 1. Not have partitions 2. weaken the guarantees
there is a tension between strong consistency and performance in normal operation. Because all nodes must agree on the same result before moving on, this introduces latency. If we can relax guarantees then we can have less latency and more availability. If less nodes are involved in an operation we will have less time to wait for the result. Tradeoff here is that we allow to have some anomolies to occur, this means that you read some old data. Consistency and availability are not binary choices, unless we fix ourselves to strong consistency. CAP consistency != ACID consistency. Consistency is a broader term and strong consistency is just one form of it.
Consistency models

Can be divided in two categories: 1. Strong consistency models 2. Weak consistency models
Strong consistency models are: 1. linearizable consistency is the one in which all operations appear to be executed atomically in the same order as the actual time ordering of operations, 2. sequential consistency is same as linearizable except that operations may be executed in a different order than received.
Weak consistency models are: 1. Client-centric models involve the notion of a client or session in some way. For example forwarding a client to the same replica after they update something so that they don’t see older data themselves. 2. Eventual consistency, where all nodes will agree on the same value after an undefined amount of time. Eventually is very weak form of consistency. So lower bound on evntual should be defined. And also how long is eventual.

Chapter 3: Time and Order

Other than distributed systems time is used by our personal computers as well, e.g to track how long a dns query is cacheable, or to track if a certificate is valid. Time helps in keeping track of the order of events in which they occured, and we care a lot about order since its easier to think about it by our brain, so time is an important property.

There are two types of clocks: 1. Physical clock: to count the number of seconds elasped 2. Logical clocks: count events such as messages sent

Physical clocks are made of Quartz crystal which vibrates at some frequency and we count the number of cycles and map it to seconds. The frequency at which it osccilates depends on temprature and though very precise but not perfect. Most quartz clocks deviate by 20 to 50ppm which is nearly 32 seconds per year. Atomic clocks are more accurate than quartz.

Definition of time

How is time defined? Time is affected by how fast earth rotates. GMT(Greenwhich Mean Time, solar time) is literally defined by astronomical observation of the sun position as seen from Greenwich in South East London. Atomic clocks measure time by frequency at which Ceasium atom resonates. 1 day = 246060*9,192,631,770 periods of cesium atom, cool. We want to use atomic time but also stay consistent with Earth’s time. We do some corrections to atomic time to account for Earth’s rotation and that gives us UTC time. All time zones are some offset to UTC time, e.g US east coast is UTC-5, Pakistan is UTC+5. So the time difference between Pak and US east coast is a10 hours. Unix time counts the number of seconds elasped since January 1st, 1970. Software simply ignores if leap second has to be incremented or decremented, so we don’t care. But in distributed systems we can not ignore the shift of a second.

Total order and partial order are mathematical relations, total order is the one where any two elements are comparable. A partial order requires some elements to be compareable in a set but not all. So every total order is a partial order but not vice versa. Time basically is a form of order, and timestamps really represent the state of a system at that timestamp.

Characteristics of time

We can simply attach timestamps to unordered events to make them ordered. Timestamps are comparebale values and can be interpreted by humans to understand when something happened. Durations in time can be used by algorithms to make judgements about a system. For example, time spent in waiting can provide a clue if system is partitioned or is just experiencing high latency. Distributed systems do not behave in a particular order and imposing order is one way to reduce the possible executions.

Global vs Logical clocks

In the context of how we know time, its easier to picture total order instead of partial order. But its an assumption that things happened one after the other and making strong assumptions can lead us to a fragile system. The more temporal nondetermism we can tolerate the more we are closer the nature of distributed system. Does time progress at the same rate everywhere? This has 3 answers, for Global clocks: yes, for Local clocks: no but and for No Clocks: No!. Total order Clocks are important to assign the order to operations. A synchronous system has a global clock and in an asynchronous system there is no clock. A global clock is a source of total order, but its limited by the latency of a clock synchronisation protocol like NTP among other things. In this regard Facebook’s Cassandra is an example which uses timestamps to resolve conflicts between writes. The writes with newest timestamps wins. Means if clock drifts then new writes might be overwritten by old ones.
No-Clock assumption: here we have a notion of logical time using counters as timestamps. We can order events between different machines using counters and find out if something happened before or after or at the same time of another. In a partial order not every pair of elements is comparable. For example if event x happens on machine A and event y happens on machine B and there was no communication between them then we can’t say that A happened before B (A->B) or B happened before A (B->A). All of this in the absense of a global clock. So what Lamport clocks guarantee is this: if A -> B then counter(A) < counter(B) but not the other way around. This is partial order.

Time can define order across systems without communication and sometimes correctness depends on correct event ordering (such as in serializing in distributed database). Only global clock can help order the events across two machines. Without global clock we need to communicate with other machines. Time can also be used to define boundary conditions for algorithms using timeouts, such as to define the difference high latency and a server is down. The algorithms that do this are called failure detectors.

Vector clocks

How can we order events without synchronizing the clocks, enters Lamport Clocks. Lamport clocks and vector clocks use counters and communication to define order and are replacements of physical clocks. This counter is compareable across machines.

This is how Lamport clock works:

If a process does work, increment the counter
If process sends a message, include the counter
When a message is received, set the counter to max(local_counter, received_counter)+1

So Lamport clock allows counters to be compared across systems with one caution: it says if counter(A)< counter(B) then either A happened before B or A is incomparable with B. Remember comparing Lamport timestamps across systems that never communicate with each other may lead us to assuming some event happened before another when in reality they happened concurrently. So you can’t say anything meaningful about events on two independent systems that are not causaly related.

A Vector Clock maintains an array of N logical clocks, one for each node. Each node increments its own counter instead of incrementing a common counter. Rules are

Whenever a process does work, increment the logical clock value of the node in the vector
Whenever process sends message, include the full vector of logical clocks
When a message is received:
Update each element in the vector to be max(local, received) AND Increment the logical clock value representing the current node in vector

The problem with vector clocks is that they require one entry per node and thus can become very large for big systems. This problem can be countered by techniques such as periodic garbage collection or by reducing accuracy by limiting the size.

Failure detectors

The amount of time spent waiting can provide clues about whether a system is partitioned or merely experiencing high latency. Here we don’t need a global clock with perfect accuracy, instead it is simply enough that there is a reliable-enough local clock. In the absense of a response from a remote machine we can assume that a node has failed after some reasonable amount of time. What should be reasonable time? Instead of specifying specific values, its better to abstract away the exact timing assumptions.

Here comes the failure detectors. They are based on heartbeat messages and timers. Failure detector based on timeout has a risk of being too aggressive(too quick to declare failure) or too conservative (taking too long to detect a failure). Failure detectors are characterized by two properties: completeness and accuracy.

Strong completeness: every crashed process is eventually suspected by every correct process
Weak completeness: every crashed process is eventually suspected by some correct process
Strong accuracy: No correct process is suspected ever
Weak accuracy: Some correct process is never suspected Completeness is easier to achieve than accuracy. In fact weak completeness can be transformed to strong completeness by broadcasting the information about suspected process. But avoiding incorrectly suspecting a non-faulty process is hard unless you have a hard limit on message delay. This is only possible in synchronous system model. Therefore in systems where hard bounds are not set on message delays, failure detectors can only be eventually accurate.

The image below is taken from Chandra et al. (1996) paper.

This diagram shows that some problems can not be solved without strong assumptions about time bounds (failure detectors), it is not possible to tell whether a remote node has crashed, or is simply experiencing high latency.

Implementing a failure detector

Conceptually, there isn’t much to a simple failure detector, which simply detects failure when a timeout expires. The most interesting part relates to how the judgments are made about whether a remote node has failed. Ideally we would want a failure detector to adjust based on network conditions and avoid hard coding timeout values. For example Cassandra uses an accrual failure detector, which is a failure detector that outputs a suspicion level (a value between 0 and 1) rather than a binary “up” or “down” judgment. So there has to be a tradeoff between accurate detection and early detection of failure which left to the application.

When is order/synchronicity really needed? It depends on a system in consideration. In many cases we want the responses from a database to represent all of the available information with no inconsistency. In other cases, it is acceptable to give an answer that only represents the best known estimate that is based on only a subset of the total information. In particular during a network partition, one may want to answer queries with only part of the system accessible. For example, is the Twitter follower count for some user X, or X+1? Or are movies A, B and C the absolutely best answers for some query? Doing a cheaper, mostly correct “best effort” can be acceptable.

Chapter 4: Replication

Replication problem provides context to many other sub-problems of distributed systems, leader election, consensus, failure detection.

Synchronous	Asynchronous
Client waits, and all nodes must recieve update and should acknowledge to master	Response is sent back to user immidiately

Primary/Backup replication

Provides Weak Consistency and not partition tolerant

In this master gets all updates and log of operations is sent to the replicas. Two variants:

Asynchronous Pri/Backup replication: can work with one message update
Synchronous Pri/Backup replication: requires two messages update + acknowledge receipt

MySQL by default uses the asynchronous varient. Any asynchronous replication algorithm can only provide weak durability guarantees. In MySQL replication, this is known as replication lag. So replicas are always one operation behind the master. If master fails then the updates that have not been sent to backups are lost.

The synchronous variant of primary/backup replication ensures that writes have been stored on other nodes before returning back to the client - client waits, but this too can only provide weak guarantees.

For example - if a primary receives a write and sends to replicas - the backup persists and ACKs the write - and then primary fails before sending ACK to the client - now client assumes that write failed but backups are already updated.

Primary/backup or log shipping schemes only offer best effort guarantees Susceptible to failed updates and split brain, for example backup kicks in for a temporary network issue and then there will be two active primary at the same time.

P/B has following properties:

Single, static master
Replicated log, slaves are not involved in executing operations
No bounds on replication delay
Manual/ad-hoc failover, not fault tolerant
Not partition tolerant

2 Phase Commit

Provides Strong Consistency but not partition tolerant also NO AUTOMATIC RECOVERY also Prevents divergence

To prevent failures from causing consistency guarantees to be violated, we need another layer of messaging, leading us to 2PC. MySQL Cluster provides synchronous replication using 2PC.

First phase voting: primary sends update to the participent, each participent decides to commit or abort, if to commit then it stores the update in temporary area (write-ahead log). Until the second phase completes, this update is considered temporary.
Second phase decision: the primary decides the outcome and informs every participant. Then this update will be made permanent from temporary area.

Having a second phase - decision - in place before making a commit permanent allows the system to rollback the update which is not possible in P/C replication. 2PC is prone to blocking even if one node fails. It assumes stable storage - data on each node is never lost and node never crashes forever. The major tasks in 2PC are ensuring writes are durable on disk and making sure that the right recovery decisions are made such as learning the outcome of a round and then updating/rolling back those changes locally.

From CAP theorem, 2PC is CA. Its not partition tolerant. Also no safe way of promoting a new primary if one fails, a manual intervention is needed. Its latency sensitive as this is N-of-N write approach. Its consistent, NOT susceptible to split brain.

2PC has these properties:

Unanimous vote: commit or abort
Static master
2PC cannot survive simultaneous failure of the coordinator and a node during a commit
Not partition tolerant, tail latency sensitive

Consensus algorithms

Provides fault tolerance and single copy consistency

Concensus is agreeing on one result by a majority. Partition tolerant consensus algorithms are fault-tolerant algorithms that maintain single-copy consistency. Paxos is well-known partition tolerant algorithm.

A network partition

It is the failure of a network link to one or several nodes. The nodes themselves continue to stay active. Network partitions are tricky as its not possible to distinguish between node being unreachable or a failed node. If its a partition, then the system is divided in two and nodes are active on both sides.

A system of three nodes, with a failure and a network partition: ().

A system designed to keep single-copy consistency should be able to break symmetry, otherwise a partition will result in two EQUAL systems. Or in other words make sure only one partition remains active in the event of partition. This is very important to keep single copy consistency.

Majority decisions

Requiring only majority of nodes - instead of all nodes to agree on updates - allows some nodes to be unavailable/unreachable or to be down. As long as (N/2 + 1)-of-N nodes are up and accessible, the system will continue to operate. Here N/2 is integer division. Partition tolerant consensus algorithms use odd number of nodes. Majority can also tolerate disagreement. Consensus algorithms for replication generally opt for having distinct roles for each node - leader and follower. All updates must pass through leader. Having roles does not mean the system is prevented from recovering from a failure, - via a leader election phase. Each period of normal operation is called an epoch during which only one is designated as leader. Raft uses the term epoch. Epochs are the logical clocks which allow nodes to identify when an outdated node starts communicating. Nodes that were partitioned or out of operation will have a smaller epoch number than the current one, and their commands are ignored.

Working of RAFT

During normal operation, the leader maintains a heartbeat (at an heartbeat inrerval) which allows the followers to detect if the leader failed or becomes partitioned. When a node detects that a leader has become non-responsive, one of the follower nodes - whoseever election timeout expires first - it switches to an intermediate state (called “candidate” in Raft) where it increments the term/epoch value by one, initiates a leader election and competes to become the new leader. In order to be elected a leader, a node must receive a majority of the votes. Raft has recently seen adoption in etcd inspired by ZooKeeper.

Working of Paxos

In Paxos in some cases - such as if two proposers are active at the same time (dueling); if messages are lost; or if a majority of the nodes have failed - then no proposal is accepted by a majority. But this is acceptable, since the decision rule for what value to propose converges towards a single value. According to the FLP impossibility result, this is the best we can do: algorithms that solve the consensus problem must either give up safety or liveness when the guarantees regarding bounds on message delivery do not hold. Paxos gives up liveness.

A Consensus based fault tolerant algorithm such as Paxos has following:

Majority vote
Dynamic master
Paxos is less sensitive to tail latency.
Robust to n/2-1 simultaneous failures as part of protocol

Paxos is one of the most important algorithms when writing strongly consistent partition tolerant replicated systems. It is used in many of Google’s systems, including the Chubby lock manager used by BigTable/Megastore, the Google File System as well as Spanner. The implementation issues of Paxos mostly relate to the fact that Paxos is described in terms of a single round of consensus decision making, but an actual working implementation usually wants to run multiple rounds of consensus efficiently.

Paxos defines three roles - proposers, acceptors, learners Paxos nodes can take mutiple roles, even all of them. Paxos nodes should know how many nodes a majority is in a non-symmetrical system. Paxos runs on unreliable network, messages can be lost and Paxos nodes are persistent - meaning they can’t forget what they accepted. A paxos run aims at reaching single consesus, so if consesus is made, it can not progress to another consensus.

If majority of acceptors have promised to ignore anything lower than an ID, any ID lower than that ID will be ignored. E.g: if proposerA sends a PREPARE for ID=4, and get a PROMISE from majority of acceptors for this ID=4, then it sends a ACCEPT-REQUEST with ID, value and acceptors accept it and reply with ACCEPT value. Now if another proposer with higher ID than 4 comes in and sends REQUEST ID=5, then it will get PROMISE from acceptors but also a piggyback value previously accepted.

WHAT IF Proposer fails in PREPARE Phase: then acceptors who have sent PROMISE will wait but upon receiving no response from Proposer, another proposer will send a PREPARE message with its own different higher ID. SO Paxos goes on.

WHAT IF Proposer fails after sending ACCEPT-REQUEST and before getting ACCEPT: then another proposer will come up with higher ID and send a PREPARE and acceptors will send a PROMISE, but with a piggyback value they already have accepted will also be sent. And then this proposer will accept the response and give up on its value.

The following pictures illustrate a consesus run in Paxos.

ZAB: Zookeeper atomic broadcast

It is used in Apache Zookeeper. It provides coordination primitives for distributed systems, and is used by Kafka. Technically, atomic broadcast is a problem different from pure consensus, but it still falls under the category of partition tolerant algorithms that ensure strong consistency.

Apache Zookeeper

2026-03-12T00:00:00+00:00

This is my attempt to understand Apache Zookeeper, I will write this post for my future reference like coming back to it after 6 months.

Introduction

Zookeeper is a system that solves distributed co-ordination problem, its a coordination kernal for distributed systems. It provides primitives - not the concrete implementation - to implement distributed configuration, distributed lock, group membership, leader election and things like that.

It has nodes which are arranged hierarchically like folders are in a filesystem. It is non-blocking, client operations are performed in FIFO manner, nodes can be normal or ephemeral (means they are deleted after some time). When node is created it has name and a number which is monotonically increasing. The beautiful thing about Zookeeper is that how other things can be built on top of these primitives.

Consistency guarantees

It provides strong consistency for writes through linearizability (single global order). This means all clients will observe the state changes in the same order.
Zookeeper also provides FIFO ordering per client session. Requests from a single client are processed in the order they are sent.
All reads are not guaranteed to see latest writes unless explicitly synchronized. So clients can see stale data if they read from a follower which has not synced with the leader.

Fault Tolerance

Zookeeper can tolerate partial failure, but it chooses correctness over availability. It chooses consistency over availability. From CAP theorem, its CP, this means zookeeper becomes unavailable if quorum is not reachable but does not compromise consistency.

What Zookeeper does and what it does not do

Guarantee Type	Does	Does NOT
Write consistency	Linearizable writes with a global order	High throughput or horizontal write scaling
Read consistency	consistency within sessions	Linearizable reads without explicit sync
Availability	when quorum is met	available if quorum is not met
Data	small metadata	large payload like a database

Data Model

Data model provides a structure just enough to store data required for co-ordination. Its a heirarchical namespace resembling a filesystem where clients find information using parent child relationship. Each node in this structure is called znode and stores co-ordination metadata (default size is 1MB).

This is the stat structure of a znode:

czxid = transaction ID when the znode was created
mzxid tracks the most recent modification
Version numbers provide concurrency control, allowing clients to perform conditional updates
ephemeralOwner identify the session that created ephemeral znode
numChildren counts immediate children

Znode types

Persistent znodes remain in the system until deleted and are often used to store configuration or long-lived coordination state. Ephemeral znodes are tied to a client session and are deleted when session ends, useful in membership tracking. Sequential znodes include a monotonically increasing sequence number in their name, generated by ZooKeeper rather than clients to ensure global uniqueness. Useful for distributed queues and leader election.

A combination of ephemeral and sequential znode provides a primitive for leader election, clients creates znodes and watch immediate lower number and if its removed, then immediate next sequence number might become a leader.

Useful blogs and articles for reference

2026-03-07T13:30:00+00:00

I often read something online and want to revisit it in future, sometimes I bookmark it but that also hasn’t proved to be very useful. The problem with bookmarking is that I need to remember what I was reading there, sometimes it obvious but many times its not. Also I can’t quickly search in bookmarks. So I will link some useful resources here.

SRE Interview Prep

A comprehensive SRE preparation guide Underpaid’s Google SRE preparation guide
Interview questions syedali sre questions
Linux memory management
Understanding memory information in Linux
SRE Job Interview
Github reporistory with many resources SRE Interview prep guide
Meta Production engineer Rishi Shah posts
Facebook production engineer Interview content
Getting a job as SRE SRE at Google
Linux Performance toolsNetflix blog
Linux performance analysis in 600 seconds
Short reads about SRE SRE Handbook
Google onsite interview Systems SRE role
I got an offer SRE Interview questions
Interview topics Technical interview
Page Cache Deep dives into page cache
From RSS to WSS, Kubernetes memory metrics
Memory management Deep dive in memory management
Bash The Shell Scripting Tutorial
HTTP Overview of HTTP
What happens when you type a URL
Cgroups Guide to cgroups
Memory inside Linux containers
Terminals How terminal works
Boot process Linux boot process
Sockets Deep dive in socket system call
Anatomy of a program in memory
Understanding NUMA
Libraries in Linux

Coding

System Design

Hello Interview System design in a hurry
Distributed Systems Theory for engineers (!scientists) Paper Trail
Github system design Big tutorial for refreshing design and components
Blog Architecture and System Design

Pratice Linux

Challenge games (https://overthewire.org/wargames/bandit/)
Hands on Problems Sad servers

Tutorials

Tutorial Shell Shell redirection
Tutorial socat
Tutorial netstat
Tutorial GDB Getting started GDB
Tutorial Build a web server
Tutorial Writing http server
Tutorial Lsof A Quick Start for Lsof
Tutorial Strace Debugging using strace
Tutorial Sockets Understanding Sockets
Tutorial /proc proc for troubleshooting
Tutorial container What is a Linux container?
Tutorial how mmap works

Nice Blogs

Siddharth K resume (https://www.siddharthkannan.in/)
CV in typst (https://mattrighetti.com/2023/10/25/i-rewrote-my-cv-in-typst)
Short post about adding 9s to SLOs (https://trstringer.com/slo-adding-nines/)
How to be an SRE (https://blog.alicegoldfuss.com/how-to-get-into-sre/)
SRE bootcamp (https://devopsbootcamp.osuosl.org/start-here.html)
Unix Sockets (https://rednafi.com/misc/tinkering-with-unix-domain-socket/)
Real world SRE (https://blog.relyabilit.ie/sre-in-the-real-world/)
Boring technology (https://boringtechnology.club/)
Technical Topics (https://brooker.co.za/blog/)

Nice Books

Computer networks: A systems approach

About SWIM protocol

2026-03-07T13:30:00+00:00

SWIM (Scalable, Weakly-Consistent, Infection-Style, Processes Group Membership Protocol) is a membership protocol, which is used in distributed systems to answer this: who are my peers? That means it should do the failure detection and update the peers to only keep the healthy nodes.
The scaleable in the name implies that it can handle increased size of the system without degrading performance. We build distributed systems in large environments, because scalability is needed. This means thousands of machines could be the in the cluster.
Gossip protocols work like how people gossip in a society, talking to only few people to share information and then those few people talk to others and then the whole society knows about it. That’s hownodes communicate with subset of their total peers to send messages, Infection-Style in the name implies its a gossip protocol.
Weekly consistent means that after some amount of time, all replicas will agree on the same value, where some is undefined amount of time.

Detecting Failure

T is the period time
k is the number of nodes in failure detection group A node A sends a PING message to node B, if the node replies with ACK no further action is needed, but if node does NOT reply before the timeout(less than T), then it marks the node B suspicious and selects some arbitrary k peers and asks them to ping node B on behalf of A. This is indirect ping. If none of k nodes receive the ACK then that node is marked dead. This reduces the number of messages sent to O(N) size.

Information dissenminating

JOINED is sent by a node P to inform about the network.
FAILED is sent to peers when a node failure is detected by the above process. These messages are sene along with the PING/ACK to the peers and it results in efficient communication by reducing the size of information dissemination to O(log(N)).

Concurrency Concepts And HTTP Server

2026-03-05T00:30:20+00:00

Concurrency

Its a way of fragmanting code so that individual fragmants can be run independently to reach the same result.
E.g for taking an average of x1,x2,x3….xn, we can do it by dividing all numbers in two segments like s1 = sum(x1, x2, x3….xm)/c1 and s2 = sum(xm1, xm2…xn)/c2 and then doing something like (s1+s2)/(c1+c2). This can be done by running fragmants on different cpu cores at the same time (parallel execution) or by sharing the same cpu (time-sliced execution).

HTTP Server

An HTTP server does the following:

Create a TCP/steam socket
Bind a name (address) to this socket object
Start listening, which is to wait for incoming connections
When a connection comes, accept the connection and start sharing HTTP messages
Close the connection
Repeat from step 3

Before writing any code, we need to know what a socket is: its a channel for communication for intra-computer communication, there are client sockets and server sockets. Client socket sends a text, receives a reply. After it exchanges some messages client socket is then destroyed.

seversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
seversocket.bind(('127.0.0.1', 8765))
serversocket.listen(5) #max number of connect requests

Now we can enter the mainloop for accepting connections, we can get the client socket from the accept and fulfill the request.

Python programming tips using standard library

2025-12-25T00:30:20+00:00

Python tips from standard library for my reference

Dataclasses

Dataclasses is essentially a code generator. It helps to avoide writing boilerplate and repeating code. Following are important use cases.

We don’t need to write init method if class is an instance of dataclass. For example:
```
@dataclass  
class Circle():  
 x: int = 0  
 y: int = 0  
 radius: int = 1
```
Even though we defined class level variables, they will act as if they were instance variables.
Also it implements repr method automatically. And if instance variables needs to be made immutable then we can pass frozen parameter to dataclass. This will also implement hash method and make the objects hashable, which means we can use them as keys in a dictionary. @dataclass(frozen=True)
We can set order=True to implement equality or less-than/greater-than methods.

Type Hinting

Even though Cpython completely ignores variable types set by type-hinting, but type-hinting is useful in many other ways. Like for documentation, and libraries such as Pydantic and dataclasses uses type-hinting.

Example:

def func(a: dict, b: list, c: bool = True) -> str:
    return f"a= {a}, b={b}"

As a parameter in Python can take different argument types, we can do like this. ``` from typing import Union def funcm(a: Union[str, int], b: int) -> Union[str, int]: return a*b

def funcm(a: str | int, b: int) -> str | int: return a*b

3. If an argument could be passed as a specific type or could be None but is NOT optional, one can use Optional to specify this.

from typing import Optional def funco(a Optional[int]) -> None: pass

4. For containers like lists/tuples/dicts, we can use generics from typing module

from typing import List def funcg(a: List[float]) -> List[int]: return [int(i) for i in a]

5. OR for functions and generators

from typing import Callable, Any, Sequence, Iterator def funcf(func: Callable[[Any], Any], sequence: Sequence[Any]) -> Iterator[str]: for i in sequence: yield str(func(i))

NOTE: From Python 3.9 onwards, many generics are being deprecated in `typing` module and moved to other modules like `collections.abc`.


## Threading
In python, threads are not run in parallel, instead they are run sequentially due to global interpreter lock (GIL). But they can still be helpful in tasks which are I/O bound and have to wait for something else to complete their execution. This is because even one CPU can do other things instead of waiting for a slower task to finish.
`threading` allows to `start` up as many threads and then `join` them later.

Example: Downloading files from an api:

from threading import Thread threads = [] urls = [url1, url2, url3, url4, url5]

for url in urls: t = Thread(target=download_file, args(url)) t.start() //start() actually starts the target function threads.append(t)

[t.join() for t in threads] //join() returns None

Threading module does not help with managing the pool of threads, like how many threads we want to create etc. So its better to use something that does that automatically. `concurrent.futures` help with this. It also provides context manager for cleanup of resources afterwards.
`concurrent.futures.ThreadPoolExecuter.submit` creates thread and gives back a variable that can hold the result of the threads, and `concurrent.futures.as_completed.result` get result from threads as they complete.

with concurrent.futures.ThreadPoolExecuter(max_workers=) as executor: futures = [] for url in urls: future = executor.submit(, url) futures.append(future)

for future in concurrent.futures.as_completed(futures):
    try:
        url, status_code = future.result()
    execept Exception as err:
        print(f"Task failed.. {err}") ```

Be aware if threads are interacting with the same resources. In that case, use a lock to block other threads and do your thing, but the downside is that the program will be running in sequential manner within that lock period :).

Is Instagram Needed?

2025-12-24T23:30:20+00:00

Every time I open instagram, the algorithm servers me with the warmth of an amusing reel or some’s friend’s full-page picture. To the user it is a sense of “missed connectedness” that needs to replenished. But the question is, do I need to know what my friends did on the weekend, or even more fundamental one, am I even friends with them? These simple questions are difficult to answer because of how our lives have been shaped by the Instagram and Facebook. We know as a fact that the people who we are connected with do not need Instagram story to keep us “in touch”. In fact, the more closely we know someone the less we would need Instagram or anything similar. So the Instagram’s slogan, “bringing you closer to the people you love” is superficial and misleading to say the least. It has been found out in many studies that people feel insecure about their body and have low self esteem, this is particularly more prevalent among teenagers. From what I realized is this, our brain feels rewarded by seeing the likes of other people on the picture we post, the reward is enough to post again. This sense of appreciation encourages the user to post more to get the same feeling, but this time its impact will be less so the user will need to post even more to reach the same level, it eventually turns into an addiction. This is how social media influencers are born. Once a ‘normal user’ starts posting, it gets incentivized by the platform and eventually starts posting consistently. It feels worse after scrolling the instagram’s feed because of the unintended comparison by seeing the lives of celebrities our mind makes. This puts one on the crossroads to even use the platform. Excluding the celebrity’s content, the instagram posts can be divided in two categories.

To show excitement This is the type where a person is truly excited about what’s happening in their life and shared their life’s moments online. This excitement is subjective, what someone has achieved in a long time could be someone was born with, thus this is left to the audience of the posts to judge. Anything posted out of excitement would fall into this. For example, I once excitedly shared the post of me getting good grades in college on Facebook.
To make brag This is the type where the poster wants to boast that something has happened in their life, now its the time to “show off”. We as social creatures want to be seen when we do something. But to do so on instagram, you have to post pictures ONLY and edited ones. For example, a picture of a train + filter will render it completely out of a movie scene. Instagram is designed in a way that the platform leaves very little room for the caption, which in turn discourages a user to write beyond couple of words. Most audience will either not comment or even if they do, there is nothing more to say than praise the poster/cameramen. And the poster expects the same, to be praised for their high resolution image. Practically there is no room for discussion. In my humble opinion, all posts on instagram should be auto-captioned as boasting edited pictures. I allege that most of the posts on Facebook/Instagram are in this category.

What it takes to have minimum reliability?

2025-12-24T23:30:20+00:00

This architecture is for web-applications and for my own reference. Some sane choices to make:

Have public and private subnets for the infrastructure, This means that application server, database, container registry, object storage, logging, monitoring should be inside the private network. A firewall should block any access from outside to this subnet. In AWS, there is a concept of security groups that can work here. A load balancer should be in the public subnet.
Database should be backed up regularly, and for this its important to not let the single instance of the database to be overloaded. Instead there should be a secondary instance to take backups from. These backups should be tested by restoring.
Logging is important for two reasons: to debug after an incident, to keep track of service events and improving it. A centralized logging solution like ELK stack should be setup. Applications should be configured to collect their logs.
A continous integration and delivery pipeline is the backbone for quickly testing, releasing in production and rollbacks. For one service, separate branches should be configured to keep the production code separate from test environments. Once code is tested, it should be run in a before-production environment, make sure that ONLY ONE change is here, and until this change is released, before-production environment should remain occupied, this will ensure changes to be tested and keep the history clean. If the deployment here is unseccessful, its time to go back to testing.
Infrastructure automation is critical. In a cloud environment, Terraform is my favorite and also an industry standard. It lets you define your infrastructure as code. Also Terraform is declarative in nature, which means it lets you define what you want at the end and takes into the account the current state, instead of how you should go about to achieve that (imperative definition). Ideally applications should be able to run on stateless servers which effectively means that we can deploy identical servers and as many of them as we want. This is the benefit of immutable deployments/containers. With respect to Terraform, one should templatize the code as variables and modules to reuse for multiple applications and setup a remote state. Lastly, remember manually deploying infrastructure does not scale for a lot of reasons.
Configuration

What Programming Language Should You Learn?

2024-02-10T08:20:47+00:00

Programming languages vary in how much abstraction they offer to the programmer. This ranges from no abstraction, referred to as low level like C, to high abstraction which are called high level languages like Python. Understanding the spectrum of programming languages is crucial for any developer. Low-level languages like assembly give you direct control over hardware, while high-level languages like Python abstract away the complexity.

Factors to Consider:

Your career goals and target industry
The type of projects you want to work on
Learning curve and time investment
Community support and job market demand

More often beginners are engaged in a dicussion of which language I should learn. The short answer is to pick any of the popular languages, like C, Javascript, Python, Java etc.. If you did not like that answer than you should do some research on the history of their creation. They all have interesting features. Albeit you can build interesting things out of all of them, but some have great libraries for one thing that other lacks. For instance consider Python, it has a rich ecosystem for Machine learning and data science while Java has comprehensive tools for server side programming. In the start, you should choose yours and start building projects, rest assured, you won’t be at a disvantage for learning one over the other.

Mubashir Usman’s Blog

Notes on System Design from First Principles

System Design from First Principles

Part 1,2 Physics of data

Four golden signals

Part 3 Communication

Data access latency

Serialization Latency

Network Latency

QUIC

HTTP2

Part 4 Anatomy of a Request

API gateway, does user exist and is JWT token valid, rate limiting, protocol translation REST to gRPC

Part 5 Persistence

Fundamental challenge

Latency in Human Scale

Databases & The OS Lie

Buffered I/O

fsync() — Safe but Slow

Write-Ahead Log (WAL)

SSD Internals — Write Amplification

The Read-Modify-Erase-Write Dance

Flash Translation Layer (FTL)

B-Tree — Shallow and Fat

Exponential Growth (fanout = 500)

LSM Tree — Log-Structured Merge Tree

Write Path

Bloom Filter

The RUM Conjecture

The Invisible Enemy — Bit Rot

Disk Failure Rates

2003 — Google File System (GFS)

The Big Trade-Off Spectrum

Practical Decision Framework

Distributed Systems For Fun And Profit

Notes from Distributed Systems for Fun and Profit

Chapter 1: Basics

Chapter 2: Up and Down the Level of Abstraction

The consensus problem

FLP Impossibility result

The CAP theorem

Consistency models

Chapter 3: Time and Order

Definition of time

Characteristics of time

Global vs Logical clocks

Vector clocks

Failure detectors

Implementing a failure detector

Chapter 4: Replication

Primary/Backup replication

2 Phase Commit

Consensus algorithms

A network partition

Majority decisions

Working of RAFT

Working of Paxos

ZAB: Zookeeper atomic broadcast

Apache Zookeeper

Introduction

Consistency guarantees

Fault Tolerance

What Zookeeper does and what it does not do

Data Model

Znode types

Useful blogs and articles for reference

SRE Interview Prep

Coding

System Design

Pratice Linux

Tutorials

Nice Blogs

Nice Books

About SWIM protocol

Detecting Failure

Information dissenminating

Concurrency Concepts And HTTP Server

Concurrency

HTTP Server

Python programming tips using standard library