<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://mubashirusman.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://mubashirusman.github.io/" rel="alternate" type="text/html" /><updated>2026-04-06T11:25:05+00:00</updated><id>https://mubashirusman.github.io/feed.xml</id><title type="html">Mubashir Usman’s Blog</title><subtitle>My learnings as a site reliability engineer and building reliable systems for production. I sometimes post my thoughts on navigating life in general.</subtitle><entry><title type="html">Notes on System Design from First Principles</title><link href="https://mubashirusman.github.io/distributed-systems/2026/04/01/system-design-first-principle.html" rel="alternate" type="text/html" title="Notes on System Design from First Principles" /><published>2026-04-01T13:09:00+00:00</published><updated>2026-04-01T13:09:00+00:00</updated><id>https://mubashirusman.github.io/distributed-systems/2026/04/01/system-design-first-principle</id><content type="html" xml:base="https://mubashirusman.github.io/distributed-systems/2026/04/01/system-design-first-principle.html"><![CDATA[<h1 id="system-design-from-first-principles">System Design from First Principles</h1>

<h2 id="part-12-physics-of-data">Part 1,2 Physics of data</h2>

<p>Data fetch time depends on distance it has to travel, more frequently the data needs to be accessed the close it should be to CPU.</p>

<p><strong>Little’s law</strong> - <code class="language-plaintext highlighter-rouge">L = λW</code></p>
<ul>
  <li>L = average number of requests in the system</li>
  <li>λ = rate of requests coming in</li>
  <li>W = average time it takes in the system, so load on a system is proportional to the time taken to process a request.</li>
</ul>

<p>Average latency does not give true picture of delays so we use percentiles, such as 99^th^ means the 1 percent of requests that faced highest delay.</p>

<p>For reliability, instead of using uptime it can be defined in terms of successful number of responses with respect to total responses. Uptime can be misleading as for a distributed service, our service will remain available in some parts of the world at all times.</p>

<p>System should be run at less capacity than it can handle, for example 70% CPU usage, the rest should be left for unexpected load spike to protect the system from creating a cascade failure. In fact its a tradeoff between resource wastage and avoiding exponential increase in latency in the face of high traffic.</p>

<p><strong>Amdahl’s law</strong> speed of doing a task is limited by the serial fraction of the task(the one that can not be parallelized).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>     1
------------
    (1-s)
s + -----
      n
</code></pre></div></div>
<p>Here <code class="language-plaintext highlighter-rouge">s</code> is the fraction of task to be done serially, and <code class="language-plaintext highlighter-rouge">n</code> is the number of processors. E.g if code is 95% parallelizable and 5% is serial and n goes to so large that fractional part becomes zero, then (1/0.05) = 20, even if n is infinity, we can not achieve more speed than 20 times.</p>

<h3 id="four-golden-signals">Four golden signals</h3>
<ul>
  <li>Latency, time to service a request, we should track P50, P90, P99</li>
  <li>Traffic, the demand of the system or number of requests coming in</li>
  <li>Errors, ratio of failed requests to the total requests</li>
  <li>Saturation, how full is the system, such as database connection pool, cpu usage etc</li>
</ul>

<hr />

<h2 id="part-3-communication">Part 3 Communication</h2>

<h3 id="data-access-latency">Data access latency</h3>
<p>This table shows cpu access time for different storage medias.</p>

<table>
  <thead>
    <tr>
      <th>Storage</th>
      <th>Scaled Latency</th>
      <th>Actual Latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>L1 Cache</td>
      <td>0.5 seconds</td>
      <td>1-4 nano sec</td>
    </tr>
    <tr>
      <td>RAM</td>
      <td>2 minutes</td>
      <td>100+ nano sec</td>
    </tr>
    <tr>
      <td>SSD</td>
      <td>2 days</td>
      <td>25-100 micro sec</td>
    </tr>
    <tr>
      <td>HDD</td>
      <td>5 months</td>
      <td>5-10 mili sec</td>
    </tr>
  </tbody>
</table>

<p>A network call is expensive due to: <strong>Latency due to network calls</strong></p>
<blockquote>
  <p>(1) it needs to do a dns query (2) needs to do TCP 3-way handshake, (3) TLS handshake (4) calculate encryption keys (5) invisible timeouts</p>
</blockquote>

<h3 id="serialization-latency">Serialization Latency</h3>
<p>Serialization is expensive: <strong>Latency due to serialization</strong>
Constructing json objects to be sent on the network from Java objects (serialization) is CPU intensive (requires to do string manipulation). Instead of JSON, we can use other techniques to serialize data.</p>

<table>
  <thead>
    <tr>
      <th>JSON</th>
      <th>Protobuf OR Flat buf</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>{“id”: 5, “status”: active}</td>
      <td>bytes</td>
    </tr>
    <tr>
      <td>CPU needs to do json parsing</td>
      <td>CPU copies bytes so no parsing is needed</td>
    </tr>
    <tr>
      <td>Not good for high performance</td>
      <td>Good for high performance</td>
    </tr>
  </tbody>
</table>

<blockquote>
  <p>flat buffers &gt; protobuffers &gt; json</p>
</blockquote>

<h3 id="network-latency">Network Latency</h3>
<p>Speed of light in fiber optic is a hard limit on data transfer speed, a total distance from London to San Francisco is 8500 KM, and one trip from London to San Francisco would take 85ms, so with HTTP/1.1 and TLS/1.2 we can calculate total round trip time for one http request sent,
TCP Handshake[85ms] + TLS trip[85ms] + HTTP[85ms] = 225ms</p>

<blockquote>
  <p>one RTT 10ms per 1000KM</p>
</blockquote>

<h3 id="quic">QUIC</h3>
<p>TCP should be used for rare and long lived connections. QUIC (built on UDP) combines TCP and TLS handshake, so network latency from LONDON to San Francisco becomes: TCP Handshake and TLS trip[85ms] + HTTP[85ms] = 170ms. This is 1-RTT instead of 2-RTT in TCP. For returning users this will reduce to 85ms or 0-RTT, using CDN the edge server can keep an open TCP connection with root server and this latency can be further reduced.</p>

<h3 id="http2">HTTP2</h3>
<p>HTTP2 sends all requests using single TCP connection (multiplexing, connection pooling), and if combined with ptotobuf then serialization and deserialization cost can be saved.</p>

<p>Apache Arrow defines standard memory layout, achieves zero copy deserialization when data on network cable, on disk and in the ram is identical.</p>

<blockquote>
  <p>Good rules of thumb for API design: <strong>Batching</strong> send one request with many little things, <strong>Data locality</strong> if two services communicate too much, consider making them one, <strong>Coarse grained API</strong> is not too chatty</p>
</blockquote>

<hr />

<h2 id="part-4-anatomy-of-a-request">Part 4 Anatomy of a Request</h2>

<ul>
  <li>URL in browser memory -&gt; syscall connect() -&gt; context switch to TCP/IP stack -&gt;</li>
  <li>MTU limit on packet is 1500 Bytes and TCP header comes in</li>
  <li>IP address needs to be searched and DNS comes in, Geo-DNS gives IP according to location of user</li>
  <li>Ethernet header</li>
  <li>After leaving ISP, BGP comes in and it decides on which path to choose for the given destination</li>
  <li>Anycast help BGP by announcing same IP from different locations</li>
  <li>Edge server comes in, TLS is terminated, since TCP handshake is expensive so edge server uses existing warm TCP connection to root server</li>
  <li>Firewall for DPI, packet inspection is expensive so keep it as close to edge as possible</li>
  <li>Load balancer, layer 7</li>
  <li>
    <h2 id="api-gateway-does-user-exist-and-is-jwt-token-valid-rate-limiting-protocol-translation-rest-to-grpc">API gateway, does user exist and is JWT token valid, rate limiting, protocol translation REST to gRPC</h2>
  </li>
</ul>

<h2 id="part-5-persistence">Part 5 Persistence</h2>

<h3 id="fundamental-challenge">Fundamental challenge</h3>
<p><strong>Persistence is important</strong> — you cannot afford to lose data. But disk is <em>slow</em>.
The goal: data that survives power outages <em>and</em> systems that feel fast.</p>

<h4 id="latency-in-human-scale">Latency in Human Scale</h4>

<p>To make latency intuitive, imagine scaling nanoseconds to human time:</p>

<table>
  <thead>
    <tr>
      <th>Storage</th>
      <th>Scaled Latency</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>L1 Cache</td>
      <td>0.5 seconds</td>
    </tr>
    <tr>
      <td>RAM</td>
      <td>2 minutes</td>
    </tr>
    <tr>
      <td>SSD</td>
      <td>2 days</td>
    </tr>
    <tr>
      <td>HDD</td>
      <td>5 months</td>
    </tr>
  </tbody>
</table>

<blockquote>
  <p>We want persistence <em>and</em> speed. These two goals conflict — and the rest of this lecture is about how to reconcile them.</p>
</blockquote>

<hr />

<h3 id="databases--the-os-lie">Databases &amp; The OS Lie</h3>

<h4 id="buffered-io">Buffered I/O</h4>

<p>When a process calls <code class="language-plaintext highlighter-rouge">write()</code>, the OS does <strong>not</strong> immediately write to disk. Data goes to the <strong>page cache</strong> (in RAM), and <code class="language-plaintext highlighter-rouge">write()</code> returns immediately. This is the “OS lie” — your data isn’t on disk yet.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Process          Page Cache (RAM)         Disk
   │                    │                   │
   │──── write() ──────►│                   │
   │◄─── returns ───────│                   │
   │                    │── (eventually) ──►│
</code></pre></div></div>

<h4 id="fsync--safe-but-slow">fsync() — Safe but Slow</h4>

<p><code class="language-plaintext highlighter-rouge">fsync()</code> blocks until writes actually reach disk and are confirmed. It is the antidote to the OS lie, but expensive — the process cannot proceed until the disk acknowledges.</p>

<blockquote>
  <p><strong>Trade-off:</strong> Buffered I/O (fast, unsafe) vs. <code class="language-plaintext highlighter-rouge">fsync()</code> (safe, slow). The choice depends on how much data loss your use-case can tolerate.</p>
</blockquote>

<hr />

<h3 id="write-ahead-log-wal">Write-Ahead Log (WAL)</h3>

<p>Instead of writing directly to tables (random I/O), databases <strong>append every write to the end of a log file</strong> first — the Write-Ahead Log.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DB Write ──► Append to WAL ──► ACK returned to client
                  │
                  │ (later, when idle)
                  ▼
            Update index/tables
</code></pre></div></div>

<p>WAL converts random writes into <strong>sequential writes</strong>, which are dramatically faster on both HDDs and SSDs.</p>

<blockquote>
  <p><strong>Sequential writes &gt; random writes.</strong>
On HDDs, the read/write head needs to physically move — appending keeps the head still. SSDs benefit from sequential patterns too.</p>
</blockquote>

<hr />

<h3 id="ssd-internals--write-amplification">SSD Internals — Write Amplification</h3>

<p>A common misconception: <strong>SSD is not fast RAM</strong>. You cannot overwrite or delete a single byte on an SSD. You can only erase in large chunks (~2 MB blocks).</p>

<h4 id="the-read-modify-erase-write-dance">The Read-Modify-Erase-Write Dance</h4>

<p>To change <strong>1 byte</strong> on an SSD:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. Read 2 MB chunk → RAM
2. Modify 1 byte in RAM
3. Erase the entire 2 MB block on SSD
4. Write the full 2 MB back to SSD
</code></pre></div></div>

<p>Goal: update 1 byte → actually moved 2 MB. This is <strong>write amplification</strong>.</p>

<h4 id="flash-translation-layer-ftl">Flash Translation Layer (FTL)</h4>

<p>The <strong>FTL</strong> is a small orchestrator embedded in every SSD. It manages physical erase blocks, tracks logical-to-physical block mappings, and handles wear leveling — making the drive appear as a simple byte-addressable device and hiding all the complexity above.</p>

<hr />

<h3 id="b-tree--shallow-and-fat">B-Tree — Shallow and Fat</h3>

<p>Without an index, every query is a full table scan. The <strong>B-Tree</strong> solves this: optimized for reads, keeping itself shallow by making each node very wide (many children).</p>

<h4 id="exponential-growth-fanout--500">Exponential Growth (fanout = 500)</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Layer 1 (Root):  1 node
Layer 2:         500 nodes
Layer 3:         250,000 nodes   → 125 million pages
Layer 4:         125 million nodes → 62.5 billion items
</code></pre></div></div>

<p><strong>62.5 billion rows indexed with only 4 disk reads (~30–40 ms per lookup).</strong></p>

<hr />

<h3 id="lsm-tree--log-structured-merge-tree">LSM Tree — Log-Structured Merge Tree</h3>

<p>The B-Tree is optimized for reads. The <strong>LSM Tree</strong> makes the opposite bet: optimize for writes. Used by <strong>Cassandra</strong>, <strong>RocksDB</strong>, and other NoSQL engines.</p>

<p>The key rule: <em>only ever do sequential writes. Never do random writes on disk.</em></p>

<h4 id="write-path">Write Path</h4>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Write arrives
     │
     ▼
MemTable (sorted list in RAM)  ──also writes──►  WAL on disk (durable)
     │
     │ (when MemTable is full → flush)
     ▼
SSTable on disk (immutable, sorted, sequential write)
</code></pre></div></div>

<h4 id="bloom-filter">Bloom Filter</h4>

<p>Each SSTable has an associated <strong>Bloom filter</strong> — a probabilistic data structure that answers:
<em>“Is this key in this SSTable?”</em></p>

<ul>
  <li><strong>“Definitely not”</strong> → skip the SSTable entirely (no disk read needed)</li>
  <li><strong>“Maybe yes”</strong> → go check</li>
</ul>

<p>If the Bloom filter says no, the SSTable is not even touched. This dramatically reduces unnecessary disk reads.</p>

<hr />

<h3 id="the-rum-conjecture">The RUM Conjecture</h3>

<p>A fundamental trade-off in data structure design — you can optimise for any <strong>two</strong> of the three, but never all three simultaneously:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>              R (Read)
             /        \
            /          \
  B-Tree ──/            \── B-Tree
  (read+mem)            (read+mem)
          /              \
         /    Pick Two    \
        /      Only        \
U (Update) ────────────── M (Memory)
    LSM Tree (update+read)
</code></pre></div></div>

<table>
  <thead>
    <tr>
      <th>Optimise for</th>
      <th>Use</th>
      <th>Example workload</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>R + M</td>
      <td><strong>B-Tree</strong></td>
      <td>Banking, relational DBs (read-heavy)</td>
    </tr>
    <tr>
      <td>U + R</td>
      <td><strong>LSM Tree</strong></td>
      <td>Logs, event streams, Cassandra (write-heavy)</td>
    </tr>
  </tbody>
</table>

<blockquote>
  <p><strong>Rule of thumb:</strong> If your application is read-heavy, use a B-Tree. If it will be write-heavy, use an LSM Tree.</p>
</blockquote>

<hr />

<h3 id="the-invisible-enemy--bit-rot">The Invisible Enemy — Bit Rot</h3>

<p>Even at rest, data can silently corrupt. Cosmic rays, voltage fluctuations, and magnetic interference can flip bits without the OS noticing. <strong>Do not trust hardware.</strong></p>

<p><strong>Solution:</strong> Checksums (e.g. SHA-256). Compute a hash on write; recompute and verify on read. ZFS does this automatically for every block.</p>

<h4 id="disk-failure-rates">Disk Failure Rates</h4>

<p>Out of 10,000 disks, approximately <strong>1 fails per month</strong>. At scale, disk failures are expected daily events — not exceptional ones. Distributed systems must treat failure as the norm.</p>

<h4 id="2003--google-file-system-gfs">2003 — Google File System (GFS)</h4>

<p>GFS splits data into chunks, storing <strong>3 copies across 3 machines on 3 different racks</strong>. If one rack goes down, data survives on the other two.</p>

<p>But this introduces a <strong>consistency problem</strong>: all three replicas might have slightly different data at any moment. Consensus protocols like <strong>Raft</strong> solve this.</p>

<hr />

<h3 id="the-big-trade-off-spectrum">The Big Trade-Off Spectrum</h3>

<p>Every persistence decision sits on a spectrum between maximum speed and maximum durability:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   |--------------------------------|---------------------------------|
High risk / High speed                                 Low risk / Low speed

  RAM /                              fsync()            Distributed replication S3 /
Buffered I/O                           SSD                 Multi-region
</code></pre></div></div>

<h4 id="practical-decision-framework">Practical Decision Framework</h4>

<p><strong>Is it a like on a TikTok post?</strong></p>
<ul>
  <li>It’s okay if people see the count update with a 2–3 second delay.</li>
  <li>Use an <strong>LSM tree</strong> with buffered I/O. Optimise for throughput.</li>
</ul>

<p><strong>Is it a bank transfer?</strong></p>
<ul>
  <li>Data loss is unacceptable.</li>
  <li>Use <strong><code class="language-plaintext highlighter-rouge">fsync()</code></strong> and wait for <em>all</em> replicas to reply before returning success. Optimise for durability.</li>
</ul>

<hr />]]></content><author><name></name></author><category term="distributed-systems" /><summary type="html"><![CDATA[System Design from First Principles]]></summary></entry><entry><title type="html">Distributed Systems For Fun And Profit</title><link href="https://mubashirusman.github.io/2026/03/14/distributed-systems-for-fun-and-profit.html" rel="alternate" type="text/html" title="Distributed Systems For Fun And Profit" /><published>2026-03-14T00:00:00+00:00</published><updated>2026-03-14T00:00:00+00:00</updated><id>https://mubashirusman.github.io/2026/03/14/distributed-systems-for-fun-and-profit</id><content type="html" xml:base="https://mubashirusman.github.io/2026/03/14/distributed-systems-for-fun-and-profit.html"><![CDATA[<h1 id="notes-from-distributed-systems-for-fun-and-profit">Notes from Distributed Systems for Fun and Profit</h1>

<p>When I started learning about distributed systems I have been taking notes mostly on my notebook, but it takes longer to reference it and go back. So I decided to keep my notes here when I decided to read this book. In big systems things like fault tolerance, leader election, failure detection, coordination, consistency, availability come up very often and what strategy we choose to deal with them is dependent on the kind of system we are after.</p>

<h2 id="chapter-1-basics">Chapter 1: Basics</h2>

<p>Distributed systems is a way of doing a task with many computers instead of doing using one. To do this we have a constraint, that is to do using comodity hardware isntead of relying on most expensive hardware. This is because of a fundamental reason that as the nodes grow the performance difference between high-end hardware and comodity hardware decreases. So everything happens when scale grows, and the goal is to gain the scalability. Scalability can be defined as the ability to grow the system with the amount of work. This can be in terms of data size, computers to administrators ratio, decrease in latency with growing nodes etc.</p>

<p>Scaleable systems have two properties: 1. Performance (latency) 2. Availability (fault-tolerance)</p>

<ul>
  <li>Performance is achieving shorter response time OR high throughput OR low utilization of resources. <br />
In these three latency is interesting as it has little to do with financial limitations and more with physical limitations. Speed of light and hardware components at which they can work is a hard limit. E.g the time between the write initiated and confirmation response received.</li>
  <li>Availability is proportion of the system functioning properly. If a user can not load the page the system is not available.<br />
Availability can be measured in terms in uptime, like a 90% availability means more than a month downtime per year. And 99.9% &lt; 9 hours, 99.99% &lt; less than an hour. Availability can be affected by many other factors than just the uptime of a service, like hard disks catching fire, a star falling from a sky or company going bankrupt. The best we can do is to design for fault tolerance. Fault tolerance = the ability to behave well when a fault occurs.</li>
</ul>

<p>The hinderance between the above good things can be: increased number of nodes increases probability of failure of one node (fault-tolerance), increased number of nodes may result in more communcation (thus reduced performance, latency). Our system design options are beyond these physical constraints. Both performance and availability are defined by external guarantees such as SLAs.</p>

<p>Making appropriate <strong>abstractions</strong> for complex systems make them more manageable and understandable. In this regard <strong>models</strong> can help concretely define what the properties of our system will be. Good abstractions remove the irrelevant details. Some models are: failure modes (crash/byzantine), system model (synchronous/asynchronous), consistency (weak/strong).
A system with weaker guarantees can be more performant/be available and also hard to reason about at the same time.<br />
Some failures types such as network latencies and network partitions means that a system has to make hard choices between whether its worth it to stay available and provide lose guarantees or reject the requests and play safe.</p>

<p>Design techniques: Partition and replicate</p>
<ol>
  <li>Partition - means to divide data on multiple nodes and each partition is a subset of data. This help to improve performance by limiting the amount of data to work with in a partition. And increase availability by allowing partitions to fail independently.</li>
  <li>Replication: Copying or reproducing something is something that we can use to fight latency. It improves performance by making the additional bandwidth and compute available. And it improves availability by increasing the copies of data by increasing the number of nodes which should fail before system become unavailable.<br />
Replicate to reduce the risk of single point of failure. Replicate data to a local cache to reduce latency or on multiple machines to increase throughput. Downsides of replication is that this data needs to be in sync, this means we need to make sure replication follow some consistency model.
Stronger consistency allows you to program the system as if the underlying system was not replicated. But other weaker consistency models expose some underlying details of the system and can offer lower latency and high availability.</li>
</ol>

<hr />
<h2 id="chapter-2-up-and-down-the-level-of-abstraction">Chapter 2: Up and Down the Level of Abstraction</h2>

<p>The fundamental tension is between how we want the system to behave, i.e as a single unit and how the system actually is, distributed. So we create abstractions, we assume that two nodes are equal even when they are not, this makes things easier and manageable. <strong>Impossible results</strong> tell us that within our assumptions some things are impossible. In a distributed system, programs run concurrently on independent nodes, there is an unreliable network between them, and they have no shared memory or shared clock. This means that the knowledge in a particular node is local, any information about global state is normally out of date, clocks are not synchronized, nodes can fail and recover from a failure independently. A robust system would be that makes little or no assumtions. And we can also make a system with strong assumptions, e.g nodes do not fail a big assumption and system will not need to handle node failure, though this unrealistic assumption.</p>
<ul>
  <li>Nodes can fail by <strong>crashing</strong> or in any arbitrary way other than crashing (<strong>Byzantine fashion</strong>). We only consider crash failure because we can’t account other infinite number of failures and then design our algorithm. For example a hacker could hack into the machine but we don’t take such failures into account.</li>
  <li>Communication links can be assumed to be <strong>unreliable</strong> and subject to <strong>message losts</strong>/delays. A <strong>network partition</strong> occurs when a network fails between nodes but nodes continue to be operational. These are enough assumptions without going into details of individual network links or counting distance between nodes (in a local network).</li>
  <li>Timing assumptions are essential as the nodes have their own clocks and are at some distance from each other. Synchronous system where there exist <strong>upper bound on message transmission delays</strong>, and asynchronous where processes execute independently without any upper bound. Synchronous means that two processes have the same experience and messages sent will be received within a maximum delay, and processes execute in a lock step. Asynchronous assumes that we can’t rely on timing, and assumtions about execution speeds, maximum message delays can help rule out failure scenarios as if they never happened. Real world systems can run occasionally processes within upper bounds but there are certainly times when there are delays and message loss.</li>
</ul>

<h3 id="the-consensus-problem">The consensus problem</h3>
<p>Some computers are in consensus if they agree on some value. Stages of consensus 1. Agreement: every correct process must agree on a value 2. Integrity: every correct process decides on one value 3. Termination: all processes eventually reach a decision 4. Validity: if all processes reach on a certain value then they all decide that value. Consensus problem helps to solve more advanced problems such atomic commit and atomic broadcast.</p>
<h4 id="flp-impossibility-result">FLP Impossibility result</h4>
<p>Assumes asynchronous model, it states that we can <strong>not</strong> have a consensus in an asynchronous system where a process can fail by crashing, even if the messages are never lost. This means that there can not be a consensus if a process remain undecided for an arbitrary amount of time (by delaying message delivery).</p>
<h4 id="the-cap-theorem">The CAP theorem</h4>
<p>Consistency in CAP means that all computers have the same copy of data or else the system refuses to answer. Availability means that system keeps giving answers even in the face of node failures. Partition tolerance means that system continues to operate even in the face of network division. Only two can be satisfied simultaneusly.
Picking CA: strict quorum protocols such as two phase commit. It can’t tolerate any node failure. Strong conistency, it can’t differentiate between network partition and node failure. common in traditional relational databases using two phase commit.
Picking CP: includes majority quorum protocols where minority partition is unavailable like Paxos. It can tolerate <code class="language-plaintext highlighter-rouge">n</code> node failures from <code class="language-plaintext highlighter-rouge">2n+1</code> nodes, means as long as <code class="language-plaintext highlighter-rouge">n+1</code> stay up. It makes the minority partition to <strong>NOT</strong> accept writes and only majority partition can accept.
Picking AP: protocols that involves conflict resolution like DynamoDB
In the face of partition, CAP theorem reduces to <strong>choose between Consistency and Availability</strong>.
Four conclusions from CAP theorem:</p>
<ul>
  <li>early system designs did not incorporate network partition in their design (mostly CA) but in today’s times we can not ignore partitions as systems are spreaded in different geographic regions.</li>
  <li>there is a fundamental tension between strong consistency and high availability when network partition occur.
Strong consistency guarantee require us to give up availability during network partition. Because if we do not give up availability and two nodes can’t communicate then we have a divergence. We can overcome this in two ways: 1. Not have partitions 2. weaken the guarantees</li>
  <li>there is a tension between strong consistency and performance in normal operation. Because all nodes must agree on the same result before moving on, this introduces latency. If we can <strong>relax guarantees</strong> then we can have <strong>less latency</strong> and <strong>more availability</strong>. If less nodes are involved in an operation we will have less time to wait for the result. Tradeoff here is that we allow to have some anomolies to occur, this means that you read some old data.
Consistency and availability are not binary choices, unless we fix ourselves to <strong>strong consistency</strong>. CAP consistency != ACID consistency.
Consistency is a broader term and strong consistency is just one form of it.
    <h3 id="consistency-models">Consistency models</h3>
    <p>Can be divided in two categories: 1. Strong consistency models 2. Weak consistency models</p>
  </li>
  <li>Strong consistency models are: 1. <em>linearizable consistency</em> is the one in which all operations appear to be executed atomically in the same order as the actual time ordering of operations, 2. <em>sequential consistency</em> is same as linearizable except that operations may be executed in a different order than received.</li>
  <li>Weak consistency models are: 1. Client-centric models involve the notion of a client or session in some way. For example forwarding a client to the same replica after they update something so that they don’t see older data themselves. 2. Eventual consistency, where all nodes will agree on the same value after an undefined amount of time. Eventually is very weak form of consistency. So lower bound on evntual should be defined. And also how long is eventual.</li>
</ul>

<hr />
<h2 id="chapter-3-time-and-order">Chapter 3: Time and Order</h2>
<p>Other than distributed systems time is used by our personal computers as well, e.g to track how long a dns query is cacheable, or to track if a certificate is valid. Time helps in keeping track of the order of events in which they occured, and we care a lot about order since its easier to think about it by our brain, so time is an important property.</p>

<p>There are two types of clocks: 1. <em>Physical clock</em>: to count the number of seconds elasped 2. <em>Logical clocks</em>: count events such as messages sent</p>

<p>Physical clocks are made of Quartz crystal which vibrates at some frequency and we count the number of cycles and map it to seconds. The frequency at which it osccilates depends on temprature and though very precise but not perfect. Most quartz clocks deviate by 20 to 50ppm which is nearly 32 seconds per year. Atomic clocks are more accurate than quartz.</p>

<h3 id="definition-of-time">Definition of time</h3>
<p>How is time defined? Time is affected by how fast earth rotates. GMT(Greenwhich Mean Time, solar time) is literally defined by astronomical observation of the sun position as seen from Greenwich in South East London. Atomic clocks measure time by frequency at which Ceasium atom resonates. 1 day = 24<em>60</em>60*9,192,631,770 periods of cesium atom, cool. We want to use atomic time but also stay consistent with Earth’s time. We do some corrections to atomic time to account for Earth’s rotation and that gives us UTC time. All time zones are some offset to UTC time, e.g US east coast is UTC-5, Pakistan is UTC+5. So the time difference between Pak and US east coast is a10 hours. Unix time counts the number of seconds elasped since January 1st, 1970. Software simply ignores if leap second has to be incremented or decremented, so we don’t care. But in distributed systems we can not ignore the shift of a second.</p>

<p><em>Total order</em> and <em>partial order</em> are mathematical relations, total order is the one where any two elements are comparable. A partial order requires some elements to be compareable in a set but not all. So every total order is a partial order but not vice versa. Time basically is a form of order, and timestamps really represent the state of a system at that timestamp.</p>

<h3 id="characteristics-of-time">Characteristics of time</h3>
<p>We can simply attach timestamps to unordered events to make them <em>ordered</em>. Timestamps are comparebale values and can be <em>interpreted</em> by humans to understand when something happened. Durations in time can be used by algorithms to make judgements about a system. For example, time spent in waiting can provide a clue if system is partitioned or is just experiencing high latency. Distributed systems do not behave in a particular order and imposing order is one way to reduce the possible executions.</p>

<h3 id="global-vs-logical-clocks">Global vs Logical clocks</h3>
<p>In the context of how we know time, its easier to picture total order instead of partial order. But its an assumption that things happened one after the other and making strong assumptions can lead us to a fragile system. The more temporal nondetermism we can tolerate the more we are closer the nature of distributed system. Does time progress at the same rate everywhere? This has 3 answers, for <strong>Global clocks</strong>: yes, for <strong>Local clocks</strong>: no but and for <strong>No Clocks</strong>: No!. 
Total order Clocks are important to assign the order to operations. A synchronous system has a global clock and in an asynchronous system there is no clock. A global clock is a source of total order, but its limited by the latency of a clock synchronisation protocol like NTP among other things. In this regard Facebook’s Cassandra is an example which uses timestamps to resolve conflicts between writes. The writes with newest timestamps wins. Means if clock drifts then new writes might be overwritten by old ones.<br />
No-Clock assumption: here we have a notion of logical time using <strong>counters</strong> as timestamps. <em>We can order events between different machines using counters and find out if something happened before or after or at the same time of another.</em> In a partial order not every pair of elements is comparable. For example if event x happens on machine A and event y happens on machine B and there was no communication between them then we can’t say that A happened before B (A-&gt;B) or B happened before A (B-&gt;A). All of this in the absense of a global clock. So what Lamport clocks guarantee is this: if A -&gt; B then counter(A) &lt; counter(B) but not the other way around. This is partial order.</p>

<p>Time can define order across systems without communication and sometimes correctness depends on correct event ordering (such as in serializing in distributed database). Only global clock can help order the events across two machines. Without global clock we need to communicate with other machines. Time can also be used to define boundary conditions for algorithms using timeouts, such as to define the difference high latency and a server is down. The algorithms that do this are called failure detectors.</p>

<h3 id="vector-clocks">Vector clocks</h3>
<p>How can we order events without synchronizing the clocks, enters Lamport Clocks. Lamport clocks and vector clocks use <strong>counters</strong> and communication to define order and are replacements of physical clocks. This counter is compareable across machines.</p>

<p>This is how Lamport clock works:</p>
<ul>
  <li>If a process does work, increment the counter</li>
  <li>If process sends a message, include the counter</li>
  <li>When a message is received, set the counter to <code class="language-plaintext highlighter-rouge">max(local_counter, received_counter)+1</code></li>
</ul>

<p>So Lamport clock allows counters to be compared across systems with one caution: it says if counter(A)&lt; counter(B) then either A happened before B or A is incomparable with B. Remember comparing Lamport timestamps across systems that never communicate with each other may lead us to assuming some event happened before another when in reality they happened concurrently. So you can’t say anything meaningful about events on two independent systems that are not causaly related.</p>

<p>A <strong>Vector Clock</strong> maintains an array of N logical clocks, one for each node. Each node increments its own counter instead of incrementing a common counter. Rules are</p>
<ul>
  <li>Whenever a process does work, increment the logical clock value of the node in the vector</li>
  <li>Whenever process sends message, include the full vector of logical clocks</li>
  <li>When a message is received:<br />
<em>Update each element in the vector to be max(local, received)</em> AND <em>Increment the logical clock value representing the current node in vector</em></li>
</ul>

<p>The problem with vector clocks is that they require one entry per node and thus can become very large for big systems. This problem can be countered by techniques such as periodic garbage collection or by reducing accuracy by limiting the size.</p>

<h3 id="failure-detectors">Failure detectors</h3>
<p>The amount of time spent waiting can provide clues about whether a system is partitioned or merely experiencing high latency. Here we don’t need a global clock with perfect accuracy, instead it is simply enough that there is a reliable-enough local clock. In the absense of a response from a remote machine we can assume that a node has failed after some reasonable amount of time. What should be <em>reasonable time</em>? Instead of specifying specific values, its better to abstract away the exact timing assumptions.</p>

<p>Here comes the <strong>failure detectors</strong>. They are based on <strong>heartbeat messages</strong> and <strong>timers</strong>. Failure detector based on timeout has a risk of being too aggressive(too quick to declare failure) or too conservative (taking too long to detect a failure).
Failure detectors are characterized by two properties: <em>completeness and accuracy</em>.</p>
<ul>
  <li>Strong completeness: every crashed process is eventually suspected by every correct process</li>
  <li>Weak completeness: every crashed process is eventually suspected by some correct process</li>
  <li>Strong accuracy: No correct process is suspected ever</li>
  <li>Weak accuracy: Some correct process is never suspected
Completeness is easier to achieve than accuracy. In fact weak completeness can be transformed to strong completeness by broadcasting the information about suspected process. But avoiding incorrectly suspecting a non-faulty process is hard unless you have a hard limit on message delay. This is only possible in synchronous system model. Therefore in systems where <strong>hard bounds are not set</strong> on message delays, failure detectors can <strong>only be eventually accurate</strong>.</li>
</ul>

<p>The image below is taken from Chandra et al. (1996) paper.
<img src="/assets/Chandra-et-al.png" alt="Chandra et al." /></p>

<p>This diagram shows that some problems can not be solved without strong assumptions about time bounds (failure detectors), it is not possible to tell whether a remote
node has crashed, or is simply experiencing high latency.</p>

<h3 id="implementing-a-failure-detector">Implementing a failure detector</h3>
<p>Conceptually, there isn’t much to a simple failure detector, which simply detects failure when a timeout expires. The most <strong>interesting part relates to how the judgments are made about whether a remote node has failed</strong>. Ideally we would want a failure detector to adjust based on network conditions and avoid hard coding timeout values. For example Cassandra uses an accrual failure detector, which is a failure detector that outputs a suspicion level (a value between 0 and 1) rather than a binary “up” or “down” judgment. So there has to be a tradeoff between accurate detection and early detection of failure which left to the application.</p>

<p>When is order/synchronicity really needed? It depends on a system in consideration. In many cases we want the responses from a database to represent all of the available
information with no inconsistency. In other cases, it is acceptable to give an answer that only represents the best known estimate that is based on only a subset of the total information. In particular during a network partition, one may want to answer queries with only part of the system accessible. For example, is the Twitter follower count for some user X, or X+1? Or are movies A, B and C the absolutely best answers for some query? Doing a cheaper, mostly correct “best effort” can be acceptable.</p>

<hr />

<h2 id="chapter-4-replication">Chapter 4: Replication</h2>

<p>Replication problem provides context to many other sub-problems of distributed systems, leader election, consensus, failure detection.</p>

<table>
  <thead>
    <tr>
      <th>Synchronous</th>
      <th>Asynchronous</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Client waits, and all nodes must recieve update and should acknowledge to master</td>
      <td>Response is sent back to user immidiately</td>
    </tr>
    <tr>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td> </td>
      <td> </td>
    </tr>
    <tr>
      <td> </td>
      <td> </td>
    </tr>
  </tbody>
</table>

<h3 id="primarybackup-replication">Primary/Backup replication</h3>

<p><strong>Provides Weak Consistency and not partition tolerant</strong></p>

<p>In this master gets all updates and log of operations is sent to the replicas. Two variants:</p>
<ul>
  <li><strong>Asynchronous</strong> Pri/Backup replication: can work with one message <em>update</em></li>
  <li><strong>Synchronous</strong> Pri/Backup replication: requires two messages <em>update + acknowledge receipt</em></li>
</ul>

<p>MySQL by default uses the asynchronous varient. Any asynchronous replication algorithm can only provide weak durability guarantees. In MySQL replication, this is known as replication lag. So replicas are always one operation behind the master. If master fails then the updates that have not been sent to backups are lost.</p>

<p>The synchronous variant of primary/backup replication ensures that writes have been stored on other nodes before returning back to the client - client waits, but this too can only provide weak guarantees.</p>
<blockquote>
  <p>For example - if a primary receives a write and sends to replicas - the backup persists and ACKs the write - and then primary fails before sending ACK to the client - now client assumes that write failed but backups are already updated.</p>
</blockquote>

<p><strong>Primary/backup or log shipping schemes only offer best effort guarantees</strong> Susceptible to failed updates and split brain, for example backup kicks in for a temporary network issue and then there will be two active primary at the same time.</p>

<p>P/B has following properties:</p>
<ul>
  <li>Single, static master</li>
  <li>Replicated log, slaves are not involved in executing operations</li>
  <li>No bounds on replication delay</li>
  <li>Manual/ad-hoc failover, not fault tolerant</li>
  <li>Not partition tolerant</li>
</ul>

<h3 id="2-phase-commit">2 Phase Commit</h3>

<p><strong>Provides Strong Consistency but not partition tolerant</strong> also <strong>NO AUTOMATIC RECOVERY</strong> also <strong>Prevents divergence</strong></p>

<p>To prevent failures from causing consistency guarantees to be violated, we need another layer of messaging, leading us to 2PC. MySQL Cluster provides synchronous replication using 2PC.</p>
<ul>
  <li>First phase <em>voting</em>: primary sends update to the participent, each participent decides to commit or abort, if to commit then it stores the update in temporary area (write-ahead log). Until the second phase completes, this update is considered temporary.</li>
  <li>Second phase <em>decision</em>: the primary decides the outcome and informs every participant. Then this update will be made permanent from temporary area.</li>
</ul>

<p>Having a second phase - <em>decision</em> - in place before making a commit permanent allows the system to rollback the update which is not possible in P/C replication. 2PC is prone to blocking even if one node fails. It assumes stable storage - data on each node is never lost and node never crashes forever. The major tasks in 2PC are <strong>ensuring writes are durable on disk</strong> and <strong>making sure that the right recovery decisions are made such as learning the outcome of a round and then updating/rolling back those changes locally</strong>.</p>

<blockquote>
  <p>From CAP theorem, 2PC is CA. Its not partition tolerant. Also no safe way of promoting a new primary if one fails, a manual intervention is needed. Its latency sensitive as this is N-of-N write approach. Its <strong>consistent</strong>, NOT susceptible to split brain.</p>
</blockquote>

<p>2PC has these properties:</p>
<ul>
  <li>Unanimous vote: commit or abort</li>
  <li>Static master</li>
  <li>2PC cannot survive simultaneous failure of the coordinator and a node during a commit</li>
  <li>Not partition tolerant, tail latency sensitive</li>
</ul>

<h3 id="consensus-algorithms">Consensus algorithms</h3>

<p><strong>Provides fault tolerance and single copy consistency</strong></p>

<p>Concensus is agreeing on <strong>one result</strong> by a <strong>majority</strong>.
Partition tolerant consensus algorithms are <strong>fault-tolerant</strong> algorithms that maintain single-copy consistency. Paxos is well-known partition tolerant algorithm.</p>

<h4 id="a-network-partition">A network partition</h4>
<p>It is the failure of a network link to one or several nodes. The nodes themselves continue to stay active. Network partitions are tricky as its not possible to distinguish between node being unreachable or a failed node. If its a partition, then the system is divided in two and nodes are active on both sides.</p>

<p>A system of three nodes, with a failure and a network partition: (<img src="/assets/partition.png" alt="partition image" />).</p>

<p>A system designed to keep single-copy consistency should be able to break symmetry, otherwise a partition will result in two EQUAL systems. Or in other words make sure only one partition remains active in the event of partition. This is very important to keep single copy consistency.</p>

<h4 id="majority-decisions">Majority decisions</h4>

<p>Requiring only <strong>majority</strong> of nodes - instead of all nodes to agree on updates - allows some nodes to be unavailable/unreachable or to be down. As long as <strong>(N/2 + 1)-of-N</strong> nodes are up and accessible, the system will continue to operate. Here N/2 is integer division. Partition tolerant consensus algorithms use odd number of nodes. Majority can also tolerate <strong>disagreement</strong>. Consensus algorithms for replication generally opt for having <strong>distinct roles</strong> for each node - leader and follower. All updates must pass through leader. Having roles does not mean the system is prevented from recovering from a failure, - via a <strong>leader election phase</strong>. Each period of normal operation is called an <strong>epoch</strong> during which only one is designated as leader. Raft uses the term epoch. Epochs are the logical clocks which allow nodes to identify when an outdated node starts communicating. Nodes that were partitioned or out of operation will have a smaller epoch number than the current one, and their commands are ignored.</p>

<h5 id="working-of-raft"><strong>Working of RAFT</strong></h5>
<p>During normal operation, the leader maintains a heartbeat (at an <strong>heartbeat inrerval</strong>) which allows the followers to detect if the leader failed or becomes partitioned. When a node detects that a leader has become non-responsive, one of the follower nodes - whoseever <strong>election timeout</strong> expires first - it switches to an intermediate state (called “candidate” in Raft) where it increments the term/epoch value by one, initiates a leader election and competes to become the new leader. In order to be elected a leader, a node must receive a majority of the votes. Raft has recently seen adoption in <em>etcd</em> inspired by ZooKeeper.</p>

<h5 id="working-of-paxos"><strong>Working of Paxos</strong></h5>
<p><strong>In Paxos</strong> in some cases - such as if two proposers are active at the same time (dueling); if messages are lost; or if a majority of the nodes have failed - then no proposal is accepted by a majority. But this is acceptable, since the decision rule for what value to propose converges towards a single value. According to the FLP impossibility result, this is the best we can do: algorithms that solve the consensus problem must either give up safety or liveness when the guarantees regarding bounds on message delivery do not hold. Paxos gives up liveness.</p>

<p>A Consensus based fault tolerant algorithm such as Paxos has following:</p>
<ul>
  <li>Majority vote</li>
  <li>Dynamic master</li>
  <li>Paxos is less sensitive to tail latency.</li>
  <li>Robust to n/2-1 simultaneous failures as part of protocol</li>
</ul>

<p>Paxos is one of the most important algorithms when writing <strong>strongly consistent partition tolerant replicated systems</strong>. It is used in many of Google’s systems, including the Chubby lock manager used by BigTable/Megastore, the Google File System as well as Spanner. The implementation issues of Paxos mostly relate to the fact that Paxos is described in terms of a single round of consensus decision making, but an actual working implementation usually wants to run multiple rounds of consensus efficiently.</p>

<p>Paxos defines three roles - <strong>proposers</strong>, <strong>acceptors</strong>, <strong>learners</strong> Paxos nodes can take mutiple roles, even all of them. Paxos nodes should know how many nodes a majority is in a non-symmetrical system. Paxos runs on unreliable network, messages can be lost and Paxos nodes are persistent - meaning they can’t forget what they accepted. A paxos run aims at reaching <strong>single consesus</strong>, so if consesus is made, it can not progress to another consensus.</p>

<blockquote>
  <p>If majority of acceptors have promised to ignore anything lower than an ID, any ID lower than that ID will be ignored. E.g: if proposerA sends a PREPARE for ID=4, and get a PROMISE from majority of acceptors for this ID=4, then it sends a ACCEPT-REQUEST with <em>ID, value</em> and acceptors accept it and reply with ACCEPT <em>value</em>. Now if another proposer with higher ID than 4 comes in and sends REQUEST ID=5, then it will get PROMISE from acceptors but also a <em>piggyback</em> value previously accepted.</p>
</blockquote>

<p>WHAT IF <strong>Proposer fails in PREPARE Phase</strong>: then acceptors who have sent PROMISE will wait but upon receiving no response from Proposer, another proposer will send a PREPARE message with its own different higher ID. SO Paxos goes on.</p>

<p>WHAT IF <strong>Proposer fails after sending ACCEPT-REQUEST and before getting ACCEPT</strong>: then another proposer will come up with higher ID and send a PREPARE and acceptors will send a PROMISE, but with a piggyback value they already have accepted will also be sent. And then this proposer will accept the response and give up on its value.</p>

<p>The following pictures illustrate a consesus run in Paxos. <img src="/assets/paxos-algorithm.png" alt="Paxos algorithm." /></p>

<h5 id="zab-zookeeper-atomic-broadcast"><strong>ZAB: Zookeeper atomic broadcast</strong></h5>
<p>It is used in Apache Zookeeper. It provides coordination primitives for distributed systems, and is used by Kafka. Technically, atomic broadcast is a problem different from pure consensus, but it still falls under the
category of partition tolerant algorithms that ensure strong consistency.</p>

<hr />]]></content><author><name></name></author><summary type="html"><![CDATA[Notes from Distributed Systems for Fun and Profit]]></summary></entry><entry><title type="html">Apache Zookeeper</title><link href="https://mubashirusman.github.io/2026/03/12/apache-zookeeper.html" rel="alternate" type="text/html" title="Apache Zookeeper" /><published>2026-03-12T00:00:00+00:00</published><updated>2026-03-12T00:00:00+00:00</updated><id>https://mubashirusman.github.io/2026/03/12/apache-zookeeper</id><content type="html" xml:base="https://mubashirusman.github.io/2026/03/12/apache-zookeeper.html"><![CDATA[<p>This is my attempt to understand Apache Zookeeper, I will write this post for my future reference like coming back to it after 6 months.</p>

<h2 id="introduction">Introduction</h2>

<p>Zookeeper is a system that solves distributed co-ordination problem, its a coordination kernal for distributed systems. It provides primitives - not the concrete implementation - to implement distributed configuration, distributed lock, group membership, leader election and things like that.</p>

<p>It has nodes which are arranged hierarchically like folders are in a filesystem. It is non-blocking, client operations are performed in FIFO manner, nodes can be normal or ephemeral (means they are deleted after some time). When node is created it has name and a number which is monotonically increasing. The beautiful thing about Zookeeper is that how other things can be built on top of these primitives.</p>

<h2 id="consistency-guarantees">Consistency guarantees</h2>
<ul>
  <li>It provides strong consistency for writes through linearizability (single global order). This means all clients will observe the state changes in the same order.</li>
  <li>Zookeeper also provides FIFO ordering per client session. Requests from a single client are processed in the order they are sent.</li>
  <li>All reads are not guaranteed to see latest writes unless explicitly synchronized. So clients can see stale data if they read from a follower which has not synced with the leader.</li>
</ul>

<h2 id="fault-tolerance">Fault Tolerance</h2>
<p>Zookeeper can tolerate partial failure, but it chooses correctness over availability. It chooses consistency over availability. From CAP theorem, its CP, this means zookeeper becomes unavailable if quorum is not reachable but does not compromise consistency.</p>

<h2 id="what-zookeeper-does-and-what-it-does-not-do">What Zookeeper does and what it does not do</h2>

<table>
  <thead>
    <tr>
      <th>Guarantee Type</th>
      <th>Does</th>
      <th>Does NOT</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Write consistency</td>
      <td>Linearizable writes with a global order</td>
      <td>High throughput or horizontal write scaling</td>
    </tr>
    <tr>
      <td>Read consistency</td>
      <td>consistency within sessions</td>
      <td>Linearizable reads without explicit sync</td>
    </tr>
    <tr>
      <td>Availability</td>
      <td>when quorum is met</td>
      <td>available if quorum is not met</td>
    </tr>
    <tr>
      <td>Data</td>
      <td>small metadata</td>
      <td>large payload like a database</td>
    </tr>
  </tbody>
</table>

<h2 id="data-model">Data Model</h2>

<p>Data model provides a structure just enough to store data required for co-ordination. Its a heirarchical namespace resembling a filesystem where clients find information using parent child relationship. Each node in this structure is called <strong><em>znode</em></strong> and stores co-ordination metadata (default size is 1MB).</p>

<p>This is the stat structure of a znode:</p>
<ul>
  <li><strong>czxid</strong> = transaction ID when the znode was created</li>
  <li><strong>mzxid</strong> tracks the most recent modification</li>
  <li><strong>Version numbers</strong> provide concurrency control, allowing clients to perform conditional updates</li>
  <li><strong>ephemeralOwner</strong> identify the session that created ephemeral znode</li>
  <li><strong>numChildren</strong> counts immediate children</li>
</ul>

<h2 id="znode-types">Znode types</h2>
<p><strong>Persistent znodes</strong> remain in the system until deleted and are often used to store configuration or long-lived coordination state. 
<strong>Ephemeral znodes</strong> are tied to a client session and are deleted when session ends, useful in membership tracking. 
<strong>Sequential znodes</strong> include a monotonically increasing sequence number in their name, generated by ZooKeeper rather than clients to ensure global uniqueness. Useful for distributed queues and leader election.</p>

<p>A combination of ephemeral and sequential znode provides a primitive for leader election, clients creates znodes and watch immediate lower number and if its removed, then immediate next sequence number might become a leader.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[This is my attempt to understand Apache Zookeeper, I will write this post for my future reference like coming back to it after 6 months.]]></summary></entry><entry><title type="html">Useful blogs and articles for reference</title><link href="https://mubashirusman.github.io/career/sre/2026/03/07/sre-resources.html" rel="alternate" type="text/html" title="Useful blogs and articles for reference" /><published>2026-03-07T13:30:00+00:00</published><updated>2026-03-07T13:30:00+00:00</updated><id>https://mubashirusman.github.io/career/sre/2026/03/07/sre-resources</id><content type="html" xml:base="https://mubashirusman.github.io/career/sre/2026/03/07/sre-resources.html"><![CDATA[<p>I often read something online and want to revisit it in future, sometimes I bookmark it but that also hasn’t proved to be very useful. The problem with bookmarking is that I need to remember what I was reading there, sometimes it obvious but many times its not. Also I can’t quickly search in bookmarks. So I will link some useful resources here.</p>

<h2 id="sre-interview-prep">SRE Interview Prep</h2>
<ul>
  <li>A comprehensive SRE preparation guide <a href="https://underpaid.medium.com/i-received-sre-offers-from-facebook-and-google-without-a-university-degree-here-is-how-224f06b49e7d">Underpaid’s Google SRE preparation guide</a></li>
  <li>Interview questions <a href="https://syedali.net/engineer-interview-questions/">syedali sre questions</a></li>
  <li><a href="https://zedas.fr/posts/linux-explained-7-memory-management/">Linux memory management</a></li>
  <li><a href="https://linux-audit.com/understanding-memory-information-on-linux-systems/">Understanding memory information in Linux</a></li>
  <li><a href="https://catonmat.net/my-job-interview-at-google">SRE Job Interview</a></li>
  <li>Github reporistory with many resources <a href="https://github.com/mxssl/sre-interview-prep-guide?tab=readme-ov-file">SRE Interview prep guide</a></li>
  <li>Meta Production engineer <a href="https://www.quora.com/profile/Rishi-Shah-11/answers">Rishi Shah posts</a></li>
  <li>Facebook production engineer <a href="https://azalio.wordpress.com/2016/05/29/facebook-production-engineer/">Interview content</a></li>
  <li>Getting a job as SRE <a href="https://fabrizio2210.medium.com/how-i-get-a-job-at-google-as-sre-83d44aef7859">SRE at Google</a></li>
  <li>Linux Performance tools<a href="https://netflixtechblog.com/netflix-at-velocity-2015-linux-performance-tools-51964ddb81cf">Netflix blog</a></li>
  <li><a href="https://www.brendangregg.com/Articles/Netflix_Linux_Perf_Analysis_60s.pdf">Linux performance analysis in 600 seconds</a></li>
  <li>Short reads about SRE <a href="https://s905060.gitbooks.io/site-reliability-engineer-handbook/content/">SRE Handbook</a></li>
  <li>Google onsite interview <a href="https://bala-krishnan.com/posts/5-google-sre-onsite/">Systems SRE role</a></li>
  <li>I got an offer <a href="https://igotanoffer.com/blogs/tech/google-site-reliability-engineer-interview#linux">SRE Interview questions</a></li>
  <li>Interview topics <a href="https://blog.ndk.name/preparing-for-the-sre-technical-interview/">Technical interview</a></li>
  <li>Page Cache <a href="https://biriukov.dev/docs/page-cache/0-linux-page-cache-for-sre/">Deep dives into page cache</a></li>
  <li><a href="https://itnext.io/from-rss-to-wss-navigating-the-depths-of-kubernetes-memory-metrics-4d7d77d8fdcb">From RSS to WSS, Kubernetes memory metrics</a></li>
  <li>Memory management <a href="https://landley.net/writing/memory-faq.txt">Deep dive in memory management</a></li>
  <li>Bash <a href="https://www.shellscript.sh/">The Shell Scripting Tutorial</a></li>
  <li>HTTP <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Overview">Overview of HTTP</a></li>
  <li><a href="https://danluu.com/navigate-url/">What happens when you type a URL</a></li>
  <li>Cgroups <a href="https://andreaskaris.github.io/blog/linux/cgroups/">Guide to cgroups</a></li>
  <li><a href="https://fabiokung.com/2014/03/13/memory-inside-linux-containers/">Memory inside Linux containers</a></li>
  <li>Terminals <a href="https://kevroletin.github.io/terminal/2021/12/11/how-terminal-works-in.html">How terminal works</a></li>
  <li>Boot process <a href="https://opensource.com/article/17/2/linux-boot-and-startup">Linux boot process</a></li>
  <li>
    <p>Sockets <a href="https://ops.tips/blog/how-linux-creates-sockets/#where-to-look-for-the-list-of-sockets-in-my-system">Deep dive in socket system call</a></p>
  </li>
  <li><a href="https://manybutfinite.com/post/anatomy-of-a-program-in-memory/">Anatomy of a program in memory</a></li>
  <li><a href="https://linuxhint.com/understanding_numa_architecture/?source=post_page-----9492b0f267e7---------------------------------------">Understanding NUMA</a></li>
  <li><a href="https://medium.com/swlh/linux-basics-static-libraries-vs-dynamic-libraries-a7bcf8157779">Libraries in Linux</a></li>
</ul>

<h2 id="coding">Coding</h2>
<ul>
  <li><a href="https://www.techinterviewhandbook.org/algorithms/string/">String cheatsheet for coding problems</a></li>
  <li><a href="https://leetcode.com/studyplan/top-interview-150/">Leetcode 150</a></li>
  <li><a href="https://www.pythonmorsels.com/pointers/">Python variables and objects</a></li>
  <li><a href="https://training.talkpython.fm/courses/python-concurrency-deep-dive">Python async techniques</a></li>
  <li><a href="https://runestone.academy/ns/books/published/pythonds3/index.html">Problem solving with DSA</a></li>
  <li><a href="https://pgexercises.com/gettingstarted.html">POSTGRES Exercises</a></li>
  <li><a href="https://training.talkpython.fm/courses/python-jumpstart-project-based-course">Build example apps</a></li>
  <li><a href="https://go.dev/tour/list">Tour of Go</a></li>
  <li><a href="https://medium.com/coderbyte/learn-by-doing-the-8-best-interactive-coding-websites-4c902915287c">Some websites to practice programming</a></li>
</ul>

<h2 id="system-design">System Design</h2>
<ul>
  <li>Hello Interview <a href="https://www.hellointerview.com/learn/system-design/in-a-hurry/introduction">System design in a hurry</a></li>
  <li>Distributed Systems Theory for engineers (!scientists) <a href="https://www.the-paper-trail.org/post/2014-08-09-distributed-systems-theory-for-the-distributed-systems-engineer/">Paper Trail</a></li>
  <li>Github system design <a href="https://github.com/karanpratapsingh/system-design">Big tutorial for refreshing design and components</a></li>
  <li>Blog <a href="https://jg.gg/2016/07/31/architecture-and-systems-design-interview/">Architecture and System Design</a></li>
</ul>

<h2 id="pratice-linux">Pratice Linux</h2>
<ul>
  <li>Challenge games (https://overthewire.org/wargames/bandit/)</li>
  <li>Hands on Problems <a href="https://sadservers.com/">Sad servers</a></li>
</ul>

<h2 id="tutorials">Tutorials</h2>
<ul>
  <li>Tutorial Shell <a href="https://www.redhat.com/en/blog/linux-shell-redirection-pipelining">Shell redirection</a></li>
  <li>Tutorial <a href="https://copyconstruct.medium.com/socat-29453e9fc8a6">socat</a></li>
  <li>Tutorial <a href="https://www.linuxhowtos.org/Network/netstat.htm">netstat</a></li>
  <li>Tutorial GDB <a href="https://developers.redhat.com/articles/the-gdb-developers-gnu-debugger-tutorial-part-1-getting-started-with-the-debugger#">Getting started GDB</a></li>
  <li>Tutorial <a href="https://ruslanspivak.com/lsbaws-part1/">Build a web server</a></li>
  <li>Tutorial <a href="https://progbook.org/httpserv.html#http">Writing http server</a></li>
  <li>Tutorial Lsof <a href="https://www.akadia.com/services/lsof_quickstart.txt">A Quick Start for Lsof</a></li>
  <li>Tutorial Strace <a href="https://medium.com/@adminstoolbox/debugging-using-strace-efda7d65be1d">Debugging using strace</a></li>
  <li>Tutorial Sockets <a href="https://www.digitalocean.com/community/tutorials/understanding-sockets">Understanding Sockets</a></li>
  <li>Tutorial /proc <a href="https://tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quickndirty-troubleshooting/">proc for troubleshooting</a></li>
  <li>Tutorial container <a href="https://www.redhat.com/en/topics/containers/whats-a-linux-container">What is a Linux container?</a></li>
  <li>Tutorial <a href="https://medium.com/@jyjimmylee/how-does-memory-mapping-mmap-work-c8a6a550ba0d">how mmap works</a></li>
</ul>

<h2 id="nice-blogs">Nice Blogs</h2>
<ul>
  <li>Siddharth K resume (https://www.siddharthkannan.in/)</li>
  <li>CV in typst (https://mattrighetti.com/2023/10/25/i-rewrote-my-cv-in-typst)</li>
  <li>Short post about adding 9s to SLOs (https://trstringer.com/slo-adding-nines/)</li>
  <li>How to be an SRE (https://blog.alicegoldfuss.com/how-to-get-into-sre/)</li>
  <li>SRE bootcamp (https://devopsbootcamp.osuosl.org/start-here.html)</li>
  <li>Unix Sockets (https://rednafi.com/misc/tinkering-with-unix-domain-socket/)</li>
  <li>Real world SRE (https://blog.relyabilit.ie/sre-in-the-real-world/)</li>
  <li>Boring technology (https://boringtechnology.club/)</li>
  <li>Technical Topics (https://brooker.co.za/blog/)</li>
</ul>

<h2 id="nice-books">Nice Books</h2>
<ul>
  <li><a href="https://book.systemsapproach.org/">Computer networks: A systems approach</a></li>
</ul>]]></content><author><name></name></author><category term="career" /><category term="sre" /><summary type="html"><![CDATA[I often read something online and want to revisit it in future, sometimes I bookmark it but that also hasn’t proved to be very useful. The problem with bookmarking is that I need to remember what I was reading there, sometimes it obvious but many times its not. Also I can’t quickly search in bookmarks. So I will link some useful resources here.]]></summary></entry><entry><title type="html">About SWIM protocol</title><link href="https://mubashirusman.github.io/distributed-systems/system-design/2026/03/07/membership-protocols.html" rel="alternate" type="text/html" title="About SWIM protocol" /><published>2026-03-07T13:30:00+00:00</published><updated>2026-03-07T13:30:00+00:00</updated><id>https://mubashirusman.github.io/distributed-systems/system-design/2026/03/07/membership-protocols</id><content type="html" xml:base="https://mubashirusman.github.io/distributed-systems/system-design/2026/03/07/membership-protocols.html"><![CDATA[<p>SWIM (Scalable, Weakly-Consistent, Infection-Style, Processes Group Membership Protocol) is a membership protocol, which is used in distributed systems to answer this: <em>who are my peers?</em> That means it should do the failure detection and update the peers to only keep the healthy nodes.<br />
The scaleable in the name implies that it can handle increased size of the system without degrading performance. We build distributed systems in large environments, because scalability is needed. This means thousands of machines could be the in the cluster.<br />
Gossip protocols work like how people gossip in a society, talking to only few people to share information and then those few people talk to others and then the whole society knows about it. That’s hownodes communicate with subset of their total peers to send messages, Infection-Style in the name implies its a gossip protocol.<br />
Weekly consistent means that after some amount of time, all replicas will agree on the same value, where <em>some</em> is undefined amount of time.</p>

<h2 id="detecting-failure">Detecting Failure</h2>
<ul>
  <li><code class="language-plaintext highlighter-rouge">T</code> is the period time</li>
  <li><code class="language-plaintext highlighter-rouge">k</code> is the number of nodes in failure detection group
A node <code class="language-plaintext highlighter-rouge">A</code> sends a <code class="language-plaintext highlighter-rouge">PING</code> message to node <code class="language-plaintext highlighter-rouge">B</code>, if the node replies with <code class="language-plaintext highlighter-rouge">ACK</code> no further action is needed, but if node does NOT reply before the timeout(less than <code class="language-plaintext highlighter-rouge">T</code>), then it marks the node <code class="language-plaintext highlighter-rouge">B</code> suspicious and selects some arbitrary <code class="language-plaintext highlighter-rouge">k</code> peers and asks them to ping node <code class="language-plaintext highlighter-rouge">B</code> on behalf of <code class="language-plaintext highlighter-rouge">A</code>. This is <code class="language-plaintext highlighter-rouge">indirect ping</code>. If none of <code class="language-plaintext highlighter-rouge">k</code> nodes receive the <code class="language-plaintext highlighter-rouge">ACK</code> then that node is marked <code class="language-plaintext highlighter-rouge">dead</code>. This reduces the number of messages sent to <code class="language-plaintext highlighter-rouge">O(N)</code> size.</li>
</ul>

<h2 id="information-dissenminating">Information dissenminating</h2>
<ul>
  <li><code class="language-plaintext highlighter-rouge">JOINED</code> is sent by a node <code class="language-plaintext highlighter-rouge">P</code> to inform about the network.</li>
  <li><code class="language-plaintext highlighter-rouge">FAILED</code> is sent to peers when a node failure is detected by the above process.
These messages are sene along with the <code class="language-plaintext highlighter-rouge">PING/ACK</code> to the peers and it results in efficient communication by reducing the size of information dissemination to <code class="language-plaintext highlighter-rouge">O(log(N))</code>.</li>
</ul>]]></content><author><name></name></author><category term="distributed-systems" /><category term="system-design" /><summary type="html"><![CDATA[SWIM (Scalable, Weakly-Consistent, Infection-Style, Processes Group Membership Protocol) is a membership protocol, which is used in distributed systems to answer this: who are my peers? That means it should do the failure detection and update the peers to only keep the healthy nodes. The scaleable in the name implies that it can handle increased size of the system without degrading performance. We build distributed systems in large environments, because scalability is needed. This means thousands of machines could be the in the cluster. Gossip protocols work like how people gossip in a society, talking to only few people to share information and then those few people talk to others and then the whole society knows about it. That’s hownodes communicate with subset of their total peers to send messages, Infection-Style in the name implies its a gossip protocol. Weekly consistent means that after some amount of time, all replicas will agree on the same value, where some is undefined amount of time.]]></summary></entry><entry><title type="html">Concurrency Concepts And HTTP Server</title><link href="https://mubashirusman.github.io/programming/python/2026/03/05/multithreading-concurrency-concepts.html" rel="alternate" type="text/html" title="Concurrency Concepts And HTTP Server" /><published>2026-03-05T00:30:20+00:00</published><updated>2026-03-05T00:30:20+00:00</updated><id>https://mubashirusman.github.io/programming/python/2026/03/05/multithreading-concurrency-concepts</id><content type="html" xml:base="https://mubashirusman.github.io/programming/python/2026/03/05/multithreading-concurrency-concepts.html"><![CDATA[<h2 id="concurrency">Concurrency</h2>
<p>Its a way of fragmanting code so that individual fragmants can be run independently to reach the same result.<br />
E.g for taking an average of x1,x2,x3….xn, we can do it by dividing all numbers in two segments like s1 = sum(x1, x2, x3….xm)/c1 and s2 = sum(xm1, xm2…xn)/c2 and then doing something like (s1+s2)/(c1+c2). This can be done by running fragmants on different cpu cores at the same time (parallel execution) or by sharing the same cpu (time-sliced execution).</p>

<h2 id="http-server">HTTP Server</h2>
<p>An HTTP server does the following:</p>
<ol>
  <li>Create a TCP/steam socket</li>
  <li>Bind a name (address) to this socket object</li>
  <li>Start listening, which is to wait for incoming connections</li>
  <li>When a connection comes, accept the connection and start sharing HTTP messages</li>
  <li>Close the connection</li>
  <li>Repeat from step 3</li>
</ol>

<p>Before writing any code, we need to know what a socket is: its a channel for communication for intra-computer communication, there are <em>client</em> sockets and <em>server</em> sockets. Client socket sends a text, receives a reply. After it exchanges some messages client socket is then destroyed.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>seversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
seversocket.bind(('127.0.0.1', 8765))
serversocket.listen(5) #max number of connect requests
</code></pre></div></div>
<p>Now we can enter the mainloop for accepting connections, we can get the client socket from the <code class="language-plaintext highlighter-rouge">accept</code> and fulfill the request.</p>]]></content><author><name></name></author><category term="programming" /><category term="python" /><summary type="html"><![CDATA[Concurrency Its a way of fragmanting code so that individual fragmants can be run independently to reach the same result. E.g for taking an average of x1,x2,x3….xn, we can do it by dividing all numbers in two segments like s1 = sum(x1, x2, x3….xm)/c1 and s2 = sum(xm1, xm2…xn)/c2 and then doing something like (s1+s2)/(c1+c2). This can be done by running fragmants on different cpu cores at the same time (parallel execution) or by sharing the same cpu (time-sliced execution).]]></summary></entry><entry><title type="html">Python programming tips using standard library</title><link href="https://mubashirusman.github.io/programming/python/2025/12/25/python-dataclasses.html" rel="alternate" type="text/html" title="Python programming tips using standard library" /><published>2025-12-25T00:30:20+00:00</published><updated>2025-12-25T00:30:20+00:00</updated><id>https://mubashirusman.github.io/programming/python/2025/12/25/python-dataclasses</id><content type="html" xml:base="https://mubashirusman.github.io/programming/python/2025/12/25/python-dataclasses.html"><![CDATA[<h1 id="python-tips-from-standard-library-for-my-reference">Python tips from standard library for my reference</h1>

<h2 id="dataclasses">Dataclasses</h2>
<p>Dataclasses is essentially a code generator. It helps to avoide writing boilerplate and repeating code. Following are important use cases.</p>
<ol>
  <li>We don’t need to write <em>init</em> method if class is an instance of dataclass. For example:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@dataclass  
class Circle():  
 x: int = 0  
 y: int = 0  
 radius: int = 1
</code></pre></div>    </div>
    <p>Even though we defined class level variables, they will act as if they were instance variables.</p>
  </li>
  <li>
    <p>Also it implements <em>repr</em> method automatically. And if instance variables needs to be made immutable then we can pass <em>frozen</em> parameter to dataclass. This will also implement <em>hash</em> method and make the objects hashable, which means we can use them as keys in a dictionary.
<code class="language-plaintext highlighter-rouge">@dataclass(frozen=True)</code></p>
  </li>
  <li>We can set <code class="language-plaintext highlighter-rouge">order=True</code> to implement equality or less-than/greater-than methods.</li>
</ol>

<h2 id="type-hinting">Type Hinting</h2>
<ol>
  <li>Even though Cpython completely ignores variable types set by type-hinting, but type-hinting is useful in many other ways. Like for documentation, and libraries such as Pydantic and dataclasses uses type-hinting.</li>
</ol>

<p>Example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def func(a: dict, b: list, c: bool = True) -&gt; str:
    return f"a= {a}, b={b}"
</code></pre></div></div>

<ol>
  <li>As a parameter in Python can take different argument types, we can do like this.
```
from typing import Union
def funcm(a: Union[str, int], b: int) -&gt; Union[str, int]:
 return a*b</li>
</ol>

<p>OR</p>

<p>def funcm(a: str | int, b: int) -&gt; str | int:
    return a*b</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
3. If an argument could be passed as a specific type or could be None but is NOT optional, one can use Optional to specify this.
</code></pre></div></div>
<p>from typing import Optional
def funco(a Optional[int]) -&gt; None:
    pass</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>4. For containers like lists/tuples/dicts, we can use generics from typing module
</code></pre></div></div>
<p>from typing import List
def funcg(a: List[float]) -&gt; List[int]:
    return [int(i) for i in a]</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>5. OR for functions and generators
</code></pre></div></div>
<p>from typing import Callable, Any, Sequence, Iterator
def funcf(func: Callable[[Any], Any], sequence: Sequence[Any]) -&gt; Iterator[str]:
    for i in sequence:
        yield str(func(i))</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
NOTE: From Python 3.9 onwards, many generics are being deprecated in `typing` module and moved to other modules like `collections.abc`.


## Threading
In python, threads are not run in parallel, instead they are run sequentially due to global interpreter lock (GIL). But they can still be helpful in tasks which are I/O bound and have to wait for something else to complete their execution. This is because even one CPU can do other things instead of waiting for a slower task to finish.
`threading` allows to `start` up as many threads and then `join` them later.

Example: Downloading files from an api:

</code></pre></div></div>
<p>from threading import Thread
threads = []
urls = [url1, url2, url3, url4, url5]</p>

<p>for url in urls:
    t = Thread(target=download_file, args(url))
    t.start() //start() actually starts the target function
    threads.append(t)</p>

<p>[t.join() for t in threads] //join() returns None</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
Threading module does not help with managing the pool of threads, like how many threads we want to create etc. So its better to use something that does that automatically. `concurrent.futures` help with this. It also provides context manager for cleanup of resources afterwards.
`concurrent.futures.ThreadPoolExecuter.submit` creates thread and gives back a variable that can hold the result of the threads, and `concurrent.futures.as_completed.result` get result from threads as they complete.

</code></pre></div></div>
<p>with concurrent.futures.ThreadPoolExecuter(max_workers=<set-to-number-of-threads-needed>) as executor:
    futures = []
    for url in urls:
        future = executor.submit(<actual-function-here>, <actual-args-here>url)
        futures.append(future)</actual-args-here></actual-function-here></set-to-number-of-threads-needed></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for future in concurrent.futures.as_completed(futures):
    try:
        url, status_code = future.result()
    execept Exception as err:
        print(f"Task failed.. {err}") ```
</code></pre></div></div>

<p>Be aware if threads are interacting with the same resources. In that case, use a lock to block other threads and do your thing, but the downside is that the program will be running in sequential manner within that lock period :).</p>]]></content><author><name></name></author><category term="programming" /><category term="python" /><summary type="html"><![CDATA[Python tips from standard library for my reference]]></summary></entry><entry><title type="html">Is Instagram Needed?</title><link href="https://mubashirusman.github.io/social/media/2025/12/24/is-instagram-needed.html" rel="alternate" type="text/html" title="Is Instagram Needed?" /><published>2025-12-24T23:30:20+00:00</published><updated>2025-12-24T23:30:20+00:00</updated><id>https://mubashirusman.github.io/social/media/2025/12/24/is-instagram-needed</id><content type="html" xml:base="https://mubashirusman.github.io/social/media/2025/12/24/is-instagram-needed.html"><![CDATA[<p>Every time I open instagram, the algorithm servers me with the warmth of an amusing reel or some’s friend’s full-page picture. To the user it is a sense of “missed connectedness” that needs to replenished. But the question is, do I need to know what my friends did on the weekend, or even more fundamental one, am I even friends with them? These simple questions are difficult to answer because of how our lives have been shaped by the Instagram and Facebook. We know as a fact that the people who we are connected with do not need Instagram story to keep us “in touch”. In fact, the more closely we know someone the less we would need Instagram or anything similar. So the Instagram’s slogan, “bringing you closer to the people you love” is superficial and misleading to say the least. It has been found out in many studies that people feel insecure about their body and have low self esteem, this is particularly more prevalent among teenagers. From what I realized is this, our brain feels rewarded by seeing the likes of other people on the picture we post, the reward is enough to post again. This sense of appreciation encourages the user to post more to get the same feeling, but this time its impact will be less so the user will need to post even more to reach the same level, it eventually turns into an addiction. This is how social media influencers are born. Once a ‘normal user’ starts posting, it gets incentivized by the platform and eventually starts posting consistently. It feels worse after scrolling the instagram’s feed because of the unintended comparison by seeing the lives of celebrities our mind makes. This puts one on the crossroads to even use the platform.
Excluding the celebrity’s content, the instagram posts can be divided in two categories.</p>
<ol>
  <li>To show excitement 
This is the type where a person is truly excited about what’s happening in their life and shared their life’s moments online. This excitement is subjective, what someone has achieved in a long time could be someone was born with, thus this is left to the audience of the posts to judge. Anything posted out of excitement would fall into this. For example, I once excitedly shared the post of me getting good grades in college on Facebook.</li>
  <li>To make brag
This is the type where the poster wants to boast that something has happened in their life, now its the time to “show off”. We as social creatures want to be seen when we do something. But to do so on instagram, you have to post pictures ONLY and edited ones. For example, a picture of a train + filter will render it completely out of a movie scene.
Instagram is designed in a way that the platform leaves very little room for the caption, which in turn discourages a user to write beyond couple of words. Most audience will either not comment or even if they do, there is nothing more to say than praise the poster/cameramen. And the poster expects the same, to be praised for their high resolution image. Practically there is no room for discussion. In my humble opinion, all posts on instagram should be auto-captioned as <strong>boasting edited pictures</strong>. I allege that most of the posts on Facebook/Instagram are in this category.</li>
</ol>]]></content><author><name></name></author><category term="social" /><category term="media" /><summary type="html"><![CDATA[Every time I open instagram, the algorithm servers me with the warmth of an amusing reel or some’s friend’s full-page picture. To the user it is a sense of “missed connectedness” that needs to replenished. But the question is, do I need to know what my friends did on the weekend, or even more fundamental one, am I even friends with them? These simple questions are difficult to answer because of how our lives have been shaped by the Instagram and Facebook. We know as a fact that the people who we are connected with do not need Instagram story to keep us “in touch”. In fact, the more closely we know someone the less we would need Instagram or anything similar. So the Instagram’s slogan, “bringing you closer to the people you love” is superficial and misleading to say the least. It has been found out in many studies that people feel insecure about their body and have low self esteem, this is particularly more prevalent among teenagers. From what I realized is this, our brain feels rewarded by seeing the likes of other people on the picture we post, the reward is enough to post again. This sense of appreciation encourages the user to post more to get the same feeling, but this time its impact will be less so the user will need to post even more to reach the same level, it eventually turns into an addiction. This is how social media influencers are born. Once a ‘normal user’ starts posting, it gets incentivized by the platform and eventually starts posting consistently. It feels worse after scrolling the instagram’s feed because of the unintended comparison by seeing the lives of celebrities our mind makes. This puts one on the crossroads to even use the platform. Excluding the celebrity’s content, the instagram posts can be divided in two categories. To show excitement This is the type where a person is truly excited about what’s happening in their life and shared their life’s moments online. This excitement is subjective, what someone has achieved in a long time could be someone was born with, thus this is left to the audience of the posts to judge. Anything posted out of excitement would fall into this. For example, I once excitedly shared the post of me getting good grades in college on Facebook. To make brag This is the type where the poster wants to boast that something has happened in their life, now its the time to “show off”. We as social creatures want to be seen when we do something. But to do so on instagram, you have to post pictures ONLY and edited ones. For example, a picture of a train + filter will render it completely out of a movie scene. Instagram is designed in a way that the platform leaves very little room for the caption, which in turn discourages a user to write beyond couple of words. Most audience will either not comment or even if they do, there is nothing more to say than praise the poster/cameramen. And the poster expects the same, to be praised for their high resolution image. Practically there is no room for discussion. In my humble opinion, all posts on instagram should be auto-captioned as boasting edited pictures. I allege that most of the posts on Facebook/Instagram are in this category.]]></summary></entry><entry><title type="html">What it takes to have minimum reliability?</title><link href="https://mubashirusman.github.io/sre/2025/12/24/what-is-minimum-reliability.html" rel="alternate" type="text/html" title="What it takes to have minimum reliability?" /><published>2025-12-24T23:30:20+00:00</published><updated>2025-12-24T23:30:20+00:00</updated><id>https://mubashirusman.github.io/sre/2025/12/24/what-is-minimum-reliability</id><content type="html" xml:base="https://mubashirusman.github.io/sre/2025/12/24/what-is-minimum-reliability.html"><![CDATA[<p><strong>This architecture is for web-applications and for my own reference.</strong>
Some sane choices to make:</p>
<ol>
  <li>Have <strong>public and private subnets</strong> for the infrastructure, This means that application server, database, container registry, object storage, 
logging, monitoring should be inside the <em>private network</em>. A firewall should block any 
access from outside to this subnet. In AWS, there is a concept of security groups that can work here.
A load balancer should be in the public subnet.</li>
  <li><strong>Database</strong> should be backed up regularly, and for this its important to not let the
single instance of the database to be overloaded. Instead there should be a secondary instance
to take backups from. These backups should be tested by restoring.</li>
  <li><strong>Logging</strong> is important for two reasons: to debug after an incident, to keep track of service events and improving it.
A centralized logging solution like ELK stack should be setup. Applications should be configured to collect their logs.</li>
  <li>A <strong>continous integration and delivery</strong> pipeline is the backbone for quickly testing, releasing in production and rollbacks.
For one service, separate branches should be configured to keep the production code separate from test environments.
Once code is tested, it should be run in a <em>before-production</em> environment, make sure that ONLY ONE change is here, and until this
change is released, before-production environment should remain occupied, this will ensure changes to be tested 
and keep the history clean. If the deployment here is unseccessful, its time to go back to testing.</li>
  <li><strong>Infrastructure automation</strong> is critical. In a cloud environment, Terraform is my favorite and also an industry standard.
It lets you define your <em>infrastructure as code</em>. Also Terraform is declarative in nature, which means it lets you
define what you want at the end and takes into the account the current state, instead of how you should go about to achieve that (imperative definition). Ideally applications should be able to run on stateless servers which effectively 
means that we can deploy identical servers and as many of them as we want. This is the benefit of immutable deployments/containers.
With respect to Terraform, one should templatize the code as variables and modules to reuse for multiple applications and setup a remote state. Lastly, remember manually deploying infrastructure does not scale for a lot of reasons.</li>
  <li><strong>Configuration</strong></li>
</ol>]]></content><author><name></name></author><category term="sre" /><summary type="html"><![CDATA[This architecture is for web-applications and for my own reference. Some sane choices to make: Have public and private subnets for the infrastructure, This means that application server, database, container registry, object storage, logging, monitoring should be inside the private network. A firewall should block any access from outside to this subnet. In AWS, there is a concept of security groups that can work here. A load balancer should be in the public subnet. Database should be backed up regularly, and for this its important to not let the single instance of the database to be overloaded. Instead there should be a secondary instance to take backups from. These backups should be tested by restoring. Logging is important for two reasons: to debug after an incident, to keep track of service events and improving it. A centralized logging solution like ELK stack should be setup. Applications should be configured to collect their logs. A continous integration and delivery pipeline is the backbone for quickly testing, releasing in production and rollbacks. For one service, separate branches should be configured to keep the production code separate from test environments. Once code is tested, it should be run in a before-production environment, make sure that ONLY ONE change is here, and until this change is released, before-production environment should remain occupied, this will ensure changes to be tested and keep the history clean. If the deployment here is unseccessful, its time to go back to testing. Infrastructure automation is critical. In a cloud environment, Terraform is my favorite and also an industry standard. It lets you define your infrastructure as code. Also Terraform is declarative in nature, which means it lets you define what you want at the end and takes into the account the current state, instead of how you should go about to achieve that (imperative definition). Ideally applications should be able to run on stateless servers which effectively means that we can deploy identical servers and as many of them as we want. This is the benefit of immutable deployments/containers. With respect to Terraform, one should templatize the code as variables and modules to reuse for multiple applications and setup a remote state. Lastly, remember manually deploying infrastructure does not scale for a lot of reasons. Configuration]]></summary></entry><entry><title type="html">What Programming Language Should You Learn?</title><link href="https://mubashirusman.github.io/programming/2024/02/10/what-programming-language.html" rel="alternate" type="text/html" title="What Programming Language Should You Learn?" /><published>2024-02-10T08:20:47+00:00</published><updated>2024-02-10T08:20:47+00:00</updated><id>https://mubashirusman.github.io/programming/2024/02/10/what-programming-language</id><content type="html" xml:base="https://mubashirusman.github.io/programming/2024/02/10/what-programming-language.html"><![CDATA[<p>Programming languages vary in how much abstraction they offer to the programmer. 
This ranges from no abstraction, referred to as low level like C, to high abstraction 
which are called high level languages like Python.
Understanding the spectrum of programming languages is crucial for any developer.
Low-level languages like assembly give you direct control over hardware, while 
high-level languages like Python abstract away the complexity.</p>
<h4 id="factors-to-consider">Factors to Consider:</h4>
<ol>
  <li>Your career goals and target industry</li>
  <li>The type of projects you want to work on</li>
  <li>Learning curve and time investment</li>
  <li>Community support and job market demand</li>
</ol>

<p>More often beginners are engaged in a dicussion of which language I should learn. The short answer is to pick any of the popular languages, like C, Javascript, Python, Java etc.. If you did not like that answer than you should do some research on the history of their creation. They all have interesting features. Albeit you can build interesting things out of all of them, but some have great libraries for one thing that other lacks. For instance consider Python, it has a rich ecosystem for Machine learning and data science while Java has comprehensive tools for server side programming. In the start, you should choose yours and start building projects, rest assured, you won’t be at a disvantage for learning one over the other.</p>]]></content><author><name></name></author><category term="programming" /><summary type="html"><![CDATA[Programming languages vary in how much abstraction they offer to the programmer. This ranges from no abstraction, referred to as low level like C, to high abstraction which are called high level languages like Python. Understanding the spectrum of programming languages is crucial for any developer. Low-level languages like assembly give you direct control over hardware, while high-level languages like Python abstract away the complexity. Factors to Consider: Your career goals and target industry The type of projects you want to work on Learning curve and time investment Community support and job market demand]]></summary></entry></feed>