Apache Zookeeper

This is my attempt to understand Apache Zookeeper, I will write this post for my future reference like coming back to it after 6 months.

Introduction

Zookeeper is a system that solves distributed co-ordination problem, its a coordination kernal for distributed systems. It provides primitives - not the concrete implementation - to implement distributed configuration, distributed lock, group membership, leader election and things like that.

It has nodes which are arranged hierarchically like folders are in a filesystem. It is non-blocking, client operations are performed in FIFO manner, nodes can be normal or ephemeral (means they are deleted after some time). When node is created it has name and a number which is monotonically increasing. The beautiful thing about Zookeeper is that how other things can be built on top of these primitives.

Consistency guarantees

It provides strong consistency for writes through linearizability (single global order). This means all clients will observe the state changes in the same order.
Zookeeper also provides FIFO ordering per client session. Requests from a single client are processed in the order they are sent.
All reads are not guaranteed to see latest writes unless explicitly synchronized. So clients can see stale data if they read from a follower which has not synced with the leader.

Fault Tolerance

Zookeeper can tolerate partial failure, but it chooses correctness over availability. It chooses consistency over availability. From CAP theorem, its CP, this means zookeeper becomes unavailable if quorum is not reachable but does not compromise consistency.

What Zookeeper does and what it does not do

Guarantee Type	Does	Does NOT
Write consistency	Linearizable writes with a global order	High throughput or horizontal write scaling
Read consistency	consistency within sessions	Linearizable reads without explicit sync
Availability	when quorum is met	available if quorum is not met
Data	small metadata	large payload like a database

Data Model

Data model provides a structure just enough to store data required for co-ordination. Its a heirarchical namespace resembling a filesystem where clients find information using parent child relationship. Each node in this structure is called znode and stores co-ordination metadata (default size is 1MB).

This is the stat structure of a znode:

czxid = transaction ID when the znode was created
mzxid tracks the most recent modification
Version numbers provide concurrency control, allowing clients to perform conditional updates
ephemeralOwner identify the session that created ephemeral znode
numChildren counts immediate children

Znode types

Persistent znodes remain in the system until deleted and are often used to store configuration or long-lived coordination state. Ephemeral znodes are tied to a client session and are deleted when session ends, useful in membership tracking. Sequential znodes include a monotonically increasing sequence number in their name, generated by ZooKeeper rather than clients to ensure global uniqueness. Useful for distributed queues and leader election.

A combination of ephemeral and sequential znode provides a primitive for leader election, clients creates znodes and watch immediate lower number and if its removed, then immediate next sequence number might become a leader.