Apache Zookeeper
This is my attempt to understand Apache Zookeeper, I will write this post for my future reference like coming back to it after 6 months.
Introduction
Zookeeper is a system that solves distributed co-ordination problem, its a coordination kernal for distributed systems. It provides primitives - not the concrete implementation - to implement distributed configuration, distributed lock, group membership, leader election and things like that.
It has nodes which are arranged hierarchically like folders are in a filesystem. It is non-blocking, client operations are performed in FIFO manner, nodes can be normal or ephemeral (means they are deleted after some time). When node is created it has name and a number which is monotonically increasing. The beautiful thing about Zookeeper is that how other things can be built on top of these primitives.
Consistency guarantees
- It provides strong consistency for writes through linearizability (single global order). This means all clients will observe the state changes in the same order.
- Zookeeper also provides FIFO ordering per client session. Requests from a single client are processed in the order they are sent.
- All reads are not guaranteed to see latest writes unless explicitly synchronized. So clients can see stale data if they read from a follower which has not synced with the leader.
Fault Tolerance
Zookeeper can tolerate partial failure, but it chooses correctness over availability. It chooses consistency over availability. From CAP theorem, its CP, this means zookeeper becomes unavailable if quorum is not reachable but does not compromise consistency.
What Zookeeper does and what it does not do
| Guarantee Type | Does | Does NOT |
|---|---|---|
| Write consistency | Linearizable writes with a global order | High throughput or horizontal write scaling |
| Read consistency | consistency within sessions | Linearizable reads without explicit sync |
| Availability | when quorum is met | available if quorum is not met |
| Data | small metadata | large payload like a database |
Data Model
Data model provides a structure just enough to store data required for co-ordination. Its a heirarchical namespace resembling a filesystem where clients find information using parent child relationship. Each node in this structure is called znode and stores co-ordination metadata (default size is 1MB).
This is the stat structure of a znode:
- czxid = transaction ID when the znode was created
- mzxid tracks the most recent modification
- Version numbers provide concurrency control, allowing clients to perform conditional updates
- ephemeralOwner identify the session that created ephemeral znode
- numChildren counts immediate children
Znode types
Persistent znodes remain in the system until deleted and are often used to store configuration or long-lived coordination state. Ephemeral znodes are tied to a client session and are deleted when session ends, useful in membership tracking. Sequential znodes include a monotonically increasing sequence number in their name, generated by ZooKeeper rather than clients to ensure global uniqueness. Useful for distributed queues and leader election.
A combination of ephemeral and sequential znode provides a primitive for leader election, clients creates znodes and watch immediate lower number and if its removed, then immediate next sequence number might become a leader.