🤿 Databases and Distributed Systems Deep Dive¶
2020 became the year I finally learnt databases and distributed systems properly—by ignoring theory and diving into implementation.
During my Master's in Computer Science, my favourite courses were Systems Architecture (operating systems, compilers) and Architecture and Hardware. I loved understanding how things work at a fundamental level—how assembly becomes machine code, how virtual memory works, how caches affect performance.
Naturally, I was excited for Database Theory. Databases are systems, after all. Surely we'd learn about B-trees, query execution, storage engines?
We didn't.
We spent weeks on Boyce-Codd normal forms, entity-relationship diagrams, and normalisation theory. Every lecture was about what databases should do in theory, never how they actually work in practice. For someone who wanted to understand implementation, it was deeply frustrating.
I passed the course and promptly ignored databases for years, treating them as black boxes. Need to store data? Use PostgreSQL. Need scalability? Add MongoDB. Need performance? ¯\_(ツ)_/¯ probably cache something?
But after too many "where is my data physically stored?" and "how does this query actually execute?" frustrations, I decided to learn databases properly. As a PhD researcher in data-intensive science dealing with gigabytes to terabytes of astronomical data, understanding what happens "under the hood" wasn't just curiosity—it was becoming essential for my work.
The Turning Point: Andy Pavlo¶
What really sparked things was discovering Andy Pavlo's database courses from Carnegie Mellon University:
His passion for databases and focus on implementation is exactly what I'd been missing. Within the first lecture of the intro course, he's talking about how PostgreSQL stores data on disk, how buffer pools work, and why database design decisions matter for real-world performance.
I binged these lectures like a Netflix series. With my systems background and previous (failed) database education, I could follow the concepts whilst appreciating the implementation details I'd been craving.
What Makes These Courses Special¶
Implementation Over Theory¶
Andy doesn't start with normalisation forms. He starts with: "Here's how a database stores data on disk. Here's why that matters."
Lecture 1 topics: - Disk-oriented architecture - Page layouts and slotted pages - Tuple storage (row vs column stores) - Buffer pool management
This is the stuff that matters when you're debugging why your query is slow or why your database is using 50GB of disk for 10GB of data.
Real Database Systems¶
The courses reference real systems constantly:
- PostgreSQL's MVCC implementation
- MySQL's InnoDB storage engine
- SQLite's write-ahead logging
- RocksDB's LSM trees
You're not learning abstract theory—you're learning how PostgreSQL decides whether to use an index scan or sequential scan, and why it makes that choice.
The BusTub Project¶
The intro course includes a hands-on project: implementing components of a working database system called BusTub.
You implement:
Project 1: Buffer Pool Manager
// Managing pages of data in memory
class BufferPoolManager {
Page* FetchPage(page_id_t page_id);
bool UnpinPage(page_id_t page_id, bool is_dirty);
bool FlushPage(page_id_t page_id);
Page* NewPage(page_id_t* page_id);
bool DeletePage(page_id_t page_id);
};
This teaches you how databases manage memory, handle page replacement (LRU), and decide what to keep in RAM versus flush to disk.
Project 2: Hash Table
Implementing extensible hashing for indexing:
template <typename KeyType, typename ValueType>
class ExtendibleHashTable {
bool Insert(const KeyType &key, const ValueType &value);
bool Find(const KeyType &key, std::vector<ValueType> &result);
bool Remove(const KeyType &key, const ValueType &value);
};
This crystallised why hash indexes are O(1) for point queries but useless for range scans.
Project 3: Query Execution
Building query operators (sequential scan, nested loop join, hash join):
class HashJoinExecutor : public AbstractExecutor {
void Init() override;
bool Next(Tuple *tuple, RID *rid) override;
private:
void BuildHashTable();
std::unordered_map<Value, std::vector<Tuple>> hash_table_;
};
Implementing a hash join makes you understand exactly why JOIN performance depends on which table is on which side, and why you should always join on indexed columns.
Project 4: Concurrency Control
Implementing lock manager and deadlock detection for MVCC transactions.
This project alone taught me more about database transactions than years of reading about ACID properties in textbooks.
Key Concepts That Finally Made Sense¶
1. Storage Models¶
Row Store vs Column Store
I'd always heard "use a column store for analytics" without understanding why.
Row store (traditional RDBMS):
Page 1: [id=1, name="Alice", age=30, salary=50000]
[id=2, name="Bob", age=25, salary=45000]
Column store:
Page 1: [id: 1, 2, 3, ...]
Page 2: [name: "Alice", "Bob", "Charlie", ...]
Page 3: [age: 30, 25, 35, ...]
Page 4: [salary: 50000, 45000, 60000, ...]
For SELECT AVG(salary) FROM employees, the column store reads only the salary page. The row store must read every page, pulling all columns into memory just to discard most of them.
Aha moment: Column stores aren't magic—they're just optimised for different access patterns.
2. Index Structures¶
B+ Trees
Every database course mentions B-trees. Few explain why they're perfect for databases:
[10 | 20 | 30]
/ | | \
[1,5] [11,15] [21,25] [31,35]
| | | |
[data] [data] [data] [data]
- All data at leaf level (uniform lookup time)
- Leaves linked (efficient range scans)
- High fanout = shallow tree = fewer disk reads
- Balanced = predictable performance
LSM Trees (Log-Structured Merge Trees)
Used by RocksDB, Cassandra, LevelDB:
MemTable (in memory)
↓ (flush when full)
SSTable L0
↓ (compact)
SSTable L1
↓ (compact)
SSTable L2
Write-optimised: all writes append to memory, then flush. Reads must check multiple levels (slower), but writes are incredibly fast.
Trade-off: B+ trees favour reads. LSM trees favour writes. Choose based on your workload.
3. Query Execution¶
Query Plan Visualisation
Understanding EXPLAIN output finally made sense when I understood executors:
EXPLAIN SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000;
HashJoin (cost=12.5..25.3 rows=100)
Hash Cond: (e.dept_id = d.id)
-> Seq Scan on employees e (cost=0.0..10.0 rows=500)
Filter: (salary > 50000)
-> Hash on departments d (cost=5.0..5.0 rows=200)
Reading this plan, I now understand: 1. Scan employees, filter high earners (predicate pushdown) 2. Build hash table on departments (smaller table) 3. Probe hash table for each employee (O(1) per probe)
Optimiser Decisions
Why does the database sometimes ignore your index?
-- Index on age exists, but not used!
SELECT * FROM users WHERE age > 18;
If 99% of users are over 18, sequential scan of the table is faster than: 1. Index scan to find 99% of row IDs 2. Random disk seeks to fetch each row
The optimiser knows this from table statistics. This blew my mind.
4. Concurrency Control¶
MVCC (Multi-Version Concurrency Control)
How PostgreSQL lets readers and writers not block each other:
Transaction T1: UPDATE users SET name = 'Alice' WHERE id = 1;
Physical storage:
[id=1, name='Bob', xmin=100, xmax=200] <- Old version
[id=1, name='Alice', xmin=200, xmax=∞] <- New version
- T1 (transaction 200) sees the new version
- T2 (transaction 150) sees the old version
- No locks required for reads!
Trade-off: requires VACUUM to clean up old versions. This explains why PostgreSQL needs regular maintenance.
Two-Phase Locking vs Optimistic Concurrency
2PL (pessimistic):
BEGIN
LOCK (record)
-- Do work
COMMIT
UNLOCK
OCC (optimistic):
BEGIN
-- Read and track versions
-- Do work
VALIDATE (no conflicts?)
Yes: COMMIT
No: ABORT and RETRY
OCC wins for read-heavy workloads. 2PL wins for write-heavy. Databases must choose.
Distributed Systems: MIT 6.824¶
Whilst watching Andy's database lectures, I discovered MIT's 6.824 Distributed Systems course.
The timing was perfect—many topics overlapped:
Topics covered: - MapReduce and distributed computation - Raft consensus algorithm - Fault tolerance and replication - Distributed transactions (2PC, 3PC) - Consistency models
The Labs¶
The course includes legendary labs implementing distributed systems in Go:
Lab 1: MapReduce
Implement a coordinator and workers:
type Coordinator struct {
tasks []Task
workers map[string]Worker
mutex sync.Mutex
}
func (c *Coordinator) AssignTask(args *Args, reply *Reply) error {
c.mutex.Lock()
defer c.mutex.Unlock()
// Find unassigned task
for i, task := range c.tasks {
if task.Status == Pending {
task.Status = Running
task.Worker = args.WorkerID
reply.Task = task
return nil
}
}
return nil
}
This taught me about coordinating distributed work, handling worker failures, and the challenges of distributed state.
Lab 2: Raft
Implement the Raft consensus algorithm for replicated state machines:
type Raft struct {
mu sync.Mutex
peers []*labrpc.ClientEnd
me int
currentTerm int
votedFor int
log []LogEntry
state State // Follower, Candidate, or Leader
}
func (rf *Raft) Start(command interface{}) (int, int, bool) {
// Leader replicates command to followers
// Returns index, term, isLeader
}
Implementing leader election, log replication, and handling network partitions gave me visceral understanding of distributed consensus.
Lab 3: Fault-Tolerant Key/Value Service
Build a linearisable key-value store on top of Raft:
type KVServer struct {
mu sync.Mutex
rf *Raft
applyCh chan ApplyMsg
data map[string]string
applied map[int64]Response // Dedup cache
}
func (kv *KVServer) Get(args *GetArgs, reply *GetReply) {
// Submit to Raft, wait for commit
op := Op{Type: "Get", Key: args.Key}
index, _, isLeader := kv.rf.Start(op)
if !isLeader {
reply.Err = ErrWrongLeader
return
}
// Wait for log application
// Return result
}
This crystallised how distributed databases like CockroachDB and TiDB work under the hood.
Lab 4: Sharded Key/Value Service
Implement sharding with shard migration:
type ShardController struct {
configs []Config // Configuration history
// Each config defines shard -> replica group mapping
}
// Rebalance shards across groups
func (sc *ShardController) Join(args *JoinArgs, reply *JoinReply) {
// Add new replica group
// Rebalance shards
}
Understanding shard rebalancing explained why MongoDB migrations can be slow and why consistent hashing matters.
Key Distributed Systems Insights¶
CAP Theorem (finally understood)
You can't have all three: - **C**onsistency: All nodes see the same data - **A**vailability: Every request gets a response - **P**artition tolerance: System works despite network failures
Network partitions will happen, so you choose CP or AP:
- CP (PostgreSQL, traditional RDBMS): Sacrifice availability during partitions
- AP (Cassandra, DynamoDB): Sacrifice consistency for availability
Aha moment: This isn't a design choice—it's physics. Network delays and partitions are unavoidable.
Consistency Models
Linearisability (strongest):
[W(x=1)] → [R(x)=1] # Reads always see latest write
Sequential Consistency:
All processes see same order, may not be real-time
Eventual Consistency (weakest):
Given enough time, all replicas converge
Understanding these models explained why: - DynamoDB sometimes returns stale data - PostgreSQL replication has lag - Spanner uses atomic clocks for global consistency
Designing Data-Intensive Applications¶
Alongside the courses, I read Martin Kleppmann's Designing Data-Intensive Applications. This book ties together databases and distributed systems beautifully.
What Makes DDIA Essential¶
Part 1: Foundations of Data Systems
Covers storage engines, encoding, and replication with production examples:
- How PostgreSQL implements replication
- How Kafka achieves high throughput
- How Elasticsearch handles full-text search
Part 2: Distributed Data
Explains partitioning, transactions, and consistency models:
- Why hash partitioning vs range partitioning matters
- How distributed transactions work (2PC, sagas)
- Trade-offs between consistency models
Part 3: Derived Data
Batch processing, stream processing, and the future of data systems:
- MapReduce and Spark
- Stream processing with Kafka and Flink
- Event sourcing and CQRS patterns
Concepts That Clicked¶
The Log is Everything
A surprising insight: almost every data system is built on append-only logs:
- Databases: Write-ahead log (WAL)
- Kafka: Distributed commit log
- Event sourcing: Event log as source of truth
Traditional view:
Database is the source of truth
Reality:
Log is the source of truth
Database is a materialized view of the log
Change Data Capture (CDC)
Reading the database's log to stream changes elsewhere:
PostgreSQL WAL → Debezium → Kafka → Elasticsearch
→ Data warehouse
→ Cache invalidation
This pattern powers modern data architectures. Understanding logs made this click.
Batch vs Stream Processing
Batch (MapReduce, Spark):
Process historical data, high latency, high throughput
Stream (Flink, Kafka Streams):
Process real-time data, low latency, bounded throughput
Not competitors—complementary. Use both in different parts of your system.
Practical Implementation Experience¶
Building a Simple Storage Engine¶
To cement my understanding, I implemented a basic LSM tree storage engine:
pub struct LSMTree {
memtable: MemTable,
immutable_memtables: Vec<MemTable>,
sstables: Vec<SSTable>,
}
impl LSMTree {
pub fn put(&mut self, key: Vec<u8>, value: Vec<u8>) -> Result<()> {
self.memtable.insert(key, value);
if self.memtable.size() > MEMTABLE_SIZE_LIMIT {
self.flush()?;
}
Ok(())
}
pub fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>> {
// Check memtable
if let Some(value) = self.memtable.get(key) {
return Ok(Some(value.clone()));
}
// Check immutable memtables
for memtable in &self.immutable_memtables {
if let Some(value) = memtable.get(key) {
return Ok(Some(value.clone()));
}
}
// Check SSTables (newest to oldest)
for sstable in self.sstables.iter().rev() {
if let Some(value) = sstable.get(key)? {
return Ok(Some(value));
}
}
Ok(None)
}
fn flush(&mut self) -> Result<()> {
let memtable = std::mem::replace(
&mut self.memtable,
MemTable::new()
);
let sstable = SSTable::from_memtable(&memtable)?;
self.sstables.push(sstable);
Ok(())
}
}
This implementation taught me:
- Write amplification: Data gets written multiple times during compaction
- Read amplification: Must check multiple levels for reads
- Space amplification: Old versions of data consume space until compaction
These trade-offs aren't academic—they're fundamental constraints.
Implementing Raft¶
Working through the Raft labs gave me appreciation for how hard distributed consensus is:
func (rf *Raft) sendAppendEntries(server int, args *AppendEntriesArgs) {
reply := &AppendEntriesReply{}
ok := rf.peers[server].Call("Raft.AppendEntries", args, reply)
if !ok {
return // RPC failed, follower may be down
}
rf.mu.Lock()
defer rf.mu.Unlock()
if reply.Term > rf.currentTerm {
// We're outdated, step down
rf.currentTerm = reply.Term
rf.state = Follower
return
}
if reply.Success {
// Follower accepted entries, update tracking
rf.matchIndex[server] = args.PrevLogIndex + len(args.Entries)
rf.nextIndex[server] = rf.matchIndex[server] + 1
// Check if we can advance commit index
rf.updateCommitIndex()
} else {
// Follower rejected, back up and retry
rf.nextIndex[server]--
}
}
Debugging distributed systems is hard: - Network delays cause spurious leader elections - Race conditions manifest non-deterministically - Logs from multiple nodes must be correlated
I gained massive respect for engineers building distributed databases.
Resources That Made the Difference¶
Online Courses¶
Must-watch: 1. CMU 15-445: Database Systems - Start here 2. CMU 15-721: Advanced Database Systems - After 445 3. MIT 6.824: Distributed Systems - Complement to databases
Also excellent: - Stanford CS245: Database System Implementation - CMU 15-799: Special Topics in Database Systems - Cutting-edge research
Books¶
Essential reading: 1. Designing Data-Intensive Applications by Martin Kleppmann 2. Database Internals by Alex Petrov 3. Distributed Systems by Maarten van Steen
Also valuable: - Database System Concepts (Silberschatz) - Good theory foundation - Transaction Processing (Gray & Reuter) - Deep dive on ACID - Distributed Algorithms (Lynch) - Formal distributed systems
Papers¶
Reading database papers became accessible after the courses:
Classic papers: - ARIES: A Transaction Recovery Method (IBM, 1992) - The Part-Time Parliament (Lamport's Paxos, 1998) - In Search of an Understandable Consensus Algorithm (Raft, 2014)
Modern systems: - Spanner: Google's Globally-Distributed Database - Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases - CockroachDB: The Resilient Geo-Distributed SQL Database
Community Resources¶
Database Internals Reading Group
I joined the DDIA Reading Group which meets monthly to discuss papers and implementations. Discussing concepts with others solidified my understanding.
Discord/Slack Communities: - CMU Database Group Discord - Distributed Systems Discord
These communities helped when I got stuck on labs or concepts.
The Rust Implementation Journey¶
My goal now is to reimplement the CMU database projects in Rust. Why Rust?
- Memory safety without garbage collection - Critical for database performance
- Excellent concurrency primitives - Essential for concurrent transactions
- Strong type system - Catches bugs at compile time
- Growing database ecosystem - TiKV, Databend, RisingWave all use Rust
Current Progress¶
I'm working through:
Phase 1: Storage Layer ✓ - Buffer pool manager - Page layouts - Disk manager
Phase 2: Access Methods (in progress) - B+ tree index - Hash table index - Table heap
Phase 3: Query Execution (planned) - Iterator model executors - Volcano-style processing - Query optimiser
Phase 4: Concurrency (planned) - Lock manager - MVCC implementation - Deadlock detection
Follow my journey in the upcoming post: "Learning Rust Through Database Implementation" (coming soon).
Code Snippets¶
Here's my Rust buffer pool manager:
pub struct BufferPoolManager {
pool: Vec<Page>,
replacer: Box<dyn Replacer>,
free_list: Vec<FrameId>,
page_table: HashMap<PageId, FrameId>,
}
impl BufferPoolManager {
pub fn fetch_page(&mut self, page_id: PageId) -> Option<&mut Page> {
// Check if page already in pool
if let Some(&frame_id) = self.page_table.get(&page_id) {
self.replacer.pin(frame_id);
return Some(&mut self.pool[frame_id]);
}
// Get victim frame
let frame_id = self.get_victim_frame()?;
// Evict old page if dirty
let page = &mut self.pool[frame_id];
if page.is_dirty() {
self.disk_manager.write_page(page.id(), page.data())?;
}
// Load new page
self.disk_manager.read_page(page_id, page.data_mut())?;
page.set_id(page_id);
page.set_pin_count(1);
self.page_table.insert(page_id, frame_id);
Some(page)
}
}
Implementing this in Rust forced me to think about: - Ownership and borrowing for concurrent access - Pin counts and reference counting - Dirty page tracking - Frame eviction policies
Lessons Learnt¶
1. Implementation Beats Theory¶
Reading about B-trees didn't teach me why they're ubiquitous. Implementing one did.
Before: "B-trees are balanced trees used in databases" After: "B-trees minimise disk I/O by maximising fanout, which is critical because disk seeks dominate query latency"
The difference is understanding versus knowledge.
2. Trade-offs Are Everywhere¶
There are no silver bullets in databases:
- Fast writes? Sacrifice read performance (LSM trees)
- Strong consistency? Sacrifice availability (CAP theorem)
- Flexible schema? Sacrifice query optimisation (NoSQL)
Understanding trade-offs lets you choose the right tool.
3. Distributed Systems Are Hard¶
Implementing Raft taught me: - Network delays and partitions are unavoidable - Debugging distributed systems requires systematic logging - Correctness requires formal reasoning (TLA+)
I have immense respect for database engineers now.
4. Real Systems Are Messy¶
PostgreSQL's codebase is 1.3 million lines. It handles: - Crash recovery edge cases - Upgrade compatibility - Performance regressions - Platform-specific quirks
Reading academic papers gives clean algorithms. Building systems reveals the complexity.
What's Next¶
My database and distributed systems education continues:
Short-term goals: - Complete Rust database implementation - Contribute to open-source databases (PostgreSQL, TiKV) - Read more research papers
Medium-term goals: - Build a distributed key-value store - Implement a query optimiser - Explore column stores and OLAP systems
Long-term goals: - Contribute to database research - Apply knowledge to astronomical data processing - Teach others what I've learnt
Conclusion¶
2020 was indeed my year for databases and distributed systems. But learning isn't finished—it never is in this field.
What changed wasn't just my technical knowledge. It was my approach to learning:
Old approach: Read theory, memorise concepts, forget details New approach: Watch lectures, implement systems, struggle with edge cases
The struggle is where learning happens.
If you're frustrated with traditional database courses, try Andy Pavlo's lectures. If you want to understand distributed systems, work through MIT 6.824 labs. If you want everything to click together, read DDIA.
Most importantly: build something. Implement a storage engine. Write a Raft library. Break things and debug them.
That's how you learn databases properly.
References and Resources¶
Courses: - CMU 15-445: Database Systems - CMU 15-721: Advanced Database Systems - MIT 6.824: Distributed Systems
Books: - Designing Data-Intensive Applications by Martin Kleppmann - Database Internals by Alex Petrov
My implementations: - GitHub: My Rust database implementation (coming soon) - GitHub: MIT 6.824 labs in Go (coming soon)
Communities: - CMU Database Group Discord - DDIA Reading Group