Skip to content

🤿 Databases and Distributed Systems Deep Dive

2020 became the year I finally learnt databases and distributed systems properly—by ignoring theory and diving into implementation.

During my Master's in Computer Science, my favourite courses were Systems Architecture (operating systems, compilers) and Architecture and Hardware. I loved understanding how things work at a fundamental level—how assembly becomes machine code, how virtual memory works, how caches affect performance.

Naturally, I was excited for Database Theory. Databases are systems, after all. Surely we'd learn about B-trees, query execution, storage engines?

We didn't.

We spent weeks on Boyce-Codd normal forms, entity-relationship diagrams, and normalisation theory. Every lecture was about what databases should do in theory, never how they actually work in practice. For someone who wanted to understand implementation, it was deeply frustrating.

I passed the course and promptly ignored databases for years, treating them as black boxes. Need to store data? Use PostgreSQL. Need scalability? Add MongoDB. Need performance? ¯\_(ツ)_/¯ probably cache something?

But after too many "where is my data physically stored?" and "how does this query actually execute?" frustrations, I decided to learn databases properly. As a PhD researcher in data-intensive science dealing with gigabytes to terabytes of astronomical data, understanding what happens "under the hood" wasn't just curiosity—it was becoming essential for my work.

The Turning Point: Andy Pavlo

What really sparked things was discovering Andy Pavlo's database courses from Carnegie Mellon University:

  1. CMU 15-445: Introduction to Database Systems
  2. CMU 15-721: Advanced Database Systems

His passion for databases and focus on implementation is exactly what I'd been missing. Within the first lecture of the intro course, he's talking about how PostgreSQL stores data on disk, how buffer pools work, and why database design decisions matter for real-world performance.

I binged these lectures like a Netflix series. With my systems background and previous (failed) database education, I could follow the concepts whilst appreciating the implementation details I'd been craving.

What Makes These Courses Special

Implementation Over Theory

Andy doesn't start with normalisation forms. He starts with: "Here's how a database stores data on disk. Here's why that matters."

Lecture 1 topics: - Disk-oriented architecture - Page layouts and slotted pages - Tuple storage (row vs column stores) - Buffer pool management

This is the stuff that matters when you're debugging why your query is slow or why your database is using 50GB of disk for 10GB of data.

Real Database Systems

The courses reference real systems constantly:

  • PostgreSQL's MVCC implementation
  • MySQL's InnoDB storage engine
  • SQLite's write-ahead logging
  • RocksDB's LSM trees

You're not learning abstract theory—you're learning how PostgreSQL decides whether to use an index scan or sequential scan, and why it makes that choice.

The BusTub Project

The intro course includes a hands-on project: implementing components of a working database system called BusTub.

You implement:

Project 1: Buffer Pool Manager

// Managing pages of data in memory
class BufferPoolManager {
  Page* FetchPage(page_id_t page_id);
  bool UnpinPage(page_id_t page_id, bool is_dirty);
  bool FlushPage(page_id_t page_id);
  Page* NewPage(page_id_t* page_id);
  bool DeletePage(page_id_t page_id);
};

This teaches you how databases manage memory, handle page replacement (LRU), and decide what to keep in RAM versus flush to disk.

Project 2: Hash Table

Implementing extensible hashing for indexing:

template <typename KeyType, typename ValueType>
class ExtendibleHashTable {
  bool Insert(const KeyType &key, const ValueType &value);
  bool Find(const KeyType &key, std::vector<ValueType> &result);
  bool Remove(const KeyType &key, const ValueType &value);
};

This crystallised why hash indexes are O(1) for point queries but useless for range scans.

Project 3: Query Execution

Building query operators (sequential scan, nested loop join, hash join):

class HashJoinExecutor : public AbstractExecutor {
  void Init() override;
  bool Next(Tuple *tuple, RID *rid) override;

private:
  void BuildHashTable();
  std::unordered_map<Value, std::vector<Tuple>> hash_table_;
};

Implementing a hash join makes you understand exactly why JOIN performance depends on which table is on which side, and why you should always join on indexed columns.

Project 4: Concurrency Control

Implementing lock manager and deadlock detection for MVCC transactions.

This project alone taught me more about database transactions than years of reading about ACID properties in textbooks.

Key Concepts That Finally Made Sense

1. Storage Models

Row Store vs Column Store

I'd always heard "use a column store for analytics" without understanding why.

Row store (traditional RDBMS):

Page 1: [id=1, name="Alice", age=30, salary=50000]
        [id=2, name="Bob", age=25, salary=45000]

Column store:

Page 1: [id: 1, 2, 3, ...]
Page 2: [name: "Alice", "Bob", "Charlie", ...]
Page 3: [age: 30, 25, 35, ...]
Page 4: [salary: 50000, 45000, 60000, ...]

For SELECT AVG(salary) FROM employees, the column store reads only the salary page. The row store must read every page, pulling all columns into memory just to discard most of them.

Aha moment: Column stores aren't magic—they're just optimised for different access patterns.

2. Index Structures

B+ Trees

Every database course mentions B-trees. Few explain why they're perfect for databases:

              [10 | 20 | 30]
             /    |    |    \
         [1,5]  [11,15] [21,25] [31,35]
           |      |       |       |
        [data] [data]  [data]  [data]
  • All data at leaf level (uniform lookup time)
  • Leaves linked (efficient range scans)
  • High fanout = shallow tree = fewer disk reads
  • Balanced = predictable performance

LSM Trees (Log-Structured Merge Trees)

Used by RocksDB, Cassandra, LevelDB:

                MemTable (in memory)
                    ↓ (flush when full)
                  SSTable L0
                    ↓ (compact)
                  SSTable L1
                    ↓ (compact)
                  SSTable L2

Write-optimised: all writes append to memory, then flush. Reads must check multiple levels (slower), but writes are incredibly fast.

Trade-off: B+ trees favour reads. LSM trees favour writes. Choose based on your workload.

3. Query Execution

Query Plan Visualisation

Understanding EXPLAIN output finally made sense when I understood executors:

EXPLAIN SELECT e.name, d.dept_name
FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE e.salary > 50000;
HashJoin (cost=12.5..25.3 rows=100)
  Hash Cond: (e.dept_id = d.id)
  -> Seq Scan on employees e (cost=0.0..10.0 rows=500)
       Filter: (salary > 50000)
  -> Hash on departments d (cost=5.0..5.0 rows=200)

Reading this plan, I now understand: 1. Scan employees, filter high earners (predicate pushdown) 2. Build hash table on departments (smaller table) 3. Probe hash table for each employee (O(1) per probe)

Optimiser Decisions

Why does the database sometimes ignore your index?

-- Index on age exists, but not used!
SELECT * FROM users WHERE age > 18;

If 99% of users are over 18, sequential scan of the table is faster than: 1. Index scan to find 99% of row IDs 2. Random disk seeks to fetch each row

The optimiser knows this from table statistics. This blew my mind.

4. Concurrency Control

MVCC (Multi-Version Concurrency Control)

How PostgreSQL lets readers and writers not block each other:

Transaction T1: UPDATE users SET name = 'Alice' WHERE id = 1;

Physical storage:
[id=1, name='Bob', xmin=100, xmax=200]    <- Old version
[id=1, name='Alice', xmin=200, xmax=∞]   <- New version
  • T1 (transaction 200) sees the new version
  • T2 (transaction 150) sees the old version
  • No locks required for reads!

Trade-off: requires VACUUM to clean up old versions. This explains why PostgreSQL needs regular maintenance.

Two-Phase Locking vs Optimistic Concurrency

2PL (pessimistic):

BEGIN
LOCK (record)
-- Do work
COMMIT
UNLOCK

OCC (optimistic):

BEGIN
-- Read and track versions
-- Do work
VALIDATE (no conflicts?)
  Yes: COMMIT
  No: ABORT and RETRY

OCC wins for read-heavy workloads. 2PL wins for write-heavy. Databases must choose.

Distributed Systems: MIT 6.824

Whilst watching Andy's database lectures, I discovered MIT's 6.824 Distributed Systems course.

The timing was perfect—many topics overlapped:

Topics covered: - MapReduce and distributed computation - Raft consensus algorithm - Fault tolerance and replication - Distributed transactions (2PC, 3PC) - Consistency models

The Labs

The course includes legendary labs implementing distributed systems in Go:

Lab 1: MapReduce

Implement a coordinator and workers:

type Coordinator struct {
    tasks     []Task
    workers   map[string]Worker
    mutex     sync.Mutex
}

func (c *Coordinator) AssignTask(args *Args, reply *Reply) error {
    c.mutex.Lock()
    defer c.mutex.Unlock()

    // Find unassigned task
    for i, task := range c.tasks {
        if task.Status == Pending {
            task.Status = Running
            task.Worker = args.WorkerID
            reply.Task = task
            return nil
        }
    }
    return nil
}

This taught me about coordinating distributed work, handling worker failures, and the challenges of distributed state.

Lab 2: Raft

Implement the Raft consensus algorithm for replicated state machines:

type Raft struct {
    mu        sync.Mutex
    peers     []*labrpc.ClientEnd
    me        int

    currentTerm int
    votedFor    int
    log         []LogEntry

    state       State  // Follower, Candidate, or Leader
}

func (rf *Raft) Start(command interface{}) (int, int, bool) {
    // Leader replicates command to followers
    // Returns index, term, isLeader
}

Implementing leader election, log replication, and handling network partitions gave me visceral understanding of distributed consensus.

Lab 3: Fault-Tolerant Key/Value Service

Build a linearisable key-value store on top of Raft:

type KVServer struct {
    mu      sync.Mutex
    rf      *Raft
    applyCh chan ApplyMsg

    data    map[string]string
    applied map[int64]Response  // Dedup cache
}

func (kv *KVServer) Get(args *GetArgs, reply *GetReply) {
    // Submit to Raft, wait for commit
    op := Op{Type: "Get", Key: args.Key}
    index, _, isLeader := kv.rf.Start(op)

    if !isLeader {
        reply.Err = ErrWrongLeader
        return
    }

    // Wait for log application
    // Return result
}

This crystallised how distributed databases like CockroachDB and TiDB work under the hood.

Lab 4: Sharded Key/Value Service

Implement sharding with shard migration:

type ShardController struct {
    configs []Config  // Configuration history

    // Each config defines shard -> replica group mapping
}

// Rebalance shards across groups
func (sc *ShardController) Join(args *JoinArgs, reply *JoinReply) {
    // Add new replica group
    // Rebalance shards
}

Understanding shard rebalancing explained why MongoDB migrations can be slow and why consistent hashing matters.

Key Distributed Systems Insights

CAP Theorem (finally understood)

You can't have all three: - **C**onsistency: All nodes see the same data - **A**vailability: Every request gets a response - **P**artition tolerance: System works despite network failures

Network partitions will happen, so you choose CP or AP:

  • CP (PostgreSQL, traditional RDBMS): Sacrifice availability during partitions
  • AP (Cassandra, DynamoDB): Sacrifice consistency for availability

Aha moment: This isn't a design choice—it's physics. Network delays and partitions are unavoidable.

Consistency Models

Linearisability (strongest):
  [W(x=1)] → [R(x)=1]  # Reads always see latest write

Sequential Consistency:
  All processes see same order, may not be real-time

Eventual Consistency (weakest):
  Given enough time, all replicas converge

Understanding these models explained why: - DynamoDB sometimes returns stale data - PostgreSQL replication has lag - Spanner uses atomic clocks for global consistency

Designing Data-Intensive Applications

DDIA Book Cover
The "Databasss" book that ties everything together

Alongside the courses, I read Martin Kleppmann's Designing Data-Intensive Applications. This book ties together databases and distributed systems beautifully.

What Makes DDIA Essential

Part 1: Foundations of Data Systems

Covers storage engines, encoding, and replication with production examples:

  • How PostgreSQL implements replication
  • How Kafka achieves high throughput
  • How Elasticsearch handles full-text search

Part 2: Distributed Data

Explains partitioning, transactions, and consistency models:

  • Why hash partitioning vs range partitioning matters
  • How distributed transactions work (2PC, sagas)
  • Trade-offs between consistency models

Part 3: Derived Data

Batch processing, stream processing, and the future of data systems:

  • MapReduce and Spark
  • Stream processing with Kafka and Flink
  • Event sourcing and CQRS patterns

Concepts That Clicked

The Log is Everything

A surprising insight: almost every data system is built on append-only logs:

  • Databases: Write-ahead log (WAL)
  • Kafka: Distributed commit log
  • Event sourcing: Event log as source of truth
Traditional view:
  Database is the source of truth

Reality:
  Log is the source of truth
  Database is a materialized view of the log

Change Data Capture (CDC)

Reading the database's log to stream changes elsewhere:

PostgreSQL WAL → Debezium → Kafka → Elasticsearch
                                   → Data warehouse
                                   → Cache invalidation

This pattern powers modern data architectures. Understanding logs made this click.

Batch vs Stream Processing

Batch (MapReduce, Spark):
  Process historical data, high latency, high throughput

Stream (Flink, Kafka Streams):
  Process real-time data, low latency, bounded throughput

Not competitors—complementary. Use both in different parts of your system.

Practical Implementation Experience

Building a Simple Storage Engine

To cement my understanding, I implemented a basic LSM tree storage engine:

pub struct LSMTree {
    memtable: MemTable,
    immutable_memtables: Vec<MemTable>,
    sstables: Vec<SSTable>,
}

impl LSMTree {
    pub fn put(&mut self, key: Vec<u8>, value: Vec<u8>) -> Result<()> {
        self.memtable.insert(key, value);

        if self.memtable.size() > MEMTABLE_SIZE_LIMIT {
            self.flush()?;
        }
        Ok(())
    }

    pub fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>> {
        // Check memtable
        if let Some(value) = self.memtable.get(key) {
            return Ok(Some(value.clone()));
        }

        // Check immutable memtables
        for memtable in &self.immutable_memtables {
            if let Some(value) = memtable.get(key) {
                return Ok(Some(value.clone()));
            }
        }

        // Check SSTables (newest to oldest)
        for sstable in self.sstables.iter().rev() {
            if let Some(value) = sstable.get(key)? {
                return Ok(Some(value));
            }
        }

        Ok(None)
    }

    fn flush(&mut self) -> Result<()> {
        let memtable = std::mem::replace(
            &mut self.memtable,
            MemTable::new()
        );

        let sstable = SSTable::from_memtable(&memtable)?;
        self.sstables.push(sstable);

        Ok(())
    }
}

This implementation taught me:

  • Write amplification: Data gets written multiple times during compaction
  • Read amplification: Must check multiple levels for reads
  • Space amplification: Old versions of data consume space until compaction

These trade-offs aren't academic—they're fundamental constraints.

Implementing Raft

Working through the Raft labs gave me appreciation for how hard distributed consensus is:

func (rf *Raft) sendAppendEntries(server int, args *AppendEntriesArgs) {
    reply := &AppendEntriesReply{}
    ok := rf.peers[server].Call("Raft.AppendEntries", args, reply)

    if !ok {
        return  // RPC failed, follower may be down
    }

    rf.mu.Lock()
    defer rf.mu.Unlock()

    if reply.Term > rf.currentTerm {
        // We're outdated, step down
        rf.currentTerm = reply.Term
        rf.state = Follower
        return
    }

    if reply.Success {
        // Follower accepted entries, update tracking
        rf.matchIndex[server] = args.PrevLogIndex + len(args.Entries)
        rf.nextIndex[server] = rf.matchIndex[server] + 1

        // Check if we can advance commit index
        rf.updateCommitIndex()
    } else {
        // Follower rejected, back up and retry
        rf.nextIndex[server]--
    }
}

Debugging distributed systems is hard: - Network delays cause spurious leader elections - Race conditions manifest non-deterministically - Logs from multiple nodes must be correlated

I gained massive respect for engineers building distributed databases.

Resources That Made the Difference

Online Courses

Must-watch: 1. CMU 15-445: Database Systems - Start here 2. CMU 15-721: Advanced Database Systems - After 445 3. MIT 6.824: Distributed Systems - Complement to databases

Also excellent: - Stanford CS245: Database System Implementation - CMU 15-799: Special Topics in Database Systems - Cutting-edge research

Books

Essential reading: 1. Designing Data-Intensive Applications by Martin Kleppmann 2. Database Internals by Alex Petrov 3. Distributed Systems by Maarten van Steen

Also valuable: - Database System Concepts (Silberschatz) - Good theory foundation - Transaction Processing (Gray & Reuter) - Deep dive on ACID - Distributed Algorithms (Lynch) - Formal distributed systems

Papers

Reading database papers became accessible after the courses:

Classic papers: - ARIES: A Transaction Recovery Method (IBM, 1992) - The Part-Time Parliament (Lamport's Paxos, 1998) - In Search of an Understandable Consensus Algorithm (Raft, 2014)

Modern systems: - Spanner: Google's Globally-Distributed Database - Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases - CockroachDB: The Resilient Geo-Distributed SQL Database

Community Resources

Database Internals Reading Group

I joined the DDIA Reading Group which meets monthly to discuss papers and implementations. Discussing concepts with others solidified my understanding.

Discord/Slack Communities: - CMU Database Group Discord - Distributed Systems Discord

These communities helped when I got stuck on labs or concepts.

The Rust Implementation Journey

My goal now is to reimplement the CMU database projects in Rust. Why Rust?

  1. Memory safety without garbage collection - Critical for database performance
  2. Excellent concurrency primitives - Essential for concurrent transactions
  3. Strong type system - Catches bugs at compile time
  4. Growing database ecosystem - TiKV, Databend, RisingWave all use Rust

Current Progress

I'm working through:

Phase 1: Storage Layer ✓ - Buffer pool manager - Page layouts - Disk manager

Phase 2: Access Methods (in progress) - B+ tree index - Hash table index - Table heap

Phase 3: Query Execution (planned) - Iterator model executors - Volcano-style processing - Query optimiser

Phase 4: Concurrency (planned) - Lock manager - MVCC implementation - Deadlock detection

Follow my journey in the upcoming post: "Learning Rust Through Database Implementation" (coming soon).

Code Snippets

Here's my Rust buffer pool manager:

pub struct BufferPoolManager {
    pool: Vec<Page>,
    replacer: Box<dyn Replacer>,
    free_list: Vec<FrameId>,
    page_table: HashMap<PageId, FrameId>,
}

impl BufferPoolManager {
    pub fn fetch_page(&mut self, page_id: PageId) -> Option<&mut Page> {
        // Check if page already in pool
        if let Some(&frame_id) = self.page_table.get(&page_id) {
            self.replacer.pin(frame_id);
            return Some(&mut self.pool[frame_id]);
        }

        // Get victim frame
        let frame_id = self.get_victim_frame()?;

        // Evict old page if dirty
        let page = &mut self.pool[frame_id];
        if page.is_dirty() {
            self.disk_manager.write_page(page.id(), page.data())?;
        }

        // Load new page
        self.disk_manager.read_page(page_id, page.data_mut())?;
        page.set_id(page_id);
        page.set_pin_count(1);

        self.page_table.insert(page_id, frame_id);
        Some(page)
    }
}

Implementing this in Rust forced me to think about: - Ownership and borrowing for concurrent access - Pin counts and reference counting - Dirty page tracking - Frame eviction policies

Lessons Learnt

1. Implementation Beats Theory

Reading about B-trees didn't teach me why they're ubiquitous. Implementing one did.

Before: "B-trees are balanced trees used in databases" After: "B-trees minimise disk I/O by maximising fanout, which is critical because disk seeks dominate query latency"

The difference is understanding versus knowledge.

2. Trade-offs Are Everywhere

There are no silver bullets in databases:

  • Fast writes? Sacrifice read performance (LSM trees)
  • Strong consistency? Sacrifice availability (CAP theorem)
  • Flexible schema? Sacrifice query optimisation (NoSQL)

Understanding trade-offs lets you choose the right tool.

3. Distributed Systems Are Hard

Implementing Raft taught me: - Network delays and partitions are unavoidable - Debugging distributed systems requires systematic logging - Correctness requires formal reasoning (TLA+)

I have immense respect for database engineers now.

4. Real Systems Are Messy

PostgreSQL's codebase is 1.3 million lines. It handles: - Crash recovery edge cases - Upgrade compatibility - Performance regressions - Platform-specific quirks

Reading academic papers gives clean algorithms. Building systems reveals the complexity.

What's Next

My database and distributed systems education continues:

Short-term goals: - Complete Rust database implementation - Contribute to open-source databases (PostgreSQL, TiKV) - Read more research papers

Medium-term goals: - Build a distributed key-value store - Implement a query optimiser - Explore column stores and OLAP systems

Long-term goals: - Contribute to database research - Apply knowledge to astronomical data processing - Teach others what I've learnt

Conclusion

2020 was indeed my year for databases and distributed systems. But learning isn't finished—it never is in this field.

What changed wasn't just my technical knowledge. It was my approach to learning:

Old approach: Read theory, memorise concepts, forget details New approach: Watch lectures, implement systems, struggle with edge cases

The struggle is where learning happens.

If you're frustrated with traditional database courses, try Andy Pavlo's lectures. If you want to understand distributed systems, work through MIT 6.824 labs. If you want everything to click together, read DDIA.

Most importantly: build something. Implement a storage engine. Write a Raft library. Break things and debug them.

That's how you learn databases properly.

References and Resources

Courses: - CMU 15-445: Database Systems - CMU 15-721: Advanced Database Systems - MIT 6.824: Distributed Systems

Books: - Designing Data-Intensive Applications by Martin Kleppmann - Database Internals by Alex Petrov

My implementations: - GitHub: My Rust database implementation (coming soon) - GitHub: MIT 6.824 labs in Go (coming soon)

Communities: - CMU Database Group Discord - DDIA Reading Group



  1. The BusTub project is how Carnegie Mellon teaches database implementation 

  2. Martin Kleppmann's DDIA is considered the definitive guide to modern data systems 

-->