🐇 2020: The Year of the Rabbit... hole¶
Lockdown gave me time. My PhD gave me problems. Together, they sent me down rabbit holes I didn't know existed.
2020 was supposed to be the year I finished my PhD in data-intensive astronomy. Then COVID happened. Conferences cancelled. Lab access restricted. Fieldwork postponed indefinitely.
What I gained instead was something unexpected: uninterrupted time to go deep on the technologies underpinning my research. No commute, no in-person meetings, no "quick chats" that derail your afternoon. Just me, my astronomical datasets, and an ever-growing list of technologies I needed to understand properly.
What started as "I should probably learn how Spark works" became a year-long journey through databases, distributed systems, embedded machine learning, compression algorithms, and systems programming. Each topic opened doors to three more. Each rabbit hole connected to others in surprising ways.
This is the story of 2020: the year I accidentally built a foundation in modern data infrastructure whilst the world was on pause.
The Context: Data-Intensive Astronomy¶
My PhD focuses on processing astronomical survey data—specifically, detecting transient events in massive datasets from telescopes like LSST (Legacy Survey of Space and Time).
The scale: - 15 TB of data per night from LSST when operational - Real-time processing requirements (detect supernovae within 60 seconds) - Petabyte-scale archives requiring efficient storage and retrieval - Machine learning inference on resource-constrained edge devices
This isn't "download a CSV and run scikit-learn" data science. This is:
# NOT this
df = pd.read_csv('data.csv')
model.fit(df)
# THIS
cluster = Spark("100-node cluster")
df = cluster.read.parquet("hdfs://petabyte-archive/")
distributed_model.train_on_cluster(df)
deploy_to_edge_device(optimized_model)
Every technology I explored in 2020 connected back to this problem space.
Rabbit Hole #1: Apache Arrow and the Data Format Revolution¶
It started innocently: "Why is reading Parquet files so much faster than CSV?"
This led me to Apache Arrow, and everything clicked.
The Problem¶
Moving data between systems is expensive:
# Reading data in Pandas
pandas_df = pd.read_csv('data.csv') # Parse, allocate, copy
# Pass to Spark
spark_df = spark.createDataFrame(pandas_df) # Serialize, copy, deserialize
# Pass to TensorFlow
tf_dataset = tf.data.Dataset.from_tensor_slices(pandas_df.values) # Copy again
Every handoff requires serialization, deserialization, and memory copies. With gigabytes of data, this dominates runtime.
The Arrow Solution¶
Arrow defines a columnar in-memory format that everyone can use:
Traditional:
Pandas → serialize → Spark → deserialize → process
(copies everywhere, CPU time wasted on conversions)
With Arrow:
Pandas → Arrow buffer → Spark reads same buffer
(zero-copy, instant handoff)
Real example from my work:
import pyarrow as pa
import pyarrow.parquet as pq
# Read Parquet with Arrow (columnar, memory-mapped)
table = pq.read_table('astronomical_sources.parquet')
# Convert to Pandas (zero-copy view)
df = table.to_pandas(zero_copy_only=True)
# Pass to Spark (zero-copy)
spark_df = spark.createDataFrame(df)
# Read specific columns only (columnar storage FTW)
magnitude_data = pq.read_table(
'sources.parquet',
columns=['magnitude', 'magnitude_error']
)
Performance impact: - CSV read: 45 seconds for 10GB - Parquet with Arrow: 3 seconds for same data - Memory usage: 60% reduction (columnar compression)
Why This Mattered for Astronomy¶
Astronomical data comes in FITS format (Flexible Image Transport System), which is: - Row-oriented (slow for columnar analytics) - Uncompressed or poorly compressed - Difficult to query efficiently
Converting to Arrow/Parquet unlocked:
# Before: Load entire 50GB FITS file to check one column
data = fits.open('huge_survey.fits')[1].data
magnitudes = data['MAG_AUTO'] # Still loaded everything
# After: Read only needed columns
magnitudes = pq.read_table(
'survey.parquet',
columns=['MAG_AUTO']
).to_pandas() # Loaded only 500MB
This rabbit hole connected to compression (Parquet's encoding schemes), distributed systems (Spark's Parquet reader), and databases (columnar storage engines).
Rabbit Hole #2: Databases and Distributed Systems¶
You can read the full story in my database deep dive, but here's how it connected to my PhD work.
The Trigger¶
I needed to: 1. Store billions of astronomical sources 2. Query them efficiently (cross-matching, nearest neighbour searches) 3. Handle updates (new observations of known sources) 4. Scale horizontally (data keeps growing)
PostgreSQL with PostGIS was struggling. I needed to understand why.
What I Learned¶
Storage engines matter:
-- Why is this slow?
SELECT * FROM sources WHERE q3c_radial_query(ra, dec, 180.0, 45.0, 0.1);
-- Because B-tree index on (ra, dec) doesn't help spatial queries!
-- Need specialized index:
CREATE INDEX sources_q3c_idx ON sources (q3c_ang2ipix(ra, dec));
Understanding B-trees, LSM trees, and spatial indexes (R-trees, k-d trees) explained performance characteristics I'd observed but never understood.
Distributed databases for scale:
For petabyte-scale data, single-node PostgreSQL wasn't enough. I explored:
- CockroachDB: Distributed SQL with Raft consensus
- TiDB: MySQL-compatible distributed database
- Cassandra: Wide-column store for time-series data
Query optimization:
-- Slow (250 seconds)
SELECT source_id, magnitude
FROM observations o
JOIN sources s ON o.source_id = s.id
WHERE o.observation_time > '2020-01-01';
-- Fast (12 seconds) - predicate pushdown
SELECT source_id, magnitude
FROM (
SELECT * FROM observations
WHERE observation_time > '2020-01-01'
) o
JOIN sources s ON o.source_id = s.id;
Understanding query plans and optimizer decisions made me write better queries.
This connected to Spark (distributed query engine), Arrow (columnar format), and compression (reduce I/O).
Rabbit Hole #3: Spark 3.0 and Distributed Computing¶
Spark 3.0 was released in June 2020, bringing major improvements for my use case.
Adaptive Query Execution (AQE)¶
Before AQE:
# Spark guesses partition sizes at planning time
df1 = spark.read.parquet("sources.parquet") # 50GB
df2 = spark.read.parquet("observations.parquet") # 500GB
# Join plan might be suboptimal if sizes were misestimated
result = df1.join(df2, "source_id")
With AQE (enabled by default in 3.0):
# Spark adapts plan during execution based on actual data sizes
spark.conf.set("spark.sql.adaptive.enabled", "true")
# Automatically:
# - Converts sort-merge join to broadcast join if one side is small
# - Coalesces small partitions to reduce overhead
# - Optimizes skew joins
Real impact on my workload: - Cross-matching astronomical catalogues: 40% faster - Aggregate queries on partitioned data: 60% faster - Skewed joins (some sources have many observations): 3x faster
Dynamic Partition Pruning¶
# Query: Find observations of bright sources
bright_sources = spark.sql("""
SELECT source_id FROM sources WHERE magnitude < 15
""")
observations = spark.sql("""
SELECT * FROM observations WHERE source_id IN (
SELECT source_id FROM sources WHERE magnitude < 15
)
""")
# Spark 3.0 automatically prunes partitions of observations
# Only reads partitions containing bright sources
# Reduced data scanned: 1TB → 50GB
This was transformative for astronomical queries where we often filter by source properties then look up observations.
Integration with Arrow¶
# Pandas UDFs with Arrow (much faster than row-based UDFs)
from pyspark.sql.functions import pandas_udf
import pandas as pd
@pandas_udf('double')
def calculate_distance_modulus(apparent_mag: pd.Series,
absolute_mag: pd.Series) -> pd.Series:
return apparent_mag - absolute_mag
# Spark handles Arrow serialization internally
df.select(
calculate_distance_modulus('mag_apparent', 'mag_absolute')
).show()
# 10x faster than row-at-a-time UDFs
This rabbit hole connected to databases (query optimization), Arrow (zero-copy), and machine learning (feature engineering at scale).
Rabbit Hole #4: Horovod and Distributed Deep Learning¶
Training neural networks on astronomical images required distributed training.
The Problem¶
# Single-GPU training on 100K images
model = create_resnet50()
model.fit(train_dataset, epochs=50)
# Time: 18 hours
With 1M images, this becomes a week. Not feasible for iteration.
Horovod Solution¶
Horovod uses Ring-AllReduce for efficient distributed training:
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
# Build model
model = create_resnet50()
# Horovod: adjust learning rate based on number of GPUs
opt = tf.optimizers.Adam(0.001 * hvd.size())
# Horovod: wrap optimizer with DistributedOptimizer
opt = hvd.DistributedOptimizer(opt)
# Train
model.fit(train_dataset, epochs=50)
Results with 8 GPUs: - Training time: 18 hours → 2.5 hours - Near-linear scaling (7.2x speedup on 8 GPUs) - Gradient synchronization overhead: ~5%
Integration with Spark¶
from sparkdl import HorovodRunner
def train_hvd(learning_rate):
import horovod.tensorflow as hvd
# Training code here
# Run distributed training on Spark cluster
hr = HorovodRunner(np=8)
hr.run(train_hvd, learning_rate=0.001)
This let me use the same Spark cluster for data preprocessing and model training.
This connected to Spark (distributed execution), TensorFlow (deep learning), and compression (storing model checkpoints efficiently).
Rabbit Hole #5: Embedded Systems and TinyML¶
The realization: astronomical surveys will generate data faster than we can transmit it to data centres. We need to run inference at the telescope.
The Constraint¶
Telescope edge device:
- ARM Cortex-M4 processor (168 MHz)
- 192 KB RAM
- 1 MB Flash storage
- No internet connectivity
- Power budget: 1W
Running a ResNet50 on this? Impossible. But detecting "interesting" events to prioritize transmission? Feasible with TinyML.
TensorFlow Lite Micro¶
// Model: Simple CNN for transient detection (10 KB)
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "model_data.h"
// Allocate memory for inference
constexpr int kTensorArenaSize = 60 * 1024;
uint8_t tensor_arena[kTensorArenaSize];
// Load model
const tflite::Model* model = tflite::GetModel(model_data);
tflite::MicroInterpreter interpreter(
model, resolver, tensor_arena, kTensorArenaSize);
// Run inference on image patch
TfLiteTensor* input = interpreter.input(0);
// Copy image data to input tensor
std::memcpy(input->data.f, image_data, input->bytes);
interpreter.Invoke();
TfLiteTensor* output = interpreter.output(0);
float transient_probability = output->data.f[0];
if (transient_probability > 0.8) {
transmit_full_resolution_image();
} else {
discard_image();
}
Model optimization techniques:
# Quantization: FP32 → INT8
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.int8]
tflite_model = converter.convert()
# Result:
# Model size: 2.4 MB → 610 KB
# Inference time: 450ms → 85ms (on Cortex-M4)
# Accuracy: 94.2% → 93.8% (acceptable trade-off)
Real Hardware Testing¶
I got my hands on several development boards:
- Arduino Nano 33 BLE Sense: Cortex-M4, 1MB Flash, 256KB RAM
- STM32 Nucleo-F767ZI: Cortex-M7, 2MB Flash, 512KB RAM
- Raspberry Pi Pico: Dual-core Cortex-M0+, 264KB RAM
Results: - Simple event detection: 60ms latency (acceptable) - More complex models required quantization-aware training - RAM was the limiting factor more than compute
This rabbit hole connected to compression (model compression), TensorFlow (quantization), and Rust (embedded systems programming).
Rabbit Hole #6: Rust for Systems Programming¶
After writing C++ for embedded systems and Python for data processing, I wanted something better.
Why Rust Appealed¶
// C++ - undefined behaviour waiting to happen
void process_data(const std::vector<float>& data) {
for (int i = 0; i <= data.size(); i++) { // Off-by-one!
std::cout << data[i] << std::endl; // Segfault or garbage
}
}
// Rust - won't compile
fn process_data(data: &Vec<f32>) {
for i in 0..=data.len() { // Off-by-one!
println!("{}", data[i]); // ERROR: index out of bounds
}
}
The compiler catches bugs that would be runtime errors in C++.
My First Rust Project: FITS File Reader¶
use std::fs::File;
use std::io::{Read, Seek, SeekFrom};
pub struct FitsReader {
file: File,
headers: Vec<FitsHeader>,
}
impl FitsReader {
pub fn new(path: &str) -> Result<Self, FitsError> {
let mut file = File::open(path)?;
let headers = Self::read_headers(&mut file)?;
Ok(FitsReader { file, headers })
}
pub fn read_data(&mut self, hdu: usize) -> Result<Vec<f32>, FitsError> {
let header = &self.headers[hdu];
let data_size = header.naxis1 * header.naxis2;
self.file.seek(SeekFrom::Start(header.data_offset))?;
let mut buffer = vec![0u8; data_size * 4];
self.file.read_exact(&mut buffer)?;
// Convert big-endian bytes to f32
let data: Vec<f32> = buffer
.chunks_exact(4)
.map(|chunk| f32::from_be_bytes([chunk[0], chunk[1], chunk[2], chunk[3]]))
.collect();
Ok(data)
}
}
What I learned: - Ownership and borrowing: No more "who owns this pointer?" - Error handling: Result<T, E> forced me to handle errors - Zero-cost abstractions: Iterator chains compile to tight loops - Fearless concurrency: The type system prevents data races
Performance Comparison¶
Reading 1GB FITS file and computing statistics:
Python (astropy): 8.5 seconds
C++ (cfitsio): 1.2 seconds
Rust (my impl): 1.1 seconds
Memory usage:
Python: 2.1 GB
C++: 1.0 GB
Rust: 1.0 GB
Rust matched C++ performance with none of the memory safety headaches.
Rewriting Database Projects in Rust¶
As mentioned in my database deep dive, I'm reimplementing CMU's BusTub projects in Rust.
// Example: Buffer pool manager with safe concurrency
use std::sync::{Arc, Mutex, RwLock};
pub struct BufferPoolManager {
pool: Vec<Arc<RwLock<Page>>>,
page_table: Arc<Mutex<HashMap<PageId, FrameId>>>,
free_list: Arc<Mutex<Vec<FrameId>>>,
}
impl BufferPoolManager {
pub fn fetch_page(&self, page_id: PageId) -> Option<Arc<RwLock<Page>>> {
// Lock only the page table, not entire buffer pool
let page_table = self.page_table.lock().unwrap();
if let Some(&frame_id) = page_table.get(&page_id) {
return Some(Arc::clone(&self.pool[frame_id]));
}
// Page not in pool, need to load it
drop(page_table); // Release lock before I/O
// Get free frame and load page
// Rust ensures no data races even with multiple threads
}
}
The type system guarantees no race conditions. This is huge for databases.
This connected to embedded systems (Rust on bare metal), databases (storage engine implementations), and compression (writing codecs).
Rabbit Hole #7: TensorFlow 2.x and Modern Deep Learning¶
TensorFlow 2.0 was a complete redesign. I needed to relearn everything.
Eager Execution Changed Everything¶
# TensorFlow 1.x - define graph, then run
graph = tf.Graph()
with graph.as_default():
x = tf.placeholder(tf.float32, shape=[None, 784])
W = tf.Variable(tf.zeros([784, 10]))
y = tf.matmul(x, W)
with tf.Session(graph=graph) as sess:
sess.run(tf.global_variables_initializer())
result = sess.run(y, feed_dict={x: data})
# TensorFlow 2.x - eager by default
x = tf.constant(data, dtype=tf.float32)
W = tf.Variable(tf.zeros([784, 10]))
y = tf.matmul(x, W) # Executes immediately
print(y.numpy()) # No session needed!
Debugging became sane:
# Can use normal Python debugging
@tf.function
def train_step(images, labels):
with tf.GradientTape() as tape:
predictions = model(images, training=True)
loss = loss_fn(labels, predictions)
# Can print, inspect, use pdb here!
tf.print("Loss:", loss)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
Keras Integration¶
# Building models became elegant
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(64, 64, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(64, 3, activation='relu'),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
# Training is simple
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
model.fit(train_dataset, epochs=10, validation_data=val_dataset)
Custom Training Loops¶
For my astronomical data, I needed custom training logic:
@tf.function
def train_step(images, labels, sample_weights):
with tf.GradientTape() as tape:
predictions = model(images, training=True)
# Custom loss: weight rare classes higher
base_loss = loss_fn(labels, predictions)
weighted_loss = base_loss * sample_weights
loss = tf.reduce_mean(weighted_loss)
# Add regularization
l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in model.trainable_variables])
loss = loss + 0.0001 * l2_loss
gradients = tape.gradient(loss, model.trainable_variables)
# Gradient clipping for stability
gradients, _ = tf.clip_by_global_norm(gradients, 1.0)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
# Training loop
for epoch in range(num_epochs):
for images, labels, weights in train_dataset:
loss = train_step(images, labels, weights)
# Evaluate on validation set
val_accuracy = evaluate(model, val_dataset)
print(f"Epoch {epoch}: loss={loss:.4f}, val_acc={val_accuracy:.4f}")
Mixed Precision Training¶
For faster training on modern GPUs:
# Enable mixed precision
policy = tf.keras.mixed_precision.Policy('mixed_float16')
tf.keras.mixed_precision.set_global_policy(policy)
# Model automatically uses float16 for computation, float32 for variables
model = create_model()
# Optimizer with loss scaling to prevent underflow
optimizer = tf.keras.optimizers.Adam()
optimizer = tf.keras.mixed_precision.LossScaleOptimizer(optimizer)
# Training is 2-3x faster with minimal accuracy loss
This connected to distributed training (Horovod), embedded ML (model optimization), and compression (model quantization).
Rabbit Hole #8: Compression and Codecs¶
Astronomical data is huge and redundant. Compression became critical.
The Problem¶
Single night of LSST observations:
- 3,200 exposures
- 3.2 GB per exposure (189 CCDs × 16 megapixels × 2 bytes)
- Total: 10.24 TB per night
- Uncompressed archive after 10 years: 37 petabytes
Clearly unsustainable without compression.
Lossless Compression for Science¶
Astronomical images require lossless compression (can't lose real signals).
FITS compression schemes:
from astropy.io import fits
# Rice compression (good for integer data)
hdu = fits.CompImageHDU(image_data, compression_type='RICE_1')
# GZIP (universal but slower)
hdu = fits.CompImageHDU(image_data, compression_type='GZIP_1')
# Results on typical astronomical image:
# Uncompressed: 32 MB
# Rice: 18 MB (1.8x compression)
# GZIP: 21 MB (1.5x compression, but slower)
Modern codecs perform better:
import blosc
import zstandard as zstd
# Blosc (optimized for numerical data)
compressed = blosc.compress(image_data.tobytes(), typesize=2, cname='lz4')
# Compression: 2.3x, Speed: 1.2 GB/s
# Zstandard (better ratio, slightly slower)
compressor = zstd.ZstdCompressor(level=3)
compressed = compressor.compress(image_data.tobytes())
# Compression: 2.8x, Speed: 600 MB/s
Columnar Format Compression¶
Parquet's encoding schemes:
# Example: Magnitude column with limited precision
magnitudes = [18.234, 18.237, 18.241, 18.239, ...]
# Dictionary encoding (values repeat)
dictionary = [18.234, 18.237, 18.239, 18.241]
encoded = [0, 1, 3, 2, ...] # Indices into dictionary
# Delta encoding (values are sequential)
base = 18.234
deltas = [0, 3, 7, 5, ...] # In units of 0.001
# Bit packing (deltas fit in 8 bits)
packed = pack_into_bytes(deltas, bit_width=8)
Real results:
# Astronomical catalogue (100M sources)
df = pd.DataFrame({
'source_id': np.arange(100_000_000),
'ra': ra_coordinates,
'dec': dec_coordinates,
'magnitude': magnitudes,
})
# Uncompressed CSV: 12 GB
# Parquet (no compression): 4.2 GB (columnar layout)
# Parquet (Snappy): 1.8 GB (2.3x over columnar)
# Parquet (Zstd): 1.4 GB (3.0x over columnar)
# Read time:
# CSV: 180 seconds
# Parquet (Snappy): 8 seconds
# Parquet (Zstd): 12 seconds (better compression, slightly slower read)
Video Codecs for Time-Series Data¶
Astronomical images are really time-series: same sky region observed repeatedly.
Insight: Video codecs exploit temporal redundancy!
import av
# Treat image sequence as video
container = av.open('observations.mp4', 'w')
stream = container.add_stream('h264', rate=1) # 1 fps
stream.pix_fmt = 'gray'
stream.width = 4096
stream.height = 4096
for image in observation_sequence:
frame = av.VideoFrame.from_ndarray(image, format='gray')
packet = stream.encode(frame)
container.mux(packet)
container.close()
# Results:
# 1000 images × 32 MB = 32 GB uncompressed
# FITS Rice: 18 GB
# H.264 lossless: 8 GB (4x compression!)
Caveat: Decoding is serial (can't jump to frame 500 without decoding 0-499). Trade-offs!
Writing a Custom Codec in Rust¶
For fun, I implemented a simple codec optimized for astronomical data:
pub fn compress_astronomical_image(image: &[i16]) -> Vec<u8> {
let mut output = Vec::new();
// 1. Delta encoding (nearby pixels are similar)
let deltas: Vec<i16> = image
.windows(2)
.map(|w| w[1] - w[0])
.collect();
// 2. Bit packing (most deltas fit in 8 bits)
let mut bit_writer = BitWriter::new(&mut output);
for &delta in &deltas {
if delta.abs() < 128 {
bit_writer.write_bit(0); // Small delta
bit_writer.write_bits(delta as u8, 8);
} else {
bit_writer.write_bit(1); // Large delta
bit_writer.write_bits(delta as u16, 16);
}
}
// 3. Run-length encoding on bit-packed data
run_length_encode(&mut output);
output
}
Results (compared to gzip): - Compression ratio: 2.1x vs 1.8x (gzip) - Compression speed: 450 MB/s vs 80 MB/s (gzip) - Decompression speed: 680 MB/s vs 220 MB/s (gzip)
Not production-ready, but a great learning experience.
This connected to Arrow (columnar compression), databases (storage compression), and Rust (writing performance-critical code).
How It All Connected¶
The beautiful part: every rabbit hole connected to the others.
Apache Arrow ←→ Databases (columnar storage)
↓ ↓
Spark 3.0 ←→ Distributed Systems
↓ ↓
Horovod ←→ TensorFlow 2.x
↓ ↓
TinyML ←→ Compression
↓ ↓
Rust (implementing everything)
Concrete example:
My astronomical processing pipeline uses:
- Arrow/Parquet: Store astronomical catalogues in columnar format
- Spark: Distributed processing for cross-matching catalogues
- TensorFlow: Train transient detection model on Spark cluster via Horovod
- Compression: Zstd-compressed Parquet files reduce storage by 3x
- TinyML: Quantized model deployed to edge device
- Rust: Custom FITS reader and compression codec
All these pieces working together.
What Stuck and What Didn't¶
What Became Essential¶
Arrow and Parquet: Now my default for any structured data. The performance gains are too significant to ignore.
Spark 3.0: Adaptive query execution made a real difference. I use Spark for any data processing that doesn't fit in memory.
TensorFlow 2.x: Eager execution and Keras integration made deep learning pleasant. I'm productive in TF2 in a way I never was with TF1.
Database fundamentals: Understanding storage engines, query optimization, and indexing changed how I design data pipelines.
Rust: Still using it for performance-critical components. The learning curve was steep, but the compiler is genuinely helpful.
What I Explored But Didn't Adopt¶
TinyML on the edge: Promising but not quite ready for production. Inference latency was acceptable, but deploying updates to remote devices remained challenging.
Custom compression codecs: Existing codecs (Zstd, Blosc) are good enough. Writing custom codecs is fun but rarely necessary.
Horovod for small-scale training: For 2-4 GPUs, tf.distribute.MirroredStrategy is simpler. Horovod shines at 8+ GPUs.
Lessons Learned¶
1. Deep Beats Broad¶
I learned more by going deep on interconnected topics than I would have by skimming many unrelated areas.
Depth: Understanding Arrow led to understanding columnar storage, which led to understanding database indexes, which led to understanding query optimization.
Breadth: Watching 10 YouTube videos on "10 Python libraries you should know" teaches you nothing.
2. Build Things¶
Reading about databases didn't teach me databases. Implementing a storage engine did.
Reading about compression didn't teach me compression. Writing a codec did.
Reading about Rust didn't teach me Rust. Porting C++ code to Rust did.
3. Context Matters¶
Every technology I learned connected to my PhD work. I wasn't learning Rust abstractly—I was learning Rust to write faster FITS file readers.
This gave me: - Motivation: Solve real problems, not toy examples - Feedback: See if it actually helps my research - Retention: Used the knowledge immediately
4. Rabbit Holes Are Good¶
Following curiosity is valuable. "Why is Parquet fast?" led to a year of learning that made me a better researcher and engineer.
Don't fight the rabbit holes. Embrace them.
The COVID Factor¶
Let's be honest: lockdown played a huge role.
Time gained: - No commute: +2 hours per day - No in-person meetings: +3-4 hours per week - No conferences: +2 weeks total - No fieldwork: +1 month
That's roughly 500 extra hours in 2020.
But it wasn't just time—it was uninterrupted time. Deep work on complex topics requires long, focused sessions. Lockdown provided that.
The trade-off: - Lost: In-person collaboration, serendipitous conversations, conference networking - Gained: Depth of understanding in core technologies
Would I trade it? No. The foundations I built in 2020 continue to compound.
Looking Back¶
2020 was the rabbit hole year. Not by plan, but by circumstance.
COVID gave me time. My PhD gave me problems. Curiosity gave me direction.
The result: a year of learning that fundamentally changed how I approach data-intensive computing.
Every rabbit hole was worth it.
References and Resources¶
My other 2020 posts: - Databases and Distributed Systems Deep Dive - Full story of learning databases - Git in the Habit - Development workflow improvements
Technologies explored: - Apache Arrow - Apache Spark 3.0 - Horovod - TensorFlow Lite for Microcontrollers - Blosc compression - Zstandard
Books that helped: - Designing Data-Intensive Applications by Martin Kleppmann - Programming Rust by Jim Blandy and Jason Orendorff - Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron
My code (coming soon): - Rust FITS reader - Custom astronomical compression codec - TinyML transient detection model