Every idea that powers Postgres and Kafka — shown in shell script you can read without a compiler. Seven chapters, one concept each. Each opens with the theory, shows the exact implementation, then explains what a production database does differently.
Chapter 1 — The append-only log
Why the simplest possible write strategy turns out to be one of the most powerful ideas in database design.
For this project append-only was the default and obvious choice for simplicity's sake
Instead of finding where a key lives and overwriting it, an append-only log just adds a new record to the end of a file. Every write is a sequential disk write — no seeking, no locking of individual records, no risk of partially overwriting data.
Benefits of append-only design
- Data integrity - Existing records are not modified, audit-only logs provide a natural audit trail and make it easier to maintain data consistency acrosss distributed-systems
- Perfomance - sequential writes are faster than random access patterns
- Simplicity - Append-only simplifies system design by eliminating update and delete operations
The tradeoff is reads: to find the latest value for a key, you scan from the end. This database stores every write as a CSV line with a timestamp and SHA256 checksum. The record format is: key,"value",timestamp,checksum.
db_set() {
_db_validate_key_value "$1" "$2" || return 1
_db_check_flock || return 1
local record
record=$(_db_format_record "$1" "$2")
(
_db_lock_exclusive
_db_write_record "$record" # append — never overwrites
) 200>"${db}.lock"
}
_db_format_record() {
local key="$1" value="$2"
local timestamp payload checksum
timestamp=$(_db_timestamp) # ISO-8601
payload="$key,\"$value\",$timestamp" # 3-field CSV payload
checksum=$(_db_checksum "$payload") # SHA256
echo "$payload,$checksum" # 4-field final record
}
Every write calls _db_format_record to build a 4-field CSV line, then appends it. No record is ever modified in place, old versions stay in the file forever until compaction removes them.
To retrieve a value, db_get uses awk to filter lines by key, then takes the last one, the most recent write wins.
_db_checksum() {
local payload="$1"
echo -n "$payload" | sha256sum | cut -d' ' -f1
}
_db_timestamp() {
date -Iseconds
}
Chapter 2 — The write-ahead log
How databases survive crashes mid-write — and why writing everything twice is worth the cost.
The problem: if a process dies mid-write, the main database file could be partially written and corrupt. The WAL solves this by recording your intent before acting on it. On startup, any data in the WAL that isn't in the main file gets replayed.
Additionally, WAL significantly reduces disk writes. Only the WAL file needs to be flushed to the db after a transaction is committed. WAL also makes it possible to support on-line backup and point-in-time recovery.
This database writes every record to db.wal first, then fsyncs to make it durable. The main db file is only updated when you call db_sync. On db_init, leftover transaction files are cleaned up.
db_sync appends WAL to main file, then truncates.
_db_append_wal() {
local record="$1"
echo "$record" >> "$db.wal" # write to WAL first...
_db_fsync_wal # ...then fsync to disk
if _db_replica_enabled; then
_db_replica_append "$record" # fan out to replica
fi
}
_db_fsync_wal() {
if [ "${DB_NO_FSYNC:-0}" = "1" ]; then
return 0 # skip for benchmarks (trade durability for speed)
fi
sync "$db.wal" 2>/dev/null || true
}
_db_sync_wal() {
[ ! -s "$db.wal" ] && return 0
cat "$db.wal" >> "$db" # flush WAL → main database file
: > "$db.wal" # truncate WAL
sync
}
db_init() {
# roll back leftover transaction files from a previous crash
if [ -f "${db}.tx.wal" ] || [ -f "${db}.tx.snapshot" ]; then
db_log_warn "Found leftover tx files; rolling back incomplete transaction"
rm -f "${db}.tx.wal" "${db}.tx.snapshot"
fi
# WAL itself is replayed implicitly: _db_read_all_lines reads both
# $db and $db.wal on every query, so nothing explicit is needed
}
wal_sync_method. WAL is also the basis for streaming replication — standbys replay the same WAL stream..db + .db-wal). WAL is appended first, then checkpointed to the main database file. Multiple readers can proceed concurrently with one writer during checkpointing.Chapter 3 — Compaction
Logs grow forever. Compaction merges them down — keeping only the latest value per key, discarding tombstones and expired records.
An append-only log accumulates duplicate keys and tombstones (deletion markers). Over time it balloons. Compaction scans the full log, keeps only the most recent record per key, and discards deleted keys entirely.
This is also where expired TTL records get purged. The compacted output is written to a temp file first, then atomically renamed over the original — so a crash during compaction can never corrupt the database. The WAL is cleared afterwards because all surviving records are now in the main file.
mv. O(n) time and O(n) space.
db_compact() {
local tmp
tmp=$(mktemp -p "$(dirname "$db")")
( _db_lock_exclusive
declare -A latest
# pass 1: build key → latest-record map
while IFS= read -r line; do
key="${line%%,*}"
latest["$key"]="$line" # later writes silently overwrite earlier ones
done < <(_db_read_all_lines)
# pass 2: write survivors — skip tombstones and expired TTL keys
for key in "${!latest[@]}"; do
value=$(_db_extract_value "${latest[$key]%,*}")
if [ "$value" != "__deleted__" ] && ! _db_is_expired "$key"; then
echo "${latest[$key]}" >> "$tmp"
fi
done
mv "$tmp" "$db" # atomic rename — crash-safe
: > "$db.wal" # clear WAL (all data now in main file)
) 200>"${db}.lock"
}
Chapter 4 — Checksums & data integrity
How every record is fingerprinted, and why returning a silently corrupt value is worse than crashing.
Every record stores a SHA256 checksum of its payload. On every read, the checksum is recomputed and compared. A mismatch means the record was corrupted — by a bad disk sector, a partial write, a bug, or manual editing. The database returns an error rather than a silently wrong value.
The checksum covers the 3-field payload (key,"value",timestamp) and is stored as the 4th field. This means the timestamp is part of what's checksummed — tampering with the timestamp is detectable.
sha256sum. Verified on every read. Strong but slower than a CRC.
_db_checksum() {
local payload="$1"
echo -n "$payload" | sha256sum | cut -d' ' -f1
}
_db_format_record() {
local key="$1" value="$2"
local timestamp payload checksum
timestamp=$(_db_timestamp)
payload="$key,\"$value\",$timestamp" # what we checksum
checksum=$(_db_checksum "$payload") # SHA256 of 3-field payload
echo "$payload,$checksum" # 4-field record on disk
}
# result: mykey,"hello",2026-06-12T02:56:00+00:00,a3f9...
# inside db_get, after reading the matching line:
local payload checksum computed value
payload="${line%,*}" # strip last field (checksum)
checksum="${line##*,}" # extract last field
computed=$(_db_checksum "$payload")
if [ "$checksum" != "$computed" ]; then
echo "Error: checksum mismatch" >&2
return 1 # fail loudly — never return corrupt data
fi
initdb --data-checksums). Verified on read. Background scrubber thread can scan the whole heap. Off by default because the CPU cost is non-trivial.Chapter 5 — Concurrency & file locking
How multiple readers and writers coexist without corrupting each other — using the POSIX advisory lock that ships with every Linux system.
This database uses flock — POSIX advisory file locks — to coordinate access. Writers acquire an exclusive lock on a .lock file. Multiple readers can hold shared locks simultaneously. A writer blocks until all current readers finish; readers block while a writer holds the lock.
The key idiom is the subshell: ( ... ) 200>"${db}.lock". The lock file is opened on file descriptor 200, flock locks that fd, and when the subshell exits the fd is automatically closed — releasing the lock. No explicit unlock needed.
_db_lock_exclusive() {
if _db_tx_active; then return 0; fi # tx already holds the lock
local timeout="${DB_LOCK_TIMEOUT:-10}"
flock -w "$timeout" -x 200 # exclusive write lock on fd 200
}
_db_lock_shared() {
if _db_tx_active; then return 0; fi
local timeout="${DB_LOCK_TIMEOUT:-10}"
flock -w "$timeout" -s 200 # shared read lock on fd 200
}
# The subshell pattern — lock is held for duration of the block:
( _db_lock_exclusive; _db_write_record "$record" ) 200>"${db}.lock"
( _db_lock_shared; result=$(_db_read) ) 200>"${db}.lock"
# 5 workers, 20 increments each = exactly 100 total
db_set "counter" "0"
for ((w = 0; w < 5; w++)); do
( for ((i = 0; i < 20; i++)); do
db_incr "counter" >/dev/null # atomic read-modify-write
done ) &
done
wait
result=$(db_get "counter")
# result must equal 100 — never 99, never 101
xmin/xmax transaction IDs). Readers see a snapshot of committed data at transaction start — no file locks needed. Readers never block writers; writers never block readers.flock, but the semantics are identical.Chapter 6 — Transactions & snapshot isolation
Copy-on-write snapshots give each transaction a consistent view of the world — the same idea behind Postgres's MVCC, built entirely from files.
When you call db_begin, the current database and WAL are concatenated into a .tx.snapshot file. All reads during the transaction read from that frozen snapshot. Writes go to a separate .tx.wal.
On db_commit, the transaction WAL is appended to the main WAL under an exclusive lock. On db_rollback, both files are deleted — the world is exactly as if nothing happened. Nested transactions are supported: the inner commit only merges at the outermost level.
db_begin() {
if [ "$db_tx_level" -eq 0 ]; then
# freeze current state into an immutable snapshot file
cat "$db" "$db.wal" > "$(_db_tx_snapshot)" 2>/dev/null
: > "$(_db_tx_wal)" # fresh tx WAL
fi
db_tx_level=$(( db_tx_level + 1 )) # supports nesting
}
db_commit() {
db_tx_level=$(( db_tx_level - 1 ))
[ "$db_tx_level" -gt 0 ] && return 0 # inner commit — wait for outer
( _db_lock_exclusive
cat "$(_db_tx_wal)" >> "$db.wal" # merge tx WAL into main WAL
rm -f "$(_db_tx_wal)" "$(_db_tx_snapshot)"
) 200>"${db}.lock"
}
db_rollback() {
rm -f "$(_db_tx_wal)" "$(_db_tx_snapshot)" # discard everything
db_tx_level=0 # abort all levels
}
# reads inside a transaction go to the snapshot, not the live files:
_db_read_key_lines_for_query() {
if _db_tx_active; then
awk -F, -v k="$1" '$1==k' "$(_db_tx_snapshot)" "$(_db_tx_wal)"
else
_db_read_key_lines "$1"
fi
}
xmin/xmax transaction IDs on every row version. Each transaction sees only rows where xmin <= txid < xmax. No snapshot file copy — just compare IDs. O(1) transaction start.Chapter 7 — LSM trees & log rotation
When one log grows too large, segment it. Once you have multiple segment files, you have the skeleton of an LSM tree — and Cassandra starts making sense.
db_truncate rotates the main database file when it exceeds a size threshold — compressing the old segment with gzip and renaming it db.1.gz, then db.2.gz, and so on (up to three segments). Reads scan all segments in order from newest to oldest.
This is a primitive LSM (Log-Structured Merge) tree: multiple immutable segment files, newest checked first. The big missing piece is sorted keys within each segment — that's what SSTables add, enabling binary search instead of full scans. But the multi-level structure is already here.
db_truncate() {
local max_bytes="${1:-10485760}" # default 10 MB
local file_size
file_size=$(stat -c %s "$db")
[ "$file_size" -le "$max_bytes" ] && return 0 # not large enough yet
# rotate: db.2 → db.3, db.1 → db.2, db → db.1.gz
[ -f "${db}.2" ] && mv "${db}.2" "${db}.3"
[ -f "${db}.2.gz" ] && mv "${db}.2.gz" "${db}.3.gz"
[ -f "${db}.1" ] && mv "${db}.1" "${db}.2"
[ -f "${db}.1.gz" ] && mv "${db}.1.gz" "${db}.2.gz"
mv "$db" "${db}.1"
gzip -f "${db}.1" # compress old segment
: > "$db" # fresh active file
}
_db_segment_files() {
# yield older segments first so newer files win on key conflicts
for seg in "${db}.3" "${db}.2" "${db}.1"; do
[ -f "$seg" ] && echo "$seg"
[ -f "$seg.gz" ] && echo "$seg.gz"
done
}
_db_read_key_lines() {
local key="$1"
# scan old segments, then active file and WAL
for seg in $(_db_segment_files); do
_db_cat_file "$seg" | awk -F, -v k="$key" '$1 == k'
done
awk -F, -v k="$key" '$1 == k' "$db" "$db.wal" 2>/dev/null
}
db.1.gz segments. Compaction merges SSTables. Bloom filters and partition indexes eliminate unnecessary scans.Source
The complete repo is available at github.com/DarynOngera/relic.