Every idea that powers Postgres and Kafka — shown in shell script you can read without a compiler. Seven chapters, one concept each. Each opens with the theory, shows the exact implementation, then explains what a production database does differently.

Chapter 1 — The append-only log

Why the simplest possible write strategy turns out to be one of the most powerful ideas in database design.

"The simplest possible database can be implemented with two Bash functions... db_set appends a key-value pair to the end of a file."
— Designing Data-Intensive Applications, Chapter 3

For this project append-only was the default and obvious choice for simplicity's sake

Instead of finding where a key lives and overwriting it, an append-only log just adds a new record to the end of a file. Every write is a sequential disk write — no seeking, no locking of individual records, no risk of partially overwriting data.

Benefits of append-only design

  • Data integrity - Existing records are not modified, audit-only logs provide a natural audit trail and make it easier to maintain data consistency acrosss distributed-systems
  • Perfomance - sequential writes are faster than random access patterns
  • Simplicity - Append-only simplifies system design by eliminating update and delete operations

The tradeoff is reads: to find the latest value for a key, you scan from the end. This database stores every write as a CSV line with a timestamp and SHA256 checksum. The record format is: key,"value",timestamp,checksum.

This database
Sequential append to file, scan from end on read. O(n) reads, O(1) writes.
Production
Same idea — Kafka is literally this. Postgres adds a heap file with page layout for O(1) reads.
bash src/db_utils.sh + src/db_ops.sh
db_set() {
  _db_validate_key_value "$1" "$2" || return 1
  _db_check_flock || return 1

  local record
  record=$(_db_format_record "$1" "$2")

  (
    _db_lock_exclusive
    _db_write_record "$record"  # append — never overwrites
  ) 200>"${db}.lock"
}

_db_format_record() {
  local key="$1" value="$2"
  local timestamp payload checksum
  timestamp=$(_db_timestamp)         # ISO-8601
  payload="$key,\"$value\",$timestamp"  # 3-field CSV payload
  checksum=$(_db_checksum "$payload") # SHA256
  echo "$payload,$checksum"            # 4-field final record
}

Every write calls _db_format_record to build a 4-field CSV line, then appends it. No record is ever modified in place, old versions stay in the file forever until compaction removes them.

To retrieve a value, db_get uses awk to filter lines by key, then takes the last one, the most recent write wins.

bash src/db_utils.sh
_db_checksum() {
  local payload="$1"
  echo -n "$payload" | sha256sum | cut -d' ' -f1
}

_db_timestamp() {
  date -Iseconds
}
Apache Kafka direct equivalent
Kafka's entire architecture is an append-only log. Producers append, consumers read by offset. No deletes, no in-place updates — just sequential writes to segment files.
LevelDB / RocksDB evolved form
Writes go to an in-memory memtable first, then flush to immutable SSTable files on disk. Same append-only principle, but sorted keys within segments enable binary search reads.
PostgreSQL heap + WAL hybrid
Postgres uses an append-only WAL for durability, but also updates its heap files in-place for fast reads. Two separate write paths, two separate structures.
1 of 7

Chapter 2 — The write-ahead log

How databases survive crashes mid-write — and why writing everything twice is worth the cost.

"The WAL is an append-only file to which every modification must be written before it can be applied to the data files. If the database crashes, the WAL can be replayed to bring the database back up to date."
— Designing Data-Intensive Applications, Chapter 3

The problem: if a process dies mid-write, the main database file could be partially written and corrupt. The WAL solves this by recording your intent before acting on it. On startup, any data in the WAL that isn't in the main file gets replayed.

Additionally, WAL significantly reduces disk writes. Only the WAL file needs to be flushed to the db after a transaction is committed. WAL also makes it possible to support on-line backup and point-in-time recovery.

This database writes every record to db.wal first, then fsyncs to make it durable. The main db file is only updated when you call db_sync. On db_init, leftover transaction files are cleaned up.

This database
Flat text WAL. Every write fsyncs individually. db_sync appends WAL to main file, then truncates.
Production
Binary WAL format, group commits, configurable fsync policy. Postgres WAL has its own streaming replication protocol built on top.
bash src/db_storage.sh
_db_append_wal() {
  local record="$1"
  echo "$record" >> "$db.wal"  # write to WAL first...
  _db_fsync_wal                  # ...then fsync to disk
  if _db_replica_enabled; then
    _db_replica_append "$record"  # fan out to replica
  fi
}

_db_fsync_wal() {
  if [ "${DB_NO_FSYNC:-0}" = "1" ]; then
    return 0  # skip for benchmarks (trade durability for speed)
  fi
  sync "$db.wal" 2>/dev/null || true
}

_db_sync_wal() {
  [ ! -s "$db.wal" ] && return 0
  cat "$db.wal" >> "$db"  # flush WAL → main database file
  : > "$db.wal"            # truncate WAL
  sync
}
bash src/db_maint.sh
db_init() {
  # roll back leftover transaction files from a previous crash
  if [ -f "${db}.tx.wal" ] || [ -f "${db}.tx.snapshot" ]; then
    db_log_warn "Found leftover tx files; rolling back incomplete transaction"
    rm -f "${db}.tx.wal" "${db}.tx.snapshot"
  fi

  # WAL itself is replayed implicitly: _db_read_all_lines reads both
  # $db and $db.wal on every query, so nothing explicit is needed
}
PostgreSQL WAL direct equivalent
Binary WAL records with LSN (Log Sequence Numbers). Configurable wal_sync_method. WAL is also the basis for streaming replication — standbys replay the same WAL stream.
SQLite WAL mode direct equivalent
Same two-file pattern (.db + .db-wal). WAL is appended first, then checkpointed to the main database file. Multiple readers can proceed concurrently with one writer during checkpointing.
MySQL InnoDB redo log direct equivalent
Circular redo log that captures every change before it hits the InnoDB tablespace. Crash recovery replays the redo log forward, then applies the undo log to roll back uncommitted transactions.
2 of 7

Chapter 3 — Compaction

Logs grow forever. Compaction merges them down — keeping only the latest value per key, discarding tombstones and expired records.

"We can then perform compaction... throwing away duplicate keys in the log, and keeping only the most recent update for each key."
— Designing Data-Intensive Applications, Chapter 3

An append-only log accumulates duplicate keys and tombstones (deletion markers). Over time it balloons. Compaction scans the full log, keeps only the most recent record per key, and discards deleted keys entirely.

This is also where expired TTL records get purged. The compacted output is written to a temp file first, then atomically renamed over the original — so a crash during compaction can never corrupt the database. The WAL is cleared afterwards because all surviving records are now in the main file.

This database
Full scan into a Bash associative array, write survivors to tmpfile, atomic mv. O(n) time and O(n) space.
Production
RocksDB runs multi-level compaction in background threads — never blocking reads or writes. Tiered strategies optimize for write vs read amplification.
bash src/db_maint.sh
db_compact() {
  local tmp
  tmp=$(mktemp -p "$(dirname "$db")")

  ( _db_lock_exclusive
    declare -A latest

    # pass 1: build key → latest-record map
    while IFS= read -r line; do
      key="${line%%,*}"
      latest["$key"]="$line"  # later writes silently overwrite earlier ones
    done < <(_db_read_all_lines)

    # pass 2: write survivors — skip tombstones and expired TTL keys
    for key in "${!latest[@]}"; do
      value=$(_db_extract_value "${latest[$key]%,*}")
      if [ "$value" != "__deleted__" ] && ! _db_is_expired "$key"; then
        echo "${latest[$key]}" >> "$tmp"
      fi
    done

    mv "$tmp" "$db"  # atomic rename — crash-safe
    : > "$db.wal"      # clear WAL (all data now in main file)
  ) 200>"${db}.lock"
}
RocksDB leveled compaction evolved form
Writes go to Level 0 (fresh SSTables). Background threads merge them into Level 1, 2... Each level is 10× the size of the one above. Compaction at each level keeps the read amplification bounded.
Apache Cassandra compaction strategies evolved form
SizeTiered, Leveled, TimeWindow — each optimized for a different workload. The tradeoff between read amplification (more segments to scan) and write amplification (more bytes rewritten) is identical to what this Bash code faces.
PostgreSQL VACUUM related concept
Postgres uses MVCC (multiple row versions in the heap) instead of an append-only log. VACUUM reclaims space from dead tuples — the same concept of "garbage collect old versions" just applied to a different storage structure.
3 of 7

Chapter 4 — Checksums & data integrity

How every record is fingerprinted, and why returning a silently corrupt value is worse than crashing.

"Checksums detect corruption of data at rest and in transit... if a checksum mismatch is detected, the data is considered lost."
— Designing Data-Intensive Applications, Chapter 1

Every record stores a SHA256 checksum of its payload. On every read, the checksum is recomputed and compared. A mismatch means the record was corrupted — by a bad disk sector, a partial write, a bug, or manual editing. The database returns an error rather than a silently wrong value.

The checksum covers the 3-field payload (key,"value",timestamp) and is stored as the 4th field. This means the timestamp is part of what's checksummed — tampering with the timestamp is detectable.

This database
SHA256 per record via sha256sum. Verified on every read. Strong but slower than a CRC.
Production
CRC32 or xxHash per block — much faster. Verified on read or by background scrubber threads. ZFS checksums at the block device level.
bash src/db_utils.sh
_db_checksum() {
  local payload="$1"
  echo -n "$payload" | sha256sum | cut -d' ' -f1
}

_db_format_record() {
  local key="$1" value="$2"
  local timestamp payload checksum
  timestamp=$(_db_timestamp)
  payload="$key,\"$value\",$timestamp"   # what we checksum
  checksum=$(_db_checksum "$payload")   # SHA256 of 3-field payload
  echo "$payload,$checksum"              # 4-field record on disk
}
# result: mykey,"hello",2026-06-12T02:56:00+00:00,a3f9...
bash src/db_ops.sh
# inside db_get, after reading the matching line:
local payload checksum computed value
payload="${line%,*}"     # strip last field (checksum)
checksum="${line##*,}"  # extract last field
computed=$(_db_checksum "$payload")

if [ "$checksum" != "$computed" ]; then
  echo "Error: checksum mismatch" >&2
  return 1   # fail loudly — never return corrupt data
fi
Apache Kafka CRC32 per message direct equivalent
Every Kafka message carries a CRC32C checksum. Brokers verify on write; consumers verify on read. Same pattern as this database, just using a faster (non-cryptographic) checksum algorithm.
ZFS end-to-end checksums layer below
ZFS checksums every block and can auto-heal from a RAID mirror. Production databases often rely on ZFS or similar for this rather than implementing per-record checksums — though some do both.
PostgreSQL page checksums related concept
Optional per-page checksums (enabled with initdb --data-checksums). Verified on read. Background scrubber thread can scan the whole heap. Off by default because the CPU cost is non-trivial.
4 of 7

Chapter 5 — Concurrency & file locking

How multiple readers and writers coexist without corrupting each other — using the POSIX advisory lock that ships with every Linux system.

"The database must ensure that the data remains consistent even if multiple clients are reading and writing concurrently."
— Designing Data-Intensive Applications, Chapter 7

This database uses flock — POSIX advisory file locks — to coordinate access. Writers acquire an exclusive lock on a .lock file. Multiple readers can hold shared locks simultaneously. A writer blocks until all current readers finish; readers block while a writer holds the lock.

The key idiom is the subshell: ( ... ) 200>"${db}.lock". The lock file is opened on file descriptor 200, flock locks that fd, and when the subshell exits the fd is automatically closed — releasing the lock. No explicit unlock needed.

This database
File-level reader/writer lock. All readers block writers. Readers never block each other. Configurable timeout.
Production
MVCC (row-level snapshots). Readers never block writers and writers never block readers. Much higher concurrency at the cost of a more complex implementation.
bash src/db_lock.sh
_db_lock_exclusive() {
  if _db_tx_active; then return 0; fi  # tx already holds the lock
  local timeout="${DB_LOCK_TIMEOUT:-10}"
  flock -w "$timeout" -x 200  # exclusive write lock on fd 200
}

_db_lock_shared() {
  if _db_tx_active; then return 0; fi
  local timeout="${DB_LOCK_TIMEOUT:-10}"
  flock -w "$timeout" -s 200  # shared read lock on fd 200
}

# The subshell pattern — lock is held for duration of the block:
( _db_lock_exclusive; _db_write_record "$record" ) 200>"${db}.lock"
( _db_lock_shared;    result=$(_db_read)         ) 200>"${db}.lock"
bash tests/integration.bats
# 5 workers, 20 increments each = exactly 100 total
db_set "counter" "0"
for ((w = 0; w < 5; w++)); do
  ( for ((i = 0; i < 20; i++)); do
      db_incr "counter" >/dev/null  # atomic read-modify-write
    done ) &
done
wait
result=$(db_get "counter")
# result must equal 100 — never 99, never 101
PostgreSQL MVCC evolved form
Postgres stores multiple versions of each row (with xmin/xmax transaction IDs). Readers see a snapshot of committed data at transaction start — no file locks needed. Readers never block writers; writers never block readers.
SQLite WAL mode direct equivalent
SQLite's WAL mode uses the same reader/writer pattern — multiple readers, one writer. The locking is implemented with shared-memory rather than flock, but the semantics are identical.
Redis single-threaded model alternative approach
Redis avoids the concurrency problem entirely — all commands run on one thread. No locks needed. This database's subshell lock is conceptually similar: one writer at a time, serialized.
5 of 7

Chapter 6 — Transactions & snapshot isolation

Copy-on-write snapshots give each transaction a consistent view of the world — the same idea behind Postgres's MVCC, built entirely from files.

"Each transaction reads from a consistent snapshot of the database — that is, the transaction sees all the data that was committed at the start of the transaction, and none of the data that was committed later."
— Designing Data-Intensive Applications, Chapter 7

When you call db_begin, the current database and WAL are concatenated into a .tx.snapshot file. All reads during the transaction read from that frozen snapshot. Writes go to a separate .tx.wal.

On db_commit, the transaction WAL is appended to the main WAL under an exclusive lock. On db_rollback, both files are deleted — the world is exactly as if nothing happened. Nested transactions are supported: the inner commit only merges at the outermost level.

This database
Copy-on-write snapshot file. Simple and correct. Reads are isolated; writes are serialized on commit. O(n) snapshot creation.
Production
MVCC stores multiple row versions in-place. No snapshot file copy needed — just record the transaction ID at start and filter rows by it on read. O(1) snapshot creation.
bash src/db_tx.sh
db_begin() {
  if [ "$db_tx_level" -eq 0 ]; then
    # freeze current state into an immutable snapshot file
    cat "$db" "$db.wal" > "$(_db_tx_snapshot)" 2>/dev/null
    : > "$(_db_tx_wal)"  # fresh tx WAL
  fi
  db_tx_level=$(( db_tx_level + 1 ))  # supports nesting
}

db_commit() {
  db_tx_level=$(( db_tx_level - 1 ))
  [ "$db_tx_level" -gt 0 ] && return 0  # inner commit — wait for outer

  ( _db_lock_exclusive
    cat "$(_db_tx_wal)" >> "$db.wal"     # merge tx WAL into main WAL
    rm -f "$(_db_tx_wal)" "$(_db_tx_snapshot)"
  ) 200>"${db}.lock"
}

db_rollback() {
  rm -f "$(_db_tx_wal)" "$(_db_tx_snapshot)"  # discard everything
  db_tx_level=0  # abort all levels
}

# reads inside a transaction go to the snapshot, not the live files:
_db_read_key_lines_for_query() {
  if _db_tx_active; then
    awk -F, -v k="$1" '$1==k' "$(_db_tx_snapshot)" "$(_db_tx_wal)"
  else
    _db_read_key_lines "$1"
  fi
}
PostgreSQL MVCC same concept, better implementation
Postgres stores xmin/xmax transaction IDs on every row version. Each transaction sees only rows where xmin <= txid < xmax. No snapshot file copy — just compare IDs. O(1) transaction start.
SQLite BEGIN DEFERRED / IMMEDIATE / EXCLUSIVE direct equivalent
SQLite's WAL mode uses a similar snapshot: readers read from a consistent point in the WAL by remembering their start position. Writes go to new WAL pages. Rollback discards the new pages.
6 of 7

Chapter 7 — LSM trees & log rotation

When one log grows too large, segment it. Once you have multiple segment files, you have the skeleton of an LSM tree — and Cassandra starts making sense.

"We can make a simple change to the format of our log segment files: we require that the sequence of key-value pairs is sorted by key. We call this format a Sorted String Table, or SSTable."
— Designing Data-Intensive Applications, Chapter 3

db_truncate rotates the main database file when it exceeds a size threshold — compressing the old segment with gzip and renaming it db.1.gz, then db.2.gz, and so on (up to three segments). Reads scan all segments in order from newest to oldest.

This is a primitive LSM (Log-Structured Merge) tree: multiple immutable segment files, newest checked first. The big missing piece is sorted keys within each segment — that's what SSTables add, enabling binary search instead of full scans. But the multi-level structure is already here.

This database
Up to 3 gzip-compressed segments + active file. Reads scan all segments. O(segments × n) worst case.
Production
Sorted segments (SSTables) with Bloom filters. Binary search per segment. Reads touch O(log n) entries per segment. Background compaction merges segments continuously.
bash src/db_maint.sh
db_truncate() {
  local max_bytes="${1:-10485760}"  # default 10 MB
  local file_size
  file_size=$(stat -c %s "$db")
  [ "$file_size" -le "$max_bytes" ] && return 0  # not large enough yet

  # rotate: db.2 → db.3, db.1 → db.2, db → db.1.gz
  [ -f "${db}.2"   ] && mv "${db}.2"    "${db}.3"
  [ -f "${db}.2.gz" ] && mv "${db}.2.gz" "${db}.3.gz"
  [ -f "${db}.1"   ] && mv "${db}.1"    "${db}.2"
  [ -f "${db}.1.gz" ] && mv "${db}.1.gz" "${db}.2.gz"
  mv "$db" "${db}.1"
  gzip -f "${db}.1"  # compress old segment
  : > "$db"          # fresh active file
}
bash src/db_storage.sh
_db_segment_files() {
  # yield older segments first so newer files win on key conflicts
  for seg in "${db}.3" "${db}.2" "${db}.1"; do
    [ -f "$seg"    ] && echo "$seg"
    [ -f "$seg.gz" ] && echo "$seg.gz"
  done
}

_db_read_key_lines() {
  local key="$1"
  # scan old segments, then active file and WAL
  for seg in $(_db_segment_files); do
    _db_cat_file "$seg" | awk -F, -v k="$key" '$1 == k'
  done
  awk -F, -v k="$key" '$1 == k' "$db" "$db.wal" 2>/dev/null
}
LevelDB / RocksDB SSTables evolved form
Segment files with keys sorted at write time. Binary search locates any key without a full scan. Bloom filters let the engine skip segments entirely if a key can't be in them. Tiered compaction merges segments across levels.
Apache Cassandra SSTable segments evolved form
Each memtable flush produces an immutable SSTable file — identical in concept to these db.1.gz segments. Compaction merges SSTables. Bloom filters and partition indexes eliminate unnecessary scans.
The missing piece: Bloom filters what comes next
A Bloom filter is a probabilistic data structure that answers "is this key definitely not in this segment?" with zero false negatives. Production LSM engines attach one to every SSTable so reads skip non-matching segments without decompressing them.
Complete

Source

The complete repo is available at github.com/DarynOngera/relic.