How to read any legacy codebase. The archaeology playbook.

Somewhere on a hard drive sits a folder of low resolution scans of Russian typewritten pages from the 1950s. The pages describe PP-BESM, the first high level programming language compiler ever built in the Soviet Union, designed by Andrey Ershov. A developer who goes by xavxav is rebuilding it. Not emulating it. Rebuilding it, line by line, from the scans. The repo is real, the VM runs, the PP-3 phase has an initial pass. You can clone it.

That project is the extreme version of every "I cannot read this codebase" problem you will ever have at work. Same shape, more dust. The PP-BESM author published a writeup last month that, once you strip the Cold War aesthetic, reads like the cleanest manual on legacy codebase archaeology I have read in years.

This article is that manual, generalized, with the techniques you can apply this week on whatever inherited PHP, COBOL, Perl, or Java 6 repo is currently your problem.

TL;DR

Stage	What you do	Why
1. Boundaries	map inputs, outputs, side effects	you cannot understand the inside until you know the outside
2. Harness	build a way to run the code in isolation	the loop is the whole game
3. Bisection	narrow the search to the load bearing 10 percent	most code is glue
4. Naming	rename systematically as you understand	you are leaving notes for future you
5. Types	add types where there are none, even loose ones	types are documentation that runs
6. Tests as ground truth	write tests that lock in observed behavior	refactoring without tests is fiction
7. Document negotiations	comment the why, never the what	the why is what time erases

The order matters. Skipping ahead is how teams spend six months on "modernization" and end up with a worse version of the same system.

1. Boundaries before internals

The first move on any unfamiliar codebase is not to read the code. The first move is to draw the boundary.

For a web service: what HTTP routes exist, what does each one return, what database tables get touched, what external APIs get called, what writes to disk, what fires events. For a CLI: what arguments does it accept, what files does it read, what does it write, what is the exit code matrix. For a library: what is the public API, what does it depend on, what does it monkey-patch.

You can do this without understanding a single function inside the code. The tools:

# HTTP routes for a Node service
grep -rE "router\.(get|post|put|delete)|app\.(get|post)" --include="*.{js,ts}" src/

# Database tables touched
grep -rE "FROM|UPDATE|INSERT INTO|DELETE FROM" --include="*.{sql,js,ts,py}" .

# External API calls
grep -rE "axios|fetch\(|http\.request" --include="*.{js,ts}" src/

# Files read or written
grep -rE "fs\.(read|write)|open\(" --include="*.{js,ts,py}" .

Write the answers down. This is your map. You cannot understand the internals until you know where the doors are.

For the PP-BESM project, the boundary was the BESM machine model. You cannot read a 1955 compiler without knowing the instruction set of the machine it targets. xavxav reconstructed that from a separate set of documents before touching the compiler source. Same pattern, smaller stakes.

2. Build a harness, even a bad one

The highest payoff move on a legacy codebase, by a wide margin, is to get any version of the code running in isolation, with one input and one observable output, before you try to understand any of it.

For a web service, that means a docker-compose that spins up the app and its database with a single command, with one curl that exercises one route. For a CLI, that means a one-liner that runs the binary with a representative input and pipes the output somewhere you can read it. For a library, that means a five line consumer that imports the library and calls the one function you care about.

If this is impossible, the rest of the audit will also be impossible. Spend a day building the harness. It is the loop.

# A minimal harness for a legacy Python script
mkdir -p harness
cat > harness/run.sh <<'EOF'
#!/bin/bash
cd "$(dirname "$0")/.."
python3 ./scary_script.py --input fixtures/sample.csv > /tmp/out.txt
diff /tmp/out.txt fixtures/expected.txt
EOF
chmod +x harness/run.sh

You now have a one-command loop. Every change you make from here on can be tested against harness/run.sh. The harness is your safety net.

xavxav's harness for PP-BESM is the BESM virtual machine he built. Every change to the compiler can be tested by running a tiny Soviet-era program inside the VM and watching the result. The VM is more important than any single piece of the compiler source.

3. Bisection beats reading top to bottom

The instinct on a new codebase is to read the entry point and follow the call graph. This is wrong almost every time. Most legacy code is glue. The interesting logic, the part that actually does the work, lives in 10 to 20 percent of the files. The other 80 to 90 percent shuffles data between the interesting parts.

The fastest way to find the interesting parts is bisection.

# What touched the database in the last year?
git log --since="1 year ago" --name-only --pretty=format: \
  | grep -E "schema|migration|model" | sort -u

# Where do the longest files live? long usually means interesting
find . -name "*.py" -not -path "*/node_modules/*" \
  -exec wc -l {} \; | sort -rn | head -20

# What gets imported the most? heavily imported usually means load bearing
grep -rE "^import|^from" --include="*.py" . | awk '{print $2}' \
  | sort | uniq -c | sort -rn | head -20

Each of those commands narrows the search. The longest file is often the dumping ground. The most imported module is often the actual brain of the system. The files that show up in every migration are the ones the schema can't live without.

For PP-BESM the bisection target was PP-3, the last compiler phase. xavxav knew the early phases were better documented in the existing literature. The interesting unknown was the last phase. He focused there first.

4. Naming as you go

Every time you understand a function, rename it. Every time you understand a variable, rename it. Do this in a branch, and commit often.

The temptation is to read the whole codebase first and rename later. This is wrong. You will forget what you understood. You will lose hours of context. The rename is the note you are leaving for future you and the next person.

// before
function process(x, y) {
  const r = x.filter(z => z.s > y).map(z => z.id)
  return db.query(r)
}

// after, you understood this is fetching active user ids over a score threshold
function fetchActiveUserIdsAboveScore(users, threshold) {
  const qualifyingIds = users
    .filter(user => user.score > threshold)
    .map(user => user.id)
  return db.query(qualifyingIds)
}

A good rule: if you cannot rename a function meaningfully, you do not understand it yet. Keep reading. Once you can rename it, do it immediately, then commit with a message that captures what you learned.

xavxav's rename pass on PP-BESM was a translation pass, but the principle is the same. Russian identifiers became English identifiers. Cryptic three letter mnemonics became words. The code became readable because someone took the time to make it readable.

5. Types as living documentation

If the codebase is dynamically typed, add types. If the types are wrong, fix them. Even loose types beat no types, because types are the documentation that runs.

// before, no types
function calculate(data, config) {
  return data.items.reduce((acc, item) => {
    return acc + item.price * (config.taxRate + 1)
  }, 0)
}

// after, types you can refactor against
type LineItem = { price: number; quantity: number; }
type TaxConfig = { taxRate: number; }
type Order = { items: LineItem[]; }

function calculateTotalWithTax(order: Order, config: TaxConfig): number {
  return order.items.reduce((acc, item) => {
    return acc + item.price * (config.taxRate + 1)
  }, 0)
}

For Python, add type hints. For PHP, use PHPStan or Psalm. For old JavaScript, migrate file by file to TypeScript with allowJs: true. The types do not need to be perfect on day one. They need to exist.

The reason this matters more than people think: types compile. Comments do not. A wrong comment lives forever. A wrong type breaks the build. Types are the only documentation format that the compiler keeps honest.

6. Tests as ground truth, even for behavior you do not love

Before you refactor anything, write tests that lock in the observed behavior, including the parts that look like bugs.

This is the most counterintuitive rule on the list. Junior engineers want to fix the bugs immediately. The right move is to write a test that proves the bug exists first, then keep that test passing while you refactor, then change the test deliberately at the end if the bug should be fixed.

# pin the current behavior, even if it is wrong
def test_calculate_returns_negative_for_empty_orders():
    """
    BUG-LIKE: empty orders currently return -1 instead of 0.
    Some downstream system depends on this. Do not change without
    coordinating with the billing team.
    """
    result = calculate([], TaxConfig(rate=0.1))
    assert result == -1

The test does two things. It tells future you that the behavior is intentional, not an accident. It also acts as the alarm if a "small refactor" breaks the contract.

xavxav's tests for PP-BESM are not unit tests in the modern sense. They are small Soviet-era programs run through the VM with their expected output captured. Same idea, smaller scope. Pin the behavior, refactor against the pin, change the pin deliberately.

7. Comment the negotiations, never the obvious

Your future maintainer can read the code. They cannot read your decision tree. The comments that survive a decade are the ones that capture why a particular choice was made, especially when the choice looks weird.

Bad comment: // increment counter. The code already says that.

Good comment: // We round down because the billing team expects integer cents only. // Historical: float cents caused the May 2023 reconciliation incident.

The good comment is a note from one engineer to another about a constraint that is not visible in the code. The constraint is real. The constraint will outlive the engineer who introduced it. The comment is the only place it lives.

Run this drill on your legacy codebase: find every place where the code looks slightly odd. A magic number, a hardcoded check, a try/except that swallows a specific exception, a special case for one customer ID. Each one of those is a negotiation that someone made with reality. If the comment is missing, add it once you figure out the negotiation.

# bad
TIMEOUT = 47

# good
# Set to 47 seconds because their auth gateway has a 50 second hard limit
# and we observed 1-2 second jitter from our load balancer. See incident
# 2024-03-15. Do not raise without coordinating with the partner team.
TIMEOUT = 47

Stitching the playbook together

The seven stages are not parallel. They build on each other. The boundary work tells you where to put the harness. The harness lets you bisect. The bisection tells you what to name. The names tell you what to type. The types tell you what to test. The tests give you the safety to comment confidently.

The same loop runs at every scale. xavxav is running it on a 70 year old compiler with the source on paper. You can run it on a 12 year old Rails app with the source on GitHub. The shape is identical.

A practical first week, if you are inheriting a legacy codebase tomorrow:

Day 1: Boundaries. Draw the map. Do not read internals.
Day 2: Harness. Get any version running with one command.
Day 3: Bisection. Find the 10 percent that does the work.
Day 4: Naming + types. Make the 10 percent readable.
Day 5: Tests. Pin the observed behavior before refactoring.

Week 2 onward: refactor against the pins, comment the negotiations.

By the end of week one you will know more about the codebase than the engineer who wrote it, because the engineer who wrote it never had the map. They built the system one room at a time. You are reading the architecture in two weeks because the map is part of the work.

The honest take

Most engineers will tell you they hate legacy codebases. They say this because the only legacy codebases they have seen are the ones nobody bothered to read. A codebase that someone has actually understood, mapped, harnessed, and pinned behavior on, is a perfectly pleasant place to work. The unpleasantness is not in the age of the code, it is in the absence of the archaeology.

The PP-BESM project will probably never have a million users. It will not show up in your dependency tree. It will not raise a Series A. The project still ranks among the most interesting software writing happening in 2026, because the goal is preservation rather than growth, and because the technique generalizes. The output is not a product. The output is a playbook.

That playbook works on the codebase that sits in your own repo right now, the one with a legacy/ directory nobody touches. Spend a week on it. The legacy directory will become an asset instead of a liability.

Question for the comments: what is the oldest piece of code you have ever read seriously, and which of the seven stages did you skip?

GDS K S · thegdsks.com · follow on X @thegdsks

Every codebase ends up as archaeology eventually. The question is whether anyone bothers to dig.