Cleaning Up Legacy Code with Codex CLI: A One-Night Playbook
Author: CodeGateway team · Tested on May 2026
TL;DR: Every engineer has one: a project they haven't touched in three years. Test coverage at 12%. The original author left two jobs ago. There's a # rewrite-someday marker from 2022 still sitting at line 47. Every time you think about it, the same dilemma: don't touch it and the next sprint slips; touch it and the next two weeks slip. Codex CLI — OpenAI's command-line coding agent paired with the GPT-5.3 Codex model — drops "touching it" to one evening. This isn't a "AI fixes everything" pitch. It's a real playbook that splits "legacy code cleanup" into four concrete stages: diagnose, set up, slice into multi-agent tasks, and a postmortem of the four most common ways the run dies. Every step has copy-paste-ready commands. A typical mid-sized Python project goes through it in an evening for an estimated $1–3.Table of Contents
- The three flavors of legacy code
- Diagnose first: figure out what to actually change
- Codex CLI in 5 steps
- Slice the refactor into 9 schedulable tasks
- A typical evening's cost receipt
- Four ways the run goes sideways
- FAQ
- Further reading
The three flavors of legacy code
Before you start the cleanup, get specific about what kind of legacy you're dealing with. Most projects you don't want to touch fall into one of three patterns — each gets a different diagnostic, different priorities, different Codex tasks.
Flavor | Telltale signs | Damage | Suitable Codex tasks |
|---|---|---|---|
Spaghetti | 800-line functions, 6-level nesting, variables named | Fixing A breaks B; regression test blind spots | Function splitting, renaming, extracting common logic |
Fossil | Dependencies 5 years stale; Python 3.6; jQuery 1.x | Security holes, CVE warnings, new hires can't onboard | Dependency batch upgrade + API adaptation |
Amnesia | No docs, no tests, original author gone, comments say "see Slack" | No one dares change it; institutional knowledge lives in ex-coworkers' heads | Auto-generate docs + unit tests |
Real-world projects usually have all three layered together — which is why the urge to "just rewrite" feels so strong. But greenfield rewrites fail more than half the time (industry data: >60% of "rewrite from scratch" projects miss schedule or get abandoned). Codex CLI's play isn't to bulldoze and rebuild — it's surgery on the existing structure.
Diagnose first: figure out what to actually change
Don't open Codex and tell it "refactor this project." That instruction's cost balloons and the result is mediocre — the AI doesn't know your priorities, can't tell historical baggage from necessary complexity.
Spend 15 minutes on diagnostics. Saves hours of rework downstream. Three actions:
1. Run a quantifiable health check
# Python projects:
pip install radon vulture pylint
radon cc -a -s . # cyclomatic complexity
radon mi -s . # maintainability index
vulture . # dead-code scan
pylint --output-format=json . # comprehensive report
# JS / TS projects:
npx complexity-report-html ./src # complexity
npx ts-prune # dead code
npx depcheck # unused depsSave the output as legacy-audit.txt. This is the input material for Codex's later decisions.
2. Write a "touch / don't touch" list
# Create refactor-scope.md, answer 4 questions:
# 1. Which modules have cyclomatic complexity > 15? (must touch)
# 2. Which functions are > 100 lines? (split)
# 3. Which deps have CVE warnings? (upgrade)
# 4. Which code did vulture flag as dead? (delete)Be careful with #4 — in dynamic languages, "dead code" is often called via reflection. Tag what you're keeping with # vulture: ignore.
3. Back up + branch
git checkout -b refactor/legacy-cleanup-2026q2
git commit -am "checkpoint: pre-refactor baseline"Codex CLI's execution is destructive. Working directly on main doubles the rollback cost.
Codex CLI in 5 steps
Step 1: Install the CLI (~30 s)
npm install -g @openai/codex
codex --versionRequires Node.js 18+. On Windows, use WSL2.
Step 2: Get a CodeGateway API key (~2 min)
Direct OpenAI works too. We use CodeGateway here for three reasons:
- Email signup, no international credit card needed.
- New accounts get a $2 starter credit — at GPT-5.3 Codex's input price of $2 per million tokens × the 1.5x starter markup, that's about 670K tokens, enough to run diagnostics plus one full refactor pass.
- Long jobs are more stable than going direct — full-codebase analyses on 100K-LOC repos run 8–12 minutes; we've rarely had one drop with the gateway in front (more on dropouts in the postmortem section below).
Sign up at https://www.codegateway.dev → Dashboard → API Keys → Create Key. Name it after where it lives, e.g. codex-laptop.
Step 3: Configure environment (~1 min)
# in ~/.zshrc or ~/.bashrc
export OPENAI_BASE_URL="https://api.codegateway.dev/v1"
export OPENAI_API_KEY="sk-cg-xxxxxxxxxxxxxxxxxxxxxx"source your rc file to take effect.
Step 4: Verify the connection (~30 s)
curl https://api.codegateway.dev/v1/models \
-H "Authorization: Bearer $OPENAI_API_KEY" | grep gpt-5.3-codexIf you see gpt-5.3-codex in the output, you're set.
Step 5: First instruction (~30 s)
cd legacy-project/
codex "Summarize the architecture of this project and list 5 most refactor-worthy spots."Codex will grep / read files / run commands and come back with an architecture overview and a priority list. This single command typically costs $0.05–$0.15. Treat it as the project's "first handshake."
Slice the refactor into 9 schedulable tasks
Telling Codex to "refactor the whole project" creates two problems: context windows get blown out, and the cost is huge with results that are hard to verify. The right move is to split the work into 9 manageable tasks and run them in dependency order.
Task decomposition template
Save the following 9 tasks as refactor-tasks.md:
1. ARCHITECTURE: Output the current architecture (Mermaid), find refactor entry points
2. DEAD-CODE: Delete what vulture flagged (preserve `# vulture: ignore` comments)
3. RENAME: Rename tmp1/tmp2/foo/bar-style variables to semantic names
4. SPLIT: Split 100+ line functions into single-responsibility small functions
5. TYPES: Add type annotations to all public APIs (Python: typing; TS: explicit interfaces)
6. DEPS: Upgrade CVE-flagged deps to latest minor; run full test suite once
7. TESTS: Backfill unit tests for modules with cyclomatic complexity > 15
8. DOCS: Add docstrings to public APIs; generate the README "API Reference" section
9. REVIEW: Run full tests + lint, output diff summary + risk highlightsRun sub-agents concurrently
Codex CLI supports /agents for sub-agent scheduling. Task 1 (architecture) must run first; task 9 (review) must run last. The middle 7 have partial parallelism (DEAD-CODE and RENAME don't conflict → parallel; TESTS must follow SPLIT → serial).
codex << 'EOF'
Run in this order:
1. First run ARCHITECTURE, write the result to docs/architecture.md
2. Then run DEAD-CODE, RENAME, TYPES concurrently (3 sub-agents)
3. After those finish, run SPLIT → TESTS serially
4. Run DEPS in parallel with SPLIT (no conflict)
5. Then DOCS
6. Finally REVIEW
After each task, write a commit with subject "refactor: <task-name>".
EOFThe main agent handles scheduling; sub-agents do the work. Codex commits per-task automatically — making it easy to step through reviews and roll back individual changes.
Picking models for sub-agents (cost-aware)
- ARCHITECTURE / REVIEW: architectural reasoning + holistic judgment →
gpt-5.3-codex(default driver). - RENAME / DEAD-CODE / DOCS: mechanical, rule-driven work → switch to
gpt-5-minior cheaper. Sub-agent cost drops to about a third. - TESTS / SPLIT / TYPES: needs context but is repetitive → keep on
gpt-5.3-codex.
Save these rules to .codex/policy.md. Codex CLI loads it on startup; sub-agent dispatch picks the model automatically — no per-command overrides.
A typical evening's cost receipt
Data note: the numbers below are an estimated cost structure for a typical mid-sized project (~30K LOC Python, ~150 files, ~12K test LOC), not a single project's actual bill. Real Codex CLI consumption swings significantly with project size, code density, and iteration count. Numbers below assume GPT-5.3 Codex pricing × the CodeGateway 1.5x starter markup.
Token consumption per task (estimated)
Task | Input tokens (K) | Output tokens (K) | Estimated cost |
|---|---|---|---|
ARCHITECTURE | 80 | 8 | $0.32 |
DEAD-CODE | 60 | 4 | $0.20 |
RENAME | 120 | 30 | $0.65 |
SPLIT | 180 | 60 | $1.10 |
TYPES | 100 | 25 | $0.55 |
DEPS | 40 | 5 | $0.16 |
TESTS | 200 | 80 | $1.30 |
DOCS | 90 | 35 | $0.50 |
REVIEW | 70 | 12 | $0.30 |
Total estimate for one evening
Total tokens: ~940K input + ~259K output ≈ 1.2M tokens
Direct OpenAI: ~$3.50 (per published GPT-5.3 Codex rates)
1.5x starter markup (new CG account): ~$5.30
1.2x floor markup (90-day spend > $500): ~$4.20The $2 starter credit covers an early 4–5 tasks — enough to validate whether this flow fits your project. Top up $5–10 to finish the rest.
Time
API calls total roughly 30–50 minutes serially; concurrency compresses to 12–20 minutes. Add your review time per commit, and you're at 60–90 minutes end-to-end.
Three cost-compression levers
- Prompt cache: keep
architecture.mdas a long, repeated system prompt. Cache hits drop input pricing to ~10%. Across one evening that's 30–40% off. - mini model: send mechanical tasks (RENAME / DEAD-CODE / DOCS) to
gpt-5-mini. About $0.7–$1.0 saved. - Slice tasks finer: a failed task retries the whole task. At "one file per task" granularity, the cost of any single failure is minimized.
Four ways the run goes sideways
The most common failures aren't model failures — they're workflow details.
1. Mid-flight connection drops
Six minutes in, Codex has just finished file 12 of 16. File 13 throws ECONNRESET. Home broadband and corporate proxies typically set NAT idle timeouts at 30–60 seconds; AI inference pauses long enough for the middlebox to evict the connection.
Fix: route through CodeGateway — Cloudflare's edge keeps long-lived connections warm. We've run 100K-LOC full-codebase analyses for 12+ minutes without a single drop. Same root-cause and fix as documented for Claude Code timeouts (logic transfers directly to Codex).
2. Sub-agent goes off-script and modifies the wrong files
You ask Codex to refactor src/. It also "helpfully" changes tests/conftest.py — breaking the test infrastructure, after which everything fails downstream.
Fix: declare a firewall up front.
codex "Refactor src/, but DO NOT modify any of:
tests/conftest.py
scripts/deploy*.sh
.github/**
If you find changes there are needed, stop and report — I'll decide."Bake this into your task template. After a few runs it's reflex.
3. RENAME also rewrites string literals
You rename user_id → accountId for variables. Codex also rewrites the literal string "user_id" in log messages — and now every grep/dashboard query against logs is broken.
Fix: scope it explicitly.
codex "Rename user_id → accountId, but ONLY in variable names, function parameters, and class attributes.
DO NOT change string literals (logs, SQL, error messages).
DO NOT change keys in JSON / YAML / config files."4. TESTS task generates "passing" tests for buggy code
Codex backfills tests for a function with a known bug — and reads the current implementation as the spec. The test asserts the buggy behavior. Now when someone fixes the bug, the test fails — so the bug "fixes itself" by reverting the patch.
Fix: make Codex test the spec, not the implementation.
codex "Write unit tests for src/billing/calculate.py.
The spec lives at docs/billing-spec.md.
DO NOT infer tests from the current implementation — that freezes bugs into assertions.
If the spec is unclear, mark `# clarify-with-pm:` and don't guess."If there's no spec, the TESTS task should run after writing the spec. Don't try to skip it.
FAQ
Q: OpenAI direct works for Codex CLI. Why use CodeGateway?
A: Direct works fine. CodeGateway solves three concrete problems: (1) network instability for long tasks (Cloudflare edge keepalive); (2) no international credit card (email signup + multiple payment methods); (3) one key calls Claude / OpenAI / Google models — convenient when you want to use different upstreams for refactoring vs. test generation.
Can I push the cost lower? $1–3 an evening still feels high. Three paths: (1) prompt-cache hits drop input cost to ~10% — load the system prompt once, reuse across tasks; (2) route mechanical tasks to gpt-5-mini (cuts cost roughly in half on those); (3) slice tasks finer so retries don't replay full tasks.
Q: Can the Codex-generated code be merged directly?
A: No. Always review. Codex writes commits, but each commit needs human eyes. Two specific danger signs: (1) it changed the test to make the test pass (rather than the source); (2) it introduced magic numbers / hardcoded values instead of abstracting. A pre-merge checklist as a Claude Code Skill is documented in our configuration guide.
Q: Can it run in CI?
A: Yes. codex --print is a non-interactive mode for CI. But don't auto-merge — let CI run a review and humans decide on merge. Use a dedicated CI key (named ci-<repo>), set RPM and monthly budget caps.
Q: Codex CLI vs Claude Code — which to pick?
A: Not mutually exclusive. Codex shines on "OpenAI Responses API" workflows (mainstream Python / TS, organizations already on OpenAI). Claude Code is stronger on long-context plus multi-agent orchestration plus its skill system. Same CodeGateway key works for both — pick per task. Detailed comparison post coming this Friday.
Q: My legacy project has commercially sensitive code. What now?
A: Codex CLI sends prompts upstream — so sensitive sections need redaction or placeholder substitution first. CodeGateway as a gateway doesn't persist conversation bodies (only metadata), but the upstream OpenAI's data retention policy is governed by OpenAI. For sensitive workloads, review OpenAI's Enterprise / Zero-Retention terms.
Q: When shouldn't I run Codex on the cleanup?
A: Three cases: (1) projects without git history (no rollback); (2) critical business code with zero tests (no way to verify changes); (3) compliance / medical / financial core transaction code (every line needs human review). Backfill tests first, then bring in Codex.
Q: How do I prevent the codebase from "going legacy" again?
A: Two actions: (1) write the post-refactor conventions into the project README + the .codex/policy.md, so the next person has a benchmark; (2) add quality gates to CI (cyclomatic complexity ceiling, coverage floor) — the next person tempted to add a 600-line function gets blocked.
Further reading
- Codex CLI setup tutorial — 5-step quickstart (CodeGateway docs)
- Claude Code connection timeout troubleshooting — root causes and fixes for long-task drops (transfers to Codex)
- The complete Claude Code configuration guide — Skills / Hooks / multi-key governance (complements Codex workflows)
- An honest receipt: 16 blog hero images for $0.92 in an hour — same "useful problem + honest receipt" approach applied to image generation
- Top-up and billing guide — tier markup, $2 starter, Stripe / Alipay / WeChat
- Tier markup explainer — 90-day rolling window, floor 1.2x
- OpenAI — Codex CLI on GitHub
- OpenAI — Responses API reference
- Task templates: open-sourced at Whitedit/code-gateway-cookbook (image-gen first; codex templates next)
Legacy code didn't get that way overnight, but it can start moving overnight. Diagnose first, take small steps, commit at a granularity you can roll back from — once the workflow is smooth, your attention shifts back to the question that actually matters: what should change here, and why.
