Codex CLI for Legacy Code: One-Night Playbook

Q: Can it run in CI?

Yes. codex --print is a non-interactive mode for CI. But don't auto-merge — let CI run a review and humans decide on merge. Use a dedicated CI key (named ci- ), set RPM and monthly budget caps.

Cleaning Up Legacy Code with Codex CLI: A One-Night Playbook

Author: CodeGateway team · Tested on May 2026

TL;DR: Every engineer has one: a project they haven't touched in three years. Test coverage at 12%. The original author left two jobs ago. There's a # rewrite-someday marker from 2022 still sitting at line 47. Every time you think about it, the same dilemma: don't touch it and the next sprint slips; touch it and the next two weeks slip. Codex CLI — OpenAI's command-line coding agent paired with the GPT-5.3 Codex model — drops "touching it" to one evening. This isn't a "AI fixes everything" pitch. It's a real playbook that splits "legacy code cleanup" into four concrete stages: diagnose, set up, slice into multi-agent tasks, and a postmortem of the four most common ways the run dies. Every step has copy-paste-ready commands. A typical mid-sized Python project goes through it in an evening for an estimated $1–3.

The three flavors of legacy code
Diagnose first: figure out what to actually change
Codex CLI in 5 steps
Slice the refactor into 9 schedulable tasks
A typical evening's cost receipt
Four ways the run goes sideways
FAQ
Further reading

The three flavors of legacy code

Before you start the cleanup, get specific about what kind of legacy you're dealing with. Most projects you don't want to touch fall into one of three patterns — each gets a different diagnostic, different priorities, different Codex tasks.

Flavor	Telltale signs	Damage	Suitable Codex tasks
Spaghetti	800-line functions, 6-level nesting, variables named `tmp1` `tmp2`	Fixing A breaks B; regression test blind spots	Function splitting, renaming, extracting common logic
Fossil	Dependencies 5 years stale; Python 3.6; jQuery 1.x	Security holes, CVE warnings, new hires can't onboard	Dependency batch upgrade + API adaptation
Amnesia	No docs, no tests, original author gone, comments say "see Slack"	No one dares change it; institutional knowledge lives in ex-coworkers' heads	Auto-generate docs + unit tests

Real-world projects usually have all three layered together — which is why the urge to "just rewrite" feels so strong. But greenfield rewrites fail more than half the time (industry data: >60% of "rewrite from scratch" projects miss schedule or get abandoned). Codex CLI's play isn't to bulldoze and rebuild — it's surgery on the existing structure.

Diagnose first: figure out what to actually change

Don't open Codex and tell it "refactor this project." That instruction's cost balloons and the result is mediocre — the AI doesn't know your priorities, can't tell historical baggage from necessary complexity.

Spend 15 minutes on diagnostics. Saves hours of rework downstream. Three actions:

1. Run a quantifiable health check

bash

# Python projects:
pip install radon vulture pylint
radon cc -a -s .                    # cyclomatic complexity
radon mi -s .                       # maintainability index
vulture .                           # dead-code scan
pylint --output-format=json .       # comprehensive report

# JS / TS projects:
npx complexity-report-html ./src    # complexity
npx ts-prune                        # dead code
npx depcheck                        # unused deps

Save the output as legacy-audit.txt. This is the input material for Codex's later decisions.

2. Write a "touch / don't touch" list

bash

# Create refactor-scope.md, answer 4 questions:
# 1. Which modules have cyclomatic complexity > 15?  (must touch)
# 2. Which functions are > 100 lines?               (split)
# 3. Which deps have CVE warnings?                  (upgrade)
# 4. Which code did vulture flag as dead?           (delete)

Be careful with #4 — in dynamic languages, "dead code" is often called via reflection. Tag what you're keeping with # vulture: ignore.

3. Back up + branch

bash

git checkout -b refactor/legacy-cleanup-2026q2
git commit -am "checkpoint: pre-refactor baseline"

Codex CLI's execution is destructive. Working directly on main doubles the rollback cost.

Codex CLI in 5 steps

Step 1: Install the CLI (~30 s)

bash

npm install -g @openai/codex
codex --version

Requires Node.js 18+. On Windows, use WSL2.

Step 2: Get a CodeGateway API key (~2 min)

Direct OpenAI works too. We use CodeGateway here for three reasons:

Email signup, no international credit card needed.
New accounts get a $2 starter credit — at GPT-5.3 Codex's input price of $2 per million tokens × the 1.5x starter markup, that's about 670K tokens, enough to run diagnostics plus one full refactor pass.
Long jobs are more stable than going direct — full-codebase analyses on 100K-LOC repos run 8–12 minutes; we've rarely had one drop with the gateway in front (more on dropouts in the postmortem section below).

Sign up at https://www.codegateway.dev → Dashboard → API Keys → Create Key. Name it after where it lives, e.g. codex-laptop.

Step 3: Configure environment (~1 min)

bash

# in ~/.zshrc or ~/.bashrc
export OPENAI_BASE_URL="https://api.codegateway.dev/v1"
export OPENAI_API_KEY="sk-cg-xxxxxxxxxxxxxxxxxxxxxx"

source your rc file to take effect.

Step 4: Verify the connection (~30 s)

bash

curl https://api.codegateway.dev/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY" | grep gpt-5.3-codex

If you see gpt-5.3-codex in the output, you're set.

Step 5: First instruction (~30 s)

bash

cd legacy-project/
codex "Summarize the architecture of this project and list 5 most refactor-worthy spots."

Codex will grep / read files / run commands and come back with an architecture overview and a priority list. This single command typically costs $0.05–$0.15. Treat it as the project's "first handshake."

Slice the refactor into 9 schedulable tasks

Telling Codex to "refactor the whole project" creates two problems: context windows get blown out, and the cost is huge with results that are hard to verify. The right move is to split the work into 9 manageable tasks and run them in dependency order.

Task decomposition template

Save the following 9 tasks as refactor-tasks.md:

1. ARCHITECTURE: Output the current architecture (Mermaid), find refactor entry points
2. DEAD-CODE:    Delete what vulture flagged (preserve `# vulture: ignore` comments)
3. RENAME:       Rename tmp1/tmp2/foo/bar-style variables to semantic names
4. SPLIT:        Split 100+ line functions into single-responsibility small functions
5. TYPES:        Add type annotations to all public APIs (Python: typing; TS: explicit interfaces)
6. DEPS:         Upgrade CVE-flagged deps to latest minor; run full test suite once
7. TESTS:        Backfill unit tests for modules with cyclomatic complexity > 15
8. DOCS:         Add docstrings to public APIs; generate the README "API Reference" section
9. REVIEW:       Run full tests + lint, output diff summary + risk highlights

Run sub-agents concurrently

Codex CLI supports /agents for sub-agent scheduling. Task 1 (architecture) must run first; task 9 (review) must run last. The middle 7 have partial parallelism (DEAD-CODE and RENAME don't conflict → parallel; TESTS must follow SPLIT → serial).

bash

codex << 'EOF'
Run in this order:
1. First run ARCHITECTURE, write the result to docs/architecture.md
2. Then run DEAD-CODE, RENAME, TYPES concurrently (3 sub-agents)
3. After those finish, run SPLIT → TESTS serially
4. Run DEPS in parallel with SPLIT (no conflict)
5. Then DOCS
6. Finally REVIEW

After each task, write a commit with subject "refactor: <task-name>".
EOF

The main agent handles scheduling; sub-agents do the work. Codex commits per-task automatically — making it easy to step through reviews and roll back individual changes.

Picking models for sub-agents (cost-aware)

ARCHITECTURE / REVIEW: architectural reasoning + holistic judgment → gpt-5.3-codex (default driver).
RENAME / DEAD-CODE / DOCS: mechanical, rule-driven work → switch to gpt-5-mini or cheaper. Sub-agent cost drops to about a third.
TESTS / SPLIT / TYPES: needs context but is repetitive → keep on gpt-5.3-codex.

Save these rules to .codex/policy.md. Codex CLI loads it on startup; sub-agent dispatch picks the model automatically — no per-command overrides.

A typical evening's cost receipt

Data note: the numbers below are an estimated cost structure for a typical mid-sized project (~30K LOC Python, ~150 files, ~12K test LOC), not a single project's actual bill. Real Codex CLI consumption swings significantly with project size, code density, and iteration count. Numbers below assume GPT-5.3 Codex pricing × the CodeGateway 1.5x starter markup.

Token consumption per task (estimated)

Task	Input tokens (K)	Output tokens (K)	Estimated cost
ARCHITECTURE	80	8	$0.32
DEAD-CODE	60	4	$0.20
RENAME	120	30	$0.65
SPLIT	180	60	$1.10
TYPES	100	25	$0.55
DEPS	40	5	$0.16
TESTS	200	80	$1.30
DOCS	90	35	$0.50
REVIEW	70	12	$0.30

Total estimate for one evening

Total tokens:  ~940K input + ~259K output ≈ 1.2M tokens
Direct OpenAI:                 ~$3.50  (per published GPT-5.3 Codex rates)
1.5x starter markup (new CG account):  ~$5.30
1.2x floor markup (90-day spend > $500):  ~$4.20

The $2 starter credit covers an early 4–5 tasks — enough to validate whether this flow fits your project. Top up $5–10 to finish the rest.

Time

API calls total roughly 30–50 minutes serially; concurrency compresses to 12–20 minutes. Add your review time per commit, and you're at 60–90 minutes end-to-end.

Three cost-compression levers

Prompt cache: keep architecture.md as a long, repeated system prompt. Cache hits drop input pricing to ~10%. Across one evening that's 30–40% off.
mini model: send mechanical tasks (RENAME / DEAD-CODE / DOCS) to gpt-5-mini. About $0.7–$1.0 saved.
Slice tasks finer: a failed task retries the whole task. At "one file per task" granularity, the cost of any single failure is minimized.

Four ways the run goes sideways

The most common failures aren't model failures — they're workflow details.

1. Mid-flight connection drops

Six minutes in, Codex has just finished file 12 of 16. File 13 throws ECONNRESET. Home broadband and corporate proxies typically set NAT idle timeouts at 30–60 seconds; AI inference pauses long enough for the middlebox to evict the connection.

Fix: route through CodeGateway — Cloudflare's edge keeps long-lived connections warm. We've run 100K-LOC full-codebase analyses for 12+ minutes without a single drop. Same root-cause and fix as documented for Claude Code timeouts (logic transfers directly to Codex).

2. Sub-agent goes off-script and modifies the wrong files

You ask Codex to refactor src/. It also "helpfully" changes tests/conftest.py — breaking the test infrastructure, after which everything fails downstream.

Fix: declare a firewall up front.

bash

codex "Refactor src/, but DO NOT modify any of:
  tests/conftest.py
  scripts/deploy*.sh
  .github/**
If you find changes there are needed, stop and report — I'll decide."

Bake this into your task template. After a few runs it's reflex.

3. RENAME also rewrites string literals

You rename user_id → accountId for variables. Codex also rewrites the literal string "user_id" in log messages — and now every grep/dashboard query against logs is broken.

Fix: scope it explicitly.

bash

codex "Rename user_id → accountId, but ONLY in variable names, function parameters, and class attributes.
DO NOT change string literals (logs, SQL, error messages).
DO NOT change keys in JSON / YAML / config files."

4. TESTS task generates "passing" tests for buggy code

Codex backfills tests for a function with a known bug — and reads the current implementation as the spec. The test asserts the buggy behavior. Now when someone fixes the bug, the test fails — so the bug "fixes itself" by reverting the patch.

Fix: make Codex test the spec, not the implementation.

bash

codex "Write unit tests for src/billing/calculate.py.
The spec lives at docs/billing-spec.md.
DO NOT infer tests from the current implementation — that freezes bugs into assertions.
If the spec is unclear, mark `# clarify-with-pm:` and don't guess."

If there's no spec, the TESTS task should run after writing the spec. Don't try to skip it.

FAQ

Q: OpenAI direct works for Codex CLI. Why use CodeGateway?

A: Direct works fine. CodeGateway solves three concrete problems: (1) network instability for long tasks (Cloudflare edge keepalive); (2) no international credit card (email signup + multiple payment methods); (3) one key calls Claude / OpenAI / Google models — convenient when you want to use different upstreams for refactoring vs. test generation.

Can I push the cost lower? $1–3 an evening still feels high. Three paths: (1) prompt-cache hits drop input cost to ~10% — load the system prompt once, reuse across tasks; (2) route mechanical tasks to gpt-5-mini (cuts cost roughly in half on those); (3) slice tasks finer so retries don't replay full tasks.

Q: Can the Codex-generated code be merged directly?

A: No. Always review. Codex writes commits, but each commit needs human eyes. Two specific danger signs: (1) it changed the test to make the test pass (rather than the source); (2) it introduced magic numbers / hardcoded values instead of abstracting. A pre-merge checklist as a Claude Code Skill is documented in our configuration guide.

Q: Can it run in CI?

A: Yes. codex --print is a non-interactive mode for CI. But don't auto-merge — let CI run a review and humans decide on merge. Use a dedicated CI key (named ci-<repo>), set RPM and monthly budget caps.

Q: Codex CLI vs Claude Code — which to pick?

A: Not mutually exclusive. Codex shines on "OpenAI Responses API" workflows (mainstream Python / TS, organizations already on OpenAI). Claude Code is stronger on long-context plus multi-agent orchestration plus its skill system. Same CodeGateway key works for both — pick per task. Detailed comparison post coming this Friday.

Q: My legacy project has commercially sensitive code. What now?

A: Codex CLI sends prompts upstream — so sensitive sections need redaction or placeholder substitution first. CodeGateway as a gateway doesn't persist conversation bodies (only metadata), but the upstream OpenAI's data retention policy is governed by OpenAI. For sensitive workloads, review OpenAI's Enterprise / Zero-Retention terms.

Q: When shouldn't I run Codex on the cleanup?

A: Three cases: (1) projects without git history (no rollback); (2) critical business code with zero tests (no way to verify changes); (3) compliance / medical / financial core transaction code (every line needs human review). Backfill tests first, then bring in Codex.

Q: How do I prevent the codebase from "going legacy" again?

A: Two actions: (1) write the post-refactor conventions into the project README + the .codex/policy.md, so the next person has a benchmark; (2) add quality gates to CI (cyclomatic complexity ceiling, coverage floor) — the next person tempted to add a 600-line function gets blocked.

Cleaning Up Legacy Code with Codex CLI: A One-Night Playbook

Cleaning Up Legacy Code with Codex CLI: A One-Night Playbook

Table of Contents

The three flavors of legacy code

Diagnose first: figure out what to actually change

1. Run a quantifiable health check

2. Write a "touch / don't touch" list

3. Back up + branch

Codex CLI in 5 steps

Step 1: Install the CLI (~30 s)

Step 2: Get a CodeGateway API key (~2 min)

Step 3: Configure environment (~1 min)

Step 4: Verify the connection (~30 s)

Step 5: First instruction (~30 s)

Slice the refactor into 9 schedulable tasks

Task decomposition template

Run sub-agents concurrently

Picking models for sub-agents (cost-aware)

A typical evening's cost receipt

Token consumption per task (estimated)

Total estimate for one evening

Time

Three cost-compression levers

Four ways the run goes sideways

1. Mid-flight connection drops

2. Sub-agent goes off-script and modifies the wrong files

3. RENAME also rewrites string literals

4. TESTS task generates "passing" tests for buggy code

FAQ

Further reading