TL;DR — A phase-aware router looks at your last message + recent tool-call history, classifies the coding phase (plan / implement / debug / test / refactor / document / small_edit) in <10ms, then routes to the cheapest model that's actually good at that phase. Opus 4.7 earns its $75/M output only on plan and hard-debug calls; the other 80% of requests go to Sonnet 4.6 / DeepSeek V3 / Haiku 4.5 / Gemini Flash at 5–100× lower cost.
The problem: one-model agents waste tokens
Most coding agents default to a single model — Claude Opus 4.7 or GPT-5.2 — and use it for every call in the agent loop. Planning a feature? Opus. Writing a docstring? Opus. Fixing a typo in a variable name? Opus. Adding a null check? Opus.
This is absurd because coding is phased work:
- Planning (5% of calls, high reasoning) — "How should I structure the retry logic?"
- Implementation (55% of calls, pattern-heavy) — "Write a function that does X"
- Debugging (10% of calls, medium reasoning) — "Why is this failing?"
- Testing (15% of calls, template-heavy) — "Write pytest cases for this"
- Refactoring (8% of calls, medium reasoning) — "Simplify this method"
- Documentation (5% of calls, template-heavy) — "Add a docstring"
- Small edits (2% of calls, trivial) — "Rename this variable"
Each phase has radically different compute requirements. Opus is 100× overkill for docstring generation. A router that recognizes this and routes accordingly is the product wedge.
How phase detection works
CodeRouter's phase detector runs in three layers, each progressively more expensive and more accurate:
Layer 1: Regex on the last user message (~2ms)
Patterns we match against the most recent user message:
PHASE_PATTERNS = [
// Debug — user describes a failure
{ phase: "debug", re: /\b(why (is|does)|not working|crash|fails?|error|exception|traceback|fix the)\b/i },
// Test — writing/running tests
{ phase: "test", re: /\b(unit test|write a test|pytest|jest|assert|coverage)\b/i },
// Plan — architecture / design
{ phase: "plan", re: /\b(plan|design|architect|break.?down|how should (i|we)|approach)\b/i },
// Refactor
{ phase: "refactor", re: /\b(refactor|clean.?up|simplify|extract)\b/i },
// Document
{ phase: "document", re: /\b(document|docstring|README|explain)\b/i },
// Small edit — last resort
{ phase: "small_edit", re: /\b(format|lint|fix typo|rename variable)\b/i },
// Implement — broad default
{ phase: "implement", re: /\b(write|create|add|implement|build)\b/i },
];
Simple, fast, transparent. Covers about 70% of requests with 80%+ confidence.
Layer 2: Tool-call history inspection (~3ms)
When Layer 1 is ambiguous, we look at the recent tool messages in the conversation history:
- Last tool_result contains
error|traceback|FAIL→ phase=debug(confidence 0.95) - Last tool_result has
PASS|failing|Ran N tests→ phase=test(confidence 0.85) - Multiple recent
file_readcalls without an explicit question → phase=planorreview - Recent
Write/Edittool calls to atest_*.pyor*.test.tspath → phase=test(confidence 0.90) - Recent
Writeto a.mdpath → phase=document
This is huge for agentic workflows. When Claude Code or Aider just ran a failing test, we don't need the user to say "debug this" — the tool output is the signal.
Layer 3: Agent fingerprinting (~2ms)
We also fingerprint the agent itself from the system prompt and tools array:
- Cursor → usually implement phase (Composer), sometimes plan
- Aider
--architectmode →planphase for architect calls,implementfor editor - Claude Code with
<plan_mode>marker →plan, obviously - OpenClaw's tool schema (contains
create_plan,implement_step) → phase depends on which tool is being invoked - Windsurf Cascade → typically implement
When we identify a specific agent mode, the phase hint is near-certain (confidence 0.95).
Layer 4: L2 LLM fallback (~150ms, rare)
If Layers 1–3 all return confidence <0.6, we dispatch a 50-token prompt to Claude Haiku 4.5 with the message body + one-shot example. Cost: ~$0.0001. Accuracy: ~92%. Used on maybe 5% of requests.
Phase → model mapping
Once we know the phase, we look up the phase → model preference table:
const PHASE_MODEL_PREFERENCE = {
plan: ["claude-opus-4.7", "gpt-5.2", "claude-sonnet-4.6", "gemini-3-pro"],
implement: ["claude-sonnet-4.6", "deepseek-chat", "gemini-3-pro", "gpt-5.2"],
debug: ["claude-sonnet-4.6", "deepseek-reasoner", "gpt-5.2", "claude-opus-4.7"],
test: ["deepseek-chat", "kimi-k2.5", "claude-sonnet-4.6", "gemini-3-flash"],
refactor: ["claude-sonnet-4.6", "deepseek-chat", "gemini-2.5-pro"],
document: ["claude-haiku-4.5", "gemini-3-flash", "gpt-5-mini"],
small_edit: ["gpt-5-mini", "gemini-3-flash", "gemini-2.5-flash", "claude-haiku-4.5"],
};
Then we apply a quality/cost ratio ranking on the candidate list filtered by:
- Models supported by the user's plan
- Models that can satisfy request features (tool use, vision)
- Per-model scores on the specific phase's capability dimension (
code_plan,code_implement, etc.)
Winner: highest capability / cost ratio.
Response header transparency
Every routed response carries these headers so you can verify the decision:
X-CodeRouter-Phase: debug
X-CodeRouter-Phase-Confidence: 0.92
X-CodeRouter-Agent: cursor
X-CodeRouter-Model: claude-sonnet-4.6
X-CodeRouter-Cost: 0.000412
X-CodeRouter-Cost-If-Opus: 0.006180
X-CodeRouter-Savings: 93.3%
No black box. If you think the router picked wrong, the phase + confidence tell you why, and you can open a ticket with the exact request hash.
Blended cost math (why this works)
Assuming a realistic coding-agent workload distribution and aggressive routing:
| Phase | % of calls | Model choice | Cost (blended per 1M) | |---|---|---|---| | Plan | 5% | Opus 50% / Sonnet 50% | $19.80 | | Implement | 55% | DeepSeek V3 85% / Sonnet 15% | $1.26 | | Debug | 10% | Sonnet 50% / DeepSeek V3 50% | $3.46 | | Test | 15% | DeepSeek V3 100% | $0.32 | | Small edit | 10% | Flash 100% | $1.25 | | Document | 5% | Flash 100% | $1.25 | | Weighted average | | | ~$2.27/M |
Versus Opus-for-everything at $33/M, that's a 93% reduction in provider cost. Real user savings land at 70–90% after the platform fee + overage markup.
When phase-aware routing doesn't help
Being honest: the routing wedge has limits.
- If your entire workload is frontier-reasoning (research scientists, novel algorithm design): everything really does need Opus. Routing gains are 0–10%.
- If your workload is entirely trivial (autocomplete, syntax fixes): you're already at Flash territory and routing can't go lower.
- Very long conversations with dense tool use: phase signal gets noisy across turns. Accuracy drops from 85% to ~70%. Still better than one-model baseline, but less dramatic.
The sweet spot is diverse coding workloads: Cursor / Aider / Claude Code / Copilot users doing a mix of features, bug fixes, tests, docs, and small edits every day. That's where routing produces 70–90% cost reduction with equivalent output quality.