← Back to Blog

Phase-Aware LLM Routing Explained (2026): Plan → Opus, Test → DeepSeek

2026-04-20·6 min read·CodeRouter Team
phase aware llm routingcoding agent routerllm routing architectureai coding proxycode routing algorithmsmart llm gateway 2026

TL;DR — A phase-aware router looks at your last message + recent tool-call history, classifies the coding phase (plan / implement / debug / test / refactor / document / small_edit) in <10ms, then routes to the cheapest model that's actually good at that phase. Opus 4.7 earns its $75/M output only on plan and hard-debug calls; the other 80% of requests go to Sonnet 4.6 / DeepSeek V3 / Haiku 4.5 / Gemini Flash at 5–100× lower cost.

The problem: one-model agents waste tokens

Most coding agents default to a single model — Claude Opus 4.7 or GPT-5.2 — and use it for every call in the agent loop. Planning a feature? Opus. Writing a docstring? Opus. Fixing a typo in a variable name? Opus. Adding a null check? Opus.

This is absurd because coding is phased work:

  1. Planning (5% of calls, high reasoning) — "How should I structure the retry logic?"
  2. Implementation (55% of calls, pattern-heavy) — "Write a function that does X"
  3. Debugging (10% of calls, medium reasoning) — "Why is this failing?"
  4. Testing (15% of calls, template-heavy) — "Write pytest cases for this"
  5. Refactoring (8% of calls, medium reasoning) — "Simplify this method"
  6. Documentation (5% of calls, template-heavy) — "Add a docstring"
  7. Small edits (2% of calls, trivial) — "Rename this variable"

Each phase has radically different compute requirements. Opus is 100× overkill for docstring generation. A router that recognizes this and routes accordingly is the product wedge.

How phase detection works

CodeRouter's phase detector runs in three layers, each progressively more expensive and more accurate:

Layer 1: Regex on the last user message (~2ms)

Patterns we match against the most recent user message:

PHASE_PATTERNS = [
  // Debug — user describes a failure
  { phase: "debug",   re: /\b(why (is|does)|not working|crash|fails?|error|exception|traceback|fix the)\b/i },
  // Test — writing/running tests
  { phase: "test",    re: /\b(unit test|write a test|pytest|jest|assert|coverage)\b/i },
  // Plan — architecture / design
  { phase: "plan",    re: /\b(plan|design|architect|break.?down|how should (i|we)|approach)\b/i },
  // Refactor
  { phase: "refactor", re: /\b(refactor|clean.?up|simplify|extract)\b/i },
  // Document
  { phase: "document", re: /\b(document|docstring|README|explain)\b/i },
  // Small edit — last resort
  { phase: "small_edit", re: /\b(format|lint|fix typo|rename variable)\b/i },
  // Implement — broad default
  { phase: "implement", re: /\b(write|create|add|implement|build)\b/i },
];

Simple, fast, transparent. Covers about 70% of requests with 80%+ confidence.

Layer 2: Tool-call history inspection (~3ms)

When Layer 1 is ambiguous, we look at the recent tool messages in the conversation history:

This is huge for agentic workflows. When Claude Code or Aider just ran a failing test, we don't need the user to say "debug this" — the tool output is the signal.

Layer 3: Agent fingerprinting (~2ms)

We also fingerprint the agent itself from the system prompt and tools array:

When we identify a specific agent mode, the phase hint is near-certain (confidence 0.95).

Layer 4: L2 LLM fallback (~150ms, rare)

If Layers 1–3 all return confidence <0.6, we dispatch a 50-token prompt to Claude Haiku 4.5 with the message body + one-shot example. Cost: ~$0.0001. Accuracy: ~92%. Used on maybe 5% of requests.

Phase → model mapping

Once we know the phase, we look up the phase → model preference table:

const PHASE_MODEL_PREFERENCE = {
  plan:       ["claude-opus-4.7", "gpt-5.2", "claude-sonnet-4.6", "gemini-3-pro"],
  implement:  ["claude-sonnet-4.6", "deepseek-chat", "gemini-3-pro", "gpt-5.2"],
  debug:      ["claude-sonnet-4.6", "deepseek-reasoner", "gpt-5.2", "claude-opus-4.7"],
  test:       ["deepseek-chat", "kimi-k2.5", "claude-sonnet-4.6", "gemini-3-flash"],
  refactor:   ["claude-sonnet-4.6", "deepseek-chat", "gemini-2.5-pro"],
  document:   ["claude-haiku-4.5", "gemini-3-flash", "gpt-5-mini"],
  small_edit: ["gpt-5-mini", "gemini-3-flash", "gemini-2.5-flash", "claude-haiku-4.5"],
};

Then we apply a quality/cost ratio ranking on the candidate list filtered by:

Winner: highest capability / cost ratio.

Response header transparency

Every routed response carries these headers so you can verify the decision:

X-CodeRouter-Phase: debug
X-CodeRouter-Phase-Confidence: 0.92
X-CodeRouter-Agent: cursor
X-CodeRouter-Model: claude-sonnet-4.6
X-CodeRouter-Cost: 0.000412
X-CodeRouter-Cost-If-Opus: 0.006180
X-CodeRouter-Savings: 93.3%

No black box. If you think the router picked wrong, the phase + confidence tell you why, and you can open a ticket with the exact request hash.

Blended cost math (why this works)

Assuming a realistic coding-agent workload distribution and aggressive routing:

| Phase | % of calls | Model choice | Cost (blended per 1M) | |---|---|---|---| | Plan | 5% | Opus 50% / Sonnet 50% | $19.80 | | Implement | 55% | DeepSeek V3 85% / Sonnet 15% | $1.26 | | Debug | 10% | Sonnet 50% / DeepSeek V3 50% | $3.46 | | Test | 15% | DeepSeek V3 100% | $0.32 | | Small edit | 10% | Flash 100% | $1.25 | | Document | 5% | Flash 100% | $1.25 | | Weighted average | | | ~$2.27/M |

Versus Opus-for-everything at $33/M, that's a 93% reduction in provider cost. Real user savings land at 70–90% after the platform fee + overage markup.

When phase-aware routing doesn't help

Being honest: the routing wedge has limits.

The sweet spot is diverse coding workloads: Cursor / Aider / Claude Code / Copilot users doing a mix of features, bug fixes, tests, docs, and small edits every day. That's where routing produces 70–90% cost reduction with equivalent output quality.

Related

Ready to Reduce Your AI API Costs?

CodeRouter routes every API call to the optimal model — automatically. Start saving today.

Get Started Free →

Get weekly AI cost optimization tips

Join 2,000+ developers saving on LLM costs