← Back to Blog

We Run Coding Agents on 7 Different LLMs Per Session — Here's the 30-Day Production Data

2026-04-27·12 min read·CodeRouter Team
phase aware llm routingcoding agent multi model routingclaude code router productionllm router architectureai coding cost optimizationdeepseek opus routingphase detection coding agentsession sticky llm routingprompt cache affinity routinganthropic api gateway

TL;DR — A typical Claude Code session in our production traffic touches 7 different upstream LLMs as it moves through plan → implement → debug → test → refactor → document → small-edit phases. Each phase has radically different reasoning requirements; routing each phase to the cheapest capable model rather than the strongest model overall cuts cost 78% vs all-Opus while maintaining indistinguishable output quality. This post explains the phase detection algorithm (runs in <8ms), the per-phase model assignments after a month of calibration, the pitfalls we hit, and the production data from 287M tokens served. The router itself is open about the routing decisions via response headers — every call returns X-CodeRouter-Phase and X-CodeRouter-Model so you can audit the choice.

The setup

We've been running phase-aware routing in production for 30 days at CodeRouter — an Anthropic-API-compatible gateway that fronts Claude Code, Cursor, Aider, Cline, and similar tools. Total served: 287M tokens across 5,957 requests in the last billing period, on a mix of Anthropic, OpenAI, DeepSeek, Moonshot, Google, and Z.AI backends. This post is what we learned.

The interesting question is not "which is the best model" — that question is dated the day a new model ships. The interesting question is: for any given coding agent call, what model gives the best ratio of (capability / cost)? Phase-aware routing is our answer.

Why one-model agents waste tokens

The default for most coding agents is to pick one model and use it for every call. Claude Code defaults to Claude Opus 4.7 ($15/$75 per 1M tokens). Cursor's "auto" defaults to Sonnet 4.6. Aider can be configured but usually gets pinned to one model.

This is wasteful because coding work is not homogeneous. A single 30-minute Claude Code session contains operations with capability requirements that differ by 1-2 orders of magnitude:

| Operation type | Capability needed | Per-token cost (Opus 4.7) | Per-token cost (DeepSeek V4-Flash) | Cost ratio | |---|---|---:|---:|---:| | Architectural design | High reasoning | $15-75 | n/a (don't use) | — | | Write a new function from spec | Pattern-following | $15-75 | $0.14-0.28 | 107× | | Generate pytest cases | Template-heavy | $15-75 | $0.14-0.28 | 107× | | Add a docstring | Trivial completion | $15-75 | $0.14-0.28 | 107× | | Rename a variable | Mechanical | $15-75 | $0.14-0.28 | 107× |

Sending "add a docstring to this function" to Opus 4.7 isn't just expensive — it's insulting to Opus. Any 7B model finetuned on code can do this perfectly. We pay 100× too much for nothing.

The phase-aware routing thesis: detect what phase the agent is in, route accordingly.

Phase detection in <8ms (without an LLM call)

Phase detection has to be fast — sub-10ms — because it runs on every routing decision. Calling another LLM to classify is too expensive (latency + cost). Our detector is a 3-layer cascade:

Layer 1: Agent fingerprinting (<1ms)

The system prompt and tool shape uniquely identify the calling agent in 80%+ of cases:

Claude Code:    system contains "I'm Claude" + specific tool names
                (View, Read, Edit, Bash, etc.)
Cursor:         system mentions "cursor" or "Cursor IDE" + tools 
                like "edit_file", "search_files"
Aider:          system contains "/architect" or "/edit" prefix
Cline:          system contains "Cline" + tool format hints
OpenClaw:       system signature, plus tools include 'task_complete'

Knowing the agent narrows the phase space dramatically. Claude Code in plan mode has a specific tag (<plan_mode>); Aider's /architect command has a specific prefix. Each agent has its own conventions for signaling intent.

Layer 2: Message + tool-history pattern matching (<3ms)

For requests not pinned to a specific agent mode, regex + keyword matching on the most recent user message + last 5 tool-call results:

const PHASE_SIGNALS = {
  plan:        /how should|architect|design|approach|plan/i,
  debug:       /error|exception|stack trace|why is|failing/i,
  test:        /pytest|jest|test case|unit test|spec/i,
  refactor:    /refactor|simplify|extract|rename|clean up/i,
  document:    /docstring|comment|readme|explain/i,
  small_edit:  /^(rename|change|update) (this|the)/i,
};

Combined with simple heuristics (message length, code-fence presence, tool-call recency), this gets us to ~85% phase confidence in <3ms.

Layer 3: L2 LLM classifier (~800ms, async)

For the remaining 15% where layers 1+2 are uncertain, a tiny model (Claude Haiku 4.5 or DeepSeek V4-Flash, depending on availability) does a focused classification call. We cache the result for 10 minutes keyed by user prompt hash, so the second occurrence of "what phase is this prompt" is instant.

L2 only fires for non-streaming requests because the 800ms latency would be visible as time-to-first-token in streaming. For streaming, we accept slightly less accurate routing in exchange for instant TTFT.

The full cascade resolves in <8ms p95 in production. Detection accuracy measured against manual labeling: 87% exact-phase match, 96% within-one-phase (i.e., classifying "implement" as "small_edit" counts as close-enough). That's good enough — within-one-phase errors don't materially change the routed model.

The phase × model assignment table

This is what we route to in production today (April 2026), after iterating for a month:

| Phase | Primary model | First fallback | Second fallback | Why this assignment | |---|---|---|---|---| | plan | claude-opus-4.7 | deepseek-v4-pro | gpt-5.5 | Plan mode rewards reasoning depth. Opus 4.7's SWE-Bench Pro 64.3% is the empirical leader; V4-Pro at 80.6% Verified is the cheapest near-Opus alternative | | implement | deepseek-v4-pro | claude-sonnet-4.6 | kimi-k2.6 | V4-Pro promoted to #1 after independent benchmarks confirmed Opus-tier quality at 1/10 the cost. Sonnet 4.6 fallback for tool-call reliability | | debug | claude-opus-4.7 | gpt-5.5 | deepseek-v4-pro | Opus's edge case sensitivity is real. GPT-5.5 wins Terminal-Bench 2.0 specifically for agent-loop debugging | | test | deepseek-v4-flash | deepseek-v4-pro | kimi-k2.6 | Test generation is template-heavy. V4-Flash at $0.14/M is the bargain. Pro for harder integration tests | | refactor | deepseek-v4-pro | kimi-k2.6 | claude-sonnet-4.6 | Refactoring needs context window (V4-Pro: 1M, K2.6: 256K) and structural reasoning. K2.6's agent-swarm training shows up here | | document | claude-haiku-4.5 | gemini-3-flash | gpt-5-mini | Documentation is closest to "predict the next obvious sentence". Tiniest models suffice | | small_edit | gpt-5-mini | gemini-3-flash | claude-haiku-4.5 | Renames, single-line changes. Latency matters more than capability |

What's striking is how different the assignments are by phase. There is no "best model" — there's "best model for this phase, given current pricing and our calibration data."

30-day production results

Aggregated across 287M tokens served in the last billing period:

Total requests:                    5,957
Total tokens:                      287.6M  (281.2M input / 6.4M output)
Actual cost:                       $1,068
What it would have cost on Opus:   $4,992
Savings:                           $3,924  (78.6%)
Cache hit rate:                    30.8%   (rising; hit 70%+ on Anthropic routes after Phase 2)
Fallback engagements:              1       (0.02% of requests)

The 78.6% savings is not from caching — it's from routing intelligence. Cache savings is a separate $10 (small because we route most volume to V4-Flash, where input is already $0.14/M and the cache discount is small in absolute terms).

A few specific observations:

Implementation is the volume driver. ~50% of requests classified as implement. After promoting V4-Pro to primary for this phase (April 27), our V4-Pro share of total traffic went from ~5% to a projected 25-30%. The single biggest cost lever in the whole router.

Plan and debug stayed expensive on purpose. Opus 4.7 at $15/$75 is non-negotiable for these phases — we tried routing plan to Sonnet 4.6 in week 2 and saw user complaints about "shallower" architectural suggestions. Reverted within 24 hours. The user perception of quality on plan/debug is sharp; on implement/test/docs it's basically invisible.

Document and small-edit phases are nearly free. Combined ~7% of requests, total cost <$5 in the period. These are the easiest to optimize because the models are 100-200× cheaper and the work is genuinely simpler.

Fallback fires almost never. 1 fallback in 5,957 requests means our primary picks are good. When fallback does fire, it's almost always a transient provider 503 (resolved in milliseconds by retrying the next model).

The pitfalls (what we tried that didn't work)

Phase-aware routing isn't free engineering — we hit real walls. Three are worth documenting because they're not in any other writeup:

Pitfall 1: Aggressive routing fragments prompt cache

Our first naive implementation had each request independently classified and routed. This means turn 1 might go to Sonnet 4.6, turn 2 to V4-Pro, turn 3 back to Sonnet, etc. Cache hit rate dropped to <1% because every model switch invalidates Anthropic's prompt cache (cache is per-model + per-prefix).

Fix: session stickiness. Once a session lands on a model in turn 1, subsequent turns prefer the same model unless there's a strong reason to switch (confidence ≥0.85 + phase change to a higher tier). After this fix, cache hit rate climbed to ~26% within 72 hours.

Pitfall 2: Round-robin multi-account further fragments cache

To handle Anthropic's per-key rate limits (60 req/min default), we added pool-based multi-account rotation across 3 Anthropic API keys. Round-robin distribution across the 3 keys means each key has its own cold cache, dividing the effective hit rate by ~3.

Fix: affinity routing. Hash (provider, agent_type, system_prompt[0:1024]) and consistently route the same hash to the same backend key. Same-agent traffic concentrates on one key; that key's cache stays continuously hot. Critical insight: the LLM API is stateless from the user's perspective, but the cache is per-API-key from the provider's perspective. Affinity makes them line up. Cache hit rate jumped from 26% → 70%+ on Anthropic routes after this fix.

Pitfall 3: Chinese-LLM thinking modes break multi-turn

Three different Chinese providers (DeepSeek's deepseek-reasoner, Moonshot's Kimi K2.6, Z.AI's GLM-5.1) emit a reasoning_content field that must be round-tripped to subsequent calls — incompatible with the OpenAI/Anthropic spec. Our gateway's translator strips it; multi-turn agents 400 on turn 2.

This isn't a routing-algorithm problem; it's a protocol-compatibility problem that bites anyone integrating Chinese LLMs through OpenAI-compatible endpoints. We wrote it up in detail here. Short version: pass extra_body: {thinking: {type: "disabled"}} to Moonshot/Z.AI; remove *-reasoner model IDs from auto-routing pools entirely.

Anti-pattern: "but isn't this over-engineering?"

We've heard this question a few times. Honest answer:

For a single solo developer running occasional Claude Code sessions: yes, possibly over-engineering. Just use Claude Pro at $20/mo and don't think about it. Time saved isn't worth the gateway complexity.

For a 5+ person engineering team running coding agents 8 hours/day: absolutely worth it. The team's monthly Claude Code bill goes from $1500-3000 to $200-400. That's the difference between a budgeted line item and an "uh, can we have a meeting about AI spend" Slack message from finance.

For an AI product builder integrating coding capabilities: the routing infrastructure is the moat. Whatever model is cheapest-and-best in 6 months will be different from today; the router that absorbs that change without a code rewrite is what scales.

The "over-engineering" critique usually comes from people who haven't yet hit the bill where it stops being theoretical. Once monthly LLM spend crosses ~$500, the cost of not having a router exceeds the cost of building one.

What we'd build next (open backlog)

A few pieces we have prototyped but not shipped:

A note on transparency

Every CodeRouter response includes headers that show the routing decision:

X-CodeRouter-Model: deepseek-v4-pro
X-CodeRouter-Phase: implement
X-CodeRouter-Phase-Confidence: 0.92
X-CodeRouter-Agent: claude_code
X-CodeRouter-Cost: 0.000123
X-CodeRouter-Cost-If-Opus: 0.012874
X-CodeRouter-Classifier: L1
X-CodeRouter-Cache-Read: 1842

This is deliberate. A router that hides its decisions is asking for trust it doesn't deserve. Show the user what was picked, why, and what it cost. Users who want to audit can; users who don't can ignore.

Summary

Phase-aware routing is one of those ideas that sounds obvious in hindsight: agents do different kinds of work; each kind has different model requirements; route accordingly. But it has real engineering depth — the detection has to be fast, the routing has to play nice with cache, the fallbacks have to be robust, and you'll discover provider-protocol surprises that aren't documented anywhere.

After 30 days running it on real Claude Code / Cursor / Aider traffic, the data is clear: 78% cost reduction with no measurable quality loss for routine coding work. The reasoning-heavy phases (plan, debug) stay on Opus where it matters; the implementation grind moves to V4-Pro / V4-Flash where it doesn't matter that the model is "less smart" overall.

If you're building a coding agent, an API gateway, or an AI product that uses LLMs for software engineering: the optimization isn't picking the best model. It's not picking the best model for the 70% of work that doesn't need it.


Related reading

Ready to Reduce Your AI API Costs?

CodeRouter routes every API call to the optimal model — automatically. Start saving today.

Get Started Free →

Get weekly AI cost optimization tips

Join 2,000+ developers saving on LLM costs