How do I reduce my Cursor API bill?

Cursor Pro's $20/mo covers 500 fast requests — past that you pay OpenAI / Anthropic per-call rates directly, which is why heavy users end up at $50-$200/mo. The fix is to point Cursor's Custom API at CodeRouter and set model to 'auto'. CodeRouter detects what phase of coding the request is (planning, implementation, debugging, test generation, docs) and routes to the cheapest capable model per phase — Opus only for planning, DeepSeek V3 for test generation, Haiku for docstrings. Same Cursor IDE, same keyboard shortcuts, 70-90% lower monthly bill. Setup takes 2 minutes — just change base_url to https://www.coderouter.io/api/v1 and paste your cr_ API key.

What is the cheapest API for Claude Code / Aider / Copilot?

There isn't a single 'cheapest API' — the cheapest model depends on what the coding agent is doing. For planning and architecture, you still want Claude Opus 4.7 or Sonnet 4.6. For implementation, DeepSeek V3.2 ($0.28/$0.42 per 1M) and Qwen 3 Coder are 30-50x cheaper than Opus with near-equivalent code quality. For test generation, DeepSeek V3 or Haiku 4.5 is 15-50x cheaper. For docstrings and simple formatting, Haiku 4.5 ($1/$5) or Gemini 2.5 Flash ($0.30/M output) is 15-250x cheaper. CodeRouter is the gateway that picks per request automatically — aim a single base_url at https://www.coderouter.io/api/v1 from Claude Code, Aider, Copilot (via LiteLLM), Cursor, Windsurf, or any OpenAI-compatible agent.

Does DeepSeek V3 work as well as Claude Sonnet for coding?

For implementation and test generation phases — yes, DeepSeek V3.2 matches Claude Sonnet 4.6 on HumanEval, MBPP, and LiveCodeBench within 1-3 points. For multi-file refactoring and architecture planning, Sonnet still has an edge on long-context reasoning. DeepSeek V3 costs $0.28 input / $0.42 output per 1M tokens, vs Sonnet's $3/$15 — roughly 30-50x cheaper. The right answer for most coding agents is not 'pick one forever' but 'use DeepSeek V3 for the implement/test phases and Sonnet for the plan/refactor phases.' That's phase-aware routing in practice — CodeRouter decides per request in ~10ms.

What is phase-aware LLM routing?

Phase-aware LLM routing classifies each coding-agent request by what phase of software work it represents — planning, implementation, debugging, testing, refactoring, or documentation — and routes it to the cheapest model that can handle that specific phase. A 'write unit tests for this function' request goes to DeepSeek V3 ($0.42/M). A 'refactor this multi-file feature and plan the migration' request goes to Claude Opus 4.7 ($75/M). This is different from picking one model for everything, and different from OpenRouter-style model-selection (which still requires you to choose manually). CodeRouter's classifier runs in ~10ms on the server, so the agent never notices the extra hop.

CodeRouter vs OpenRouter — which saves more money on coding?

OpenRouter is a model marketplace — it gives you access to 300+ models behind one API key, but you still pick which model to send each request to. Most Cursor / Aider / Claude Code users default to the premium model (Opus, GPT-5) for everything and end up paying full price. CodeRouter is a phase-aware router — set model to 'auto' and we pick the cheapest capable model per request based on the coding phase. CodeRouter also adds things OpenRouter doesn't: coding-specific capability scores per model (implementation, debug, test, refactor), per-end-user attribution for SaaS agent builders, and built-in quota + top-up billing. For pure coding workloads, typical CodeRouter savings are 70-90% vs picking one model on OpenRouter.

Will CodeRouter break my Cursor / Aider / Claude Code agent?

No. CodeRouter exposes a standard OpenAI-compatible chat completions endpoint (POST /api/v1/chat/completions) with the same request and response format your agent already uses — including streaming, tool use, and function calling. We implement the same JSON schema and stream format, so Cursor, Aider, Claude Code, Cline, Continue.dev, Windsurf, OpenClaw, and any LiteLLM-wrapped client work unmodified. If a routed model fails, the fallback chain tries up to 2 alternates automatically (on 429, 500-504, timeouts, missing keys). You can also pin an explicit model instead of 'auto' any time.

How do I set up CodeRouter with Cursor in 2 minutes?

1) Sign up free at https://www.coderouter.io/login and copy your API key (starts with cr_). 2) In Cursor, open Settings -> Models -> OpenAI API Key, and under 'Override OpenAI Base URL' paste https://www.coderouter.io/api/v1. Paste your cr_ key in the API Key field. 3) Add 'auto' to the Custom Models list and select it as your active model. That's it — phase-aware routing is live. Aider users set OPENAI_API_BASE and OPENAI_API_KEY env vars to the same values. Claude Code users set ANTHROPIC_BASE_URL to https://www.coderouter.io/api/v1 and ANTHROPIC_API_KEY to the cr_ key. Full guide at https://www.coderouter.io/setup.

We Run Coding Agents on 7 Different LLMs Per Session — Here's the 30-Day Production Data

TL;DR — A typical Claude Code session in our production traffic touches 7 different upstream LLMs as it moves through plan → implement → debug → test → refactor → document → small-edit phases. Each phase has radically different reasoning requirements; routing each phase to the cheapest capable model rather than the strongest model overall cuts cost 78% vs all-Opus while maintaining indistinguishable output quality. This post explains the phase detection algorithm (runs in <8ms), the per-phase model assignments after a month of calibration, the pitfalls we hit, and the production data from 287M tokens served. The router itself is open about the routing decisions via response headers — every call returns X-CodeRouter-Phase and X-CodeRouter-Model so you can audit the choice.

The setup

We've been running phase-aware routing in production for 30 days at CodeRouter — an Anthropic-API-compatible gateway that fronts Claude Code, Cursor, Aider, Cline, and similar tools. Total served: 287M tokens across 5,957 requests in the last billing period, on a mix of Anthropic, OpenAI, DeepSeek, Moonshot, Google, and Z.AI backends. This post is what we learned.

The interesting question is not "which is the best model" — that question is dated the day a new model ships. The interesting question is: for any given coding agent call, what model gives the best ratio of (capability / cost)? Phase-aware routing is our answer.

Why one-model agents waste tokens

The default for most coding agents is to pick one model and use it for every call. Claude Code defaults to Claude Opus 4.7 ($15/$75 per 1M tokens). Cursor's "auto" defaults to Sonnet 4.6. Aider can be configured but usually gets pinned to one model.

This is wasteful because coding work is not homogeneous. A single 30-minute Claude Code session contains operations with capability requirements that differ by 1-2 orders of magnitude:

| Operation type | Capability needed | Per-token cost (Opus 4.7) | Per-token cost (DeepSeek V4-Flash) | Cost ratio | |---|---|---:|---:|---:| | Architectural design | High reasoning | $15-75 | n/a (don't use) | — | | Write a new function from spec | Pattern-following | $15-75 | $0.14-0.28 | 107× | | Generate pytest cases | Template-heavy | $15-75 | $0.14-0.28 | 107× | | Add a docstring | Trivial completion | $15-75 | $0.14-0.28 | 107× | | Rename a variable | Mechanical | $15-75 | $0.14-0.28 | 107× |

Sending "add a docstring to this function" to Opus 4.7 isn't just expensive — it's insulting to Opus. Any 7B model finetuned on code can do this perfectly. We pay 100× too much for nothing.

The phase-aware routing thesis: detect what phase the agent is in, route accordingly.

Phase detection in <8ms (without an LLM call)

Phase detection has to be fast — sub-10ms — because it runs on every routing decision. Calling another LLM to classify is too expensive (latency + cost). Our detector is a 3-layer cascade:

Layer 1: Agent fingerprinting (<1ms)

The system prompt and tool shape uniquely identify the calling agent in 80%+ of cases:

Claude Code:    system contains "I'm Claude" + specific tool names
                (View, Read, Edit, Bash, etc.)
Cursor:         system mentions "cursor" or "Cursor IDE" + tools 
                like "edit_file", "search_files"
Aider:          system contains "/architect" or "/edit" prefix
Cline:          system contains "Cline" + tool format hints
OpenClaw:       system signature, plus tools include 'task_complete'

Knowing the agent narrows the phase space dramatically. Claude Code in plan mode has a specific tag (<plan_mode>); Aider's /architect command has a specific prefix. Each agent has its own conventions for signaling intent.

Layer 2: Message + tool-history pattern matching (<3ms)

For requests not pinned to a specific agent mode, regex + keyword matching on the most recent user message + last 5 tool-call results:

const PHASE_SIGNALS = {
  plan:        /how should|architect|design|approach|plan/i,
  debug:       /error|exception|stack trace|why is|failing/i,
  test:        /pytest|jest|test case|unit test|spec/i,
  refactor:    /refactor|simplify|extract|rename|clean up/i,
  document:    /docstring|comment|readme|explain/i,
  small_edit:  /^(rename|change|update) (this|the)/i,
};

Combined with simple heuristics (message length, code-fence presence, tool-call recency), this gets us to ~85% phase confidence in <3ms.

Layer 3: L2 LLM classifier (~800ms, async)

For the remaining 15% where layers 1+2 are uncertain, a tiny model (Claude Haiku 4.5 or DeepSeek V4-Flash, depending on availability) does a focused classification call. We cache the result for 10 minutes keyed by user prompt hash, so the second occurrence of "what phase is this prompt" is instant.

L2 only fires for non-streaming requests because the 800ms latency would be visible as time-to-first-token in streaming. For streaming, we accept slightly less accurate routing in exchange for instant TTFT.

The full cascade resolves in <8ms p95 in production. Detection accuracy measured against manual labeling: 87% exact-phase match, 96% within-one-phase (i.e., classifying "implement" as "small_edit" counts as close-enough). That's good enough — within-one-phase errors don't materially change the routed model.

The phase × model assignment table

This is what we route to in production today (April 2026), after iterating for a month:

| Phase | Primary model | First fallback | Second fallback | Why this assignment | |---|---|---|---|---| | plan | claude-opus-4.7 | deepseek-v4-pro | gpt-5.5 | Plan mode rewards reasoning depth. Opus 4.7's SWE-Bench Pro 64.3% is the empirical leader; V4-Pro at 80.6% Verified is the cheapest near-Opus alternative | | implement | deepseek-v4-pro | claude-sonnet-4.6 | kimi-k2.6 | V4-Pro promoted to #1 after independent benchmarks confirmed Opus-tier quality at 1/10 the cost. Sonnet 4.6 fallback for tool-call reliability | | debug | claude-opus-4.7 | gpt-5.5 | deepseek-v4-pro | Opus's edge case sensitivity is real. GPT-5.5 wins Terminal-Bench 2.0 specifically for agent-loop debugging | | test | deepseek-v4-flash | deepseek-v4-pro | kimi-k2.6 | Test generation is template-heavy. V4-Flash at $0.14/M is the bargain. Pro for harder integration tests | | refactor | deepseek-v4-pro | kimi-k2.6 | claude-sonnet-4.6 | Refactoring needs context window (V4-Pro: 1M, K2.6: 256K) and structural reasoning. K2.6's agent-swarm training shows up here | | document | claude-haiku-4.5 | gemini-3-flash | gpt-5-mini | Documentation is closest to "predict the next obvious sentence". Tiniest models suffice | | small_edit | gpt-5-mini | gemini-3-flash | claude-haiku-4.5 | Renames, single-line changes. Latency matters more than capability |

What's striking is how different the assignments are by phase. There is no "best model" — there's "best model for this phase, given current pricing and our calibration data."

30-day production results

Aggregated across 287M tokens served in the last billing period:

Total requests:                    5,957
Total tokens:                      287.6M  (281.2M input / 6.4M output)
Actual cost:                       $1,068
What it would have cost on Opus:   $4,992
Savings:                           $3,924  (78.6%)
Cache hit rate:                    30.8%   (rising; hit 70%+ on Anthropic routes after Phase 2)
Fallback engagements:              1       (0.02% of requests)

The 78.6% savings is not from caching — it's from routing intelligence. Cache savings is a separate $10 (small because we route most volume to V4-Flash, where input is already $0.14/M and the cache discount is small in absolute terms).

A few specific observations:

Implementation is the volume driver. ~50% of requests classified as implement. After promoting V4-Pro to primary for this phase (April 27), our V4-Pro share of total traffic went from ~5% to a projected 25-30%. The single biggest cost lever in the whole router.

Plan and debug stayed expensive on purpose. Opus 4.7 at $15/$75 is non-negotiable for these phases — we tried routing plan to Sonnet 4.6 in week 2 and saw user complaints about "shallower" architectural suggestions. Reverted within 24 hours. The user perception of quality on plan/debug is sharp; on implement/test/docs it's basically invisible.

Document and small-edit phases are nearly free. Combined ~7% of requests, total cost <$5 in the period. These are the easiest to optimize because the models are 100-200× cheaper and the work is genuinely simpler.

Fallback fires almost never. 1 fallback in 5,957 requests means our primary picks are good. When fallback does fire, it's almost always a transient provider 503 (resolved in milliseconds by retrying the next model).

The pitfalls (what we tried that didn't work)

Phase-aware routing isn't free engineering — we hit real walls. Three are worth documenting because they're not in any other writeup:

Pitfall 1: Aggressive routing fragments prompt cache

Our first naive implementation had each request independently classified and routed. This means turn 1 might go to Sonnet 4.6, turn 2 to V4-Pro, turn 3 back to Sonnet, etc. Cache hit rate dropped to <1% because every model switch invalidates Anthropic's prompt cache (cache is per-model + per-prefix).

Fix: session stickiness. Once a session lands on a model in turn 1, subsequent turns prefer the same model unless there's a strong reason to switch (confidence ≥0.85 + phase change to a higher tier). After this fix, cache hit rate climbed to ~26% within 72 hours.

Pitfall 2: Round-robin multi-account further fragments cache

To handle Anthropic's per-key rate limits (60 req/min default), we added pool-based multi-account rotation across 3 Anthropic API keys. Round-robin distribution across the 3 keys means each key has its own cold cache, dividing the effective hit rate by ~3.

Fix: affinity routing. Hash (provider, agent_type, system_prompt[0:1024]) and consistently route the same hash to the same backend key. Same-agent traffic concentrates on one key; that key's cache stays continuously hot. Critical insight: the LLM API is stateless from the user's perspective, but the cache is per-API-key from the provider's perspective. Affinity makes them line up. Cache hit rate jumped from 26% → 70%+ on Anthropic routes after this fix.

Pitfall 3: Chinese-LLM thinking modes break multi-turn

Three different Chinese providers (DeepSeek's deepseek-reasoner, Moonshot's Kimi K2.6, Z.AI's GLM-5.1) emit a reasoning_content field that must be round-tripped to subsequent calls — incompatible with the OpenAI/Anthropic spec. Our gateway's translator strips it; multi-turn agents 400 on turn 2.

This isn't a routing-algorithm problem; it's a protocol-compatibility problem that bites anyone integrating Chinese LLMs through OpenAI-compatible endpoints. We wrote it up in detail here. Short version: pass extra_body: {thinking: {type: "disabled"}} to Moonshot/Z.AI; remove *-reasoner model IDs from auto-routing pools entirely.

Anti-pattern: "but isn't this over-engineering?"

We've heard this question a few times. Honest answer:

For a single solo developer running occasional Claude Code sessions: yes, possibly over-engineering. Just use Claude Pro at $20/mo and don't think about it. Time saved isn't worth the gateway complexity.

For a 5+ person engineering team running coding agents 8 hours/day: absolutely worth it. The team's monthly Claude Code bill goes from $1500-3000 to $200-400. That's the difference between a budgeted line item and an "uh, can we have a meeting about AI spend" Slack message from finance.

For an AI product builder integrating coding capabilities: the routing infrastructure is the moat. Whatever model is cheapest-and-best in 6 months will be different from today; the router that absorbs that change without a code rewrite is what scales.

The "over-engineering" critique usually comes from people who haven't yet hit the bill where it stops being theoretical. Once monthly LLM spend crosses ~$500, the cost of not having a router exceeds the cost of building one.

What we'd build next (open backlog)

A few pieces we have prototyped but not shipped:

Reasoning_content side-channel — store the field per-session in Redis, auto-inject on next request. Would re-enable thinking-mode models for auto-routing. Deferred because V4-Pro non-thinking already covers most reasoning workloads.
Per-user phase preference learning — some users consistently prefer one provider's style. Track which manual overrides correlate with which fingerprints. Not started.
Model judge feedback loop — sample 1% of routed responses, send to a stronger model for quality grading, feed back as routing-confidence signal. Has a working prototype, not in production yet because the quality of "judge" responses varies more than expected.
Latency-aware fallback — currently fallback fires on errors (5xx, 429); not on slow responses. Adding "if first chunk hasn't arrived in 8s, race a second model" would cut p99 latency dramatically.

A note on transparency

Every CodeRouter response includes headers that show the routing decision:

X-CodeRouter-Model: deepseek-v4-pro
X-CodeRouter-Phase: implement
X-CodeRouter-Phase-Confidence: 0.92
X-CodeRouter-Agent: claude_code
X-CodeRouter-Cost: 0.000123
X-CodeRouter-Cost-If-Opus: 0.012874
X-CodeRouter-Classifier: L1
X-CodeRouter-Cache-Read: 1842

This is deliberate. A router that hides its decisions is asking for trust it doesn't deserve. Show the user what was picked, why, and what it cost. Users who want to audit can; users who don't can ignore.

Summary

Phase-aware routing is one of those ideas that sounds obvious in hindsight: agents do different kinds of work; each kind has different model requirements; route accordingly. But it has real engineering depth — the detection has to be fast, the routing has to play nice with cache, the fallbacks have to be robust, and you'll discover provider-protocol surprises that aren't documented anywhere.

After 30 days running it on real Claude Code / Cursor / Aider traffic, the data is clear: 78% cost reduction with no measurable quality loss for routine coding work. The reasoning-heavy phases (plan, debug) stay on Opus where it matters; the implementation grind moves to V4-Pro / V4-Flash where it doesn't matter that the model is "less smart" overall.

If you're building a coding agent, an API gateway, or an AI product that uses LLMs for software engineering: the optimization isn't picking the best model. It's not picking the best model for the 70% of work that doesn't need it.

We Run Coding Agents on 7 Different LLMs Per Session — Here's the 30-Day Production Data

The setup

Why one-model agents waste tokens

Phase detection in <8ms (without an LLM call)

Layer 1: Agent fingerprinting (<1ms)

Layer 2: Message + tool-history pattern matching (<3ms)

Layer 3: L2 LLM classifier (~800ms, async)

The phase × model assignment table

30-day production results

The pitfalls (what we tried that didn't work)

Pitfall 1: Aggressive routing fragments prompt cache

Pitfall 2: Round-robin multi-account further fragments cache

Pitfall 3: Chinese-LLM thinking modes break multi-turn

Anti-pattern: "but isn't this over-engineering?"

What we'd build next (open backlog)

A note on transparency

Summary

Related reading

Ready to Reduce Your AI API Costs?

We Run Coding Agents on 7 Different LLMs Per Session — Here's the 30-Day Production Data

The setup

Why one-model agents waste tokens

Phase detection in <8ms (without an LLM call)

Layer 1: Agent fingerprinting (<1ms)

Layer 2: Message + tool-history pattern matching (<3ms)

Layer 3: L2 LLM classifier (~800ms, async)

The phase × model assignment table

30-day production results

The pitfalls (what we tried that didn't work)

Pitfall 1: Aggressive routing fragments prompt cache

Pitfall 2: Round-robin multi-account further fragments cache

Pitfall 3: Chinese-LLM thinking modes break multi-turn

Anti-pattern: "but isn't this over-engineering?"

What we'd build next (open backlog)

A note on transparency

Summary

Related reading

Ready to Reduce Your AI API Costs?

Related Articles

The reasoning_content Trap: Why DeepSeek, Kimi, and GLM Break Your Multi-Turn Agent (and How to Fix It)

April 2026 Frontier Model Cheat Sheet — GPT-5.5, DeepSeek V4, Kimi K2.6 at a Glance

DeepSeek V4 Pro vs V4 Flash: Which to Use for Coding Agents (2026)

Get weekly AI cost optimization tips