How do I reduce my Cursor API bill?

Cursor Pro's $20/mo covers 500 fast requests — past that you pay OpenAI / Anthropic per-call rates directly, which is why heavy users end up at $50-$200/mo. The fix is to point Cursor's Custom API at CodeRouter and set model to 'auto'. CodeRouter detects what phase of coding the request is (planning, implementation, debugging, test generation, docs) and routes to the cheapest capable model per phase — Opus only for planning, DeepSeek V3 for test generation, Haiku for docstrings. Same Cursor IDE, same keyboard shortcuts, 70-90% lower monthly bill. Setup takes 2 minutes — just change base_url to https://www.coderouter.io/api/v1 and paste your cr_ API key.

What is the cheapest API for Claude Code / Aider / Copilot?

There isn't a single 'cheapest API' — the cheapest model depends on what the coding agent is doing. For planning and architecture, you still want Claude Opus 4.7 or Sonnet 4.6. For implementation, DeepSeek V3.2 ($0.28/$0.42 per 1M) and Qwen 3 Coder are 30-50x cheaper than Opus with near-equivalent code quality. For test generation, DeepSeek V3 or Haiku 4.5 is 15-50x cheaper. For docstrings and simple formatting, Haiku 4.5 ($1/$5) or Gemini 2.5 Flash ($0.30/M output) is 15-250x cheaper. CodeRouter is the gateway that picks per request automatically — aim a single base_url at https://www.coderouter.io/api/v1 from Claude Code, Aider, Copilot (via LiteLLM), Cursor, Windsurf, or any OpenAI-compatible agent.

Does DeepSeek V3 work as well as Claude Sonnet for coding?

For implementation and test generation phases — yes, DeepSeek V3.2 matches Claude Sonnet 4.6 on HumanEval, MBPP, and LiveCodeBench within 1-3 points. For multi-file refactoring and architecture planning, Sonnet still has an edge on long-context reasoning. DeepSeek V3 costs $0.28 input / $0.42 output per 1M tokens, vs Sonnet's $3/$15 — roughly 30-50x cheaper. The right answer for most coding agents is not 'pick one forever' but 'use DeepSeek V3 for the implement/test phases and Sonnet for the plan/refactor phases.' That's phase-aware routing in practice — CodeRouter decides per request in ~10ms.

What is phase-aware LLM routing?

Phase-aware LLM routing classifies each coding-agent request by what phase of software work it represents — planning, implementation, debugging, testing, refactoring, or documentation — and routes it to the cheapest model that can handle that specific phase. A 'write unit tests for this function' request goes to DeepSeek V3 ($0.42/M). A 'refactor this multi-file feature and plan the migration' request goes to Claude Opus 4.7 ($75/M). This is different from picking one model for everything, and different from OpenRouter-style model-selection (which still requires you to choose manually). CodeRouter's classifier runs in ~10ms on the server, so the agent never notices the extra hop.

CodeRouter vs OpenRouter — which saves more money on coding?

OpenRouter is a model marketplace — it gives you access to 300+ models behind one API key, but you still pick which model to send each request to. Most Cursor / Aider / Claude Code users default to the premium model (Opus, GPT-5) for everything and end up paying full price. CodeRouter is a phase-aware router — set model to 'auto' and we pick the cheapest capable model per request based on the coding phase. CodeRouter also adds things OpenRouter doesn't: coding-specific capability scores per model (implementation, debug, test, refactor), per-end-user attribution for SaaS agent builders, and built-in quota + top-up billing. For pure coding workloads, typical CodeRouter savings are 70-90% vs picking one model on OpenRouter.

Will CodeRouter break my Cursor / Aider / Claude Code agent?

No. CodeRouter exposes a standard OpenAI-compatible chat completions endpoint (POST /api/v1/chat/completions) with the same request and response format your agent already uses — including streaming, tool use, and function calling. We implement the same JSON schema and stream format, so Cursor, Aider, Claude Code, Cline, Continue.dev, Windsurf, OpenClaw, and any LiteLLM-wrapped client work unmodified. If a routed model fails, the fallback chain tries up to 2 alternates automatically (on 429, 500-504, timeouts, missing keys). You can also pin an explicit model instead of 'auto' any time.

How do I set up CodeRouter with Cursor in 2 minutes?

1) Sign up free at https://www.coderouter.io/login and copy your API key (starts with cr_). 2) In Cursor, open Settings -> Models -> OpenAI API Key, and under 'Override OpenAI Base URL' paste https://www.coderouter.io/api/v1. Paste your cr_ key in the API Key field. 3) Add 'auto' to the Custom Models list and select it as your active model. That's it — phase-aware routing is live. Aider users set OPENAI_API_BASE and OPENAI_API_KEY env vars to the same values. Claude Code users set ANTHROPIC_BASE_URL to https://www.coderouter.io/api/v1 and ANTHROPIC_API_KEY to the cr_ key. Full guide at https://www.coderouter.io/setup.

Fix: DeepSeek 400 Error — "reasoning_content in thinking mode must be passed back"

TL;DR — DeepSeek's thinking-mode models (like DeepSeek V4 with thinking enabled) return each response with a reasoning_content field next to content. Once the model makes a tool call, the API requires that reasoning_content be sent back on every assistant message in the conversation history. Most OpenAI-compatible clients silently strip unknown fields when they rebuild history — the next request then fails with HTTP 400: The "reasoning_content" in the thinking mode must be passed back to the API. Fix it by preserving reasoning_content in replayed assistant messages, routing through a proxy that round-trips it for you, or disabling thinking mode for tool-heavy workloads.

The error

You call a DeepSeek thinking model through the OpenAI-compatible endpoint, the first turn works, the model makes a tool call — and the second request fails:

HTTP 400
{
  "error": {
    "message": "The \"reasoning_content\" in the thinking mode must be passed back to the API...",
    "type": "invalid_request_error"
  }
}

The confusing part: your code didn't change between turn one and turn two. The conversation state did.

Why it happens

In thinking mode, DeepSeek returns the model's chain-of-thought in a separate reasoning_content field, at the same level as content:

{
  "role": "assistant",
  "content": "...",
  "reasoning_content": "Let me check the file first...",
  "tool_calls": [ ... ]
}

Per DeepSeek's thinking-mode documentation, when the model performs a tool call, the intermediate assistant message's reasoning_content must participate in context concatenation — i.e., you must send it back verbatim in subsequent turns.

The problem is that reasoning_content is not part of the standard OpenAI chat schema. Most SDKs and agent frameworks rebuild message history through a converter that keeps only the fields it knows (role, content, tool_calls, tool_call_id). The reasoning field gets dropped, and DeepSeek rejects the replayed history.

This bites real tools, not just hand-rolled scripts — the same 400 has been reported in opencode, claude-code-router, and n8n's AI agent nodes, all for the same reason: the history converter strips the field.

Fix 1: Preserve `reasoning_content` in your client

If you control the request-building code, keep the field on every assistant message you replay:

# When appending the model's response to history, do NOT rebuild the
# message from scratch — carry the raw fields through.
msg = response.choices[0].message
history.append({
    "role": "assistant",
    "content": msg.content,
    "reasoning_content": getattr(msg, "reasoning_content", None),
    "tool_calls": [tc.model_dump() for tc in (msg.tool_calls or [])],
})

Two details that matter:

Every assistant message in the history needs the field once a tool call has occurred in the conversation — not just the latest one.
Pass it back unmodified. Truncating or summarizing the reasoning also triggers rejection on some model versions.

If you use the OpenAI Python SDK, the extra field survives as long as you don't round-trip messages through strict Pydantic models that drop unknown keys.

Fix 2: Let the router handle the round-trip

If you'd rather not patch every client, put a router between your agent and DeepSeek. CodeRouter stores the raw provider response per turn and re-attaches reasoning_content when your client's replayed history is missing it, so unmodified OpenAI-compatible clients (Cursor with a custom base URL, Aider, plain SDK code) work with DeepSeek thinking models out of the box:

client = OpenAI(
    base_url="https://www.coderouter.io/api/v1",
    api_key="<your coderouter key>",
)

This is also the practical answer for tools you can't patch (closed-source IDE plugins, hosted agents).

Fix 3: Disable thinking mode for tool-heavy workloads

If the reasoning tokens aren't buying you quality on your workload, turn thinking off and the constraint disappears entirely — assistant messages then carry no reasoning_content and standard OpenAI clients replay history cleanly. On the DeepSeek API this is controlled per-request (see their thinking-mode docs for the current parameter shape; older releases used separate -reasoner model names).

For coding agents specifically, a common pattern is: thinking mode for the planning step, non-thinking for mechanical multi-tool execution loops. That is exactly the phase-aware split CodeRouter automates.

How to verify the fix

Send a two-turn conversation where turn one forces a tool call, then check that turn two succeeds:

# Turn 2 request body must contain BOTH fields on the assistant message:
"messages": [
  {"role": "user", "content": "What files changed?"},
  {"role": "assistant", "content": "", "reasoning_content": "...", "tool_calls": [...]},
  {"role": "tool", "tool_call_id": "call_1", "content": "src/app.py"}
]

If the request logs show the assistant message without reasoning_content, your converter is still stripping it.

FAQ

Does this affect non-thinking DeepSeek models?

No. Only thinking-mode responses carry reasoning_content, and only they enforce the round-trip requirement. Standard chat models replay fine with the plain OpenAI schema.

Why does it only fail after tool calls?

Without tool calls there's usually no multi-turn assistant history to replay inside one logical exchange. The requirement is documented specifically for the tool-call case: the intermediate assistant turn (the one that decided to call the tool) must keep its reasoning attached when you send the tool result back.

Do OpenRouter or other gateways fix this automatically?

Not reliably — a gateway that just proxies your request forwards whatever history your client built, so if the client stripped the field, the 400 still happens. The fix has to happen where history is rebuilt: in your client (Fix 1) or in a router that reconstructs it (Fix 2).

Sources: DeepSeek Thinking Mode docs, opencode issue #24104, claude-code-router issue #1378.

Fix: DeepSeek 400 Error — "reasoning_content in thinking mode must be passed back"

The error

Why it happens

Fix 1: Preserve reasoning_content in your client

Fix 2: Let the router handle the round-trip

Fix 3: Disable thinking mode for tool-heavy workloads

How to verify the fix

FAQ

Does this affect non-thinking DeepSeek models?

Why does it only fail after tool calls?

Do OpenRouter or other gateways fix this automatically?

Ready to Reduce Your AI API Costs?

Related Articles

Claude Code 401 With a Custom Base URL: ANTHROPIC_API_KEY vs ANTHROPIC_AUTH_TOKEN

Cursor "Override OpenAI Base URL" Breaks Claude Models — What Works and What Doesn't

Agent Router Alternative: Complete Guide to AI Coding Model Routers in 2026

Get weekly AI cost optimization tips

Fix 1: Preserve `reasoning_content` in your client