How do I reduce my Cursor API bill?

Cursor Pro's $20/mo covers 500 fast requests — past that you pay OpenAI / Anthropic per-call rates directly, which is why heavy users end up at $50-$200/mo. The fix is to point Cursor's Custom API at CodeRouter and set model to 'auto'. CodeRouter detects what phase of coding the request is (planning, implementation, debugging, test generation, docs) and routes to the cheapest capable model per phase — Opus only for planning, DeepSeek V3 for test generation, Haiku for docstrings. Same Cursor IDE, same keyboard shortcuts, 70-90% lower monthly bill. Setup takes 2 minutes — just change base_url to https://www.coderouter.io/api/v1 and paste your cr_ API key.

What is the cheapest API for Claude Code / Aider / Copilot?

There isn't a single 'cheapest API' — the cheapest model depends on what the coding agent is doing. For planning and architecture, you still want Claude Opus 4.7 or Sonnet 4.6. For implementation, DeepSeek V3.2 ($0.28/$0.42 per 1M) and Qwen 3 Coder are 30-50x cheaper than Opus with near-equivalent code quality. For test generation, DeepSeek V3 or Haiku 4.5 is 15-50x cheaper. For docstrings and simple formatting, Haiku 4.5 ($1/$5) or Gemini 2.5 Flash ($0.30/M output) is 15-250x cheaper. CodeRouter is the gateway that picks per request automatically — aim a single base_url at https://www.coderouter.io/api/v1 from Claude Code, Aider, Copilot (via LiteLLM), Cursor, Windsurf, or any OpenAI-compatible agent.

Does DeepSeek V3 work as well as Claude Sonnet for coding?

For implementation and test generation phases — yes, DeepSeek V3.2 matches Claude Sonnet 4.6 on HumanEval, MBPP, and LiveCodeBench within 1-3 points. For multi-file refactoring and architecture planning, Sonnet still has an edge on long-context reasoning. DeepSeek V3 costs $0.28 input / $0.42 output per 1M tokens, vs Sonnet's $3/$15 — roughly 30-50x cheaper. The right answer for most coding agents is not 'pick one forever' but 'use DeepSeek V3 for the implement/test phases and Sonnet for the plan/refactor phases.' That's phase-aware routing in practice — CodeRouter decides per request in ~10ms.

What is phase-aware LLM routing?

Phase-aware LLM routing classifies each coding-agent request by what phase of software work it represents — planning, implementation, debugging, testing, refactoring, or documentation — and routes it to the cheapest model that can handle that specific phase. A 'write unit tests for this function' request goes to DeepSeek V3 ($0.42/M). A 'refactor this multi-file feature and plan the migration' request goes to Claude Opus 4.7 ($75/M). This is different from picking one model for everything, and different from OpenRouter-style model-selection (which still requires you to choose manually). CodeRouter's classifier runs in ~10ms on the server, so the agent never notices the extra hop.

CodeRouter vs OpenRouter — which saves more money on coding?

OpenRouter is a model marketplace — it gives you access to 300+ models behind one API key, but you still pick which model to send each request to. Most Cursor / Aider / Claude Code users default to the premium model (Opus, GPT-5) for everything and end up paying full price. CodeRouter is a phase-aware router — set model to 'auto' and we pick the cheapest capable model per request based on the coding phase. CodeRouter also adds things OpenRouter doesn't: coding-specific capability scores per model (implementation, debug, test, refactor), per-end-user attribution for SaaS agent builders, and built-in quota + top-up billing. For pure coding workloads, typical CodeRouter savings are 70-90% vs picking one model on OpenRouter.

Will CodeRouter break my Cursor / Aider / Claude Code agent?

No. CodeRouter exposes a standard OpenAI-compatible chat completions endpoint (POST /api/v1/chat/completions) with the same request and response format your agent already uses — including streaming, tool use, and function calling. We implement the same JSON schema and stream format, so Cursor, Aider, Claude Code, Cline, Continue.dev, Windsurf, OpenClaw, and any LiteLLM-wrapped client work unmodified. If a routed model fails, the fallback chain tries up to 2 alternates automatically (on 429, 500-504, timeouts, missing keys). You can also pin an explicit model instead of 'auto' any time.

How do I set up CodeRouter with Cursor in 2 minutes?

1) Sign up free at https://www.coderouter.io/login and copy your API key (starts with cr_). 2) In Cursor, open Settings -> Models -> OpenAI API Key, and under 'Override OpenAI Base URL' paste https://www.coderouter.io/api/v1. Paste your cr_ key in the API Key field. 3) Add 'auto' to the Custom Models list and select it as your active model. That's it — phase-aware routing is live. Aider users set OPENAI_API_BASE and OPENAI_API_KEY env vars to the same values. Claude Code users set ANTHROPIC_BASE_URL to https://www.coderouter.io/api/v1 and ANTHROPIC_API_KEY to the cr_ key. Full guide at https://www.coderouter.io/setup.

The reasoning_content Trap: Why DeepSeek, Kimi, and GLM Break Your Multi-Turn Agent (and How to Fix It)

TL;DR — Three of the most popular Chinese LLMs (DeepSeek's deepseek-reasoner, Moonshot's Kimi K2.6, and Z.AI's GLM-5.1) have thinking mode enabled by default at the API level. This mode emits a non-standard reasoning_content field that the model demands back on every subsequent turn. OpenAI and Anthropic specs have no such field; clients like Claude Code / Cursor / Aider don't know it exists; your multi-turn agent will deterministically 400 on turn 2. Fix is a one-line extra_body parameter per provider, but you have to know it exists. We hit this in production — twice — building CodeRouter. This post saves you the pain.

The error you'll Google for

Here's what landed in our production error logs from a real Claude Code session:

DeepSeek API error 400: {
  "error": {
    "message": "The `reasoning_content` in the thinking mode 
                must be passed back to the API.",
    "type": "invalid_request_error"
  }
}

And another, from Moonshot:

Moonshot API error 400: {
  "error": {
    "message": "thinking is enabled but reasoning_content is missing 
                in assistant tool call message at index 63",
    "type": "invalid_request_error"
  }
}

And the worst part: the first turn always works. The error only fires on turn 2+. So you ship the integration on a smoke test, watch one round-trip succeed, declare victory, and find out three days later when an actual user does a real conversation.

We hit this twice in production. This is the article we wish someone had written.

The two API conventions

Western LLM APIs (OpenAI, Anthropic, Google) are stateless. Every request includes the full conversation history. The server holds nothing between calls. When the model's "thinking" or "reasoning" output is enabled (Claude's extended thinking, OpenAI's o1 series), the trace appears in the response but the client does not need to round-trip it — the next request just sends user/assistant messages with content and optionally tool_calls. That's it.

Chinese LLM APIs that expose thinking mode (DeepSeek's reasoner family, Moonshot's K2 series, Z.AI's GLM-5+) work differently. The thinking output appears in a top-level reasoning_content field on the assistant turn, and the client is required to echo this field back in subsequent requests. If you don't, you get the 400 above.

Why? It's a side effect of how these models were trained. Their reasoning trace is part of the model's working memory across turns. Sending it back tells the model "here's what you were thinking last time — continue from there." Strip it out, and the model literally can't reconcile its own state.

This makes some technical sense. It also breaks every client built against the OpenAI or Anthropic spec because those specs have no reasoning_content field to round-trip.

The three traps we hit

Trap 1: `deepseek-reasoner` (DeepSeek)

This is the one with the cleanest semantics: it's a separate model ID specifically marked as the thinking variant of DeepSeek V4 Flash. The user (or routing system) has to pick deepseek-reasoner to get into trouble. If you stick with deepseek-chat, you're safe.

We added it to our routing's debug phase fallback list because thinking mode genuinely helps reasoning-heavy debugging. Then we watched users hit the multi-turn 400 within hours. Removed it from auto-routing entirely. Fix: don't put *-reasoner / *-thinking model IDs in auto-routing pools unless you've implemented round-trip support.

Trap 2: Kimi K2.6 (Moonshot)

This is sneakier. K2.6 is a single model ID — kimi-k2.6 — but Moonshot has thinking mode enabled by default for it. You don't pick a thinking variant; you just use K2.6 normally and discover it's emitting reasoning_content on every turn.

Fix per Moonshot docs: pass extra_body: {thinking: {type: "disabled"}} on every request. K2.6's published 58.6% SWE-Bench Pro score is non-thinking-mode anyway, so disabling it doesn't lose capability for typical coding workflows.

Trap 3: GLM-5.1 (Z.AI / Zhipu)

Same pattern as K2.6. Single model ID glm-5.1, thinking mode on by default per Z.AI docs. We caught this proactively after K2.6 because we audited every Chinese provider in our routing — but if we hadn't, GLM-5.1 traffic would have hit the same 400 once volume grew.

Fix is identical syntax to Moonshot: extra_body: {thinking: {type: "disabled"}}. The vLLM / open-source GLM uses a different syntax (chat_template_kwargs.enable_thinking: false), but for Z.AI's hosted API the Moonshot-style thinking: {type: "disabled"} works.

The fix (TypeScript / OpenAI-compatible adapter)

If you're building an API gateway with a generic OpenAI-compatible adapter, you need provider-specific extra body. Here's the pattern (extracted from our production code):

// Per-provider config with extraBody override
const CHINESE_PROVIDERS = {
  deepseek: {
    baseUrl: "https://api.deepseek.com/v1",
    // No extraBody — deepseek-chat is non-thinking by default
  },
  moonshot: {
    baseUrl: "https://api.moonshot.ai/v1",
    extraBody: { thinking: { type: "disabled" } },
  },
  zhipu: {
    baseUrl: "https://open.bigmodel.cn/api/paas/v4",
    extraBody: { thinking: { type: "disabled" } },
  },
  qwen: {
    baseUrl: "https://dashscope.aliyuncs.com/compatible-mode/v1",
    // Hybrid mode, default disabled — no action
  },
};

// Adapter spreads extraBody into every outbound request
class OpenAICompatibleAdapter {
  constructor(
    private apiKey: string,
    private baseUrl: string,
    private extraBody: Record<string, unknown> = {},
  ) {}

  async chat(request, modelId) {
    return fetch(`${this.baseUrl}/chat/completions`, {
      method: "POST",
      headers: { Authorization: `Bearer ${this.apiKey}` },
      body: JSON.stringify({
        model: modelId,
        messages: request.messages,
        // ... other standard params ...
        ...this.extraBody,  // ← provider-specific override
      }),
    });
  }
}

The key insight: don't try to be clever about parsing/handling reasoning_content. Just disable thinking mode at the API boundary for all auto-routed traffic. If a power user explicitly wants thinking (e.g., via a "Direct" or pass-through mode), they can opt in via headers and own the round-trip themselves.

Cross-provider audit checklist

For anyone routing to Chinese LLMs in 2026, here's the audit we now run on every new provider:

Does any model on this provider default to thinking mode at the API level?
- Check official docs for thinking, reasoning_mode, enable_thinking parameters.
- If any default to on → add extraBody to disable.
Does the provider have a separate thinking-variant model ID (e.g., *-reasoner, *-thinking)?
- Yes → don't put it in auto-routing pools without round-trip support.
- Yes → keep it in the registry only for explicit user selection.
What's the disable parameter syntax for this provider?
- Moonshot, Z.AI: {thinking: {type: "disabled"}}
- vLLM-hosted: {chat_template_kwargs: {enable_thinking: false}}
- Qwen / DashScope: {enable_thinking: false}
- Inconsistent across providers — read each one's docs.
Are you stripping client-supplied thinking parameters in your translator?
- Anthropic's thinking: {type: "enabled", budget_tokens: N} shouldn't forward to a Chinese provider.
- Most translators do strip it as a side effect of not having that field in their internal request shape, but verify.
Test with multi-turn conversations, not single requests.
- Single-turn tests will pass. Always.
- Reproduce the bug with at least 3 turns including a tool call.

State of the union (April 2026)

| Provider | Default thinking? | Action needed | |---|---|---| | Anthropic | Off (extended thinking is opt-in) | None | | OpenAI | Off (o1/o3-mini reasoning is per-model-ID) | None | | Google Gemini | Off (thinking is implicit, no client param) | None | | DeepSeek chat (V4-Flash) | Off | None | | DeepSeek reasoner (V4-Flash thinking) | On (it's the thinking variant) | Don't auto-route | | Moonshot Kimi K2.6 | On by default | extraBody: {thinking: {type: "disabled"}} | | Z.AI GLM-5.1 | On by default | Same as Moonshot | | Alibaba Qwen | Off (hybrid mode, opt-in) | None | | Doubao (ByteDance) | Per-model-ID thinking variant | Don't auto-route the *-thinking variants |

This will keep changing. Every new Chinese model release seems to ship with thinking mode by default — it's becoming the marketing differentiator. Watch for new entrants.

Why this matters more in 2026

Chinese LLMs are no longer "the cheap option." DeepSeek V4-Pro scores 80.6% on SWE-Bench Verified — within 0.2 percentage points of Claude Opus 4.6's 80.8% — at less than 10% of the price. Kimi K2.6 ties GPT-5.5 on SWE-Bench Pro for ~$0.60/M input. GLM-5.1 (open-weight under MIT) self-reports a leading SWE-Bench Pro score.

Real adoption is following:

Cursor, Aider, Cline, Claude Code, OpenClaw all support routing to these models via OpenAI-compatible endpoints.
API gateways like OpenRouter, our CodeRouter, and others are seeing dramatic shifts in mix toward Chinese models.
Enterprise teams are evaluating Chinese models for the cost savings.

Every one of these integration paths hits the reasoning_content trap unless explicitly handled. Most don't handle it. You will see this error in production.

What we did differently the second time

The first time we hit this (DeepSeek reasoner), we found out from a user complaint after silent failures had been happening for hours. Embarrassing.

The second time (Moonshot K2.6), we found out from a different user — but at least we recognized the pattern and shipped a fix in 30 minutes.

The third time (Z.AI GLM-5.1) — we never let it become a user complaint because we proactively audited every Chinese provider after fix #2 and caught it before the bug compounded.

That audit is now embedded in our team's checklist for adding any new Chinese LLM provider. We've shared the checklist above. Please use it.

Summary

If you're integrating Chinese LLMs into a multi-turn agent in 2026:

Always disable thinking mode by default in auto-routing paths. The capability is rarely worth the protocol complexity.
Per-provider extra body parameters, not a single magic flag — three providers, three syntaxes, no standard.
Test with multi-turn conversations including tool calls, never just single-turn smoke tests.
Audit every new Chinese provider you add with the checklist above. The bug class is universal; only the syntax varies.
If you want thinking mode, build a stateful side-channel that captures reasoning_content from each response and re-injects it on the next request. Or use the model only in single-turn workflows.

We deliberately defer the side-channel work at CodeRouter — for our coding-agent use case, disabling thinking gives 95% of the value with 5% of the complexity. Your tradeoff may differ.

The reasoning_content Trap: Why DeepSeek, Kimi, and GLM Break Your Multi-Turn Agent (and How to Fix It)

The error you'll Google for

The two API conventions

The three traps we hit

Trap 1: `deepseek-reasoner` (DeepSeek)

Trap 2: Kimi K2.6 (Moonshot)

Trap 3: GLM-5.1 (Z.AI / Zhipu)

The fix (TypeScript / OpenAI-compatible adapter)

Cross-provider audit checklist

State of the union (April 2026)

Why this matters more in 2026

What we did differently the second time

Summary

Related reading

Ready to Reduce Your AI API Costs?

The reasoning_content Trap: Why DeepSeek, Kimi, and GLM Break Your Multi-Turn Agent (and How to Fix It)

The error you'll Google for

The two API conventions

The three traps we hit

Trap 1: deepseek-reasoner (DeepSeek)

Trap 2: Kimi K2.6 (Moonshot)

Trap 3: GLM-5.1 (Z.AI / Zhipu)

The fix (TypeScript / OpenAI-compatible adapter)

Cross-provider audit checklist

State of the union (April 2026)

Why this matters more in 2026

What we did differently the second time

Summary

Related reading

Ready to Reduce Your AI API Costs?

Related Articles

We Run Coding Agents on 7 Different LLMs Per Session — Here's the 30-Day Production Data

April 2026 Frontier Model Cheat Sheet — GPT-5.5, DeepSeek V4, Kimi K2.6 at a Glance

DeepSeek V4 Pro vs V4 Flash: Which to Use for Coding Agents (2026)

Get weekly AI cost optimization tips

Trap 1: `deepseek-reasoner` (DeepSeek)