← Back to Blog

The reasoning_content Trap: Why DeepSeek, Kimi, and GLM Break Your Multi-Turn Agent (and How to Fix It)

2026-04-27·9 min read·CodeRouter Team
deepseek reasoning_content errorkimi k2.6 thinking modeglm-5.1 thinking disabledchinese llm api gatewayanthropic openai compatibility chinese modelsdeepseek reasoner multi-turn 400claude code deepseek errorcursor deepseek erroraider chinese llm bugopenai compatible adapter chineseextra_body thinking disabled

TL;DR — Three of the most popular Chinese LLMs (DeepSeek's deepseek-reasoner, Moonshot's Kimi K2.6, and Z.AI's GLM-5.1) have thinking mode enabled by default at the API level. This mode emits a non-standard reasoning_content field that the model demands back on every subsequent turn. OpenAI and Anthropic specs have no such field; clients like Claude Code / Cursor / Aider don't know it exists; your multi-turn agent will deterministically 400 on turn 2. Fix is a one-line extra_body parameter per provider, but you have to know it exists. We hit this in production — twice — building CodeRouter. This post saves you the pain.

The error you'll Google for

Here's what landed in our production error logs from a real Claude Code session:

DeepSeek API error 400: {
  "error": {
    "message": "The `reasoning_content` in the thinking mode 
                must be passed back to the API.",
    "type": "invalid_request_error"
  }
}

And another, from Moonshot:

Moonshot API error 400: {
  "error": {
    "message": "thinking is enabled but reasoning_content is missing 
                in assistant tool call message at index 63",
    "type": "invalid_request_error"
  }
}

And the worst part: the first turn always works. The error only fires on turn 2+. So you ship the integration on a smoke test, watch one round-trip succeed, declare victory, and find out three days later when an actual user does a real conversation.

We hit this twice in production. This is the article we wish someone had written.

The two API conventions

Western LLM APIs (OpenAI, Anthropic, Google) are stateless. Every request includes the full conversation history. The server holds nothing between calls. When the model's "thinking" or "reasoning" output is enabled (Claude's extended thinking, OpenAI's o1 series), the trace appears in the response but the client does not need to round-trip it — the next request just sends user/assistant messages with content and optionally tool_calls. That's it.

Chinese LLM APIs that expose thinking mode (DeepSeek's reasoner family, Moonshot's K2 series, Z.AI's GLM-5+) work differently. The thinking output appears in a top-level reasoning_content field on the assistant turn, and the client is required to echo this field back in subsequent requests. If you don't, you get the 400 above.

Why? It's a side effect of how these models were trained. Their reasoning trace is part of the model's working memory across turns. Sending it back tells the model "here's what you were thinking last time — continue from there." Strip it out, and the model literally can't reconcile its own state.

This makes some technical sense. It also breaks every client built against the OpenAI or Anthropic spec because those specs have no reasoning_content field to round-trip.

The three traps we hit

Trap 1: deepseek-reasoner (DeepSeek)

This is the one with the cleanest semantics: it's a separate model ID specifically marked as the thinking variant of DeepSeek V4 Flash. The user (or routing system) has to pick deepseek-reasoner to get into trouble. If you stick with deepseek-chat, you're safe.

We added it to our routing's debug phase fallback list because thinking mode genuinely helps reasoning-heavy debugging. Then we watched users hit the multi-turn 400 within hours. Removed it from auto-routing entirely. Fix: don't put *-reasoner / *-thinking model IDs in auto-routing pools unless you've implemented round-trip support.

Trap 2: Kimi K2.6 (Moonshot)

This is sneakier. K2.6 is a single model IDkimi-k2.6 — but Moonshot has thinking mode enabled by default for it. You don't pick a thinking variant; you just use K2.6 normally and discover it's emitting reasoning_content on every turn.

Fix per Moonshot docs: pass extra_body: {thinking: {type: "disabled"}} on every request. K2.6's published 58.6% SWE-Bench Pro score is non-thinking-mode anyway, so disabling it doesn't lose capability for typical coding workflows.

Trap 3: GLM-5.1 (Z.AI / Zhipu)

Same pattern as K2.6. Single model ID glm-5.1, thinking mode on by default per Z.AI docs. We caught this proactively after K2.6 because we audited every Chinese provider in our routing — but if we hadn't, GLM-5.1 traffic would have hit the same 400 once volume grew.

Fix is identical syntax to Moonshot: extra_body: {thinking: {type: "disabled"}}. The vLLM / open-source GLM uses a different syntax (chat_template_kwargs.enable_thinking: false), but for Z.AI's hosted API the Moonshot-style thinking: {type: "disabled"} works.

The fix (TypeScript / OpenAI-compatible adapter)

If you're building an API gateway with a generic OpenAI-compatible adapter, you need provider-specific extra body. Here's the pattern (extracted from our production code):

// Per-provider config with extraBody override
const CHINESE_PROVIDERS = {
  deepseek: {
    baseUrl: "https://api.deepseek.com/v1",
    // No extraBody — deepseek-chat is non-thinking by default
  },
  moonshot: {
    baseUrl: "https://api.moonshot.ai/v1",
    extraBody: { thinking: { type: "disabled" } },
  },
  zhipu: {
    baseUrl: "https://open.bigmodel.cn/api/paas/v4",
    extraBody: { thinking: { type: "disabled" } },
  },
  qwen: {
    baseUrl: "https://dashscope.aliyuncs.com/compatible-mode/v1",
    // Hybrid mode, default disabled — no action
  },
};

// Adapter spreads extraBody into every outbound request
class OpenAICompatibleAdapter {
  constructor(
    private apiKey: string,
    private baseUrl: string,
    private extraBody: Record<string, unknown> = {},
  ) {}

  async chat(request, modelId) {
    return fetch(`${this.baseUrl}/chat/completions`, {
      method: "POST",
      headers: { Authorization: `Bearer ${this.apiKey}` },
      body: JSON.stringify({
        model: modelId,
        messages: request.messages,
        // ... other standard params ...
        ...this.extraBody,  // ← provider-specific override
      }),
    });
  }
}

The key insight: don't try to be clever about parsing/handling reasoning_content. Just disable thinking mode at the API boundary for all auto-routed traffic. If a power user explicitly wants thinking (e.g., via a "Direct" or pass-through mode), they can opt in via headers and own the round-trip themselves.

Cross-provider audit checklist

For anyone routing to Chinese LLMs in 2026, here's the audit we now run on every new provider:

  1. Does any model on this provider default to thinking mode at the API level?

    • Check official docs for thinking, reasoning_mode, enable_thinking parameters.
    • If any default to on → add extraBody to disable.
  2. Does the provider have a separate thinking-variant model ID (e.g., *-reasoner, *-thinking)?

    • Yes → don't put it in auto-routing pools without round-trip support.
    • Yes → keep it in the registry only for explicit user selection.
  3. What's the disable parameter syntax for this provider?

    • Moonshot, Z.AI: {thinking: {type: "disabled"}}
    • vLLM-hosted: {chat_template_kwargs: {enable_thinking: false}}
    • Qwen / DashScope: {enable_thinking: false}
    • Inconsistent across providers — read each one's docs.
  4. Are you stripping client-supplied thinking parameters in your translator?

    • Anthropic's thinking: {type: "enabled", budget_tokens: N} shouldn't forward to a Chinese provider.
    • Most translators do strip it as a side effect of not having that field in their internal request shape, but verify.
  5. Test with multi-turn conversations, not single requests.

    • Single-turn tests will pass. Always.
    • Reproduce the bug with at least 3 turns including a tool call.

State of the union (April 2026)

| Provider | Default thinking? | Action needed | |---|---|---| | Anthropic | Off (extended thinking is opt-in) | None | | OpenAI | Off (o1/o3-mini reasoning is per-model-ID) | None | | Google Gemini | Off (thinking is implicit, no client param) | None | | DeepSeek chat (V4-Flash) | Off | None | | DeepSeek reasoner (V4-Flash thinking) | On (it's the thinking variant) | Don't auto-route | | Moonshot Kimi K2.6 | On by default | extraBody: {thinking: {type: "disabled"}} | | Z.AI GLM-5.1 | On by default | Same as Moonshot | | Alibaba Qwen | Off (hybrid mode, opt-in) | None | | Doubao (ByteDance) | Per-model-ID thinking variant | Don't auto-route the *-thinking variants |

This will keep changing. Every new Chinese model release seems to ship with thinking mode by default — it's becoming the marketing differentiator. Watch for new entrants.

Why this matters more in 2026

Chinese LLMs are no longer "the cheap option." DeepSeek V4-Pro scores 80.6% on SWE-Bench Verified — within 0.2 percentage points of Claude Opus 4.6's 80.8% — at less than 10% of the price. Kimi K2.6 ties GPT-5.5 on SWE-Bench Pro for ~$0.60/M input. GLM-5.1 (open-weight under MIT) self-reports a leading SWE-Bench Pro score.

Real adoption is following:

Every one of these integration paths hits the reasoning_content trap unless explicitly handled. Most don't handle it. You will see this error in production.

What we did differently the second time

The first time we hit this (DeepSeek reasoner), we found out from a user complaint after silent failures had been happening for hours. Embarrassing.

The second time (Moonshot K2.6), we found out from a different user — but at least we recognized the pattern and shipped a fix in 30 minutes.

The third time (Z.AI GLM-5.1) — we never let it become a user complaint because we proactively audited every Chinese provider after fix #2 and caught it before the bug compounded.

That audit is now embedded in our team's checklist for adding any new Chinese LLM provider. We've shared the checklist above. Please use it.

Summary

If you're integrating Chinese LLMs into a multi-turn agent in 2026:

  1. Always disable thinking mode by default in auto-routing paths. The capability is rarely worth the protocol complexity.
  2. Per-provider extra body parameters, not a single magic flag — three providers, three syntaxes, no standard.
  3. Test with multi-turn conversations including tool calls, never just single-turn smoke tests.
  4. Audit every new Chinese provider you add with the checklist above. The bug class is universal; only the syntax varies.
  5. If you want thinking mode, build a stateful side-channel that captures reasoning_content from each response and re-injects it on the next request. Or use the model only in single-turn workflows.

We deliberately defer the side-channel work at CodeRouter — for our coding-agent use case, disabling thinking gives 95% of the value with 5% of the complexity. Your tradeoff may differ.


Related reading

Ready to Reduce Your AI API Costs?

CodeRouter routes every API call to the optimal model — automatically. Start saving today.

Get Started Free →

Get weekly AI cost optimization tips

Join 2,000+ developers saving on LLM costs