TL;DR — Moonshot's Kimi K2.6 launched April 20, 2026. It scores 58.6% on SWE-Bench Pro (tied with GPT-5.5), leads Humanity's Last Exam with tools at 54.0 (beats Opus 4.6 and GPT-5.4), and costs $0.60 / $4.00 per 1M — roughly 10× cheaper than GPT-5.5 and 25× cheaper than Opus 4.7. It's open-weight (huggingface.co/moonshotai) and ships as a 1T-parameter MoE (32B active per token). The purpose-built angle: long-horizon agentic coding — scales to 300 sub-agents and 4000 coordinated steps without drifting. If you run coding agents with deep tool chains, this is the new cost/quality sweet spot.
What actually shipped
Moonshot released K2.6 on April 20, 2026 — four days after Opus 4.7 and three days before GPT-5.5. The stats that matter:
- Architecture: MoE, 1T total params / 32B active, 384 experts (8 selected + 1 shared), 61 layers, 160K vocab, 15.5T training tokens.
- Context: 256K window (vs K2.5's 128K).
- Max output: 65,536 tokens per response — larger than Claude/OpenAI flagships.
- Pricing:
- Moonshot API: $0.60 / $4.00 per 1M input/output
- Cache hit: $0.16 per 1M input (73% off)
- OpenRouter: $0.60 / $2.80
- Cloudflare Workers AI: also available
- Open-weight on Hugging Face for self-hosting
- Agentic specialization: natively trained to coordinate 300 sub-agents for 4,000 steps on long-horizon tasks.
The last point is unusual. Most model releases pitch reasoning benchmarks; Moonshot specifically targeted "coding agents that don't lose the plot after 50 steps."
Benchmark numbers
| Benchmark | Kimi K2.6 | GPT-5.5 | Claude Opus 4.7 | GPT-5.4 | DeepSeek V4-Pro | |---|---:|---:|---:|---:|---:| | SWE-Bench Pro | 58.6% | 58.6% | 64.3% | 57.7% | ~55% | | HLE (Humanity's Last Exam) w/ tools | 54.0 | — | 53.0* | 52.1 | — | | AIME 2026 | 96.4% | — | — | 99.2% | — | | GPQA-Diamond | 90.5% | — | — | 92.8% | — | | Input $/M | $0.60 | $5.00 | $15.00 | $2.50 | $1.74 | | Output $/M | $4.00 | $30.00 | $75.00 | $15.00 | $3.48 | | Context | 256K | 1M | 200K | 1.05M | 1M |
*Opus 4.6 was the benchmark reference — Opus 4.7 is incrementally higher but not dramatically so on HLE.
The headline: 58.6% SWE-Bench Pro at $0.60/$4.00. GPT-5.5 hits the same number at $5/$30. That's a ~9× price difference on the single coding benchmark that tracks real GitHub-issue patches.
Where K2.6 wins
- Cost-per-correct-patch on SWE-Bench Pro is the lowest of any frontier model right now. If your workload is implementation-heavy (writing code to spec + fixing bugs), K2.6 delivers GPT-5.5-equivalent quality for 10% of the cost.
- Long-horizon agent loops. The "300 sub-agents / 4000 steps" design target shows up in practice as much lower context-drift than models that merely pattern-match through long tool chains. For multi-hour Claude-Code-style sessions or Aider architect-mode runs, K2.6 holds coherence better than models with similar benchmark scores.
- 256K context + huge max output. Most models cap output at 8K–16K. K2.6's 65K ceiling matters for two workflows: generating entire test suites in one call, and multi-file refactors where the model outputs the full updated content of 8+ files.
- Open weights. If self-hosting is viable for you, Moonshot publishes weights on Hugging Face. Your cost floor becomes GPU time.
- Cache discount is aggressive: $0.16/M on cache hit = 73% off. For agentic sessions where the same system prompt + tools ship on every call, this is meaningful.
Where K2.6 doesn't win
- Pure math / reasoning benchmarks. GPT-5.4 still leads AIME 2026 (99.2% vs 96.4%) and GPQA-Diamond (92.8% vs 90.5%). If you're building a model-powered math tutor or doing formal-methods work, K2.6 isn't the first pick.
- SWE-Bench Pro real-world edits. Opus 4.7 still tops the chart at 64.3%. On sensitive codebase edits where "don't break 40 callers" matters, Opus has a ~6-point edge.
- Tool-call reliability. Our production routing sees Chinese-provider models (DeepSeek, Kimi, Qwen, GLM) with slightly higher tool-schema retry rates than Anthropic/OpenAI. The gap is narrowing — K2.6 is visibly better than K2.5 — but for apps that absolutely require structured-output reliability, Anthropic is still the floor.
- Vision tasks. K2.6 is text + tools. No image input.
The "should I switch" math
Scenario: solo engineer using Claude Code all day, ~15M tokens/month.
| Setup | Monthly bill (70/30 input/output) | |---|---:| | 100% Claude Opus 4.7 (default) | $495 | | 100% GPT-5.5 | $165 | | 100% DeepSeek V4-Pro | $37 | | 100% Kimi K2.6 | $24 | | Phase-routed (Opus for plan, GPT-5.5 for agent loops, V4-Flash for impl, K2.6 for long refactors) | ~$15–20 |
K2.6 alone saves 95% vs default Opus. Phase-routing saves another 30% on top by using V4-Flash for the truly-routine 60% of calls.
The long-horizon coding angle
What actually matters for coding agents isn't raw benchmark scores — it's step stability. Most benchmarks evaluate a single turn in isolation. Real coding sessions run for hundreds of turns, each shipping 50K+ of accumulated context, tools, and prior outputs.
Models that score well on SWE-Bench in isolation can drift hard over 50+ turns: they forget the architectural decision from turn 7, they repeat a rejected approach from turn 23, they lose track of which file they've already edited. Moonshot trained K2.6 specifically on long trajectories, and early third-party eval confirms the claim — K2.6's coherence at turn 100 is visibly better than GPT-5.4's or Sonnet 4.6's.
For workflows like:
- Claude Code in Plan mode → implement 20+ files
- Aider's
/architectmode over large refactors - Cline/Continue running autonomous multi-step tasks
- OpenClaw multi-file coordinated changes
...this translates to fewer "agent got confused and I had to restart" moments, which directly dominates productivity.
Using K2.6 today
Direct via Moonshot:
export OPENAI_API_BASE="https://api.moonshot.cn/v1"
export OPENAI_API_KEY="sk-..."
# In your agent: model: "kimi-k2.6"
Via CodeRouter — any OpenAI or Anthropic-compatible agent, single API key, router auto-picks K2.6 for phases where it's optimal:
export ANTHROPIC_BASE_URL="https://api.coderouter.io/v1"
export ANTHROPIC_AUTH_TOKEN="cr_..."
# Model: "auto" — K2.6 is in the candidate pool for implement/refactor/test
Self-hosting: weights on huggingface.co/moonshotai. Running the full 1T-param MoE takes ~16×80GB H100 or equivalent — not trivial, but viable for teams with inference infrastructure.
Where K2.6 fits in the phase router
CodeRouter's phase preference map as of April 23, 2026:
- plan: Opus 4.7 → GPT-5.5 → V4-Pro → Sonnet 4.6
- implement: Sonnet 4.6 → V4-Pro → K2.6 → V4-Flash → GPT-5.4
- debug: Opus 4.7 → GPT-5.5 → Sonnet 4.6 → V4-Pro → K2.6
- test: V4-Flash → V4-Pro → K2.6 → Sonnet 4.6
- refactor: V4-Pro → K2.6 → Sonnet 4.6 → V4-Flash
- document: Haiku 4.5 → Gemini 3 Flash → GPT-5 Mini
K2.6's sweet spots are implement/test/refactor — the phases where its agentic stability pays off and the cost savings vs flagships are dramatic.
The answer
Kimi K2.6 is the most dramatic cost/capability shift of the 2026 April model wave. At $0.60/$4.00 with SWE-Bench Pro tied to GPT-5.5 and explicit long-horizon training, it's the model to default-to for implementation-heavy coding agents unless you specifically need Anthropic's tool-call floor or OpenAI's math edge.
Pragmatic take: add K2.6 to your router's candidate pool, monitor SWE-bench pass-rate on your actual workload for a week, and you'll likely find it replaces 40–50% of your prior Sonnet or V4-Pro calls without any observable quality change. That's a real ~60% cost cut on the same work.