TL;DR — DeepSeek V4 launched as two sibling models, not one: V4 Pro ($1.74/$3.48, 81% SWE-bench Verified, near-Opus) and V4 Flash ($0.14/$0.28, same 1M context, still 5× cheaper than Sonnet 4.6). The old
deepseek-chatanddeepseek-reasonerAPI aliases now auto-map to V4 Flash. The decision is NOT "which one" — it's "Flash for 70% of calls, Pro when reasoning matters." A phase-aware router hits both.
What DeepSeek actually shipped
As of April 2026, DeepSeek's API serves two V4 models:
| Model | Input $/M | Output $/M | Cache hit $/M | Context | Max output | |---|---:|---:|---:|---:|---:| | deepseek-v4-flash | $0.14 | $0.28 | $0.028 | 1M | 384K | | deepseek-v4-pro | $1.74 | $3.48 | $0.145 | 1M | 384K |
Both are MoE architecture (V4-Pro: 1.6T total params / 49B activated per token). Both support tool calls, JSON output, streaming, and chat-prefix completion. Flash is non-thinking; Pro runs extended reasoning.
Important back-compat: the old model IDs deepseek-chat and deepseek-reasoner now route to V4-Flash (non-thinking mode) and V4-Flash (thinking mode) respectively. Clients pinned to the old IDs get a free upgrade — 1M context and cheaper pricing, no code change. DeepSeek's docs mark the legacy aliases for eventual deprecation.
Benchmark separation
| Benchmark | V4-Flash (est.) | V4-Pro | Claude Opus 4.7 | GPT-5.5 | |---|---:|---:|---:|---:| | SWE-bench Verified | ~73% | 81% | 82% | — | | HumanEval | ~85% | 90% | ~88% | — | | SWE-Bench Pro | — | ~55% | 64.3% | 58.6% |
The 8-point SWE-bench Verified gap between Flash and Pro is the main differentiator. Pro is in Opus territory on verified code-edit benchmarks. Flash is still competitive with Sonnet-class models at 20× less cost.
When V4 Flash is enough
- Implementation phase. Writing new code from a clear spec. Flash's 5/5 on code_implement in CodeRouter's scoring tracks independent benchmarks.
- Test generation. Deterministic, bounded scope — Flash is an absolute bargain here at $0.14 per million input.
- Small edits / refactors under 100 files. 1M context means even mid-size refactors fit in a single call.
- Docs + docstring generation. Wasted money to send this to Pro.
- Classifier / routing calls. At $0.14 input, this is the cheapest model capable of reliable structured outputs.
When V4 Pro earns its 12× markup
- Plan mode / architecture design. 81% SWE-bench Verified = comparable to Opus on structural reasoning. At $1.74/M input, still a fraction of Opus 4.7's $15/M.
- Complex debugging where you need the model to hold 10+ files in its head. The thinking-mode budget pays off.
- Multi-file refactors with subtle cross-module invariants. Pro's deeper reasoning catches what Flash misses.
- Hard code review — spotting race conditions, subtle null-handling bugs, ordering issues.
The routing math
Say a heavy coding day runs 1M tokens (roughly a morning of agentic work with Claude Code or Cursor):
| Setup | Cost for 1M tokens (70/30 in/out) | |---|---:| | 100% Claude Opus 4.7 | $33.00 | | 100% Claude Sonnet 4.6 | $6.60 | | 100% V4-Pro | $2.26 | | 100% V4-Flash | $0.18 | | Phase-routed (Pro for plan/debug, Flash for impl/test, Sonnet fallback) | ~$0.80 |
The 100%-Flash scenario is tempting but breaks on hard reasoning tasks. The 100%-Pro scenario wastes 10× on routine work. Phase-routing uses each model where it's best.
"Should I just pin V4-Flash everywhere?"
Short answer: no, and here's why.
Flash is excellent at what it's built for — a fast, cheap, high-context generalist for writing code against a clear spec. But when the spec is ambiguous, when there are competing constraints to weigh, when a bug has multiple plausible causes and you need reasoning to disambiguate — Flash will confidently generate the wrong answer. The 5.7-point HumanEval gap and 8-point SWE-bench Verified gap show up exactly in these hard cases.
The right posture: default to Flash, escalate to Pro for phases that specifically reward reasoning. A phase-aware router does this automatically by reading signals from your prompt (plan-mode tags, ambiguity markers, multi-step debug chains).
Integrating V4 with your coding agent
If you're using any agent that speaks OpenAI-compatible or Anthropic-compatible APIs, the clean pattern is:
# Hit DeepSeek direct (cheapest, V4 Pro manual selection)
export OPENAI_API_BASE="https://api.deepseek.com/v1"
export OPENAI_API_KEY="sk-..."
# Then in your agent: model: "deepseek-v4-pro"
But pinning to one provider defeats the main win. With CodeRouter, the same agent hits all providers via one API key:
export ANTHROPIC_BASE_URL="https://api.coderouter.io/v1"
export ANTHROPIC_AUTH_TOKEN="cr_..."
# Model: "auto" — router picks V4-Pro, V4-Flash, Opus, Sonnet, GPT-5.5, etc per phase
Full agent-by-agent guides: Claude Code setup · Aider cost optimization · Cut Cursor bill.
Cache hits — don't skip this
V4's cache discounts are aggressive:
- V4-Flash cache hit: $0.028/M (80% off normal input)
- V4-Pro cache hit: $0.145/M (~92% off)
For coding agents, each tool call ships a growing context. If your gateway (or DeepSeek directly) properly marks prompts for caching, your actual bill on a multi-step session lands 60–80% below the nominal per-token math. CodeRouter handles prompt-cache markers automatically for Anthropic, OpenAI, and DeepSeek.
The answer
DeepSeek V4 isn't "cheaper DeepSeek." It's two products: a bargain-basement generalist (Flash, $0.14/$0.28, 1M context) and a near-Opus reasoner (Pro, $1.74/$3.48, same context). Pinning either one wastes capacity.
Pragmatic take: Let your agent call Flash for most of the work and escalate to Pro when the phase specifically rewards reasoning. Phase-aware routing does this automatically. Pair with Kimi K2.6 for long-horizon agentic refactors and your typical monthly bill lands in the $20–60 range instead of the $500+ you pay pinning to Claude or GPT.