Your AI subscription
is subsidized.
The subsidy is being pulled.
The Subsidy Era — AI subscriptions, coming due.
@kamilkrauspe
A heavy day can break the economics
of a $200 plan.
$200
One heavy agentic-coding day burns more inference
than the monthly plan covers.
The rest is paid by someone — until it gets metered.
@kamilkrauspe
Four weeks. Five moves.
One direction.
March
Google
Antigravity restructured to AI Credits. AI Pro effective token usage cut sharply.
April 2
OpenAI
Codex billing moved from credits-per-message to credits-per-token.
April 21
Anthropic
Claude Code briefly restricted to Max plans. Reversed within a day.
April 23
Anthropic
Postmortem: “compute is a constraint across the entire industry.”
April 27
GitHub
All Copilot plans move to AI Credits. Input, output, cached tokens all metered.
@kamilkrauspe
Agentic work doesn’t have
normal SaaS variance.
Inline autocompleteone keystroke
< 1K
< $0.01
Single chat promptno project context
< 10K
< $0.05
Targeted file editshort multi-step
< 100K
< $0.30
Interactive coding30–90 min session
< 10M
< $10
Long autonomous run1–4 hours
< 100M
< $80
Multi-agent / backgroundproduction fleets
up to 1B
$500+
1K
1M
1B
~100,000× spread per session
52.5M tokens · 38 minutes
Cursor agent replaying a 120K context window in a tool loop.
Sonnet 4.6 list rates. x-axis logarithmic. Estimates.
@kamilkrauspe
A real heavy user
costs more to serve
than the plan covers.
What you pay
$200
flat subscription, per month
What heavy use actually costs
$1,000–$4,400
inference at list rates, per month
The plan covers about 5–20% of the inference.
Someone is paying the rest.
Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7. Estimates.
@kamilkrauspe
It’s not only about the money.
Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.
4×
Tokens per request, heavy users (P90), year over year.
Pricing changes aren’t only cost recovery.
They’re also rationing on a scarce resource.
Datadog State of AI Engineering, 2026.
@kamilkrauspe
The cost moves at three layers.
1
Inference
The unit cost of generating tokens.
2
Harness
The loop around the model.
3
Use
The choices people make in the loop.
The first two improve over time. The third needs discipline.
@kamilkrauspe
Layer 1 — Inference
Cheaper tokens come from throughput.
$/M tokens
=
GPU rent
÷
tokens per hour
Example: $3/hour ÷ 18M tokens/hour ≈ $0.17/MT.
Self-hosted GPT-OSS-120B on H100 80GB. Single-config example.
Attention & Memory
MLA (KV cache 10× smaller)
Sparse attention (NSA, DSA, TriAttention)
KV eviction (LMCache, Dynamo, TurboQuant)
Model Strategy
MoE (DeepSeek 671B/37B, GPT-OSS 117B/5.1B)
Hybrid SSM (Nemotron-H, Qwen3, Hunyuan-TurboS)
MXFP4 / NVFP4 quantization
Serving
Speculative decoding (2–4× decode)
Disaggregated prefill/decode
MoD / MoR (dynamic compute)
Push throughput up, unit cost falls.
@kamilkrauspe
Layer 2 — Harness
Every turn carries the previous turn.
In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.
Turn 1 — system prompt + tools + user ask. ~30K tokens.
Turn 2 — Turn 1 + grep results. ~32K.
Turn 3 — Turn 2 + file contents. ~35K.
…by Turn 50, running context ~150K. Cumulative across turns: several million.
cached
uncached
≈3.5M tokens
~90% cached
turn 1turn 25turn 50
@kamilkrauspe
Layer 2 — Harness
How harnesses get cheaper.
Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don’t add — they multiply.
Reduce the loop
Programmatic tool calling
Anthropic case: 150K → 2K context.
JIT MCP discovery
Cursor: −46.9% on MCP runs.
Scoped subagents
Summaries to the planner, not raw transcripts.
Protect the cache
Cache miss penalty
Each miss costs 10–20× a hit.
Stable prefixes
Anything dynamic kills cache.
TTL-aware sessions
5-min vs. 1-hour, by session length.
Shape the work
Edit format matters
Aider: 3× tokens, whole-file vs. diff.
70/20/10 routing
Triage Haiku → escalate Sonnet → reserve Opus.
Workflow tools, not API endpoints
One right-shaped tool beats five raw ones.
This is context engineering in practice.
@kamilkrauspe
Layer 3 — Use
Same spend. Different productivity.
Team A
Spend: $5,000/month
$35
cost per merged PR
Team B
Spend: $5,000/month
$120
cost per merged PR
The variance is not in price. It’s in which model people reach for, when they escalate, and when they stop.
Specify
Outcome-first prompts
describe what good looks like; drop process scaffolding.
Literal instruction
the model won’t generalize what you don’t say.
Explicit stop conditions
fewest useful tool loops.
Control
Subagent context isolation
fresh context windows; only summaries return.
Task budgets
advisory token ceiling; the model paces itself.
Escalation rules
small model first; reasoning model only when needed.
Measure
Cost per merged PR
the highest spender can be the lowest cost-per-PR.
AI drafts, humans gate
generate freely; verify before commit.
Experienced-user playbooks
senior users converge on different patterns.
This layer does not get cheaper on its own.
Vantage agentic-coding cost analysis, April 2026. ~10× cost-per-developer variance within a single team is typical.
@kamilkrauspe
Layer 1 sets the rate card.
Layer 2 sets the loop.
Layer 3 sets the denominator.
The subsidies are ending.
What looks like a loss is a gift —
knowing what the work cost all along,
and choosing what is worth running.
@kamilkrauspe