AI subscriptions are subsidized.
The subsidy is being pulled back.
What’s being unwound.
What it means for the people who pay.
What to do.
The flat-rate era is ending.
The metered era is starting.
@kamilkrauspe
One heavy day of agentic coding
burns through
$200
— more than the whole monthly plan covers.
The rest is paid by someone — and that is what is being unwound.
@kamilkrauspe
Four weeks. Five moves.
One direction.
March
Google
Antigravity restructured to AI Credits. AI Pro effective token usage cut sharply (~97% per independent developer reports).
April 2
OpenAI
Codex billing moved from credits-per-message to credits-per-token. New $100 Pro tier between Plus and the $200 Pro.
April 21
Anthropic
Claude Code briefly restricted to Max plans on the pricing page. Reversed within a day.
April 23
Anthropic
Postmortem for Claude Code quality issues. Statement to Fortune: “compute is a constraint across the entire industry.”
April 27
GitHub
All Copilot plans move to AI-Credit billing on June 1. Input, output, and cached tokens all metered.
Sources: blog.google · tpsreport.news · openai · anthropic · github · April 2026.
@kamilkrauspe
Variance per request, 2026.
Per-session token consumption across six common workloads.
Inline autocompleteone keystroke
< 1K
< $0.01
Single chat promptno project context
< 10K
< $0.05
Targeted file editshort multi-step
< 100K
< $0.30
Interactive coding30–90 min session
< 10M
< $10
Long autonomous run1–4 hours
< 100M
< $80
Multi-agent / backgroundproduction fleets
up to 1B
$500+
1K
1M
1B
~100,000× spread per session
Documented heavy-use sessions
52.5M
tokens in 38 minutes — Cursor agent replaying a 120K context window in a tool loop.
$300 / day
Jason Calacanis running 8 AI agents, “a fraction of their potential capacity” — All-In Podcast, Feb 2026.
$1,000 / wk
single Replit Agent 3 user spike, vs. their typical $180–200 / month baseline — Q1 2026.
Sonnet 4.6 list rates. Cache reads dominate at scale. Opus ~1.5–1.7× higher. Ranges are estimates; x-axis is logarithmic.
@kamilkrauspe
A real heavy user
costs more to serve
than the plan covers.
What you pay
$200
flat subscription, per month
What heavy use actually costs
$1,000–$4,400
inference at list rates, per month
The plan covers about 5–20% of the inference.
Someone is paying the rest.
Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7. Estimates.
@kamilkrauspe
It’s not only about the money.
Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.
4×
Tokens per request, heavy users (P90), year over year.*
Subscription changes aren’t only cost recovery.
They’re also rationing on a scarce resource.
* Source: Datadog State of AI Engineering, 2026.
@kamilkrauspe
So what can we do about it?
The same lever shows up at three layers.
Cost, throughput, and capacity move together.
1
Inference
The unit of inference: architecture, runtime, cache mechanics.
Improves throughLabs and serving providers competing on $/M.
Your handWhich model you reach for; prefix discipline.
2
Harness
The system around the model: context, tools, orchestration.
Improves throughTool builders competing; engineering teams iterating.
Your handContext engineering; harness, agent, and tool design.
3
Use
The choices made turn-by-turn: which model, when to escalate, when to stop.
Improves throughYou, deliberately.
How it’s usedAll of it.
As prices become visible, layer 3 — the one only you can move — matters most.
@kamilkrauspe
Layer 1 — Inference
What does a million tokens cost to make?
Self-hosted GPT-OSS-120B on a rented H100 80GB.
cost
throughput
=
~$0.17
per million tokens
A simplified single-configuration example. Real costs vary.
Throughput is the denominator.
Push it up, the cost falls. So how?
COMPRESS
MLA · MXFP4/NVFP4 quantization · KV eviction …
SKIP
sparse attention · MoE · MoD/MoR …
RESHAPE
speculative decoding · disaggregation · Hybrid SSM …
@kamilkrauspe
Layer 2 — Harness
Every turn carries the previous turn.
In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.
Turn 1 — system prompt + tools + MCP defs + user ask: fix failing tests. ~30K tokens. Returns: grep "test_users".
Turn 2 — Turn 1 + grep call + ~50 hits. ~32K. Returns: read_file("test_users.py").
Turn 3 — Turn 2 + ~300 lines of file content. ~35K. Returns: edit proposal.
…by Turn 50, running context ~150K. Cumulative across turns: several million.
cached
uncached
≈3.5M tokens
~90% cached
turn 1turn 25turn 50
@kamilkrauspe
Layer 2 — Harness
How harnesses get cheaper.
Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don’t add — they multiply.
Reduce the loop
Programmatic tool calling
Anthropic case: 150K → 2K context.
JIT MCP discovery
Cursor: −46.9% on MCP runs.
Scoped subagents
Summaries to the planner, not raw transcripts.
Protect the cache
Cache miss penalty
Each miss costs 10–20× a hit.
Stable prefixes
Anything dynamic kills cache.
TTL-aware sessions
5-min vs. 1-hour, by session length.
Shape the work
Edit format matters
Aider: 3× tokens, whole-file vs. diff.
70/20/10 routing
Triage Haiku → escalate Sonnet → reserve Opus.
Workflow tools, not API endpoints
One right-shaped tool beats five raw ones.
Context engineering, prompt engineering, harness engineering — in practice. …
@kamilkrauspe
Layer 3 — Users
Same spend.
Different productivity.
Team A
Spend: $5,000/month
$35
cost per merged PR
Team B
Spend: $5,000/month
$120
cost per merged PR
The variance is not in price. It’s in which model people reach for, when they escalate, and when they stop.
SPECIFY
— outcome-first prompts · literal instructions · explicit stop conditions …
CONTROL
— subagent isolation · task budgets · escalation rules …
MEASURE
— cost per merged PR · AI drafts, humans gate · experienced-user playbooks …
This is the layer that does not get cheaper on its own.
Source: Vantage agentic-coding cost analysis, April 2026. ~10× cost-per-developer variance within a team is typical.
@kamilkrauspe
The subsidies are ending.
What looks like a loss is also a gift —
A gift of knowing
what anything was
costing all along.
And of choosing, again,
what is worth running.
(Larger reverberations too — through organizations, through societies, through what we have been calling “cheap”…)
@kamilkrauspe