AI subscriptions are subsidized.
The subsidy is being pulled back.

What’s being unwound.
What it means for the people who pay.
What to do.

The flat-rate era is ending.
The metered era is starting.
@kamilkrauspe
One heavy day of agentic coding
burns through
$200
— more than the whole monthly plan covers.
The rest is paid by someone — and that is what is being unwound.
@kamilkrauspe

Four weeks. Five moves.
One direction.

March
Google
Antigravity restructured to AI Credits. AI Pro effective token usage cut sharply (~97% per independent developer reports).
April 2
OpenAI
Codex billing moved from credits-per-message to credits-per-token. New $100 Pro tier between Plus and the $200 Pro.
April 21
Anthropic
Claude Code briefly restricted to Max plans on the pricing page. Reversed within a day.
April 23
Anthropic
Postmortem for Claude Code quality issues. Statement to Fortune: “compute is a constraint across the entire industry.”
April 27
GitHub
All Copilot plans move to AI-Credit billing on June 1. Input, output, and cached tokens all metered.
Sources: blog.google · tpsreport.news · openai · anthropic · github · April 2026.
@kamilkrauspe

Variance per request, 2026.

Per-session token consumption across six common workloads.

Inline autocompleteone keystroke
< 1K
< $0.01
Single chat promptno project context
< 10K
< $0.05
Targeted file editshort multi-step
< 100K
< $0.30
Interactive coding30–90 min session
< 10M
< $10
Long autonomous run1–4 hours
< 100M
< $80
Multi-agent / backgroundproduction fleets
up to 1B
$500+
~100,000× spread per session
Documented heavy-use sessions
52.5M
tokens in 38 minutes — Cursor agent replaying a 120K context window in a tool loop.
$300 / day
Jason Calacanis running 8 AI agents, “a fraction of their potential capacity” — All-In Podcast, Feb 2026.
$1,000 / wk
single Replit Agent 3 user spike, vs. their typical $180–200 / month baseline — Q1 2026.
Sonnet 4.6 list rates. Cache reads dominate at scale. Opus ~1.5–1.7× higher. Ranges are estimates; x-axis is logarithmic.
@kamilkrauspe

A real heavy user
costs more to serve
than the plan covers.

What you pay
$200
flat subscription, per month
What heavy use actually costs
$1,000–$4,400
inference at list rates, per month
The plan covers about 5–20% of the inference.
Someone is paying the rest.
Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7. Estimates.
@kamilkrauspe

It’s not only about the money.

Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.

Tokens per request, heavy users (P90), year over year.*
Subscription changes aren’t only cost recovery.
They’re also rationing on a scarce resource.
* Source: Datadog State of AI Engineering, 2026.
@kamilkrauspe

So what can we do about it?

The same lever shows up at three layers.
Cost, throughput, and capacity move together.

1
Inference
The unit of inference: architecture, runtime, cache mechanics.
Improves throughLabs and serving providers competing on $/M.
Your handWhich model you reach for; prefix discipline.
2
Harness
The system around the model: context, tools, orchestration.
Improves throughTool builders competing; engineering teams iterating.
Your handContext engineering; harness, agent, and tool design.
3
Use
The choices made turn-by-turn: which model, when to escalate, when to stop.
Improves throughYou, deliberately.
How it’s usedAll of it.
As prices become visible, layer 3 — the one only you can move — matters most.
@kamilkrauspe
Layer 1 — Inference

What does a million tokens cost to make?

Self-hosted GPT-OSS-120B on a rented H100 80GB.
Cost
$3 / hour
rental list rate
Throughput
18M tokens / hour
vLLM saturated, output tokens
cost throughput
= ~$0.17 per million tokens
A simplified single-configuration example. Real costs vary.
Throughput is the denominator.
Push it up, the cost falls. So how?
COMPRESS
MLA · MXFP4/NVFP4 quantization · KV eviction …
SKIP
sparse attention · MoE · MoD/MoR …
RESHAPE
speculative decoding · disaggregation · Hybrid SSM …
@kamilkrauspe
Layer 2 — Harness

Every turn carries the previous turn.

In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.

Turn 1 — system prompt + tools + MCP defs + user ask: fix failing tests. ~30K tokens. Returns: grep "test_users".
Turn 2 — Turn 1 + grep call + ~50 hits. ~32K. Returns: read_file("test_users.py").
Turn 3 — Turn 2 + ~300 lines of file content. ~35K. Returns: edit proposal.
…by Turn 50, running context ~150K. Cumulative across turns: several million.
cached uncached
≈3.5M tokens
~90% cached
turn 1turn 25turn 50
Caching makes replayed tokens cheaper. It does not make them disappear.
@kamilkrauspe
Layer 2 — Harness

How harnesses get cheaper.

Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don’t add — they multiply.

Reduce the loop
Programmatic tool calling Anthropic case: 150K → 2K context.
JIT MCP discovery Cursor: −46.9% on MCP runs.
Scoped subagents Summaries to the planner, not raw transcripts.
Protect the cache
Cache miss penalty Each miss costs 10–20× a hit.
Stable prefixes Anything dynamic kills cache.
TTL-aware sessions 5-min vs. 1-hour, by session length.
Shape the work
Edit format matters Aider: 3× tokens, whole-file vs. diff.
70/20/10 routing Triage Haiku → escalate Sonnet → reserve Opus.
Workflow tools, not API endpoints One right-shaped tool beats five raw ones.
Context engineering, prompt engineering, harness engineering — in practice. …
@kamilkrauspe
Layer 3 — Users

Same spend.
Different productivity.

Team A
Spend: $5,000/month
$35
cost per merged PR
Team B
Spend: $5,000/month
$120
cost per merged PR
The variance is not in price. It’s in which model people reach for, when they escalate, and when they stop.
SPECIFY  — outcome-first prompts  ·  literal instructions  ·  explicit stop conditions …
CONTROL  — subagent isolation  ·  task budgets  ·  escalation rules …
MEASURE  — cost per merged PR  ·  AI drafts, humans gate  ·  experienced-user playbooks …
This is the layer that does not get cheaper on its own.
Source: Vantage agentic-coding cost analysis, April 2026. ~10× cost-per-developer variance within a team is typical.
@kamilkrauspe
The subsidies are ending.
What looks like a loss is also a gift

A gift of knowing
what anything was
costing all along.
And of choosing, again,
what is worth running.

(Larger reverberations too — through organizations, through societies, through what we have been calling “cheap”…)
@kamilkrauspe