Same plan.
Same user.
100,000× the bill.
The Subsidy Era — AI subscriptions, coming due.
@kamilkrauspe
A heavy day can break the economics
of a $200 plan.
$200
One heavy agentic-coding day burns more inference
than the monthly plan covers.
The rest is paid by someone — until it gets metered.
@kamilkrauspe
Four weeks. Five moves.
One direction.
March
Google
Antigravity restructured to AI Credits. AI Pro effective token usage cut sharply.
April 2
OpenAI
Codex billing moved from credits-per-message to credits-per-token.
April 21
Anthropic
Claude Code briefly restricted to Max plans. Reversed within a day.
April 23
Anthropic
Postmortem: “compute is a constraint across the entire industry.”
April 27
GitHub
All Copilot plans move to AI Credits. Input, output, cached tokens all metered.
@kamilkrauspe
Agentic work doesn’t have
normal SaaS variance.
Inline autocompleteone keystroke
< 1K
< $0.01
Single chat promptno project context
< 10K
< $0.05
Targeted file editshort multi-step
< 100K
< $0.30
Interactive coding30–90 min session
< 10M
< $10
Long autonomous run1–4 hours
< 100M
< $80
Multi-agent / backgroundproduction fleets
up to 1B
$500+
1K
1M
1B
~100,000× spread
52.5M tokens · 38 minutes
Cursor agent replaying a 120K context window in a tool loop.
Sonnet 4.6 list rates. x-axis logarithmic. Estimates.
@kamilkrauspe
A real heavy user
costs more to serve
than the plan covers.
What you pay
$200
flat subscription, per month
What heavy use actually costs
$1,000–$4,400
inference at list rates, per month
The plan covers about 5–20% of the inference.
Someone is paying the rest.
Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7. Estimates.
@kamilkrauspe
It’s not only about the money.
Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.
4×
Tokens per request, heavy users (P90), year over year.
Pricing changes aren’t only cost recovery.
They’re also rationing on a scarce resource.
Datadog State of AI Engineering, 2026.
@kamilkrauspe
The cost moves at three layers.
1
Inference
The unit cost of generating tokens.
2
Harness
The loop around the model.
3
Use
The choices people make in the loop.
The first two improve over time. The third needs discipline.
@kamilkrauspe
Layer 1 — Inference
Throughput is the denominator.
GPU rent
÷
tokens per hour
=
$ per million tokens
Example: $3/hour ÷ 18M tokens/hour ≈ $0.17/MT.
Self-hosted GPT-OSS-120B on H100 80GB. Single-config example.
More throughput
vLLM · SGLang · TensorRT-LLM
Less active compute
MoE · quantization · distillation
Less memory pressure
KV compression · MLA · paged attention
Push throughput up, unit cost falls.
@kamilkrauspe
Layer 2 — Harness
Every turn carries the previous turn.
In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.
Turn 1 — system prompt + tools + user ask. ~30K tokens.
Turn 2 — Turn 1 + grep results. ~32K.
Turn 3 — Turn 2 + file contents. ~35K.
…by Turn 50, running context ~150K. Cumulative across turns: several million.
cached
uncached
≈3.5M tokens
~90% cached
turn 1turn 25turn 50
@kamilkrauspe
Layer 2 — Harness
How harnesses get cheaper.
Wins on this layer compound.
Reduce the loop
Programmatic tool calling
JIT context discovery
Scoped subagents
Preserve the cache
Stable prefixes
TTL-aware sessions
Dynamic content outside cache
Shape the work
Diff/edit formats
Model routing
Workflow-shaped tools
Anthropic case: 150K → 2K. Cursor: −46.9% on MCP runs. Aider: whole-file edits cost ~3× diff output.
This is context engineering in practice.
@kamilkrauspe
Layer 3 — Use
Same spend.
Different productivity.
Team A
Spend: $5,000/month
$35
cost per merged PR
Team B
Spend: $5,000/month
$120
cost per merged PR
The variance is not only in price. It’s in which model people reach for, when they escalate, and when they stop.
Specify
·
Control
·
Measure
This is the layer
that does not get cheaper on its own.
Vantage agentic-coding cost analysis, April 2026. ~10× cost-per-developer variance within a single team is typical.
@kamilkrauspe
Layer 1 sets the rate card.
Layer 2 sets the loop.
Layer 3 sets the denominator.
What was free was never free.
@kamilkrauspe