Same plan.
Same user.
100,000× the bill.

The Subsidy Era — AI subscriptions, coming due.

@kamilkrauspe
A heavy day can break the economics
of a $200 plan.
$200
One heavy agentic-coding day burns more inference
than the monthly plan covers.
The rest is paid by someone — until it gets metered.
@kamilkrauspe

Four weeks. Five moves.
One direction.

March
Google
Antigravity restructured to AI Credits. AI Pro effective token usage cut sharply.
April 2
OpenAI
Codex billing moved from credits-per-message to credits-per-token.
April 21
Anthropic
Claude Code briefly restricted to Max plans. Reversed within a day.
April 23
Anthropic
Postmortem: “compute is a constraint across the entire industry.”
April 27
GitHub
All Copilot plans move to AI Credits. Input, output, cached tokens all metered.
@kamilkrauspe

Agentic work doesn’t have
normal SaaS variance.

Inline autocompleteone keystroke
< 1K
< $0.01
Single chat promptno project context
< 10K
< $0.05
Targeted file editshort multi-step
< 100K
< $0.30
Interactive coding30–90 min session
< 10M
< $10
Long autonomous run1–4 hours
< 100M
< $80
Multi-agent / backgroundproduction fleets
up to 1B
$500+
~100,000× spread
52.5M tokens · 38 minutes
Cursor agent replaying a 120K context window in a tool loop.
Sonnet 4.6 list rates. x-axis logarithmic. Estimates.
@kamilkrauspe

A real heavy user
costs more to serve
than the plan covers.

What you pay
$200
flat subscription, per month
What heavy use actually costs
$1,000–$4,400
inference at list rates, per month
The plan covers about 5–20% of the inference.
Someone is paying the rest.
Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7. Estimates.
@kamilkrauspe

It’s not only about the money.

Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.

Tokens per request, heavy users (P90), year over year.
Pricing changes aren’t only cost recovery.
They’re also rationing on a scarce resource.
Datadog State of AI Engineering, 2026.
@kamilkrauspe

The cost moves at three layers.

1
Inference
The unit cost of generating tokens.
2
Harness
The loop around the model.
3
Use
The choices people make in the loop.
The first two improve over time. The third needs discipline.
@kamilkrauspe
Layer 1 — Inference

Throughput is the denominator.

GPU rent ÷ tokens per hour = $ per million tokens
Example: $3/hour ÷ 18M tokens/hour ≈ $0.17/MT.
Self-hosted GPT-OSS-120B on H100 80GB. Single-config example.
More throughput
vLLM · SGLang · TensorRT-LLM
Less active compute
MoE · quantization · distillation
Less memory pressure
KV compression · MLA · paged attention
Push throughput up, unit cost falls.
@kamilkrauspe
Layer 2 — Harness

Every turn carries the previous turn.

In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.

Turn 1 — system prompt + tools + user ask. ~30K tokens.
Turn 2 — Turn 1 + grep results. ~32K.
Turn 3 — Turn 2 + file contents. ~35K.
…by Turn 50, running context ~150K. Cumulative across turns: several million.
cached uncached
≈3.5M tokens
~90% cached
turn 1turn 25turn 50
Caching makes replayed tokens cheaper. It does not make them disappear.
@kamilkrauspe
Layer 2 — Harness

How harnesses get cheaper.

Wins on this layer compound.

Reduce the loop
Programmatic tool calling
JIT context discovery
Scoped subagents
Preserve the cache
Stable prefixes
TTL-aware sessions
Dynamic content outside cache
Shape the work
Diff/edit formats
Model routing
Workflow-shaped tools
Anthropic case: 150K → 2K. Cursor: −46.9% on MCP runs. Aider: whole-file edits cost ~3× diff output.
This is context engineering in practice.
@kamilkrauspe
Layer 3 — Use

Same spend.
Different productivity.

Team A
Spend: $5,000/month
$35
cost per merged PR
Team B
Spend: $5,000/month
$120
cost per merged PR
The variance is not only in price. It’s in which model people reach for, when they escalate, and when they stop.
Specify · Control · Measure
This is the layer
that does not get cheaper on its own.
Vantage agentic-coding cost analysis, April 2026. ~10× cost-per-developer variance within a single team is typical.
@kamilkrauspe

Layer 1 sets the rate card. Layer 2 sets the loop. Layer 3 sets the denominator.

What was free was never free.
@kamilkrauspe