Your AI subscription
is subsidized.
The subsidy is being pulled.

The Subsidy Era — AI subscriptions, coming due.

@kamilkrauspe
A heavy day can break the economics
of a $200 plan.
$200
One heavy agentic-coding day burns more inference
than the monthly plan covers.
The rest is paid by someone — until it gets metered.
@kamilkrauspe

Four weeks. Five moves.
One direction.

March
Google
Antigravity restructured to AI Credits. AI Pro effective token usage cut sharply.
April 2
OpenAI
Codex billing moved from credits-per-message to credits-per-token.
April 21
Anthropic
Claude Code briefly restricted to Max plans. Reversed within a day.
April 23
Anthropic
Postmortem: “compute is a constraint across the entire industry.”
April 27
GitHub
All Copilot plans move to AI Credits. Input, output, cached tokens all metered.
@kamilkrauspe

Agentic work doesn’t have
normal SaaS variance.

Inline autocompleteone keystroke
< 1K
< $0.01
Single chat promptno project context
< 10K
< $0.05
Targeted file editshort multi-step
< 100K
< $0.30
Interactive coding30–90 min session
< 10M
< $10
Long autonomous run1–4 hours
< 100M
< $80
Multi-agent / backgroundproduction fleets
up to 1B
$500+
~100,000× spread per session
52.5M tokens · 38 minutes
Cursor agent replaying a 120K context window in a tool loop.
Sonnet 4.6 list rates. x-axis logarithmic. Estimates.
@kamilkrauspe

A real heavy user
costs more to serve
than the plan covers.

What you pay
$200
flat subscription, per month
What heavy use actually costs
$1,000–$4,400
inference at list rates, per month
The plan covers about 5–20% of the inference.
Someone is paying the rest.
Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7. Estimates.
@kamilkrauspe

It’s not only about the money.

Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.

Tokens per request, heavy users (P90), year over year.
Pricing changes aren’t only cost recovery.
They’re also rationing on a scarce resource.
Datadog State of AI Engineering, 2026.
@kamilkrauspe

The cost moves at three layers.

1
Inference
The unit cost of generating tokens.
2
Harness
The loop around the model.
3
Use
The choices people make in the loop.
The first two improve over time. The third needs discipline.
@kamilkrauspe
Layer 1 — Inference

Cheaper tokens come from throughput.

$/M tokens = GPU rent ÷ tokens per hour
Example: $3/hour ÷ 18M tokens/hour ≈ $0.17/MT.
Self-hosted GPT-OSS-120B on H100 80GB. Single-config example.
Attention & Memory
MLA (KV cache 10× smaller)
Sparse attention (NSA, DSA, TriAttention)
KV eviction (LMCache, Dynamo, TurboQuant)
Model Strategy
MoE (DeepSeek 671B/37B, GPT-OSS 117B/5.1B)
Hybrid SSM (Nemotron-H, Qwen3, Hunyuan-TurboS)
MXFP4 / NVFP4 quantization
Serving
Speculative decoding (2–4× decode)
Disaggregated prefill/decode
MoD / MoR (dynamic compute)
Push throughput up, unit cost falls.
@kamilkrauspe
Layer 2 — Harness

Every turn carries the previous turn.

In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.

Turn 1 — system prompt + tools + user ask. ~30K tokens.
Turn 2 — Turn 1 + grep results. ~32K.
Turn 3 — Turn 2 + file contents. ~35K.
…by Turn 50, running context ~150K. Cumulative across turns: several million.
cached uncached
≈3.5M tokens
~90% cached
turn 1turn 25turn 50
Caching makes replayed tokens cheaper. It does not make them disappear.
@kamilkrauspe
Layer 2 — Harness

How harnesses get cheaper.

Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don’t add — they multiply.

Reduce the loop
Programmatic tool calling Anthropic case: 150K → 2K context.
JIT MCP discovery Cursor: −46.9% on MCP runs.
Scoped subagents Summaries to the planner, not raw transcripts.
Protect the cache
Cache miss penalty Each miss costs 10–20× a hit.
Stable prefixes Anything dynamic kills cache.
TTL-aware sessions 5-min vs. 1-hour, by session length.
Shape the work
Edit format matters Aider: 3× tokens, whole-file vs. diff.
70/20/10 routing Triage Haiku → escalate Sonnet → reserve Opus.
Workflow tools, not API endpoints One right-shaped tool beats five raw ones.
This is context engineering in practice.
@kamilkrauspe
Layer 3 — Use

Same spend. Different productivity.

Team A
Spend: $5,000/month
$35
cost per merged PR
Team B
Spend: $5,000/month
$120
cost per merged PR
The variance is not in price. It’s in which model people reach for, when they escalate, and when they stop.
Specify
Outcome-first prompts describe what good looks like; drop process scaffolding.
Literal instruction the model won’t generalize what you don’t say.
Explicit stop conditions fewest useful tool loops.
Control
Subagent context isolation fresh context windows; only summaries return.
Task budgets advisory token ceiling; the model paces itself.
Escalation rules small model first; reasoning model only when needed.
Measure
Cost per merged PR the highest spender can be the lowest cost-per-PR.
AI drafts, humans gate generate freely; verify before commit.
Experienced-user playbooks senior users converge on different patterns.
This layer does not get cheaper on its own.
Vantage agentic-coding cost analysis, April 2026. ~10× cost-per-developer variance within a single team is typical.
@kamilkrauspe

Layer 1 sets the rate card. Layer 2 sets the loop. Layer 3 sets the denominator.

The subsidies are ending.
What looks like a loss is a gift —
knowing what the work cost all along,
and choosing what is worth running.
@kamilkrauspe