AI subscriptions, coming due.

The Subsidy
Era.

What’s being unwound.
What it means for the people who pay.
What to do.

The flat-rate era is ending.
The metered era is starting.
Six moves. Five vendors. Four weeks.
Why $200 a month is smaller than what one heavy day costs.
The three layers where teams are bringing the cost down.
Same spend, very different output — the variance most teams don’t measure.
@kamilkrauspe
One heavy day of agentic coding
burns through
$200
— more than the whole monthly plan covers.
The rest is being paid by someone — and that is what is being unwound.
@kamilkrauspe

Six moves. Five vendors.
About four weeks.

March
Google
Antigravity restructured to AI Credits. AI Pro weekly token usage cut ~97%.
April 2
OpenAI
Codex billing for paid plans moved from credits-per-message to credits-per-token. New $100 Pro tier between Plus and the $200 Pro.
April 9
Amazon
Jassy shareholder letter: AWS still has “capacity constraints that yield unserved demand” despite +3.9 GW added in 2025.
April 21
Anthropic
Claude Code briefly restricted to Max plans on the pricing page. Reversed within a day.
April 23
Anthropic
Postmortem for Claude Code quality issues. Statement to Fortune: “compute is a constraint across the entire industry.”
April 27
GitHub
All Copilot plans move to AI-Credit billing on June 1. Input, output, and cached tokens all metered.
@kamilkrauspe

Variance per request, 2026.

Per-session token consumption across six common workloads.

Inline autocompleteone keystroke
< 1K
< $0.01
Single chat promptno project context
< 10K
< $0.05
Targeted file editshort multi-step
< 100K
< $0.30
Interactive coding30–90 min session
< 10M
< $10
Long autonomous run1–4 hours
< 100M
< $80
Multi-agent / backgroundproduction fleets
up to 1B
$500+
100,000× spread
Documented heavy-use sessions
52.5M
tokens in 38 minutes — Cursor agent replaying a 120K context window in a tool loop.
$300 / day
Jason Calacanis running 8 AI agents, “a fraction of their potential capacity” — All-In Podcast, Feb 2026.
$1,000 / wk
single Replit Agent 3 user spike, vs. their typical $180–200 / month baseline — Q1 2026.
Sonnet 4.6 list rates. Cache reads dominate at scale (especially in agent loops). Opus ~1.5–1.7× higher. Ranges are estimates; x-axis is logarithmic.
@kamilkrauspe

A real heavy user
costs more to serve
than the plan covers.

What you pay
$200
flat subscription, per month
What heavy use actually costs
$1,000–$4,400
inference at list rates, per month
The plan covers about 5–20% of the inference.
Someone is paying the rest.
Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7 across the range. Figures are estimates.
@kamilkrauspe

It isn’t only about the money.

Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.

Tokens per request, heavy users (P90), year over year.
Pricing changes aren’t only cost recovery.
They’re also rationing on a scarce resource.
Datadog State of AI Engineering, 2026.
@kamilkrauspe
?

Where could the subsidy
be coming from?

1 · Investor capital
OpenAI projects $665B cumulative cash burn 2026–30 against active fundraising at $852B post-money.
Whether that capital flows to subscription subsidy, capex, R&D, or talent is not publicly disclosed.
2 · Hyperscaler compute
Hyperscalers with equity in labs can price compute below market — protecting the equity by supporting the burn.
Amazon’s $16.8B Q1 gain on Anthropic and the multi-GW Broadcom–Google TPU deal fit the shape. Pricing undisclosed.
3 · Cross-subsidy
In a flat plan, light users consume far less than they pay for — covering much of the heavy minority’s overuse.
Anthropic flagged this in July 2025 (<5% drive most use). Agentic usage has grown sharply since.
4 · PAYG → flat-rate subsidy
If PAYG margins exceed serving cost, they could subsidize flat-rate plans, ration scarce capacity, or both.
No vendor discloses unit economics by billing model.
The subsidy gap is real, and it is being closed.
@kamilkrauspe

So what can we do about it?

The same lever shows up at three layers.
Cost, throughput, and capacity move together.

1
Inference
The unit of inference: architecture, runtime, cache mechanics.
Improves throughLabs competing on capability and $/M.
Your handWhich model you reach for; prefix discipline.
2
Harness
The system around the model: context, tools, orchestration.
Improves throughTool builders competing; engineering teams iterating.
Your handContext engineering; which harness, agent and tool design.
3
Use
The choices made turn-by-turn: which model, when to escalate, when to stop.
Improves throughYou, deliberately.
How it’s usedAll of it.
As prices become visible, layer 3 — the one only you can move — matters most.
@kamilkrauspe
Layer 1 — Inference

What does a million tokens cost to make?

Self-hosted GPT-OSS-120B on H100 80GB
Rent
$3 / hour
rental list rate
Throughput
18M tokens / hour
vLLM saturated, output tokens
rent throughput
= ~$0.17 per million tokens
A simplified single-configuration example. Real costs vary.
Throughput is the denominator.
Push it up, the cost falls.
@kamilkrauspe
Layer 1 — Inference

How throughput keeps growing.

Techniques in use, by category. Each one pushes more useful tokens through the same hardware.

Attention
MLAKV cache 10× smaller.
Sparse attentionSkip irrelevant tokens. NSA, DSA, TriAttention.
MQA / GQAFewer KV heads, denser batches.
Sliding-window attentionAttention only over recent tokens.
FlashAttention 3/4Kernel-level wins.
Linear/sub-quadraticMamba, RWKV, RetNet.
Serving runtime
vLLM, SGLang, TensorRT-LLM20–40% throughput swings.
Paged attentionVirtual KV memory.
Speculative decoding2–4× decode.
KV evictionLMCache, Dynamo, TurboQuant.
Disaggregated decodePhases optimized separately.
Model strategy
Mixture-of-ExpertsDeepSeek 671B/37B, GPT-OSS 117B/5.1B.
DistillationCapability per parameter.
Hybrid SSMNemotron-H, Qwen3, Hunyuan-TurboS.
QuantizationMXFP4, NVFP4, INT4. Bandwidth wins.
MoD / MoRDynamic compute.
$0.17/MT is not where it ends. The work of pushing it lower is what this menu describes.
@kamilkrauspe
Layer 2 — Harness

Every turn carries the previous turn.

In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.

Turn 1 — system prompt + tools + MCP defs + user ask: fix failing tests. ~30K tokens. Returns: grep "test_users".
Turn 2 — Turn 1 + grep call + ~50 hits. ~32K. Returns: read_file("test_users.py").
Turn 3 — Turn 2 + ~300 lines of file content. ~35K. Returns: edit proposal.
…by Turn 50, running context ~150K. Cumulative across turns: several million.
cached uncached
≈3.5M tokens
~90% cached
turn 1turn 25turn 50
Caching makes replayed tokens cheaper. It does not make them disappear.
@kamilkrauspe
Layer 2 — Harness

How harnesses get cheaper.

Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don't add — they multiply.

Reduce the loop
Programmatic tool callingModel writes code that calls tools. Intermediate results stay outside context. (Anthropic case: 150K → 2K.)
JIT MCP discoveryCursor: −46.9% on MCP runs.
Sub-agents with scoped contextsSummaries to the planner, not raw transcripts.
Tools return content, not IDsAvoid the lookup-by-handle round-trip.
Cache discipline
Cache miss penaltyEach miss forces a fresh write — 10–20× the cost of a hit.
5-min vs 1-hour TTLPick by session length. Idle time drains short TTLs.
Stable prefixesAnything that varies kills cache. Keep dynamic content out of the cached portion.
Tool & format design
Edit format mattersAider: 3× output tokens for whole-file vs. diff.
70/20/10 routingTriage Haiku → escalate Sonnet → reserve Opus.
Error messages as instructions“Date format must be YYYY-MM-DD” beats a stack trace.
Reasoning vs instant modeThinking tokens count. Choose deliberately.
Workflow tools, not API endpointsOne tool with the right shape beats five raw ones.
The harness is where the biggest cost wins are reported.
This is context engineering in practice.
@kamilkrauspe
Layer 3 — Users

Same spend,
different productivity.

Team A
Spend: $5,000/month
$35
cost per merged PR
Team B
Spend: $5,000/month
$120
cost per merged PR
The variance is not in the price. It's in which model people reach for, when to escalate, when to stop.
This is the layer
that does not get cheaper on its own.
Source: Vantage agentic-coding cost analysis, April 2026. Across customer fleets, ~10× cost-per-developer variance within the same team is typical.
@kamilkrauspe
Layer 3 — Users

How users get more from the same spend.

The variance closes through three disciplines — how teams specify, how they structure work, how they measure outcomes.

Prompting
Outcome-first prompts Describe outcomes; drop process scaffolding.
Literal instruction Won’t infer what you don’t write.
Explicit stop conditions Fewest useful tool loops; correctness outranks brevity.
Workflow
Subagent context isolation Fresh context windows; only summary returns to parent.
Task budgets Advisory token ceiling; model paces itself.
Parallel worktrees 3–5 concurrent sessions; throughput over per-session cost.
Measurement
Cost per merged pull request The highest spender can be the lowest cost-per-PR.
Vibe-then-verify AI generates; humans gate before commit.
Tenure playbooks Experienced users converge on different patterns.
Layer 1 sets the rate card.
Layer 2 sets the loop.
Layer 3 sets the denominator.
Sources: OpenAI prompt guidance (developers.openai.com); Anthropic Claude 4.7 migration & Claude Code docs (platform.claude.com, code.claude.com); Vantage agentic-coding cost analysis; Sonar State of Code 2026; Anthropic Economic Index, March 2026 report.
@kamilkrauspe

What if you paid by the token?

A $200 subscription is 1% of a US senior engineer and 3% of a Slovakia senior engineer. Heavy actual use at PAYG rates tells a different story.

Region (senior engineer FL)
$/month
Sonnet PAYG
Opus PAYG
United States
$20,000
~10%
~17%
Germany
$13,000
~15%
~27%
Slovakia
$7,000
~30%
~50%
The substitution math holds at the flat-plan level. It tightens fast outside top-tier US compensation as soon as you pay by the token.
Engineer compensation: fully-loaded employer cost (base + bonus + equity + benefits + payroll overhead). US (Bay Area / NYC big-tech tier), Germany (Berlin / Munich senior), Slovakia (Bratislava senior). Sources: Levels.fyi, ERI SalaryExpert. AI burn: Sonnet 4.6 / Opus 4.7 list rates × 75M tokens/active day × 90% cache hit × 22 working days.
@kamilkrauspe

What we called ‘cheap’
was hiding everywhere.

Per-token prices fell ~280× at fixed quality.
Enterprise GenAI spend tripled to $37B in 2025.
Cheaper tokens — more than offset by bigger workloads.

~280× per-token price decline, same period. $2.3B $11.5B $37B 2023 2024 2025
FinOps practitioners actively managing AI spend
2024
31%
2025
63%
2026
98%
State of FinOps, FinOps Foundation
Frontier capability rose over the same period. Cheaper tokens, also smarter ones.
Bigger workloads more than offset cheaper tokens — reasoning models, multi-agent systems, longer-running loops.
Visible prices may rise even as inference cost falls. The subsidy was holding visible price below cost.
What was always there is becoming visible.
Sources: Menlo Ventures, 2025 State of Generative AI in the Enterprise (Dec 2025); Stanford HAI, 2025 AI Index Report (Apr 2025); FinOps Foundation, State of FinOps 2026.
@kamilkrauspe
The subsidies are ending.
What looks like a loss is also a gift

A gift of knowing
what anything was
costing all along.
And of choosing, again,
what is worth running.

(Larger reverberations too — through organizations, through societies, through what we have been
calling “cheap”…)
@kamilkrauspe