AI subscriptions, coming due.
The Subsidy Era.
What’s being unwound. What it means for the people who pay. What to do.
The flat-rate era is ending . The metered era is starting .
Six moves. Five vendors. Four weeks.
Why $200 a month is smaller than what one heavy day costs.
The three layers where teams are bringing the cost down.
Same spend, very different output — the variance most teams don’t measure.
@kamilkrauspe
One heavy day of agentic coding burns through
— more than the whole monthly plan covers.
The rest is being paid by someone — and that is what is being unwound.
@kamilkrauspe
Six moves. Five vendors.About four weeks.
March
Google
Antigravity restructured to AI Credits. AI Pro weekly token usage cut ~97%.
April 2
OpenAI
Codex billing for paid plans moved from credits-per-message to credits-per-token. New $100 Pro tier between Plus and the $200 Pro.
April 9
Amazon
Jassy shareholder letter: AWS still has “capacity constraints that yield unserved demand” despite +3.9 GW added in 2025.
April 21
Anthropic
Claude Code briefly restricted to Max plans on the pricing page. Reversed within a day.
April 23
Anthropic
Postmortem for Claude Code quality issues. Statement to Fortune : “compute is a constraint across the entire industry.”
April 27
GitHub
All Copilot plans move to AI-Credit billing on June 1. Input, output, and cached tokens all metered.
@kamilkrauspe
Variance per request, 2026.
Per-session token consumption across six common workloads.
Inline autocompleteone keystroke
< 1K
< $0.01
Single chat promptno project context
< 10K
< $0.05
Targeted file editshort multi-step
< 100K
< $0.30
Interactive coding30–90 min session
< 10M
< $10
Long autonomous run1–4 hours
< 100M
< $80
Multi-agent / backgroundproduction fleets
up to 1B
$500+
1K
1M
1B
100,000× spread
Documented heavy-use sessions
52.5M
tokens in 38 minutes — Cursor agent replaying a 120K context window in a tool loop.
$300 / day
Jason Calacanis running 8 AI agents , “a fraction of their potential capacity” — All-In Podcast, Feb 2026 .
$1,000 / wk
single Replit Agent 3 user spike, vs. their typical $180–200 / month baseline — Q1 2026 .
Sonnet 4.6 list rates. Cache reads dominate at scale (especially in agent loops). Opus ~1.5–1.7× higher. Ranges are estimates; x-axis is logarithmic.
@kamilkrauspe
A real heavy user costs more to serve than the plan covers.
What you pay
$200
flat subscription, per month
What heavy use actually costs
$1,000–$4,400
inference at list rates, per month
The plan covers about 5–20% of the inference. Someone is paying the rest.
Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7 across the range. Figures are estimates.
@kamilkrauspe
It isn’t only about the money.
Heavy use overshoots the plan in dollars. It also overshoots it in capacity.
4×
Tokens per request, heavy users (P90), year over year.
Pricing changes aren’t only cost recovery. They’re also rationing on a scarce resource .
Datadog State of AI Engineering, 2026.
@kamilkrauspe
?
Where could the subsidybe coming from?
OpenAI projects $665B cumulative cash burn 2026–30 against active fundraising at $852B post-money.
Whether that capital flows to subscription subsidy, capex, R&D, or talent is not publicly disclosed.
Hyperscalers with equity in labs can price compute below market — protecting the equity by supporting the burn.
Amazon’s $16.8B Q1 gain on Anthropic and the multi-GW Broadcom–Google TPU deal fit the shape. Pricing undisclosed.
In a flat plan, light users consume far less than they pay for — covering much of the heavy minority’s overuse.
Anthropic flagged this in July 2025 (<5% drive most use). Agentic usage has grown sharply since.
4 · PAYG → flat-rate subsidy
If PAYG margins exceed serving cost, they could subsidize flat-rate plans, ration scarce capacity, or both.
No vendor discloses unit economics by billing model.
The subsidy gap is real, and it is being closed .
@kamilkrauspe
So what can we do about it?
The same lever shows up at three layers.Cost , throughput , and capacity move together .
1
Inference
The unit of inference: architecture, runtime, cache mechanics.
Improves through Labs competing on capability and $/M.
Your hand Which model you reach for; prefix discipline.
2
Harness
The system around the model: context, tools, orchestration.
Improves through Tool builders competing; engineering teams iterating.
Your hand Context engineering; which harness, agent and tool design.
3
Use
The choices made turn-by-turn: which model, when to escalate, when to stop.
Improves through You, deliberately.
How it’s used All of it.
As prices become visible, layer 3 — the one only you can move — matters most.
@kamilkrauspe
Layer 1 — Inference
What does a million tokens cost to make?
Self-hosted GPT-OSS-120B on H100 80GB
rent
throughput
=
~ $0.17
per million tokens
A simplified single-configuration example. Real costs vary.
Throughput is the denominator.Push it up, the cost falls.
@kamilkrauspe
Layer 1 — Inference
How throughput keeps growing.
Techniques in use, by category. Each one pushes more useful tokens through the same hardware.
Attention
MLA KV cache 10× smaller.
Sparse attention Skip irrelevant tokens. NSA, DSA, TriAttention.
MQA / GQA Fewer KV heads, denser batches.
Sliding-window attention Attention only over recent tokens.
FlashAttention 3/4 Kernel-level wins.
Linear/sub-quadratic Mamba, RWKV, RetNet.
Serving runtime
vLLM, SGLang, TensorRT-LLM 20–40% throughput swings.
Paged attention Virtual KV memory.
Speculative decoding 2–4× decode.
KV eviction LMCache, Dynamo, TurboQuant.
Disaggregated decode Phases optimized separately.
Model strategy
Mixture-of-Experts DeepSeek 671B/37B, GPT-OSS 117B/5.1B.
Distillation Capability per parameter.
Hybrid SSM Nemotron-H, Qwen3, Hunyuan-TurboS.
Quantization MXFP4, NVFP4, INT4. Bandwidth wins.
MoD / MoR Dynamic compute.
@kamilkrauspe
Layer 2 — Harness
Every turn carries the previous turn.
In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.
Turn 1 — system prompt + tools + MCP defs + user ask: fix failing tests. ~30K tokens. Returns: grep "test_users".
Turn 2 — Turn 1 + grep call + ~50 hits. ~32K. Returns: read_file("test_users.py").
Turn 3 — Turn 2 + ~300 lines of file content. ~35K. Returns: edit proposal.
…by Turn 50, running context ~150K. Cumulative across turns: several million.
cached
uncached
turn 1 turn 25 turn 50
@kamilkrauspe
Layer 2 — Harness
How harnesses get cheaper.
Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don't add — they multiply.
Reduce the loop
Programmatic tool calling Model writes code that calls tools. Intermediate results stay outside context. (Anthropic case: 150K → 2K.)
JIT MCP discovery Cursor: −46.9% on MCP runs.
Sub-agents with scoped contexts Summaries to the planner, not raw transcripts.
Tools return content, not IDs Avoid the lookup-by-handle round-trip.
Cache discipline
Cache miss penalty Each miss forces a fresh write — 10–20× the cost of a hit.
5-min vs 1-hour TTL Pick by session length. Idle time drains short TTLs.
Stable prefixes Anything that varies kills cache. Keep dynamic content out of the cached portion.
Tool & format design
Edit format matters Aider: 3× output tokens for whole-file vs. diff.
70/20/10 routing Triage Haiku → escalate Sonnet → reserve Opus.
Error messages as instructions “Date format must be YYYY-MM-DD” beats a stack trace.
Reasoning vs instant mode Thinking tokens count. Choose deliberately.
Workflow tools, not API endpoints One tool with the right shape beats five raw ones.
@kamilkrauspe
Layer 3 — Users
Same spend,different productivity.
Team A
Spend:
$5,000/month
$35
cost per merged PR
Team B
Spend:
$5,000/month
$120
cost per merged PR
The variance is not in the price. It's in which model people reach for, when to escalate, when to stop.
This is the layer that does not get cheaper on its own.
Source: Vantage agentic-coding cost analysis, April 2026. Across customer fleets, ~10× cost-per-developer variance within the same team is typical.
@kamilkrauspe
Layer 3 — Users
How users get more from the same spend.
The variance closes through three disciplines — how teams specify, how they structure work, how they measure outcomes.
Prompting
Outcome-first prompts
Describe outcomes; drop process scaffolding.
Literal instruction
Won’t infer what you don’t write.
Explicit stop conditions
Fewest useful tool loops; correctness outranks brevity.
Workflow
Subagent context isolation
Fresh context windows; only summary returns to parent.
Task budgets
Advisory token ceiling; model paces itself.
Parallel worktrees
3–5 concurrent sessions; throughput over per-session cost.
Measurement
Cost per merged pull request
The highest spender can be the lowest cost-per-PR.
Vibe-then-verify
AI generates; humans gate before commit.
Tenure playbooks
Experienced users converge on different patterns.
Sources: OpenAI prompt guidance (developers.openai.com); Anthropic Claude 4.7 migration & Claude Code docs (platform.claude.com, code.claude.com); Vantage agentic-coding cost analysis; Sonar State of Code 2026; Anthropic Economic Index, March 2026 report.
@kamilkrauspe
What if you paid by the token?
A $200 subscription is 1% of a US senior engineer and 3% of a Slovakia senior engineer. Heavy actual use at PAYG rates tells a different story.
Region (senior engineer FL)
$/month
Sonnet PAYG
Opus PAYG
United States
$20,000
~10%
~17%
Germany
$13,000
~15%
~27%
Slovakia
$7,000
~30%
~50%
The substitution math holds at the flat-plan level. It tightens fast outside top-tier US compensation as soon as you pay by the token.
Engineer compensation: fully-loaded employer cost (base + bonus + equity + benefits + payroll overhead). US (Bay Area / NYC big-tech tier), Germany (Berlin / Munich senior), Slovakia (Bratislava senior). Sources: Levels.fyi, ERI SalaryExpert. AI burn: Sonnet 4.6 / Opus 4.7 list rates × 75M tokens/active day × 90% cache hit × 22 working days.
@kamilkrauspe
What we called ‘cheap’was hiding everywhere.
Per-token prices fell ~280× at fixed quality. Enterprise GenAI spend tripled to $37B in 2025. Cheaper tokens — more than offset by bigger workloads.
~280× per-token price decline,
same period.
$2.3B
$11.5B
$37B
2023
2024
2025
FinOps practitioners actively managing AI spend
State of FinOps, FinOps Foundation
Frontier capability rose over the same period. Cheaper tokens, also smarter ones.
Bigger workloads more than offset cheaper tokens — reasoning models, multi-agent systems, longer-running loops.
Visible prices may rise even as inference cost falls. The subsidy was holding visible price below cost.
What was always there is becoming visible.
Sources: Menlo Ventures, 2025 State of Generative AI in the Enterprise (Dec 2025); Stanford HAI, 2025 AI Index Report (Apr 2025); FinOps Foundation, State of FinOps 2026.
@kamilkrauspe
The subsidies are ending . What looks like a loss is also a gift —
A gift of knowing what anything was costing all along.
And of choosing, again, what is worth running.
(Larger reverberations too — through organizations, through societies, through what we have been calling “cheap”…)
@kamilkrauspe