AI subscriptions, coming due.

The Subsidy
Era.

What’s being unwound.
What it means for the people who pay.
What to do.

The flat-rate era is ending.
The metered era is starting.

Six moves. Five vendors. Four weeks.

Why $200 a month is smaller than what one heavy day costs.

The three layers where teams are bringing the cost down.

Same spend, very different output — the variance most teams don’t measure.

@kamilkrauspe

One heavy day of agentic coding
burns through

$200

— more than the whole monthly plan covers.

The rest is being paid by someone — and that is what is being unwound.

@kamilkrauspe

Six moves. Five vendors.
About four weeks.

March

Google

Antigravity restructured to AI Credits. AI Pro weekly token usage cut ~97%.

April 2

OpenAI

Codex billing for paid plans moved from credits-per-message to credits-per-token. New $100 Pro tier between Plus and the $200 Pro.

April 9

Amazon

Jassy shareholder letter: AWS still has “capacity constraints that yield unserved demand” despite +3.9 GW added in 2025.

April 21

Anthropic

Claude Code briefly restricted to Max plans on the pricing page. Reversed within a day.

April 23

Anthropic

Postmortem for Claude Code quality issues. Statement to Fortune: “compute is a constraint across the entire industry.”

April 27

GitHub

All Copilot plans move to AI-Credit billing on June 1. Input, output, and cached tokens all metered.

@kamilkrauspe

Variance per request, 2026.

Per-session token consumption across six common workloads.

Inline autocompleteone keystroke

< 1K

< $0.01

Single chat promptno project context

< 10K

< $0.05

Targeted file editshort multi-step

< 100K

< $0.30

Interactive coding30–90 min session

< 10M

< $10

Long autonomous run1–4 hours

< 100M

< $80

Multi-agent / backgroundproduction fleets

up to 1B

$500+

100,000× spread

Documented heavy-use sessions

52.5M

tokens in 38 minutes — Cursor agent replaying a 120K context window in a tool loop.

$300 / day

Jason Calacanis running 8 AI agents, “a fraction of their potential capacity” — All-In Podcast, Feb 2026.

$1,000 / wk

single Replit Agent 3 user spike, vs. their typical $180–200 / month baseline — Q1 2026.

Sonnet 4.6 list rates. Cache reads dominate at scale (especially in agent loops). Opus ~1.5–1.7× higher. Ranges are estimates; x-axis is logarithmic.

@kamilkrauspe

A real heavy user
costs more to serve
than the plan covers.

What you pay

$200

flat subscription, per month

What heavy use actually costs

$1,000–$4,400

inference at list rates, per month

The plan covers about 5–20% of the inference.
Someone is paying the rest.

Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7 across the range. Figures are estimates.

@kamilkrauspe

It isn’t only about the money.

Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.

4×

Tokens per request, heavy users (P90), year over year.

Pricing changes aren’t only cost recovery.
They’re also rationing on a scarce resource.

Datadog State of AI Engineering, 2026.

@kamilkrauspe

Where could the subsidy
be coming from?

1 · Investor capital

OpenAI projects $665B cumulative cash burn 2026–30 against active fundraising at $852B post-money.

Whether that capital flows to subscription subsidy, capex, R&D, or talent is not publicly disclosed.

2 · Hyperscaler compute

Hyperscalers with equity in labs can price compute below market — protecting the equity by supporting the burn.

Amazon’s $16.8B Q1 gain on Anthropic and the multi-GW Broadcom–Google TPU deal fit the shape. Pricing undisclosed.

3 · Cross-subsidy

In a flat plan, light users consume far less than they pay for — covering much of the heavy minority’s overuse.

Anthropic flagged this in July 2025 (<5% drive most use). Agentic usage has grown sharply since.

4 · PAYG → flat-rate subsidy

If PAYG margins exceed serving cost, they could subsidize flat-rate plans, ration scarce capacity, or both.

No vendor discloses unit economics by billing model.

The subsidy gap is real, and it is being closed.

@kamilkrauspe

So what can we do about it?

The same lever shows up at three layers.
Cost, throughput, and capacity move together.

Inference

The unit of inference: architecture, runtime, cache mechanics.

Improves throughLabs competing on capability and $/M.

Your handWhich model you reach for; prefix discipline.

Harness

The system around the model: context, tools, orchestration.

Improves throughTool builders competing; engineering teams iterating.

Your handContext engineering; which harness, agent and tool design.

Use

The choices made turn-by-turn: which model, when to escalate, when to stop.

Improves throughYou, deliberately.

How it’s usedAll of it.

As prices become visible, layer 3 — the one only you can move — matters most.

@kamilkrauspe

Layer 1 — Inference

What does a million tokens cost to make?

Self-hosted GPT-OSS-120B on H100 80GB

Rent

$3 / hour

rental list rate

Throughput

18M tokens / hour

vLLM saturated, output tokens

rent throughput

= ~$0.17 per million tokens

A simplified single-configuration example. Real costs vary.

Throughput is the denominator.
Push it up, the cost falls.

@kamilkrauspe

Layer 1 — Inference

How throughput keeps growing.

Techniques in use, by category. Each one pushes more useful tokens through the same hardware.

Attention

MLAKV cache 10× smaller.

Sparse attentionSkip irrelevant tokens. NSA, DSA, TriAttention.

MQA / GQAFewer KV heads, denser batches.

Sliding-window attentionAttention only over recent tokens.

FlashAttention 3/4Kernel-level wins.

Linear/sub-quadraticMamba, RWKV, RetNet.

Serving runtime

vLLM, SGLang, TensorRT-LLM20–40% throughput swings.

Paged attentionVirtual KV memory.

Speculative decoding2–4× decode.

KV evictionLMCache, Dynamo, TurboQuant.

Disaggregated decodePhases optimized separately.

Model strategy

Mixture-of-ExpertsDeepSeek 671B/37B, GPT-OSS 117B/5.1B.

DistillationCapability per parameter.

Hybrid SSMNemotron-H, Qwen3, Hunyuan-TurboS.

QuantizationMXFP4, NVFP4, INT4. Bandwidth wins.

MoD / MoRDynamic compute.

$0.17/MT is not where it ends. The work of pushing it lower is what this menu describes.

@kamilkrauspe

Layer 2 — Harness

Every turn carries the previous turn.

In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.

Turn 1 — system prompt + tools + MCP defs + user ask: fix failing tests. ~30K tokens. Returns: grep "test_users".

Turn 2 — Turn 1 + grep call + ~50 hits. ~32K. Returns: read_file("test_users.py").

Turn 3 — Turn 2 + ~300 lines of file content. ~35K. Returns: edit proposal.

…by Turn 50, running context ~150K. Cumulative across turns: several million.

cached uncached

≈3.5M tokens

~90% cached

turn 1turn 25turn 50

Caching makes replayed tokens cheaper. It does not make them disappear.

@kamilkrauspe

Layer 2 — Harness

How harnesses get cheaper.

Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don't add — they multiply.

Reduce the loop

Programmatic tool callingModel writes code that calls tools. Intermediate results stay outside context. (Anthropic case: 150K → 2K.)

JIT MCP discoveryCursor: −46.9% on MCP runs.

Sub-agents with scoped contextsSummaries to the planner, not raw transcripts.

Tools return content, not IDsAvoid the lookup-by-handle round-trip.

Cache discipline

Cache miss penaltyEach miss forces a fresh write — 10–20× the cost of a hit.

5-min vs 1-hour TTLPick by session length. Idle time drains short TTLs.

Stable prefixesAnything that varies kills cache. Keep dynamic content out of the cached portion.

Tool & format design

Edit format mattersAider: 3× output tokens for whole-file vs. diff.

70/20/10 routingTriage Haiku → escalate Sonnet → reserve Opus.

Error messages as instructions“Date format must be YYYY-MM-DD” beats a stack trace.

Reasoning vs instant modeThinking tokens count. Choose deliberately.

Workflow tools, not API endpointsOne tool with the right shape beats five raw ones.

The harness is where the biggest cost wins are reported.
This is context engineering in practice.

@kamilkrauspe

Layer 3 — Users

Same spend,
different productivity.

Team A

Spend: $5,000/month

$35

cost per merged PR

Team B

Spend: $5,000/month

$120

cost per merged PR

The variance is not in the price. It's in which model people reach for, when to escalate, when to stop.

This is the layer
that does not get cheaper on its own.

Source: Vantage agentic-coding cost analysis, April 2026. Across customer fleets, ~10× cost-per-developer variance within the same team is typical.

@kamilkrauspe

Layer 3 — Users

How users get more from the same spend.

The variance closes through three disciplines — how teams specify, how they structure work, how they measure outcomes.

Prompting

Outcome-first prompts Describe outcomes; drop process scaffolding.

Literal instruction Won’t infer what you don’t write.

Explicit stop conditions Fewest useful tool loops; correctness outranks brevity.

Workflow

Subagent context isolation Fresh context windows; only summary returns to parent.

Task budgets Advisory token ceiling; model paces itself.

Parallel worktrees 3–5 concurrent sessions; throughput over per-session cost.

Measurement

Cost per merged pull request The highest spender can be the lowest cost-per-PR.

Vibe-then-verify AI generates; humans gate before commit.

Tenure playbooks Experienced users converge on different patterns.

Layer 1 sets the rate card.

Layer 2 sets the loop.

Layer 3 sets the denominator.

Sources: OpenAI prompt guidance (developers.openai.com); Anthropic Claude 4.7 migration & Claude Code docs (platform.claude.com, code.claude.com); Vantage agentic-coding cost analysis; Sonar State of Code 2026; Anthropic Economic Index, March 2026 report.

@kamilkrauspe

What if you paid by the token?

A $200 subscription is 1% of a US senior engineer and 3% of a Slovakia senior engineer. Heavy actual use at PAYG rates tells a different story.

Region (senior engineer FL)

$/month

Sonnet PAYG

Opus PAYG

United States

$20,000

~10%

~17%

Germany

$13,000

~15%

~27%

Slovakia

$7,000

~30%

~50%

The substitution math holds at the flat-plan level. It tightens fast outside top-tier US compensation as soon as you pay by the token.

Engineer compensation: fully-loaded employer cost (base + bonus + equity + benefits + payroll overhead). US (Bay Area / NYC big-tech tier), Germany (Berlin / Munich senior), Slovakia (Bratislava senior). Sources: Levels.fyi, ERI SalaryExpert. AI burn: Sonnet 4.6 / Opus 4.7 list rates × 75M tokens/active day × 90% cache hit × 22 working days.

@kamilkrauspe

What we called ‘cheap’
was hiding everywhere.

Per-token prices fell ~280× at fixed quality.
Enterprise GenAI spend tripled to $37B in 2025.
Cheaper tokens — more than offset by bigger workloads.

FinOps practitioners actively managing AI spend

2024

31%

→

2025

63%

→

2026

98%

State of FinOps, FinOps Foundation

Frontier capability rose over the same period. Cheaper tokens, also smarter ones.

Bigger workloads more than offset cheaper tokens — reasoning models, multi-agent systems, longer-running loops.

Visible prices may rise even as inference cost falls. The subsidy was holding visible price below cost.

What was always there is becoming visible.

Sources: Menlo Ventures, 2025 State of Generative AI in the Enterprise (Dec 2025); Stanford HAI, 2025 AI Index Report (Apr 2025); FinOps Foundation, State of FinOps 2026.

@kamilkrauspe

The subsidies are ending.
What looks like a loss is also a gift —

A gift of knowing
what anything was
costing all along. And of choosing, again,
what is worth running.

(Larger reverberations too — through organizations, through societies, through what we have been
calling “cheap”…)

@kamilkrauspe

The SubsidyEra.

Six moves. Five vendors.About four weeks.

Variance per request, 2026.

A real heavy usercosts more to servethan the plan covers.

It isn’t only about the money.

Where could the subsidybe coming from?

So what can we do about it?

What does a million tokens cost to make?

How throughput keeps growing.

Every turn carries the previous turn.

How harnesses get cheaper.

Same spend,different productivity.

How users get more from the same spend.

What if you paid by the token?

What we called ‘cheap’was hiding everywhere.

A gift of knowingwhat anything wascosting all along. And of choosing, again,what is worth running.

The Subsidy
Era.

Six moves. Five vendors.
About four weeks.

A real heavy user
costs more to serve
than the plan covers.

Where could the subsidy
be coming from?

Same spend,
different productivity.

What we called ‘cheap’
was hiding everywhere.

A gift of knowing
what anything was
costing all along. And of choosing, again,
what is worth running.