AI subscriptions are subsidized.
The subsidy is being pulled back.

What’s being unwound.
What it means for the people who pay.
What to do.

The flat-rate era is ending.
The metered era is starting.

@kamilkrauspe

One heavy day of agentic coding
burns through

$200

— more than the whole monthly plan covers.

The rest is paid by someone — and that is what is being unwound.

@kamilkrauspe

Four weeks. Five moves.
One direction.

March

Google

Antigravity restructured to AI Credits. AI Pro effective token usage cut sharply (~97% per independent developer reports).

April 2

OpenAI

Codex billing moved from credits-per-message to credits-per-token. New $100 Pro tier between Plus and the $200 Pro.

April 21

Anthropic

Claude Code briefly restricted to Max plans on the pricing page. Reversed within a day.

April 23

Anthropic

Postmortem for Claude Code quality issues. Statement to Fortune: “compute is a constraint across the entire industry.”

April 27

GitHub

All Copilot plans move to AI-Credit billing on June 1. Input, output, and cached tokens all metered.

Sources: blog.google · tpsreport.news · openai · anthropic · github · April 2026.

@kamilkrauspe

Variance per request, 2026.

Per-session token consumption across six common workloads.

Inline autocompleteone keystroke

< 1K

< $0.01

Single chat promptno project context

< 10K

< $0.05

Targeted file editshort multi-step

< 100K

< $0.30

Interactive coding30–90 min session

< 10M

< $10

Long autonomous run1–4 hours

< 100M

< $80

Multi-agent / backgroundproduction fleets

up to 1B

$500+

~100,000× spread per session

Documented heavy-use sessions

52.5M

tokens in 38 minutes — Cursor agent replaying a 120K context window in a tool loop.

$300 / day

Jason Calacanis running 8 AI agents, “a fraction of their potential capacity” — All-In Podcast, Feb 2026.

$1,000 / wk

single Replit Agent 3 user spike, vs. their typical $180–200 / month baseline — Q1 2026.

Sonnet 4.6 list rates. Cache reads dominate at scale. Opus ~1.5–1.7× higher. Ranges are estimates; x-axis is logarithmic.

@kamilkrauspe

A real heavy user
costs more to serve
than the plan covers.

What you pay

$200

flat subscription, per month

What heavy use actually costs

$1,000–$4,400

inference at list rates, per month

The plan covers about 5–20% of the inference.
Someone is paying the rest.

Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7. Estimates.

@kamilkrauspe

It’s not only about the money.

Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.

4×

Tokens per request, heavy users (P90), year over year.*

Subscription changes aren’t only cost recovery.
They’re also rationing on a scarce resource.

* Source: Datadog State of AI Engineering, 2026.

@kamilkrauspe

So what can we do about it?

The same lever shows up at three layers.
Cost, throughput, and capacity move together.

Inference

The unit of inference: architecture, runtime, cache mechanics.

Improves throughLabs and serving providers competing on $/M.

Your handWhich model you reach for; prefix discipline.

Harness

The system around the model: context, tools, orchestration.

Improves throughTool builders competing; engineering teams iterating.

Your handContext engineering; harness, agent, and tool design.

Use

The choices made turn-by-turn: which model, when to escalate, when to stop.

Improves throughYou, deliberately.

How it’s usedAll of it.

As prices become visible, layer 3 — the one only you can move — matters most.

@kamilkrauspe

Layer 1 — Inference

What does a million tokens cost to make?

Self-hosted GPT-OSS-120B on a rented H100 80GB.

Cost

$3 / hour

rental list rate

Throughput

18M tokens / hour

vLLM saturated, output tokens

cost throughput

= ~$0.17 per million tokens

A simplified single-configuration example. Real costs vary.

Throughput is the denominator.
Push it up, the cost falls. So how?

COMPRESS

MLA · MXFP4/NVFP4 quantization · KV eviction …

SKIP

sparse attention · MoE · MoD/MoR …

RESHAPE

speculative decoding · disaggregation · Hybrid SSM …

@kamilkrauspe

Layer 2 — Harness

Every turn carries the previous turn.

In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.

Turn 1 — system prompt + tools + MCP defs + user ask: fix failing tests. ~30K tokens. Returns: grep "test_users".

Turn 2 — Turn 1 + grep call + ~50 hits. ~32K. Returns: read_file("test_users.py").

Turn 3 — Turn 2 + ~300 lines of file content. ~35K. Returns: edit proposal.

…by Turn 50, running context ~150K. Cumulative across turns: several million.

cached uncached

≈3.5M tokens

~90% cached

turn 1turn 25turn 50

Caching makes replayed tokens cheaper. It does not make them disappear.

@kamilkrauspe

Layer 2 — Harness

How harnesses get cheaper.

Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don’t add — they multiply.

Reduce the loop

Programmatic tool calling Anthropic case: 150K → 2K context.

JIT MCP discovery Cursor: −46.9% on MCP runs.

Scoped subagents Summaries to the planner, not raw transcripts.

Protect the cache

Cache miss penalty Each miss costs 10–20× a hit.

Stable prefixes Anything dynamic kills cache.

TTL-aware sessions 5-min vs. 1-hour, by session length.

Shape the work

Edit format matters Aider: 3× tokens, whole-file vs. diff.

70/20/10 routing Triage Haiku → escalate Sonnet → reserve Opus.

Workflow tools, not API endpoints One right-shaped tool beats five raw ones.

Context engineering, prompt engineering, harness engineering — in practice. …

@kamilkrauspe

Layer 3 — Users

Same spend.
Different productivity.

Team A

Spend: $5,000/month

$35

cost per merged PR

Team B

Spend: $5,000/month

$120

cost per merged PR

The variance is not in price. It’s in which model people reach for, when they escalate, and when they stop.

SPECIFY — outcome-first prompts · literal instructions · explicit stop conditions …

CONTROL — subagent isolation · task budgets · escalation rules …

MEASURE — cost per merged PR · AI drafts, humans gate · experienced-user playbooks …

This is the layer that does not get cheaper on its own.

Source: Vantage agentic-coding cost analysis, April 2026. ~10× cost-per-developer variance within a team is typical.

@kamilkrauspe

The subsidies are ending.
What looks like a loss is also a gift —

A gift of knowing
what anything was
costing all along. And of choosing, again,
what is worth running.

(Larger reverberations too — through organizations, through societies, through what we have been calling “cheap”…)

@kamilkrauspe

AI subscriptions are subsidized. The subsidy is being pulled back.

Four weeks. Five moves.One direction.

Variance per request, 2026.

A real heavy usercosts more to servethan the plan covers.

It’s not only about the money.

So what can we do about it?

What does a million tokens cost to make?

Every turn carries the previous turn.

How harnesses get cheaper.

Same spend.Different productivity.

A gift of knowingwhat anything wascosting all along. And of choosing, again,what is worth running.

AI subscriptions are subsidized.
The subsidy is being pulled back.

Four weeks. Five moves.
One direction.

A real heavy user
costs more to serve
than the plan covers.

Same spend.
Different productivity.

A gift of knowing
what anything was
costing all along. And of choosing, again,
what is worth running.