AI subscriptions are subsidized.
The subsidy is being pulled back.

The flat-rate era is ending.
The metered era is starting.

@kamilkrauspe

One heavy day of agentic coding
can burn through

$200

— more than the whole monthly plan covers.

The rest is paid by someone — and that is what is being unwound.

@kamilkrauspe

Six weeks. Five moves.
One direction.

March

Google

Antigravity restructured to AI Credits. Weekly quotas tightened across paid AI plans — one tracked Pro-tier user reported a ~97% drop (per The Register, Mar 12).

April 2

OpenAI

Codex billing moved from credits-per-message to credits-per-token. April 9: new $100 Pro tier between Plus and the $200 Pro.

April 21

Anthropic

Claude Code briefly restricted to Max plans on the pricing page. Reversed within a day.

April 23

Anthropic

Postmortem for Claude Code quality issues. Statement to Fortune: “compute is a constraint across the entire industry.”

April 27

GitHub

All paid Copilot plans move to AI-Credit billing on June 1. Input, output, and cached tokens all metered. (autocomplete remains unlimited)

Sources: blog.google · theregister.com · openai.com · anthropic.com · fortune.com · github.blog.

@kamilkrauspe

Variance per workload, 2026.

Per-workload token consumption across six common shapes.

Inline autocompleteone keystroke

< 1K

< $0.01

Single chat promptno project context

< 10K

< $0.05

Targeted file editshort multi-step

< 100K

< $0.30

Interactive coding30–90 min session

< 10M

< $10

Long autonomous run1–4 hours

< 100M

< $80

Multi-agent / backgroundproduction fleets

up to 1B

$500+

~100,000× from a chat ping to a multi-agent run.

Documented heavy-use sessions

52M tokens

in 38 minutes — Cursor agent replaying a 120K context window in a tool loop.

$300 / day per agent

Jason Calacanis on All-In, at ~10–20% capacity utilization.

$1,000 / wk

single Replit Agent 3 user spike, vs. their typical $180–200 / month baseline.

Sonnet 4.6 list-rate estimates. Sources: Cursor support forum (April 2026), All-In Podcast (Feb 2026), Replit user reports via The Register (Sept 2025).

I run engineering across two continents — and the variance between developers dwarfs the variance between models.

@kamilkrauspe

A real heavy user
costs more to serve
than the plan covers.

What you pay

$200

flat subscription, per month

What heavy use actually costs

$1,000–$4,400

inference at list rates, per month

The plan covers 5–20% of the inference.
Someone is paying the rest.

Heavy power-user at 100M tokens/active day. ~93% cache reads · ~2% output. Range spans Sonnet-default (75% Sonnet / 20% Opus / 5% Haiku) to Opus-loyal (95% Opus / 5% Sonnet). ~22 active days/month. Estimates at list rates.

@kamilkrauspe

It’s not only about the money.

Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.

4×

Tokens per request, heavy users (P90), year over year.*

Subscription changes aren’t only cost recovery.
They’re also rationing on a scarce resource.

* Source: Datadog State of AI Engineering, 2026.

@kamilkrauspe

So what can we do about it?

The same lever shows up at three layers.
Cost, throughput, and capacity move together.

Inference

The unit of inference: architecture, runtime, cache mechanics.

Improves throughLabs and serving providers competing on $/M.

Your handWhich model you reach for; prefix discipline.

Harness

The system around the model: context, tools, orchestration.

Improves throughTool builders competing; engineering teams iterating.

Your handContext engineering; harness, agent, and tool design.

Use

The choices made turn-by-turn: which model, when to escalate, when to stop.

Improves throughYou, deliberately.

How it’s usedAll of it.

Only you can move layer 3.

That is where your work begins.

@kamilkrauspe

Layer 1 — Inference

What does a million tokens cost to make?

Self-hosted GPT-OSS-120B on a rented H100 80GB.

Cost

$3 / hour

rental list rate

Throughput

18M tokens / hour

vLLM saturated, output tokens

cost throughput

= ~$0.17

per million tokens

A simplified single-configuration example. Real costs vary.

Throughput is the denominator.
Push it up, the cost falls. So how?

COMPRESS

MLA · MXFP4/NVFP4 quantization · KV eviction …

SKIP

sparse attention · MoE · MoD/MoR …

RESHAPE

speculative decoding · disaggregation · Hybrid SSM …

@kamilkrauspe

Layer 2 — Harness

Every turn carries the previous turn.

In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each replaying accumulated context.

Turn 1 — system prompt + tools + MCP defs + user ask: fix failing tests. ~30K tokens. Returns: grep "test_users".

Turn 2 — Turn 1 + grep call + ~50 hits. ~32K. Returns: read_file("test_users.py").

Turn 3 — Turn 2 + ~300 lines of file content. ~35K. Returns: edit proposal.

…by Turn 50, running context ~150K. Cumulative across turns: several million.

cached uncached

≈3.5M tokens

~90% cached

turn 1turn 25turn 50

Caching makes replayed tokens cheaper. It does not make them disappear.

@kamilkrauspe

Layer 2 — Harness

How harnesses get cheaper.

Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don’t add — they multiply.

Reduce the
loop

Programmatic tool calling Anthropic case: ~134K → ~2K tokens.

JIT MCP discovery Cursor: −46.9% on MCP runs.

Scoped subagents Summaries to the planner, not raw transcripts.

Protect the
cache

Cache miss penalty Each miss costs 10–20× a hit (Claude).

Stable prefixes Dynamic content invalidates cache from that point.

TTL-aware sessions 5-min vs. 1-hour, by session length.

Shape the
work

Edit format matters Aider: 3× fewer lazy completions with diffs vs. whole-file.

70/20/10 routing Triage Haiku → escalate Sonnet → reserve Opus.

Workflow tools, not API endpoints 1 right-shaped tool beats 5 raw ones.

Context engineering, prompt engineering, harness engineering — in practice.

@kamilkrauspe

Layer 3 — Users

Same spend.
Different productivity.

Imagine two teams:

Team A

Spend: $5,000/month

$35

cost per merged PR

Team B

Spend: $5,000/month

$120

cost per merged PR

The variance is not in price. It’s in which model people reach for, when they escalate, and when they stop.

SPECIFY

outcome-first specs · verifiable goals · curated context …

CONTROL

plan first · reset early · escalate deliberately …

MEASURE

cost per PR · accepted vs. retained · repo rules …

This layer does not get cheaper on its own.

Sources: Vantage, ‘AI Costs Are Cloud Costs Now’ (Apr 29, 2026) and ‘The Hidden Cost Driver in Agentic Coding’ (Apr 15, 2026). Vantage documents up to ~10× difference in token costs between two engineers on the same team.

@kamilkrauspe

The subsidies are ending.
What looks like a loss is also a gift —

A gift of knowing
what anything was
costing all along. And of choosing, again,
what is worth running.

(Larger reverberations too — through organizations, through societies, through what we have been calling “cheap”…)

Written from where engineering leadership, agentic coding, and enterprise constraints meet.

@kamilkrauspe

AI subscriptions are subsidized. The subsidy is being pulled back.

Six weeks. Five moves.One direction.

Variance per workload, 2026.

A real heavy usercosts more to servethan the plan covers.

It’s not only about the money.

So what can we do about it?

What does a million tokens cost to make?

Every turn carries the previous turn.

How harnesses get cheaper.

Same spend.Different productivity.

A gift of knowingwhat anything wascosting all along. And of choosing, again,what is worth running.

AI subscriptions are subsidized.
The subsidy is being pulled back.

Six weeks. Five moves.
One direction.

A real heavy user
costs more to serve
than the plan covers.

Same spend.
Different productivity.

A gift of knowing
what anything was
costing all along. And of choosing, again,
what is worth running.