AI subscriptions are subsidized.
The subsidy is being pulled back.

$200

One heavy day of agentic coding can burn through this —
more than the whole monthly plan covers.

The flat-rate era is ending.
The metered era is starting.

@kamilkrauspe

AI subscriptions are subsidized.
The subsidy is being pulled back.

What’s being unwound.
What it means for the people who pay.
What to do.

The flat-rate era is ending.
The metered era is starting.

@kamilkrauspe

Six weeks. Five moves.
One direction.

March

Google

Antigravity restructured to AI Credits. AI Pro effective token quota cut sharply (~97% in one tracked Pro-tier user case, per The Register, March 12).

April 2

OpenAI

Codex billing moved from credits-per-message to credits-per-token. April 9: new $100 Pro tier between Plus and the $200 Pro.

April 21

Anthropic

Claude Code briefly restricted to Max plans on the pricing page. Reversed within a day.

April 23

Anthropic

Postmortem for Claude Code quality issues. Statement to Fortune: “compute is a constraint across the entire industry.”

April 27

GitHub

All Copilot plans move to AI-Credit billing on June 1. Input, output, and cached tokens all metered. (autocomplete remains unlimited)

Sources: blog.google · support.google.com · theregister.com · openai.com · anthropic.com · github.blog (March–April 2026).

@kamilkrauspe

Variance per workload, 2026.

Per-workload token consumption across six common shapes.

Inline autocompleteone keystroke

< 1K

< $0.01

Single chat promptno project context

< 10K

< $0.05

Targeted file editshort multi-step

< 100K

< $0.30

Interactive coding30–90 min session

< 10M

< $10

Long autonomous run1–4 hours

< 100M

< $80

Multi-agent / backgroundproduction fleets

up to 1B

$500+

~100,000× spread, per workload

Documented heavy-use sessions

52.5M tokens

in 38 minutes — Cursor agent replaying a 120K context window in a tool loop.

$300 / day per agent

Jason Calacanis on All-In, at ~10–20% capacity utilization.

$1,000 / wk

single Replit Agent 3 user spike, vs. their typical $180–200 / month baseline.

Sonnet 4.6 list-rate estimates. Sources: Cursor support forum (April 2026), All-In Podcast (Feb 2026), Replit user reports via The Register (Sept 2025).

I run engineering across two continents — and the variance between developers dwarfs the variance between models.

@kamilkrauspe

A real heavy user
costs more to serve
than the plan covers.

What you pay

$200

flat subscription, per month

What heavy use actually costs

$1,000–$4,400

inference at list rates, per month

The plan covers 5–20% of the inference.
Someone is paying the rest.

Heavy power-user at 100M tokens/active day. ~93% cache reads · ~2% output. Range spans Sonnet-default (75% Sonnet / 20% Opus / 5% Haiku) to Opus-loyal (95% Opus / 5% Sonnet). ~22 active days/month. Estimates at list rates.

@kamilkrauspe

It’s not only about the money.

Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.

4×

Tokens per request, heavy users (P90), year over year.*

Subscription changes aren’t only cost recovery.
They’re also rationing on a scarce resource.

* Source: Datadog State of AI Engineering, 2026.

@kamilkrauspe

So what can we do about it?

The same lever shows up at three layers.
Cost, throughput, and capacity move together.

Inference

The unit of inference: architecture, runtime, cache mechanics.

Improves throughLabs and serving providers competing on $/M.

Your handWhich model you reach for; prefix discipline.

Harness

The system around the model: context, tools, orchestration.

Improves throughTool builders competing; engineering teams iterating.

Your handContext engineering; harness, agent, and tool design.

Use

The choices made turn-by-turn: which model, when to escalate, when to stop.

Improves throughYou, deliberately.

How it’s usedAll of it.

Only you can move layer 3.

That is where your work begins.

@kamilkrauspe

Layer 1 — Inference

What does a million tokens cost to make?

Self-hosted GPT-OSS-120B on a rented H100 80GB.

Cost

$3 / hour

rental list rate

Throughput

18M tokens / hour

vLLM saturated, output tokens

cost throughput

= ~$0.17

per million tokens

A simplified single-configuration example. Real costs vary.

Throughput is the denominator.
Push it up, the cost falls. So how?

COMPRESS

MLA · MXFP4/NVFP4 quantization · KV eviction …

SKIP

sparse attention · MoE · MoD/MoR …

RESHAPE

speculative decoding · disaggregation · Hybrid SSM …

@kamilkrauspe

Layer 2 — Harness

Every turn carries the previous turn.

In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each replaying accumulated context.

Turn 1 — system prompt + tools + MCP defs + user ask: fix failing tests. ~30K tokens. Returns: grep "test_users".

Turn 2 — Turn 1 + grep call + ~50 hits. ~32K. Returns: read_file("test_users.py").

Turn 3 — Turn 2 + ~300 lines of file content. ~35K. Returns: edit proposal.

…by Turn 50, running context ~150K. Cumulative across turns: several million.

cached uncached

≈3.5M tokens

~90% cached

turn 1turn 25turn 50

Caching makes replayed tokens cheaper. It does not make them disappear.

@kamilkrauspe

Layer 2 — Harness

How harnesses get cheaper.

Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don’t add — they multiply.

Reduce the
loop

Programmatic tool calling Anthropic case: 150K → 2K context.

JIT MCP discovery Cursor: −46.9% on MCP runs.

Scoped subagents Summaries to the planner, not raw transcripts.

Protect the
cache

Cache miss penalty Each miss costs 10–20× a hit (Claude).

Stable prefixes Dynamic content invalidates cache from that point.

TTL-aware sessions 5-min vs. 1-hour, by session length.

Shape the
work

Edit format matters Aider: 3× fewer lazy completions with diffs vs. whole-file.

70/20/10 routing Triage Haiku → escalate Sonnet → reserve Opus.

Workflow tools, not API endpoints 1 right-shaped tool beats 5 raw ones.

Context engineering, prompt engineering, harness engineering — in practice.

@kamilkrauspe

Layer 3 — Users

Same spend.
Different productivity.

Imagine two teams:

Team A

Spend: $5,000/month

$35

cost per merged PR

Team B

Spend: $5,000/month

$120

cost per merged PR

The variance is not in price. It’s in which model people reach for, when they escalate, and when they stop.

SPECIFY

outcome-first specs · verifiable goals · curated context …

CONTROL

plan first · reset early · escalate deliberately …

MEASURE

cost per PR · accepted vs. retained · repo rules …

This layer does not get cheaper on its own.

Sources: Vantage, ‘AI Costs Are Cloud Costs Now’ (Apr 29, 2026) and ‘The Hidden Cost Driver in Agentic Coding’ (Apr 15, 2026). Vantage documents up to ~10× difference in token costs between two engineers on the same team.

@kamilkrauspe

The subsidies are ending.
What looks like a loss is also a gift —

A gift of knowing
what anything was
costing all along. And of choosing, again,
what is worth running.

(Larger reverberations too — through organizations, through societies, through what we have been calling “cheap”…)

Written from where engineering leadership, agentic coding, and enterprise constraints meet.

@kamilkrauspe

AI subscriptions are subsidized. The subsidy is being pulled back.

AI subscriptions are subsidized. The subsidy is being pulled back.

AI subscriptions are subsidized. The subsidy is being pulled back.

Six weeks. Five moves.One direction.

Variance per workload, 2026.

A real heavy usercosts more to servethan the plan covers.

It’s not only about the money.

So what can we do about it?

What does a million tokens cost to make?

Every turn carries the previous turn.

How harnesses get cheaper.

Same spend.Different productivity.

A gift of knowingwhat anything wascosting all along. And of choosing, again,what is worth running.

AI subscriptions are subsidized.
The subsidy is being pulled back.

AI subscriptions are subsidized.
The subsidy is being pulled back.

AI subscriptions are subsidized.
The subsidy is being pulled back.

Six weeks. Five moves.
One direction.

A real heavy user
costs more to serve
than the plan covers.

Same spend.
Different productivity.

A gift of knowing
what anything was
costing all along. And of choosing, again,
what is worth running.