Your AI subscription
is subsidized. The subsidy is being pulled.

The Subsidy Era — AI subscriptions, coming due.

@kamilkrauspe

A heavy day can break the economics
of a $200 plan.

$200

One heavy agentic-coding day burns more inference
than the monthly plan covers.

The rest is paid by someone — until it gets metered.

@kamilkrauspe

Four weeks. Five moves.
One direction.

March

Google

Antigravity restructured to AI Credits. AI Pro effective token usage cut sharply.

April 2

OpenAI

Codex billing moved from credits-per-message to credits-per-token.

April 21

Anthropic

Claude Code briefly restricted to Max plans. Reversed within a day.

April 23

Anthropic

Postmortem: “compute is a constraint across the entire industry.”

April 27

GitHub

All Copilot plans move to AI Credits. Input, output, cached tokens all metered.

@kamilkrauspe

Agentic work doesn’t have
normal SaaS variance.

Inline autocompleteone keystroke

< 1K

< $0.01

Single chat promptno project context

< 10K

< $0.05

Targeted file editshort multi-step

< 100K

< $0.30

Interactive coding30–90 min session

< 10M

< $10

Long autonomous run1–4 hours

< 100M

< $80

Multi-agent / backgroundproduction fleets

up to 1B

$500+

~100,000× spread per session

52.5M tokens · 38 minutes

Cursor agent replaying a 120K context window in a tool loop.

Sonnet 4.6 list rates. x-axis logarithmic. Estimates.

@kamilkrauspe

A real heavy user
costs more to serve
than the plan covers.

What you pay

$200

flat subscription, per month

What heavy use actually costs

$1,000–$4,400

inference at list rates, per month

The plan covers about 5–20% of the inference.
Someone is paying the rest.

Heavy power-user at 100M tokens/active day. Cache hit ~90%. Sonnet 4.6 to Opus 4.7. Estimates.

@kamilkrauspe

It’s not only about the money.

Heavy use overshoots the plan in dollars.
It also overshoots it in capacity.

4×

Tokens per request, heavy users (P90), year over year.

Pricing changes aren’t only cost recovery.
They’re also rationing on a scarce resource.

Datadog State of AI Engineering, 2026.

@kamilkrauspe

The cost moves at three layers.

Inference

The unit cost of generating tokens.

Harness

The loop around the model.

Use

The choices people make in the loop.

The first two improve over time. The third needs discipline.

@kamilkrauspe

Layer 1 — Inference

Cheaper tokens come from throughput.

$/M tokens = GPU rent ÷ tokens per hour

Example: $3/hour ÷ 18M tokens/hour ≈ $0.17/MT.
Self-hosted GPT-OSS-120B on H100 80GB. Single-config example.

Attention & Memory

MLA (KV cache 10× smaller)

Sparse attention (NSA, DSA, TriAttention)

KV eviction (LMCache, Dynamo, TurboQuant)

Model Strategy

MoE (DeepSeek 671B/37B, GPT-OSS 117B/5.1B)

Hybrid SSM (Nemotron-H, Qwen3, Hunyuan-TurboS)

MXFP4 / NVFP4 quantization

Serving

Speculative decoding (2–4× decode)

Disaggregated prefill/decode

MoD / MoR (dynamic compute)

Push throughput up, unit cost falls.

@kamilkrauspe

Layer 2 — Harness

Every turn carries the previous turn.

In Claude Code, Cursor, Codex, Copilot — every coding agent is a loop. A 50-turn session is 50 calls, each carrying everything before.

Turn 1 — system prompt + tools + user ask. ~30K tokens.

Turn 2 — Turn 1 + grep results. ~32K.

Turn 3 — Turn 2 + file contents. ~35K.

…by Turn 50, running context ~150K. Cumulative across turns: several million.

cached uncached

≈3.5M tokens

~90% cached

turn 1turn 25turn 50

Caching makes replayed tokens cheaper. It does not make them disappear.

@kamilkrauspe

Layer 2 — Harness

How harnesses get cheaper.

Wins on this layer compound. A 3× edit-format gain and a 10× cache gain don’t add — they multiply.

Reduce the loop

Programmatic tool calling Anthropic case: 150K → 2K context.

JIT MCP discovery Cursor: −46.9% on MCP runs.

Scoped subagents Summaries to the planner, not raw transcripts.

Protect the cache

Cache miss penalty Each miss costs 10–20× a hit.

Stable prefixes Anything dynamic kills cache.

TTL-aware sessions 5-min vs. 1-hour, by session length.

Shape the work

Edit format matters Aider: 3× tokens, whole-file vs. diff.

70/20/10 routing Triage Haiku → escalate Sonnet → reserve Opus.

Workflow tools, not API endpoints One right-shaped tool beats five raw ones.

This is context engineering in practice.

@kamilkrauspe

Layer 3 — Use

Same spend. Different productivity.

Team A

Spend: $5,000/month

$35

cost per merged PR

Team B

Spend: $5,000/month

$120

cost per merged PR

The variance is not in price. It’s in which model people reach for, when they escalate, and when they stop.

Specify

Outcome-first prompts describe what good looks like; drop process scaffolding.

Literal instruction the model won’t generalize what you don’t say.

Explicit stop conditions fewest useful tool loops.

Control

Subagent context isolation fresh context windows; only summaries return.

Task budgets advisory token ceiling; the model paces itself.

Escalation rules small model first; reasoning model only when needed.

Measure

Cost per merged PR the highest spender can be the lowest cost-per-PR.

AI drafts, humans gate generate freely; verify before commit.

Experienced-user playbooks senior users converge on different patterns.

This layer does not get cheaper on its own.

Vantage agentic-coding cost analysis, April 2026. ~10× cost-per-developer variance within a single team is typical.

@kamilkrauspe

Layer 1 sets the rate card. Layer 2 sets the loop. Layer 3 sets the denominator.

The subsidies are ending.
What looks like a loss is a gift — knowing what the work cost all along,
and choosing what is worth running.

@kamilkrauspe

Your AI subscriptionis subsidized. The subsidy is being pulled.

Four weeks. Five moves.One direction.

Agentic work doesn’t havenormal SaaS variance.

A real heavy usercosts more to servethan the plan covers.

It’s not only about the money.

The cost moves at three layers.

Cheaper tokens come from throughput.

Every turn carries the previous turn.

How harnesses get cheaper.

Same spend. Different productivity.

Layer 1 sets the rate card. Layer 2 sets the loop. Layer 3 sets the denominator.

Your AI subscription
is subsidized. The subsidy is being pulled.

Four weeks. Five moves.
One direction.

Agentic work doesn’t have
normal SaaS variance.

A real heavy user
costs more to serve
than the plan covers.