Per-workload token consumption across six common shapes.
Inline autocompleteone keystroke
< 1K
< $0.01
Single chat promptno project context
< 10K
< $0.05
Targeted file editshort multi-step
< 100K
< $0.30
Interactive coding30–90 min session
< 10M
< $10
Long autonomous run1–4 hours
< 100M
< $80
Multi-agent / backgroundproduction fleets
up to 1B
$500+
1K1M1B
~100,000× spread, per workload
Documented heavy-use sessions
52.5M tokens
in 38 minutes — Cursor agent replaying a 120K
context window in a tool loop.
$300 / day per agent
Jason Calacanis on All-In, at ~10–20% capacity utilization.
$1,000 / wk
single Replit Agent 3 user spike, vs. their typical
$180–200 / month baseline.
Sonnet 4.6 list-rate estimates. Sources: Cursor support forum
(April 2026), All-In Podcast (Feb 2026), Replit user reports
via The Register (Sept 2025).
I run engineering across two continents — and the variance between
developers dwarfs the variance between models.
@kamilkrauspe
A real heavy user costs more to serve than the plan covers.
What you pay
$200
flat subscription, per month
What heavy use actually costs
$1,000–$4,400
inference at list rates, per month
The plan covers 5–20% of the inference. Someone is paying the rest.
Heavy power-user at 100M tokens/active day. ~93% cache reads · ~2% output. Range spans Sonnet-default (75% Sonnet / 20% Opus / 5% Haiku) to Opus-loyal (95% Opus / 5% Sonnet). ~22 active days/month.Estimates at list rates.
@kamilkrauspe
It’s not only about the money.
Heavy use overshoots the plan in dollars. It also overshoots it in capacity.
4×
Tokens per request, heavy users (P90), year over year.*
Subscription changes aren’t only cost recovery.
They’re also rationing on a scarce resource.
* Source: Datadog State of AI Engineering, 2026.
@kamilkrauspe
So what can we do about it?
The same lever shows up at three layers. Cost, throughput, and capacity move together.
1
Inference
The unit of inference: architecture, runtime, cache mechanics.
Improves throughLabs and serving providers competing on $/M.
Your handWhich model you reach for; prefix discipline.
2
Harness
The system around the model: context, tools, orchestration.
Improves throughTool builders competing; engineering teams iterating.
Your handContext engineering; harness, agent, and tool design.
3
Use
The choices made turn-by-turn: which model, when to escalate, when to stop.
Improves throughYou, deliberately.
How it’s usedAll of it.
Only you can move layer 3.
That is where your work begins.
@kamilkrauspe
Layer 1 — Inference
What does a million tokens cost to make?
Self-hosted GPT-OSS-120B on a rented H100 80GB.
Cost
$3 / hour
rental list rate
Throughput
18M tokens / hour
vLLM saturated, output tokens
costthroughput
=~$0.17
per million tokens
A simplified single-configuration example. Real costs vary.
Throughput is the denominator. Push it up, the cost falls.So how?
plan first · reset early · escalate deliberately …
MEASURE
cost per PR · accepted vs. retained · repo rules …
This layer does not get cheaper on
its own.
Sources: Vantage, ‘AI Costs Are Cloud Costs Now’ (Apr 29,
2026) and ‘The Hidden Cost Driver in Agentic Coding’ (Apr 15,
2026). Vantage documents up to ~10× difference in token costs
between two engineers on the same team.
@kamilkrauspe
The subsidies are ending. What looks like a loss is also a gift —
A gift of knowing what anything was costing all along.And of choosing, again, what is worth running.
(Larger reverberations too — through organizations, through societies, through what we have been calling “cheap”…)
Written from where engineering leadership, agentic coding, and enterprise constraints meet.