Model choice, escalation, when to delegate vs. supervise.
Two have engineering vocabularies. The third shows up as variance inside the others.
@kamilkrauspe
ALT · CAND-14
The seam
Three places the cost lives.
1
MODEL
labs and serving providers
2
HARNESS
users and tool builders
3
HUMAN
still emerging
Two have engineering vocabularies. One does not.
@kamilkrauspe
V3 + ALT-B · beat 6
Mechanics of $/M tokens
A $/M-tokens number is mechanical. Rent ÷ throughput.
$/M tokens=GPU rent / hour÷tokens/sec × 3,600
The 2026 architectural fronts mostly raise the denominator.
Mixture-of-Experts
DeepSeek V3.2 carries 671B total, activates 37B per token. Fewer FLOPs → more tokens/sec.
KV / latent attention
Compress the cache. ~10× less HBM at long context. More concurrent requests fit per GPU.
Quantization (NVFP4)
Lower-precision weights and KV. Less memory and bandwidth per token.
User-facing lever: prompt caching. Anthropic reads at 10% of input price; one varying byte at the prefix kills the chain.
SourcesHugging Face (DeepSeek, GPT-OSS); NVIDIA NVFP4; Anthropic pricing.
@kamilkrauspe
ALT · CAND-06
A formula
Where a price per million tokens actually comes from.
rent÷throughput
Anything that moves throughput moves the price. The 2026 architectural fronts are mostly versions of this.
@kamilkrauspe
ALT · CAND-08
A model and its weights
DeepSeek V3.2 carries 671B parameters, and activates a fraction of them per token.
total parameters
671B
most of the model is dormant on most tokens
active per token
37B
what each forward pass actually computes
Mixture-of-Experts. One of several reasons "$/M tokens" can fall.
SourceHugging Face · DeepSeek V3.2 model card.
@kamilkrauspe
V3 · beat 7
Where a $/M number comes from
Where a $/M-tokens number actually comes from.
A single open-weights model on a single GPU, walked through five lines of arithmetic.
Step 1
GPT-OSS-120B at MXFP4 on 1× H100 80GB.
Step 2
Hardware rent: ~$3 / hour on commodity rental.
Step 3
Saturated throughput: ~5,000 output tokens/sec.
Step 4
5,000 × 3,600 = 18M tokens/hour.
Step 5
$3 ÷ 18 = ~$0.17 / M output tokens.
Illustrative, not normative
Moves with prompt mix, cache hit rate, prefill-vs-decode, on-demand vs reserved capacity, and runtime engine. Shows the physics. Not a market-pricing claim.
@kamilkrauspe
ALT-B · beat 7
A worked example · GPT-OSS-120B
Where a $/M-tokens number actually comes from.
A worked example on one open-weights model and one commodity GPU.
Step 1
1× H100 80GB on a typical GPU rental market.
Step 2
Hardware rent: ~$3 / hour on commodity rental.
Step 3
Saturated throughput at MXFP4: ~5,000 tokens/second.
Step 4
Tokens/hour = 5,000 × 3,600 = 18,000,000.
Step 5
Cost / M output tokens = $3 ÷ 18 = ~$0.17 / M.
Illustrative on one model and one config. Cache, prompt mix, runtime engine, and procurement model all move the answer. Not a market-price comparison.
SourceOpenAI GPT-OSS-120B model card; commodity GPU rental rates (2026 Q1).
@kamilkrauspe
ALT · CAND-07
One model · one calculation
GPT-OSS-120B at MXFP4 on 1× H100.
Rent
1× H100 80GB rental: ~$3 / hour
Speed
Saturated throughput: ~5,000 tokens/second
Per hr
5,000 × 3,600 = 18,000,000 tokens / hour
Per M
$3 ÷ 18 = ~$0.17 / M tokens
Move any term. The price moves.
@kamilkrauspe
V3 + ALT-B · beat 8
Where the multiplier lives
Model-side gains lower the unit cost. Harness-side gains compound.
Agentic loops are recursive. So is the bill.
turn 150× turns · per-turn deltas compoundturn 50
Pattern
Evidence
Just-in-time context
Cursor MCP dynamic discovery: 46.9% reduction in agent tokens.
Scoped sub-agents
~7× more tokens per run — earned on parallelizable work.
Cache-breakpoint discipline
Falling from 85% to 70% hit-rate roughly doubles the bill.
Compaction before exhaustion
Anthropic April postmortem: botched compaction → quality regression.
A 10% per-turn improvement, applied 50 times, is not a 10% improvement.
Harness gains compound. So do harness regressions.
SourcesCursor engineering blog; Anthropic April 2026 postmortem; Anthropic pricing docs.
@kamilkrauspe
ALT · CAND-09
Total agent token consumption, before and after switching MCP tools from a static load to dynamic context discovery.
46.9%
reduction
That is not a model improvement. It is a harness improvement.
SourceCursor engineering blog · "Dynamic context discovery."
@kamilkrauspe
V3 · beat 9
Two teams · one tool · 10× variance
One AI tool, one $5K seat budget, a 10× gap in cost per merged PR.
The visible artifact of the harness/user seam.
Team A
$5,000 / month
$35
cost per merged PR
Team B
$5,000 / month
$120
cost per merged PR
The variance lives in three user habits: model selection, reasoning concision, and escalation thresholds.
The user layer has no dashboard.
It shows up as variance inside the harness layer's numbers.
SourceVantage agentic-coding cost analysis · April 15, 2026.
@kamilkrauspe
ALT-B + ALT · CAND-10
Two teams · two bills
Identical setup, a $35 vs $120 outcome per merged PR.
Team A
$35
per merged PR
Team B
$120
per merged PR
The variance is not in the price. It is in how people use the same tool.
SourceVantage · April 15, 2026.
@kamilkrauspe
V3 · beat 10
A note on falling prices
Won't falling prices make this moot?
They have fallen. That has not reduced total spend.
Per-token price (Epoch AI · fixed-quality)Enterprise GenAI total spend (CloudZero)
Per-token prices fell ~1,000×. Total enterprise GenAI spend rose 220%. Adoption ate the savings.
Planning around 2026-Q1 inference prices is
the same mistake teams made with cloud spend in 2017.
SourcesEpoch AI inference-pricing dataset; CloudZero State of AI Costs; CloudZero Cloud Unit Economics 2026.
@kamilkrauspe
ALT · CAND-11
A line going down · a line going up
Won't falling prices make this moot?
Per-token priceEnterprise GenAI total spend
They have. It hasn't reduced total spend.
SourcesEpoch AI · CloudZero.
@kamilkrauspe
V3 · beat 11
Labor-substitution · regional table
The substitution math depends entirely on which cell you're in.
AI subscription, AI PAYG (avg vs. heavy), and engineer cost — by region. The numbers diverge sharply.
Average dev = ~2M tokens/active day. Heavy power-user = 50–100M, midpointed at 75M (Redelinghuys forensic dataset, Anthropic enterprise capacity guidance).
Region (senior eng FL)
$/month
$200 plan
Avg dev (2M)
Heavy · Sonnet
Heavy · Opus
United States — major
$20,000
1.0%
0.3%
10.5%
17.6%
Western Europe (UK, DE)
$11,000
1.8%
0.5%
19.2%
31.9%
Slovakia / CEE
$7,000
2.9%
0.8%
30.1%
50.2%
The question is not "does AI replace engineers."
It is: at what subsidy level, in what region, for which seniority tier, and for what kind of work?
AI is absorbing pieces of the role,
while the human stays accountable for the whole.
SourcesLevels.fyi; ERI SalaryExpert; Anthropic + OpenAI pricing; Sonar State of Code 2026; Faros AI; Anthropic Economic Index.
@kamilkrauspe
ALT-B · beat 11
Heavy power-user · 75M tokens/day
Heavy power-user. Sonnet 4.6 and Opus 4.7 PAYG at 75M tokens/day.
The substitution math depends on which cell you're in.
Average dev ~2M/day. Heavy 50–100M/day midpointed at 75M. 85% cache hit, 22 working days.
Region
$/month
$200 plan
Avg dev
Heavy · Sonnet
Heavy · Opus
US — major
$20,000
1.0%
0.3%
10.5%
17.6%
W. Europe (UK, DE)
$11,000
1.8%
0.5%
19.2%
31.9%
Slovakia / CEE
$7,000
2.9%
0.8%
30.1%
50.2%
The average developer is fully self-funded by the $200 plan. The heavy user is heavily subsidized — at Opus rates, by 12–23×.
SourcesLevels.fyi; ERI SalaryExpert; Anthropic + OpenAI pricing; Redelinghuys forensic dataset.
@kamilkrauspe
ALT · CAND-12
The Slovakia row
Heavy power-user. Sonnet 4.6 PAYG at 75M tokens/day.
Region (senior eng FL)
Heavy · Sonnet
Heavy · Opus
United States — major ($20K/mo)
10.5%
17.6%
Western Europe ($11K/mo)
19.2%
31.9%
Slovakia / CEE ($7K/mo)
30.1%
50.2%
The substitution math depends entirely on which cell you're in.