AI token economics: what buyers keep getting wrong

Introduction

On our wall at High Peak Studio, the top row of the dashboard is not “active users” or “queries served.” It is three lines that move together: tokens per accepted user story, tokens per pull request merged, and tokens per support ticket resolved. We cross‑correlate those token averages with development progress every week before we scale spend. When the slope of progress flattens while token consumption rises, we don’t argue about “adoption.” We change the plan, the routing rules, or the model mix. This is how we run.

The operational rhythm is simple: gauges before throttle. Every pilot begins instrumented at the task level, every model call is labeled with a work unit, and every increase in concurrency is gated behind real conversions in throughput or quality. We track token cost per plan generated, per test suite stabilized, and per deployment unlocked; the board only approves a broader rollout when those ratios hold for four consecutive sprints. That discipline is dull from the outside, but it is the difference between a showcase demo and an operating capability.

Enterprise headlines have made the cost pattern visible. Per‑token math does not care about org charts; if you scale the agent count faster than outcomes, the bill outpaces headcount efficiency. The problem is literal, not philosophical, which is why even hyperscalers have flagged it in public reporting about the per‑token versus per‑employee tradeoff covered here. Cost curves have improved dramatically—token prices have fallen by orders of magnitude since early GPT‑4—yet sustained returns remain rare, with one analysis estimating token costs down roughly 280× and only about 5% of enterprises seeing durable gains at scale as documented. The gap between cheaper tokens and expensive outcomes is an operator’s problem, not a technology problem.

stop buying AI capacity you haven’t instrumented to outcomes.

From tokens to timelines

Start with the unit you actually ship: a user story accepted, a defect closed, a customer email resolved, a release cut. We anchor every workflow to that artifact and use it to compute cost per outcome. The formula we use on day one is trivial by design:

Tokens per task × price per token × calls per task × failure retries ÷ outcomes achieved per interval.

If tokens per task are stable but outcomes per interval slip, it is a queuing or orchestration problem. If tokens per task jump after a model upgrade, it is a routing and prompt design problem. If both move together, you have a planning problem: you are asking the wrong model to do the wrong job. Operators who separate these signals can fix the right layer in the stack.

For enterprise buyers, the near‑term economics were summarized cleanly in a May 2026 framework that treats inference as a real cost center, with practical controls around throughput, quality, and concurrency—not as a background utility that scales without friction see this enterprise inference economics view. When we brought that discipline into our developer workflows, the conversation with product owners changed—no more abstract ROI slides. We showed the cost per merged PR dropping as we tuned prompt scaffolds and split tasks across model tiers, then we increased concurrency only after the curve held across two releases.

We also tie planning calendars to token budgets. A quarterly roadmap that depends on unconstrained model calls is not a roadmap. We set sprint‑level token envelopes per epic, establish guardrails at the orchestrator, and escalate only if the throughput per token is improving. When a team requests more capacity, we ask for the expected delta in outcomes. If it is not explicit, we wait.

schedule and spend move together, or they don’t move at all.

Tiering models to tasks

The most common error we see is buyers picking a “house model” and pointing everything at it. The result is either overspend or performance complaints that prompt endless prompt‑tweaking. Models are tools. The job decides the model, not the other way around.

Our default tiering pattern looks like this:

Planner/architect work (heavy reasoning). Use a high‑capability model with strong planning and tool‑use skills. We bound this to a small number of calls per epic—e.g., one to three session plans per feature, each capped at 8–12k tokens. This model writes the plan, enumerates risks, and generates scaffolds—not bulk content.

Code generation and refactoring (structured synthesis). Use a mid‑tier model optimized for code generation with reliable function‑calling and constrained output adherence. We limit these calls by file or function: 1–3k tokens per file, retries capped at one. We stage diffs and run tests automatically; failed tests trigger either a second mid‑tier pass or an escalated review, not an automatic jump to the top model.

Bulk classification, routing, and enrichment (narrow decisions). Use the smallest viable model or even rules when the schema is stable. Typical caps: 50–300 tokens per item, high concurrency, dead‑letter queues for edge cases. These tasks do not justify premium reasoning.

We implement this as a routing tree inside our orchestrator. Inputs are labeled by task type, complexity, and risk flag. The router selects a model and prompt scaffold, enforces a token ceiling, and records outcome metrics. If a task times out or violates the ceiling, it is either downgraded in complexity (smaller chunk) or escalated (one heavy‑model pass) based on a learned policy. The critical point: escalation is explicit and rare. A small percentage of tasks deserve premium reasoning; the rest deserve predictability.

The reflex to aim everything at the “smartest” model is understandable—nobody wants to ship the wrong answer. But the price jump between tiers is real and compounds at scale. We use Anthropic’s premium tier sparingly; planner calls may land on a top‑capability model for initial architecture, but production bulk work does not. You do not reach for Claude Opus 4.5 to sweep the floor. One escalated plan, many economical executions.

Under the hood, the instrumentation is boring and necessary:

Every route has a token ceiling and a rollback rule.
Every route logs tokens, latency, pass/fail tests, and human correction time.
Every route exposes a per‑task cost and per‑task variance in the dashboard.

Teams see the same facts product sees. When a mid‑tier model’s pass rate rises, we push more traffic to it. When a small model drifts and human correction time rises, we retrain prompts or swap the model. The architecture rewards discipline, not preference.

tiering is not a nice‑to‑have; it is your margin.

Go measured before broadband

Pilots that “just work” in a demo room can consume a year’s AI budget by April when they aren’t instrumented. The pattern is now public enough to be teachable. One high‑profile operator reportedly exhausted its 2026 AI allocation within the first four months as pilot enthusiasm outran consumption modeling as reported. In parallel, a major platform provider reversed course on an internal code assistant and terminated licenses by June 30, citing costs and low adoption relative to spend documented here. These are not “anti‑AI” stories; they are governance stories. If the system cannot show cost per outcome, leaders will pull the brake.

The cost structure that creates these reversals is straightforward. If you scale the number of agents or copilots based on seats instead of measured throughput, your per‑employee cost curve decouples from value. Even companies with advantaged infrastructure economics have flagged the issue—per‑token costs accumulate faster than per‑employee productivity if requests are unbounded as outlined. The fix is not ideological; it is operational: quotas, queues, and per‑task ceilings before company‑wide access.

At High Peak Studio, we won’t push a pilot to 100% of a team until it holds three conditions for at least two sprints:

1) The cost per outcome is below the human‑only baseline and stable within ±10%. 2) The variance in outcome quality is trending down across the last two releases. 3) The tail of expensive retries is shrinking and bounded by an explicit ceiling.

We also use staged concurrency. Ten users get 100% access. The next 50 get a rate‑limited tier. Everyone else sees a request pool. Access rises when the dashboards show a real change in outcomes. This slows the applause but keeps the budget intact.

measured rollouts extend budgets and reveal real ROI.

Name the subsidy you’re standing on

For the past two years, the market was built on coupons. Providers ate a meaningful share of inference costs to build volume, comfort, and switching costs. Estimates of the effective subsidy vary by product and period, but the directional point is uncontroversial: unit prices were often below sustainable economics to drive adoption—and that era is repricing now framed plainly here. CFOs who modeled pilots on “temporary price sheets” are discovering that fragile unit economics break when coupons expire, a point echoed in a May 2026 cautionary note for finance leaders outlined here.

The repricing pressure has two sources. First, demand is real and rising. Second, the infrastructure revenue required to meet it is not trivial; one forward view places a multi‑hundred‑billion‑dollar gap between demand and captured infrastructure revenue by 2030, which will translate into price discipline as capacity and capital chase each other see the Georgetown KGI/Bain analysis. Pair this with field reports that, despite massive token price drops, a small fraction of enterprises are realizing durable, scaled returns, and you get a simple directive: build cost consciousness into the system, not the slideware analysis here.

Practically, we name the subsidy on every dashboard. Each route displays two token prices: book price and effective price (after credits or promotions). We compute “subsidy exposure” as the percentage of our throughput that would flip from green to red if book price applied tomorrow. If that number rises above 20%, the router automatically experiments with a cheaper mix and throttles bulk tasks until we can reclaim margin.

Industry voices have called the current state bluntly—unit economics are fragile without controls as one operator‑investor argues. None of this is doom. Repricing is a feature of a normalizing market. Real value, real efficiency, and real go‑to‑market gains are ahead for teams that treat cost as a design input. But if your plan only works with coupons, you don’t have a plan.

model your stack at book price and survive it.

Echoes of dot‑com, drained of drama

You do not have to hype or fear AI to recognize the pattern: cheap capital, visible winners, and fast followership create a rush of pilots that try to buy time with spend. Markets have seen this before. A sober historical account out of MIT argues that speculative investment can accelerate capability build‑out—but also overshoots productive use when cash outpaces real adoption paths see the speculative‑growth framing. Another lens from policy and consulting quantifies the mismatch between projected demand and the revenue that can sustainably support the infrastructure by 2030 the KGI/Bain gap. Those two together—overenthusiastic spend and a real capacity build—create a simple operator mandate: make your own unit economics work now, or the market will do it for you later.

At High Peak Studio, we treat dot‑com history as a governance case study, not a mood. The early web sorted itself through instrumentation and category formation: logs, A/B tests, CDN economics, and standardized stacks. AI will normalize the same way. The noise on social timelines will not set your margins; your routing rules will. When budgets compress, the pilots with task‑level instrumentation, model tiering, and sober subsidy management keep going. Everything else waits for the next steering committee.

FOMO doesn’t ship; disciplined unit economics do.

AI token economics in practice

Here is how we build cost into the system without slowing teams down. We start with a naming convention, then we wire data exhaust to a dashboard, then we make the router act on it.

1) Naming and tagging. Every call carries: project_id, epic_id, task_type, model_id, route_id, token_in, token_out, retry_count, test_pass, human_fix_minutes.

2) Book price vs. effective price. We log the vendor’s stated price per token and the actual effective price after any credits. Two columns, no arguments. Dashboards show cost at both rates.

3) Token ceilings per route. We set hard ceilings for planning, codegen, and routing tasks. Examples we’ve used: 8–12k tokens for planner sessions per epic (max three per feature), 1–3k tokens for codegen per file (one retry), 50–300 tokens for routing/classify per item.

4) Outcome anchors. We attach tokens to outcomes: accepted user stories, merged PRs, closed tickets, or customer messages resolved. We don’t debate “adoption”; we measure conversions.

5) Concurrency gates. Access scales with outcomes. Ten users get full access; the next fifty have rate limits; the rest share a pool. Gates lift when outcome curves hold at target cost.

6) Escalation rules. Routes only escalate to a higher‑tier model when objective criteria are met (e.g., two consecutive failed tests, confidence below threshold, or high business risk tag).

7) Downgrade rules. If human fix time is low but tokens are high, we trial a smaller model with prompt tweaks for a slice of traffic. We do not argue feelings; we run the slice.

8) Planning cadence. We sync the budget and roadmap every sprint: token envelopes per epic, expected conversion to outcomes, and a release‑level review of subsidy exposure at book price.

This gets real when you translate it into the board view:

Plan generation: 2–3 planner sessions per feature; 8–12k tokens each; expected output is one architecture doc and a test scaffold; cost per plan compared to saved engineer hours.
Codegen/refactor: per‑file caps; tests must pass; retries capped; per‑PR token cost displayed next to human correction minutes.
Routing/classification: caps at sub‑300 tokens; accuracy targets baked in; drift triggers retrain or model swap.

We track three health metrics across all routes:

Token efficiency: tokens per outcome at book and effective price.
Quality stability: pass rate and variance across sprints.
Human correction drag: minutes of human fix per 1k tokens.

Where do most teams wobble? In two places. First, they optimize prompts for accuracy while ignoring token drift. Ask a model to “think step by step” without a ceiling and it will—at your expense. Second, they conflate “time saved” with “outcomes delivered.” If AI writes more code that engineers don’t merge, your costs rise and velocity does not.

We use two small practices to keep the system honest:

Token budget annotations in tickets. Each story carries a token envelope. Engineers know what the router will approve. Exceed it and you need a reason.
Red/green per route. We display a red/green light next to each route in the IDE plugin and in the project dashboard. Green means the route is in budget and stable. Red means either too expensive or too unstable. It is visible, and that visibility changes behavior.

This is the dull work most organizations skip. It is the work that makes AI a capability rather than a cost.

if it isn’t tagged, budgeted, and routed, it will sprawl.

What I’d do this quarter

If I were walking into a new enterprise today, here is the five‑step operator playbook we run at High Peak Studio. Ninety days is enough to see the shape of ROI and the contour of your risks.

1) Instrument first. Before any expansion, tag every model call with task metadata (project, epic, task type, model, tokens in/out, retries, pass/fail, human fix time). Stand up a dashboard that shows tokens per accepted story, per merged PR, and per resolved ticket at book and effective prices. This is week one work.

2) Tier models. Adopt the three‑tier pattern: heavy reasoning for planner/architect work (capped sessions), mid‑tier for code generation and refactoring (per‑file caps, test‑gated), and smallest viable for routing/classification (tight caps, high concurrency). Allow explicit, rare escalation. Do not aim bulk work at top‑capability models; you are burning margin for the feeling of safety.

3) Set per‑task token budgets. Define envelopes per task type and make them visible in tickets and IDEs. Examples: 8–12k per planner session (≤3 per feature), 1–3k per file for codegen (≤1 retry), 50–300 per item for routing/classify. Deny requests that exceed envelopes unless business risk warrants escalation. Budgets change behavior.

4) Monitor subsidy exposure. In every dashboard, show book price and effective price and compute the percentage of work that flips out of margin at book price. If >20%, experiment with routing rules and model mix until you can absorb repricing. The subsidy era is repricing—treat it as training wheels you’re taking off a framing that many have made explicit and that finance leaders have been warned about here.

5) Review monthly. Put a 60‑minute review on the calendar with product, engineering, and finance. Pull three charts: cost per outcome over time, quality stability, and human correction drag. If cost is rising faster than quality is stabilizing, pause expansion. If subsidy exposure is high, rehearse book‑price survival. If pilots are burning budget faster than outcomes scale, you are not alone—several public cases have shown how quickly enthusiasm outpaces modeling from budgets consumed early to license cancellations under cost pressure noted here—but you do not have to repeat it. Use the gates.

If you want a north star, remember the cockpit. You do not throttle up because the plane can go fast. You throttle up because the gauges say you have the altitude, fuel, and heading to do it safely. AI will be no different. Teams that wire in AI token economics now—tokens to outcomes, tiered models, measured rollouts, and honest subsidy math—will own their category when the coupons end and the market asks for proof.

gauges first, then throttle.

Token economics first: what AI buyers keep getting wrong