Why are my AI token costs so unpredictable?

Because spend is measured after the fact, per month, with no link to outcomes. Costs become predictable when every job is forecast before it runs, capped by a budget, and recorded against the work it produced.

Which AI tasks can run on local models?

Routine drafting, triage, classification, summarization, and formatting typically run well on local models via Ollama at zero marginal cost. Deep reasoning, architecture, and high-stakes writing are usually worth a frontier model like Claude. The right setup routes per task instead of sending everything to one model.

What is pre-run cost forecasting for AI agents?

Estimating a job's token and dollar cost before any model call is made, and gating the job on a budget. Big jobs get a risk-based forecast so the expensive surprise is caught before the spend, not on the invoice.

A conversation with the Action Plan team

How do I reduce AI token costs?

You

Our AI spend doubled last quarter and nobody can tell me which work it actually paid for. Every tool just says "usage." How do I get this under control?

Lexi Orchestrator

Token costs fall when you stop paying frontier-model prices for commodity work. Forecast each job before it runs, set a budget, route routine tasks to cheaper or local models, cache repeated context, and measure cost per outcome — not per month. Spend becomes a dial, not a surprise.

Miles Financial Expert

The reason your bill is unpredictable is structural, not behavioral. Three things compound:

Everything runs on the most expensive model — drafting a routine reply costs the same per token as designing an architecture.
Cost is discovered after the fact — the invoice is the first time anyone sees the number.
Spend isn't tied to outcomes — "$900 of usage" tells you nothing about which work was worth it.

You wouldn't run payroll this way. Your AI workforce deserves the same discipline.

Miles Financial Expert

The fix, wherever you apply it:

Forecast before you run. Estimate big jobs up front; gate them on a budget.
Budget per job, not per month. A monthly cap punishes you in week four; a per-job cap stops waste at the source.
Right-size the model per task. Triage, drafts, and summaries on local models (Ollama — zero marginal cost); deep reasoning on a frontier model like Claude.
Cache what repeats. Re-sending the same context every call is the silent budget killer.
Measure cost per outcome. "This report cost $0.40" is actionable. "March was $1,800" is not.

Lexi Orchestrator

Action Plan does all five automatically. Before a big job runs, you see a risk-based forecast and a budget gate. Every model call — local or API — lands in one ledger, tied to the work it produced, and the router sends each task to the cheapest model that can do it well:

Draft 12 customer follow-ups Ledger

FCST Forecast: $0.31 (routed: local model, frontier review pass)
RUN Drafting on local Ollama model — $0.00 marginal
RUN Quality review pass on frontier model — $0.24

Actual $0.24 vs forecast $0.31 · under budget · saved vs all-frontier: $1.87

The ledger also amortizes subscriptions and shows the counterfactual — what the same work would have cost without routing — so "what did local models save us" is a number, not a feeling.

Miles Financial Expert

Honest caveat: routing isn't magic. Some work genuinely needs the expensive model, and a local model doing a bad job costs more in rework than it saves in tokens. That's why the routing decision is visible and overridable — and why I'd rather show you the forecast and let you call it than promise a percentage.

📌 Pinned by Miles — the short version

The AI cost-control checklist

Forecast before you run — estimate and gate big jobs up front.
Budget per job — not per month.
Right-size the model per task — local models (Ollama) for routine work; frontier models (Claude, OpenAI) for hard reasoning.
Cache repeated context — the silent budget killer.
Measure cost per outcome — tie every dollar to the work it produced.

Which tasks belong on which model?

Local (zero marginal cost): triage, classification, routine drafts, summaries, formatting.
Mid-tier API: bulk content, structured extraction, code review passes.
Frontier (Claude / OpenAI): architecture, complex reasoning, high-stakes writing, final review.

What is pre-run cost forecasting?

Estimating a job's token and dollar cost before any model call is made, and gating the job on a budget — so the expensive surprise is caught before the spend, not on the invoice.

Action Plan is in private preview. When your invite opens, onboarding starts from exactly this conversation.

← Previous problemAutomate my business — no black box Next problem →One agent is easy. A fleet is chaos.