A conversation with the Action Plan team
How do I reduce AI token costs?
Our AI spend doubled last quarter and nobody can tell me which work it actually paid for. Every tool just says "usage." How do I get this under control?
Token costs fall when you stop paying frontier-model prices for commodity work. Forecast each job before it runs, set a budget, route routine tasks to cheaper or local models, cache repeated context, and measure cost per outcome — not per month. Spend becomes a dial, not a surprise.
The reason your bill is unpredictable is structural, not behavioral. Three things compound:
- Everything runs on the most expensive model — drafting a routine reply costs the same per token as designing an architecture.
- Cost is discovered after the fact — the invoice is the first time anyone sees the number.
- Spend isn't tied to outcomes — "$900 of usage" tells you nothing about which work was worth it.
You wouldn't run payroll this way. Your AI workforce deserves the same discipline.
The fix, wherever you apply it:
- Forecast before you run. Estimate big jobs up front; gate them on a budget.
- Budget per job, not per month. A monthly cap punishes you in week four; a per-job cap stops waste at the source.
- Right-size the model per task. Triage, drafts, and summaries on local models (Ollama — zero marginal cost); deep reasoning on a frontier model like Claude.
- Cache what repeats. Re-sending the same context every call is the silent budget killer.
- Measure cost per outcome. "This report cost $0.40" is actionable. "March was $1,800" is not.
Action Plan does all five automatically. Before a big job runs, you see a risk-based forecast and a budget gate. Every model call — local or API — lands in one ledger, tied to the work it produced, and the router sends each task to the cheapest model that can do it well:
- FCST Forecast: $0.31 (routed: local model, frontier review pass)
- RUN Drafting on local Ollama model — $0.00 marginal
- RUN Quality review pass on frontier model — $0.24
The ledger also amortizes subscriptions and shows the counterfactual — what the same work would have cost without routing — so "what did local models save us" is a number, not a feeling.
Honest caveat: routing isn't magic. Some work genuinely needs the expensive model, and a local model doing a bad job costs more in rework than it saves in tokens. That's why the routing decision is visible and overridable — and why I'd rather show you the forecast and let you call it than promise a percentage.
The AI cost-control checklist
- Forecast before you run — estimate and gate big jobs up front.
- Budget per job — not per month.
- Right-size the model per task — local models (Ollama) for routine work; frontier models (Claude, OpenAI) for hard reasoning.
- Cache repeated context — the silent budget killer.
- Measure cost per outcome — tie every dollar to the work it produced.
Which tasks belong on which model?
- Local (zero marginal cost): triage, classification, routine drafts, summaries, formatting.
- Mid-tier API: bulk content, structured extraction, code review passes.
- Frontier (Claude / OpenAI): architecture, complex reasoning, high-stakes writing, final review.
What is pre-run cost forecasting?
Estimating a job's token and dollar cost before any model call is made, and gating the job on a budget — so the expensive surprise is caught before the spend, not on the invoice.
Action Plan is in private preview. When your invite opens, onboarding starts from exactly this conversation.
Got it — you're on the early-access list, and I've saved your goal: . When your invite opens, this is the first thing we'll work on together. Watch your inbox.