Cache the stable prefix
Most agent calls repeat a large, unchanging head — the system prompt, the tool list, the retrieved context. Prompt caching bills that repeated prefix at a fraction of the fresh-input price on a cache hit.
How: Put everything stable first (frozen system prompt, deterministically-ordered tools), set a cache breakpoint at the end of it, and keep volatile content (timestamps, the per-run question) after the breakpoint. Verify hits with the cache-read token count — if it stays zero, a silent invalidator is changing the prefix.
Cache reads bill at roughly a tenth of base input — about 90% off the cached prefix. Writes cost a small premium, so the win compounds as a prefix is reused across runs.
≈90% off the cached prefix · estimateRight-size the model to the turn
Not every turn needs the frontier model. Routing easy, well-scoped work to a smaller, cheaper tier and reserving the top tier for genuinely hard reasoning is the single biggest structural lever on cost.
How: Default to the latest Claude, then tier deliberately: a small model (e.g. Haiku) for classification, extraction, and routing; a mid model (Sonnet) for most work; the top model (Opus) only where the reasoning truly demands it. Tune the effort level too — lower effort means fewer, more-consolidated tokens.
Tiers differ by multiples per token, and output bills several times more than input — so trimming both the tier and the output length compounds.
multiples per tier · estimateBatch the non-interactive work
Anything that does not need an answer this second — overnight enrichment, bulk classification, scheduled reports — can run through the asynchronous Batch API instead of the live endpoint.
How: Collect the requests, submit them as one batch keyed by a custom id, and poll for results (most finish within the hour, all within a day). Reserve the live, latency-sensitive path for the turns a human is actually waiting on.
The Batch API runs at roughly half the standard per-token price — a flat 50% off for work that can tolerate the delay.
≈50% off · async only · estimateRetrieve, don’t stuff
Pasting an entire corpus into every prompt pays for the whole thing on every call. Retrieval fetches only the few snippets a turn actually needs, so input tokens scale with relevance, not with the size of your knowledge base.
How: Index your reference material once (embeddings + a vector store), retrieve the top-k relevant chunks per turn, and pass only those into the prompt. Combine with caching: cache the stable instructions, vary only the retrieved snippets after the breakpoint.
Input tokens per call drop from "the whole corpus" to "the handful that matter" — often the difference between an unviable and a profitable run.
tokens scale with relevance · estimateCap the output, scope the effort
Output tokens bill several times more than input, so a verbose answer is disproportionately expensive. Asking for exactly the shape you need — and no more — is a direct margin lever.
How: Set a deliberate max-output ceiling, ask for structured or short outputs, and pick the lowest effort level that still passes your quality bar. Strip dead instructions from the system prompt; they cost input tokens on every single call.
Because output is the priciest token class, shaving a long answer down to the necessary payload moves the per-run cost more than an equivalent cut to input.
output ≈ priciest tokens · estimateDon’t pay for the same answer twice
Deterministic or repeated requests don’t need a fresh model call each time. A cheap guard or a memoized result short-circuits the call entirely — the cheapest token is the one you never send.
How: Memoize deterministic results, dedupe identical in-flight requests, and gate the model behind cheap rules (a regex, a lookup, a cache) that resolve the easy cases before any inference. Count tokens with the provider’s token counter — never a third-party estimator — so your budgets are accurate.
Eliminating redundant calls removes their cost outright, not just discounts it — the highest-leverage saving when traffic is repetitive.
the cheapest call is none · estimate