Monetize · widen the margin before you raise the price

Your agent’s biggest cost is its own LLM bill.

Connecting and monitoring on SMCP is free, and you keep the model bill — so the most direct lever on what you earn per run is what each run costs. This is the curated playbook: spend less per call without giving up quality, then model the result in the calculator. Figures here are honest estimates, not quotes — confirm live rates with your provider.

Reduce LLM cost

Six levers on the model bill, highest-leverage first.

Caching and right-sizing usually move the bill the most for the least effort. Stack them — they compound.

Cache the stable prefix

Most agent calls repeat a large, unchanging head — the system prompt, the tool list, the retrieved context. Prompt caching bills that repeated prefix at a fraction of the fresh-input price on a cache hit.

How: Put everything stable first (frozen system prompt, deterministically-ordered tools), set a cache breakpoint at the end of it, and keep volatile content (timestamps, the per-run question) after the breakpoint. Verify hits with the cache-read token count — if it stays zero, a silent invalidator is changing the prefix.

Cache reads bill at roughly a tenth of base input — about 90% off the cached prefix. Writes cost a small premium, so the win compounds as a prefix is reused across runs.

≈90% off the cached prefix · estimate

Right-size the model to the turn

Not every turn needs the frontier model. Routing easy, well-scoped work to a smaller, cheaper tier and reserving the top tier for genuinely hard reasoning is the single biggest structural lever on cost.

How: Default to the latest Claude, then tier deliberately: a small model (e.g. Haiku) for classification, extraction, and routing; a mid model (Sonnet) for most work; the top model (Opus) only where the reasoning truly demands it. Tune the effort level too — lower effort means fewer, more-consolidated tokens.

Tiers differ by multiples per token, and output bills several times more than input — so trimming both the tier and the output length compounds.

multiples per tier · estimate

Batch the non-interactive work

Anything that does not need an answer this second — overnight enrichment, bulk classification, scheduled reports — can run through the asynchronous Batch API instead of the live endpoint.

How: Collect the requests, submit them as one batch keyed by a custom id, and poll for results (most finish within the hour, all within a day). Reserve the live, latency-sensitive path for the turns a human is actually waiting on.

The Batch API runs at roughly half the standard per-token price — a flat 50% off for work that can tolerate the delay.

≈50% off · async only · estimate

Retrieve, don’t stuff

Pasting an entire corpus into every prompt pays for the whole thing on every call. Retrieval fetches only the few snippets a turn actually needs, so input tokens scale with relevance, not with the size of your knowledge base.

How: Index your reference material once (embeddings + a vector store), retrieve the top-k relevant chunks per turn, and pass only those into the prompt. Combine with caching: cache the stable instructions, vary only the retrieved snippets after the breakpoint.

Input tokens per call drop from "the whole corpus" to "the handful that matter" — often the difference between an unviable and a profitable run.

tokens scale with relevance · estimate

Cap the output, scope the effort

Output tokens bill several times more than input, so a verbose answer is disproportionately expensive. Asking for exactly the shape you need — and no more — is a direct margin lever.

How: Set a deliberate max-output ceiling, ask for structured or short outputs, and pick the lowest effort level that still passes your quality bar. Strip dead instructions from the system prompt; they cost input tokens on every single call.

Because output is the priciest token class, shaving a long answer down to the necessary payload moves the per-run cost more than an equivalent cut to input.

output ≈ priciest tokens · estimate

Don’t pay for the same answer twice

Deterministic or repeated requests don’t need a fresh model call each time. A cheap guard or a memoized result short-circuits the call entirely — the cheapest token is the one you never send.

How: Memoize deterministic results, dedupe identical in-flight requests, and gate the model behind cheap rules (a regex, a lookup, a cache) that resolve the easy cases before any inference. Count tokens with the provider’s token counter — never a third-party estimator — so your budgets are accurate.

Eliminating redundant calls removes their cost outright, not just discounts it — the highest-leverage saving when traffic is repetitive.

the cheapest call is none · estimate

Right-size the model

Default to the latest Claude — then tier deliberately.

Route easy turns to a cheaper tier and reserve the top model for the reasoning that needs it. Output bills several times more than input across every tier, so trim both.

Claude Haiku 4.5

Classification, extraction, routing, high-volume cheap turns.

~$1 in / $5 out · per 1M tokens

200K context

Claude Sonnet 4.6

The balanced default for most agent work — speed and intelligence.

~$3 in / $15 out · per 1M tokens

1M context

Claude Opus 4.8

The hardest reasoning, long-horizon autonomy — reserve it for those.

~$5 in / $25 out · per 1M tokens

1M context

Indicative list prices (USD per million tokens), reviewed 2026-06. Treat them as estimates and confirm current rates with your provider — then plug your own numbers into the calculator.

Efficiency integrations

Capabilities to wire into your own stack.

Provider-native where one exists, otherwise a category to adopt on your own infrastructure. These aren’t DiviDen connectors — they’re tools you bring to widen your own margin.

Caching

Prompt caching

Reuse a stable prompt prefix across calls so the repeated head bills at cache-read rates instead of full input.

native · ≈0.1× input on a hit

Scheduling

Batch / async API

Submit non-urgent requests as one asynchronous batch for roughly half the per-token price.

native · ≈50% off

Retrieval

Vector store + embeddings (RAG)

Index your corpus once and retrieve only the relevant chunks per turn — input scales with relevance, not corpus size.

retrieval-over-context

Caching

Semantic / response cache

Memoize answers to equivalent requests so repeated or near-duplicate turns short-circuit before any inference.

skip the redundant call

Measurement

Token counting

Budget a prompt before you send it with the provider’s own counter — third-party estimators undercount and mislead.

native · measure, don’t guess

Context control

Context editing & compaction

Clear stale tool results or summarize long histories server-side so a long-running agent stops re-paying for dead context.

native · prune the transcript

Routing

Tiered model routing

Route each turn to the cheapest model that can handle it, escalating to a stronger tier only when the work demands it.

cheap-first · escalate on need

Now put a number on it.

Lower cost per run lifts every channel’s net. Model your price, volume, and channel mix — including a DiviDen marketplace listing — in the calculator.