back to guides

Cost Control and Optimization: Budgets, Limits, and Policies

A guide to controlling spend on ChatBotKit, covering model economics, context-management options, plan and platform limits, sub-account isolation, and usage policies that throttle or pause runaway agents.

Overview

The cost of running an AI agent is dominated by tokens. Everything else - datasets, skillsets, files, conversation rows - carries a real cost in database rows and blob storage, but that cost is small next to tokens and scales far more gently. So cost control on ChatBotKit is, at its core, token control: deciding which model burns tokens, how many tokens each interaction is allowed to carry, and what happens when consumption crosses a line.

That control is layered. A single setting rarely protects you on its own, because the failure modes differ. A well-chosen model still loops. A sensible per-interaction budget still runs millions of times under an unexpected traffic spike. A monthly plan cap still lets a single agent exhaust the whole account in an afternoon. Each layer addresses a failure the layer above it cannot see.

This guide works through the layers from the inside out: the per-token economics of model choice, the per-interaction controls that bound a single completion, the account and platform budgets that cap aggregate spend, the sub-account boundaries that isolate risky agents, and finally the usage policies that watch consumption in real time and act on it automatically. Read top to bottom for the full mental model, or jump to the layer you need to tighten.

Layer 1: Model Economics

Every completion is priced in tokens, and the model you pick sets the exchange rate.

Credit Tokens, Not Raw Tokens

ChatBotKit bills in credit tokens, a calibrated unit rather than a raw provider token. Each model carries a tokenRatio that is tuned so that roughly one million credit tokens corresponds to a comparable amount of real cost of goods regardless of which model produced them. A cheaper model consumes credit tokens slowly; a frontier model consumes them faster for the same raw token count. This is what lets a single account budget cover a mix of models without you having to reconcile half a dozen provider price sheets.

Because tokens are calibrated to real cost, they are the primary lever. Structural counts - how many bots, datasets, or files you have - do carry a storage cost, but it is small and grows slowly, so they are sized mainly for packaging and abuse control. Optimizing them trims the edges. Optimizing tokens moves the bill.

How a Model Is Priced

The pricing parameters on each model (covered in full in the Language Models manual) come down to three values:

  • tokenRatio - the base cost multiplier applied to token usage. Higher means more expensive per token.
  • inputTokenRatio - an optional separate multiplier for input (prompt) tokens.
  • outputTokenRatio - an optional separate multiplier for output (generated) tokens. When both input and output ratios are present, they override the base ratio, reflecting that most providers charge more to generate than to read.

The practical consequence: input is usually cheaper than output, and the input side is where bloat accumulates silently. A long system prompt, a large retrieved context, and a deep conversation history all ride along on every single turn. The model you choose sets the rate; the next layer controls how much rides along.

Choosing for Cost

Match the model to the job rather than defaulting to the most capable one everywhere.

WorkloadLean toward
High-volume classification, routing, extractionA smaller, lower-ratio model
Customer-facing conversation with nuanceA mid-tier balanced model
Multi-step reasoning, code, long-document analysisA frontier model, used deliberately

A frequent and effective pattern is to route: a cheap model handles triage and the easy majority of turns, and only escalates to an expensive model when the task genuinely needs it. The expensive model's per-token rate stops mattering when it runs on a small fraction of traffic.

Optimize the model last, not first. Premature cost optimization is a common trap. Reaching for the cheapest model before the agent works tends to send you chasing prompt and behavior problems that are really model-capability problems. Start with a capable model - not necessarily the most expensive, but one comfortably strong enough that the agent's quality is never in question. Lock in the instructions, tools, context strategy, and the other parameters against that baseline. Once the agent behaves the way you want, step down to a more cost-efficient model and confirm it holds up. Now a regression is unambiguous: the only thing that changed was the model, so any drop in quality is the model's doing and you know exactly how much capability the savings cost you. Tuning a weak agent and a cheap model at the same time hides which one is at fault.

A custom model is rarely the cheaper option. ChatBotKit can bring your own model and keys to the platform, and it is tempting to assume that paying a provider directly undercuts the built-in models. It usually does not. The platform's own models are served at negotiated rates and their usage is heavily optimized on your behalf, so for most workloads the out-of-the-box models are the cost-efficient choice. Reach for a custom model when the reason is control rather than price: when you need the model to stay within your own provider account for privacy or data-residency reasons, when contractual or compliance terms require a specific provider, or when you want a more esoteric or self-hosted model that ChatBotKit does not offer out of the box. With your own key you also take on the model's availability and operational limits, which is a reason to choose it deliberately and not as a default cost play.

Layer 2: Per-Interaction Controls

Model choice sets the unit price. The interaction options set the quantity - how many tokens a single completion is permitted to carry. This is the layer with the highest leverage, because it multiplies across every turn of every conversation.

Token Limits

Each model exposes a context budget through three related values:

  • maxTokens - the total context window, the combined ceiling for input plus output.
  • maxInputTokens - the share available for the prompt, system instructions, retrieved context, and conversation history. This is typically around three-quarters of the window.
  • maxOutputTokens - the share reserved for the response, typically the remaining quarter.

The relationship is simply maxTokens = maxInputTokens + maxOutputTokens. Lowering maxTokens on a model configuration directly caps how much any one interaction can cost. If a bot never needs to read a hundred thousand tokens of history to answer a support question, do not give it a hundred-thousand-token budget - the budget is a ceiling that bloat will rise to fill.

Interaction Max Messages

interactionMaxMessages limits how many conversation messages are included in each model interaction. It is the most direct control over history-driven cost. A long conversation does not have to carry its entire transcript on every turn; capping the message count keeps the input side from growing without bound as the conversation extends.

Lower values (in the range of a few to a dozen messages) keep interactions focused, deterministic, and cheap. Higher values (fifty to a hundred) give the model more context awareness at a proportionally higher per-turn cost and some loss of response consistency. For most production workloads, a modest cap is both cheaper and more reliable than the full history.

Context Management: Truncate vs Compact

When a conversation grows past the configured thresholds, something has to give. ChatBotKit offers two Threshold Strategies, selectable in the Language Model settings:

  • Truncate prioritizes the latest turns and drops the oldest once the threshold is reached. It is the cheapest and simplest strategy. The trade-off is hard forgetting: anything that scrolled out of the window is gone, which is fine for stateless or short-lived interactions and poor for conversations that reference earlier context.

  • Compact rolls earlier turns into checkpoint-style summaries instead of discarding them outright. It preserves continuity across long conversations while keeping token pressure down, at the cost of the occasional summarization pass. It suits multi-step assistants, operations workflows, and support sessions that extend over many turns and need to stay grounded in what came before.

These strategies work in concert with Max Tokens and Interaction Max Messages. Together they let you tune the trade-off between memory depth and efficiency: truncate with a tight message cap for cheap, focused bots; compact with a more generous budget for assistants that must remember. The conversation compaction announcement walks through the reasoning in more depth.

Response behavior is a quality lever with a cost dividend. Parameters like temperature, frequencyPenalty, and presencePenalty shape output quality rather than cost directly. They matter to cost only indirectly: a poorly tuned model that produces rambling or repetitive output generates more output tokens, and output tokens are the expensive side. Tightening response behavior for the task at hand is a quality decision that pays a modest cost dividend.

Layer 3: Account Limits

The per-interaction layer bounds one completion. It does nothing about a million completions. Account limits cap aggregate consumption over a billing period.

Plan Limits

Every account inherits a set of limits from its plan. These are defined per tier and cover the metrics that actually drive cost and abuse:

  • tokens - the monthly credit-token budget, the primary cost lever and the main reason accounts upgrade.
  • conversations and messages - volume caps on conversational activity.
  • image, video, audio - caps on generative media, which are expensive and disabled entirely on lower tiers.
  • fetch and email - caps on outbound fetches and email sends.
  • rate.* - per-minute rate limits on records, abilities, conversations, messages, and polls, which catch bursts rather than monthly totals.

When an account reaches its token budget for the period, operations that would consume more are rejected at the API with a limits-reached error. The agent stops producing completions until the window rolls over or the plan is upgraded. This is the backstop that keeps a busy month from becoming an unbounded one.

Structural limits - bots, datasets, skillsets, abilities, files, records - also live here, but as packaging and abuse controls rather than cost levers. Their storage cost is real but small and slow-growing, so tightening them yields modest savings. Keep your attention on the token, rate, and media caps, where the money actually is.

Monitoring Before You Hit the Wall

Limits enforce hard stops, and a hard stop is a blunt experience for an end user mid-conversation. Watch consumption through the bot usage statistics and account analytics so you can upgrade, rebalance, or tighten policies before the wall arrives rather than discovering it from a wave of failed completions. The account-level notifications for nearly-exceeded and exceeded limits exist for exactly this reason.

Layer 4: Platform Limits - The Global Safety Net

Plan limits protect your bill. They do not, on their own, protect every account from a single catastrophic bug. For that, ChatBotKit imposes a platform-wide limit underneath all of them.

The platform tracks total token consumption across every account against a fixed monthly ceiling. Once that ceiling is reached, non-exempt requests are rejected - across the whole platform, without exception.

This is not a billing mechanism. It is a circuit breaker. Its job is to make sure that no single runaway agent, compromised key, or pathological loop anywhere on the platform can run up unbounded global cost that would affect everyone. Your account limits are bound to your plan and agreement; the platform limit sits below them as a hard ceiling that you never configure and that protects all accounts collectively.

Two design choices are worth knowing. The platform counter is tracked in lockstep with per-account usage, so accounting is real-time rather than batched. And the check fails open: if usage cannot be read during an infrastructure incident, the platform is assumed to be within budget, so a transient outage does not itself become a platform-wide one. The safety net is deliberately built so that its own failure does not take everyone down with it.

Layer 5: Sub-Account Isolation

Plan and platform limits operate on a whole account. When you run many independent workloads - separate customers, separate teams, or a few experimental agents you do not fully trust - a single account-wide budget is too coarse. One workload can quietly consume the budget that the others depend on.

Sub-accounts (partner users) solve this by giving each workload its own walled garden. Each sub-account is a fully isolated environment with its own bots, datasets, conversations, integrations, and settings, while inheriting billing and subscription from the parent. The Partner Users and Resource Limits manuals cover the full surface; here the relevant part is the limits object.

Per-Sub-Account Limits

When you create or update a sub-account, you can attach a limits object that constrains exactly what that tenant can consume:

The rules are straightforward:

  • Every field is optional. Omit one and the sub-account inherits the parent's default for it.
  • All values are non-negative integers. Set a value to 0 to completely restrict that resource.
  • Limits are enforced at the API level. When a sub-account hits one, its operations that would exceed it are rejected - the same hard-stop behavior as account limits, scoped to the tenant.

Isolating Risky Agents

This is where sub-accounts become a cost-control instrument rather than just a multi-tenancy feature. Give a risky or experimental agent its own sub-account with a deliberately small token allotment. If that agent misbehaves - loops, gets prompt-injected into chattiness, or simply turns out to be more expensive than expected - it exhausts its own token budget and then stops, because there are no tokens left for it to consume. The blast radius is the allotment you granted it. Nothing it does can reach into the budget that your production agents rely on.

A small token ceiling on a sub-account turns "this agent is a little risky" into "this agent can cost at most this much, and then it halts." That is a cleaner containment story than trying to reason about a risky agent's behavior inside a shared account.

Layer 6: Usage Policies - Active Enforcement

Every layer so far is a static ceiling: a budget that, once spent, stops things until it resets. Static ceilings catch sustained overspend. They are slow to catch a sudden burst - an agent that starts looping at 2 a.m. can consume a great deal before it hits a monthly cap, and you find out from the invoice.

Usage policies close that gap. A usage policy watches consumption in real time and acts the moment a threshold is crossed within a window you define. They are the dynamic counterpart to the static budgets above. The usage policies announcement introduces the feature; the Policies manual covers the API.

How a Usage Policy Works

A usage policy watches a single metric against a threshold over a rolling window, and fires one or more actions when the threshold is crossed:

FieldMeaning
metricWhat to count: tokens, messages, or conversations
thresholdThe count that trips the policy
windowInSecondsThe rolling window the count is measured over
actions.blockPause the bot for durationInSeconds
actions.emailNotify the owner, or an explicit list of recipients

A policy must define at least one action, and the two combine freely. Usage is counted as it is recorded, so every token, message, and conversation a bot consumes is accounted for the moment it happens - the window is measured against live consumption, not a periodic sweep.

The Two Actions

Block is the hard enforcement. It places a temporary soft-lock on the bot: while the block is present the engine refuses to run completions for that bot, and the block auto-expires when its duration elapses, so a time-limited cool-down needs no separate unblock step. This is what actually stops spend in its tracks. A block can also be lifted early if you want the bot back sooner, which resets the policy's window so it does not immediately re-trip.

Email is the soft signal. It alerts the account owner, or an explicit recipient list you specify, that a threshold was crossed. Alerts are deduplicated to once per window, so a sustained breach sends a single heads-up rather than an email on every recorded event.

Use them together for the common case: notify and block, so a runaway agent both pauses itself and tells you it did.

A Worked Example

Suppose you want any agent that burns more than a million tokens in a five-minute window to pause itself for five hours and email you. The policy is:

A million tokens in five minutes is far above any legitimate single-bot pace, so this never touches a healthy agent. The moment a loop or an abusive session pushes past it, the bot blocks for five hours, an email goes out once, and the spend stops climbing - long before any monthly budget would have noticed.

Policy Scope

Usage policies apply at two scopes, and both are evaluated together on every usage event:

  • Bot-level - attach the policy to a specific bot to govern that bot alone. This is the right scope for a tight, agent-specific cap like the example above.
  • Account-level (global) - leave the policy unattached to any bot and it covers every bot in the account from a single rule. This is the right scope for a broad backstop that no bot should ever cross.

A bot's own policies and the account-wide ones are both checked on each event, so a global ceiling and a per-bot cap coexist cleanly. Policies can also be associated with a blueprint for organizational grouping, the same way other resources are organized, while enforcement itself resolves to the bot and the account.

A good arrangement is layered, mirroring the rest of this guide: a generous account-wide monthly-scale policy for billing peace of mind, a tighter per-hour account policy to catch abuse, and per-bot policies on anything sensitive or experimental. Each catches a failure the others would miss.

Putting It Together

No single layer is sufficient, and that is the point. Each one fails differently, so each one is backed by the next.

A practical baseline for a production account:

  • Build on a capable model, then step down to the cheapest one that still does the job - and route to an expensive one only for the turns that need it.
  • Cap each interaction with a sensible maxTokens and interactionMaxMessages, and choose Compact over Truncate only where continuity genuinely matters.
  • Rely on plan limits as the monthly backstop, and monitor usage so you upgrade or tighten before hitting them rather than after.
  • Isolate anything risky in a sub-account with a small token allotment, so its worst case is bounded by what you granted it.
  • Add usage policies - a broad account-wide ceiling plus tighter per-bot caps on sensitive agents - so a sudden burst pauses itself in minutes instead of surfacing on an invoice.

A misconfigured prompt, a traffic spike, or an agent stuck in a loop can quietly burn through tokens. With these layers in place, the burn is caught - at the model rate, at the interaction size, at the account budget, at the tenant boundary, and at the live threshold - and the most expensive failure mode, the one nobody noticed until the bill arrived, stops being possible.