By Otso Veisterä · May 2026 · Last updated May 6, 2026
Cut Your LLM API Costs by 70% Without Losing Quality
Your LLM API bill is probably your second-biggest line item after payroll. And unlike headcount, it scales with every user, every conversation, every query. Here's how to fix that without degrading the experience your users depend on.
Your token bill is growing faster than your revenue
Here's the math that keeps engineering leaders up at night. Take a B2B product with 10,000 daily active users, each sending 8 messages a day through an AI assistant. Each API call includes a system prompt, conversation history, and some retrieved context — roughly 3,200 input tokens per call. The model generates about 400 tokens in response.
That's 256 million input tokens per day. At Claude Sonnet 4.6 pricing ($3 per million input tokens, $15 per million output), your monthly bill looks like this:
Input: 7,680M tokens × $3/MTok
= $23,040/month
Output: 960M tokens × $15/MTok
= $14,400/month
Total: $37,440/month for 10K users
≈ $3.74 per user per month, just API costs
Switch to GPT-5.4 ($2.50 input / $15 output) and you're at $33,600. Use a cheaper model and you trade off capability. The cost floor is high regardless of provider.
And this gets worse, not better. Conversation history grows linearly with every turn. Agent loops can re-process the same context dozens of times. RAG pulls retrieve far more context than the model actually attends to. System prompts balloon as product requirements accumulate. Most teams have zero visibility into where their tokens are actually going.
The five places your tokens are hiding
Before you can cut costs, you need to see where the tokens go. In our experience working with teams running production LLM apps, the waste falls into five buckets. Most teams are leaking tokens in all five.
1. Bloated system prompts
System prompts are the silent cost multiplier. They ship with every single API call, and they tend to grow over time as teams append instructions, guardrails, formatting rules, and persona descriptions. We routinely see system prompts consuming 30-50% of the total input tokens on a request — often 1,000 to 3,000 tokens of instructions that the model has already internalized from fine-tuning or that could be expressed in a fraction of the space.
The fix: Audit your system prompt quarterly. Strip redundant instructions. Move few-shot examples into retrieval rather than hardcoding them. Measure whether each line actually changes model behavior — if removing it produces the same output on your eval set, it's dead weight you're paying for on every call.
2. Conversation history that never stops growing
Every turn in a conversation adds to the context window. By turn 10, the accumulated history often dwarfs the actual user message. By turn 20, you're sending tens of thousands of tokens of prior conversation just so the model remembers what happened three messages ago. In agent loops, this is even worse — the agent re-processes its entire reasoning chain on every iteration.
The fix: Implement a sliding window that keeps the last N turns verbatim and summarizes earlier context. Or compress the full history so the model retains the semantic content without the token overhead. The key insight: older turns contain context the model needs, but not at full fidelity.
3. RAG context that over-fetches by 3-5x
This is the biggest one for retrieval-augmented applications. Teams set top-k conservatively (usually 5-10 chunks) because missing a relevant chunk is worse than including an irrelevant one. Each chunk carries boilerplate, navigation elements, headers, and formatting noise that embedding similarity doesn't filter out. The result: 2,500-5,000 tokens of retrieved context per query, and the model meaningfully attends to maybe 30-40% of it.
The fix: Rerank after retrieval to drop chunks that aren't actually useful. Strip boilerplate and formatting from chunks at indexing time. And compress the remaining context — this is where purpose-built compression delivers the highest ROI. We go deeper on this in our RAG cost analysis.
4. Tool definitions and schemas
If you're building agents, every tool definition and JSON schema gets injected into the system prompt. Ten tools with detailed parameter descriptions can easily add 2,000+ tokens to every call. Most agent frameworks include all tool definitions on every turn, even when only one or two tools are relevant to the current step.
The fix: Dynamically select which tools to include based on the current agent state. Trim descriptions to the minimum the model needs. Consider a two-stage approach: first call to select relevant tools, second call with only those tool definitions included.
5. Output tokens you don't need
Output tokens are 3-6x more expensive than input tokens across most providers. Claude Sonnet 4.6 charges $15/MTok output vs. $3 input — a 5x multiplier. Yet most teams don't constrain output length at all. The model generates preambles, restates the question, adds disclaimers, and pads responses with filler that users skip.
The fix: Set max_tokens aggressively. Add explicit instructions like "Answer in under 100 words" or "Return only the JSON object" to your prompts. For structured outputs, use the provider's JSON mode to eliminate natural language wrapping entirely.
Step zero: see where the tokens go
You can't optimize what you don't measure. Before applying any of the fixes above, instrument your LLM calls to track token usage by category. Here's a minimal setup for a Node/TypeScript app:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function trackedCompletion(
systemPrompt: string,
messages: Anthropic.MessageParam[],
ragContext?: string
) {
// Tag token sources for visibility
const systemTokens = estimateTokens(systemPrompt);
const historyTokens = messages.reduce(
(sum, m) => sum + estimateTokens(
typeof m.content === "string"
? m.content
: JSON.stringify(m.content)
), 0
);
const ragTokens = ragContext ? estimateTokens(ragContext) : 0;
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: systemPrompt,
messages: ragContext
? [...messages, { role: "user" as const, content: ragContext }]
: messages,
});
// Log the breakdown
console.log({
system_tokens: systemTokens,
history_tokens: historyTokens,
rag_tokens: ragTokens,
input_tokens: response.usage.input_tokens,
output_tokens: response.usage.output_tokens,
input_cost: (response.usage.input_tokens / 1_000_000) * 3,
output_cost: (response.usage.output_tokens / 1_000_000) * 15,
});
return response;
}
// Rough estimator — 1 token ≈ 4 chars for English
function estimateTokens(text: string): number {
return Math.ceil(text.length / 4);
}Run this for a day and you'll have a clear picture of which bucket dominates your spend. In most apps we've seen, it's a combination of system prompts and RAG context — the two categories where compression has the highest impact.
Why compression is the highest-leverage move
Each of the fixes above helps. But compression compounds with all of them, and it delivers benefits beyond the token bill.
If you reduce input tokens by 60%, you don't just save 60% on input costs. You also cut latency — fewer tokens means faster time-to-first-token, because the model has less to process before it starts generating. You reduce KV cache pressure, which matters at scale when you're competing for GPU memory on shared infrastructure. And counterintuitively, you often improve output quality. Less noise in the context means the model's attention is focused on the tokens that actually matter.
This is the insight that's easy to miss if you only think about compression as a cost tool. It's a quality tool too. When your system prompt is 1,200 tokens of carefully written instructions and 800 tokens of accumulated cruft, removing the cruft doesn't just save money — it gives the model a cleaner signal.
What "without losing quality" actually means
Every compression approach has a quality envelope — a range where the output is indistinguishable from uncompressed, and a point where degradation starts. Simple truncation hits that point almost immediately. Summarization preserves broad meaning but loses specific details the model needs for precise answers.
Learned compression — models trained specifically to remove redundancy while preserving the semantic content LLMs attend to — operates in a much wider envelope. On standard benchmarks, good compression at 60% reduction preserves or slightly improves accuracy, because it's removing tokens the model was ignoring anyway.
The way to verify this for your use case: run your existing eval suite on compressed inputs. If your evals pass, compression is safe. If they don't, you've dialed compression too aggressively for that specific content type. This isn't guesswork — it's measurable.
How we built this at The Token Company
This is exactly the problem we built The Token Company to solve. Our compression models (bear-1, bear-1.1, bear-1.2) are trained specifically for LLM input — they understand what language models attend to and what they don't.
The integration is a single API call before your existing LLM pipeline. Send your text in, get compressed text back, pass it to your model. No infrastructure changes, no prompt rewriting, no fine-tuning.
~66%
Token reduction
<100ms
Per 10K tokens
+1.1%
Accuracy vs. uncompressed
Three things to do this week
Monday: Instrument your token usage. Add logging that breaks down input tokens by source (system prompt, history, RAG context, tools). You need the data before you can prioritize.
Wednesday: Attack your biggest bucket. If system prompts dominate, audit and trim them. If it's RAG context, try reducing top-k or adding a reranker. If it's conversation history, implement a sliding window. Pick the single highest-impact change and ship it.
Friday: Test compression on your actual traffic. The fastest path from here to 60%+ input reduction is a purpose-built compression layer — and you can have it running in production by end of week.
Start compressing prompts in 5 minutes
Create an account, grab your API key, and add one line to your pipeline. Read the quickstart or jump straight in.
About The Token Company — We build learned compression models that reduce LLM input tokens by up to 66% while preserving or improving output quality. Our API sits between your application and your LLM provider, compressing input in real time with sub-100ms overhead. Used by teams processing billions of tokens per month.