By Otso Veisterä · May 2026 · Last updated May 6, 2026

Why Your RAG App's Token Bill Is So High (And How to Fix It)

TL;DR

Most RAG applications retrieve 3-5x more context than the model meaningfully uses. Embedding similarity pulls in chunks that are topically related but full of boilerplate, formatting, and noise. You're paying for tokens the model ignores. Compression at the context layer — after retrieval, before the LLM call — is the highest-leverage fix.

The retrieval cost nobody warned you about

A standard RAG setup: top-k set to 5, average chunk size of 500 tokens, for 2,500 tokens of retrieved context per query. Add a system prompt (800 tokens) and the user's question (100 tokens), and you're at 3,400 input tokens per call. The retrieved context is 74% of your input.

Now scale it. At 100,000 queries per day on Claude Sonnet 4.6 ($3/MTok input):

100K queries × 3,400 tokens = 340M tokens/day

340M × 30 days = 10.2B tokens/month

10,200 MTok × $3 = $30,600/month on input alone

Of which ~$22,500 is just the retrieved context

That $22,500 per month is paying for chunks that include navigation breadcrumbs, repeated headers, HTML artifacts, copyright footers, and boilerplate paragraphs that appeared in every page of the source docs. The model reads all of it. It attends to almost none of it.

Why retrieval pulls more than the model needs

The root problem is a mismatch between what retrieval optimizes for and what the LLM needs.

Embedding similarity is not the same as usefulness. Your embedding model finds chunks that are semantically close to the query. But "semantically close" includes chunks that discuss the same topic without containing the specific answer, or chunks where the relevant sentence is buried in 400 tokens of surrounding context.

Chunks carry structural noise. Unless you've done aggressive preprocessing, your chunks contain headers, footers, navigation elements, table-of-contents fragments, and formatting markup. These tokens look benign but they add up fast — often 15-25% of a chunk is non-content tokens.

The model ignores most of what you send. Research on long-context LLM behavior (the "lost in the middle" paper by Liu et al.) shows that models disproportionately attend to the beginning and end of their context. Chunks in the middle of your retrieved context contribute less to the answer than their token cost suggests.

Top-k is set defensively, not optimally. Nobody wants to be the engineer who shipped a RAG app that misses answers because k was too low. So teams set k=5 or k=10 as a safe default and never revisit it. For most queries, k=2 or k=3 would produce identical answers at a fraction of the cost.

The costs that don't show up on your invoice

Token spend is the obvious cost. But over-fetching hurts in three other ways that compound in production.

Latency. More input tokens means longer time-to-first-token. For a user waiting for an answer from a support bot or search interface, each additional 1,000 tokens of context adds measurable latency. On frontier models with long context, this can mean the difference between a 1-second and a 3-second response.

Quality degradation. The "lost in the middle" effect is real. Padding your context with marginally relevant chunks doesn't just waste money — it actively degrades answer quality. The model has to separate signal from noise, and more noise means more mistakes.

KV cache pressure. At scale, every extra token consumes GPU memory for the key-value cache during inference. On shared infrastructure (which is most API usage), this affects throughput. You're not just paying for the tokens directly — you're contributing to the congestion that causes rate limits and increased latency for everyone, including your own concurrent requests.

Three layers of fixes, ranked by leverage

If you want the full picture on LLM cost optimization beyond just RAG, start with our broader cost framework. For RAG specifically, here are the three layers, in order of how much effort they take versus how much they save.

Layer 1: Better retrieval

Add a reranker after your initial retrieval pass to score chunks by actual relevance to the query, not just embedding distance. Use dynamic top-k instead of a fixed number — let the reranker score determine how many chunks make the cut. This typically saves 20-40% of retrieval tokens with moderate engineering effort. The downside: it adds a reranking step to your pipeline, which has its own latency and cost.

Layer 2: Chunk hygiene

Clean your chunks at indexing time. Strip HTML artifacts, navigation elements, repeated headers, and boilerplate. Deduplicate near- identical chunks. This is straightforward work — regex, some heuristics, maybe a simple classifier for boilerplate. Typically saves 10-20% of tokens per chunk. Easy wins, but the ceiling is limited because the remaining tokens are genuine content, just more than the model needs.

Layer 3: Compression at the context layer

This is where the math changes. After retrieval and reranking, compress the assembled context before it hits the LLM. Purpose- built compression removes redundancy across chunks — repeated information, verbose phrasing, tokens the model wouldn't attend to anyway — while preserving the semantic content and structure the model needs to generate accurate answers.

The key difference from naive summarization: good compression preserves citation anchors and source attribution. If your RAG app needs to tell users which document an answer came from, compression must keep those markers intact. Summarization destroys them.

Before and after: a real RAG context

Here's what a typical RAG pipeline looks like in code, and where compression slots in:

rag-with-compression.ts

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

async function ragQuery(query: string, chunks: string[]) {
  // Assemble context from retrieved chunks
  const ragContext = chunks
    .map((chunk, i) => `[Source ${i + 1}] ${chunk}`)
    .join("\n\n");

  // Before compression: ~2,500 tokens across 5 chunks
  // After compression:  ~1,000 tokens — same info, no noise
  const compressed = await compress(ragContext);

  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 512,
    system: "Answer based on the provided sources. Cite sources by number.",
    messages: [
      {
        role: "user",
        content: `Context:\n${compressed}\n\nQuestion: ${query}`,
      },
    ],
  });

  return response;
}

async function compress(text: string): Promise<string> {
  // One API call to The Token Company
  const res = await fetch("https://api.thetokencompany.com/compress", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${process.env.TTC_API_KEY}`,
    },
    body: JSON.stringify({ text, model: "bear-1.2" }),
  });
  const data = await res.json();
  return data.compressed_text;
}

The numbers shift depending on content type, but across the RAG workloads we process, we consistently see 50-65% of retrieved context tokens being compressible without any change in answer quality. That's not a theoretical number — it's measured on production traffic with eval suites.

Typical RAG context compression

Before: 2,500 tokens100%

After: ~1,000 tokens~40%

Source attribution and citation anchors preserved

How The Token Company handles RAG compression

RAG context has structure that generic compression breaks. Chunks come from different sources. Citation markers need to survive compression intact. The boundary between chunks carries semantic meaning — the model uses it to weigh conflicting information.

Our compression models are trained on LLM input specifically, including multi-chunk retrieved context. They understand that [Source 3] is a citation anchor, not disposable text. They deduplicate information that appears across multiple chunks without losing the source attribution. And they do it in under 100ms, adding negligible overhead to your pipeline.

It's a single API call that drops into your existing pipeline between retrieval and the LLM call. No reindexing, no re-chunking, no changes to your retrieval logic.

Start compressing your RAG context today

One API call between retrieval and your LLM. Five minutes to integrate. See the quickstart.

Start compressing

About The Token Company — We build learned compression models that reduce LLM input tokens by up to 66% while preserving or improving output quality. Our API sits between your application and your LLM provider, compressing input in real time with sub-100ms overhead. Used by teams processing billions of tokens per month.