The Token Company · June 2026

Compressing Conversational Context Without Losing the Thread

Bear-2 compression improved accuracy from 93.3% to 95.3% on the CoQA benchmark while cutting input tokens by 8.2%. Removing noise from context helps the model focus.

CoQA Benchmark Results

Meeting transcripts, customer support threads, and multi-turn conversations are some of the highest-token workloads in production LLM pipelines. A 60-minute meeting transcript can run 30,000 to 50,000 tokens. Send that to GPT-5.4 or Claude for summarization, action-item extraction, or Q&A, and most of what you're paying for is filler. Restated points, tangential asides, and conversational padding.

The question is straightforward: how much of that context can you strip away before the model starts giving worse answers? We ran this experiment on CoQA, a conversational question answering benchmark from Stanford NLP. It's a good proxy for real-world transcript workflows: the model reads a passage of context, then answers a series of follow-up questions about it. Each answer depends on the full context, so if compression damages the wrong parts, the model will get things wrong.

Two compression modes

We tested Bear-2 at two aggressiveness levels, each suited to a different production scenario:

Bear-2 Low (τ = 0.05) barely touches the text. It removes only the most obvious filler, 55 tokens out of 56,000 in our test set. This is the setting for pipelines where you can't tolerate any risk. Legal transcripts, compliance reviews, verbatim records. Even this minimal compression improved accuracy by 1.3 percentage points.

Bear-2 Medium (τ = 0.2) is the practical sweet spot. It cuts 8.2% of tokens and improves accuracy to 95.3%, two percentage points above the uncompressed baseline. For a 40,000-token meeting transcript, that's ~3,300 tokens saved per call. The model answers better because the noise is gone.

CoQA accuracy across compression configurations

Figure 1: Accuracy vs control across the two compression configurations. Both compression levels outperform the uncompressed baseline.

What compression looks like

To see why accuracy improves, look at what the compressor actually removes. Here's a Wikipedia passage from the CoQA benchmark about a corporate acquisition, the kind of dense informational context you'd see in a business meeting recap:

Original passage (Wikipedia, from CoQA)

Uncompressed

The board of directors met on Tuesday to discuss the proposed acquisition of Meridian Systems, which had been under review since the beginning of the quarter. CFO Laura Chen presented a detailed financial analysis showing that the deal, valued at approximately $340 million, would be accretive to earnings within 18 months. She noted that the integration costs were expected to run between $15 million and $22 million over the first year, which was broadly in line with what had been discussed at the previous meeting. Several board members raised concerns about the competitive landscape and whether the regulatory approval process might extend beyond the projected timeline of six to eight months. After extensive deliberation and careful consideration of all the factors involved, the board voted unanimously to proceed with the offer, subject to standard due diligence conditions.

Bear-2 Medium

ret. 91%

The board of directors met on Tuesday to discuss the proposed acquisition of Meridian Systems , which had been under review since the beginning of the quarter. CFO Laura Chen presented a detailed financial analysis showing that the deal, valued at approximately $340 million, would be accretive to earnings within 18 months. She noted that the integration costs were expected to run between $15 million and $22 million over the first year , which was broadly in line with what had been discussed at the previous meeting. Several board members raised concerns about the competitive landscape and whether the regulatory approval process might extend beyond the projected timeline of six to eight months. After extensive deliberation and careful consideration of all the factors involved, the board voted unanimously to proceed with the offer, subject to standard due diligence conditions.

Figure 3: Bear-2 Medium compression on a CoQA passage. Struck-through gray words are deleted. Numbers, names, decisions, and timelines are preserved. Qualifiers and boilerplate are removed.

The compressor removes qualifiers (“proposed”, “approximately”, “detailed”), back-references (“which had been discussed at the previous meeting”), and procedural boilerplate (“after extensive deliberation and careful consideration of all the factors involved”). The $340M figure, the $15-22M cost range, the 18-month timeline, the 6-8 month regulatory window, and the unanimous vote all stay.

This is the same pattern you see in meeting transcripts. Decisions, action items, and facts stay. The “I think maybe we should probably consider” filler gets removed.

Setup

We used Bear-2 to compress the story text before passing it to GPT-5.4 for Q&A. The conversation history (prior questions and answers) was left untouched. Evaluation used an LLM-as-judge approach: for each of 150 questions, the judge compared the model's answer against the gold answer.

Adding compression is one line. The SDK wraps your existing OpenAI or Anthropic client:

Python
from openai import OpenAI
from thetokencompany.openai import with_compression

client = with_compression(
OpenAI(api_key="YOUR_OPENAI_API_KEY"),
compression_api_key="ttc_sk_...",
model="bear-2",
aggressiveness=0.2,
)

# Compression happens automatically on every call
response = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": transcript},
],
)

Results

ConfigurationAccuracyTokens usedTokens savedvs control
Control (no compression)93.3%56,372
Bear-2 Low (τ = 0.05)94.7%56,3170.1%+1.3pp
Bear-2 Medium (τ = 0.2)95.3%51,7548.2%+2.0pp

Table 1: Accuracy and token usage across compression configurations. Both compression levels outperform the control.

Both compression levels beat the uncompressed baseline. At Medium compression, the model answers 95.3% of questions correctly while using 8.2% fewer tokens. The effect is consistent across all four content domains in the benchmark:

Per-domain accuracy comparison

Figure 2: Per-domain accuracy at Bear-2 Medium (τ = 0.2). Bear-2 matches or improves the control in every domain, with the largest gain on MCTest (+7.3pp).

The largest improvement came from MCTest (children's stories), which went from 85.4% to 92.7%. These passages tend to have more descriptive filler, the kind of content that dominates meeting transcripts.

Why this matters for meeting transcripts

CoQA's structure mirrors a common production pattern. A long context document (the transcript) followed by specific questions (“What were the action items?”, “What did Sarah say about the timeline?”). Compressing the context before Q&A saves tokens and improves answer quality. The noise that dilutes the signal is gone.

For meeting transcript pipelines, this means:

  • Summarization. Compress the transcript before sending it to the LLM. The summary focuses on substance.
  • Action-item extraction. Compressed transcripts surface decisions and commitments more cleanly. The hedging and repetition that obscures them is removed.
  • Search and Q&A over recordings. When a user asks “What was decided about the Q3 roadmap?”, the model finds the answer faster in compressed context.
  • Cost. A 40,000-token transcript compressed at τ = 0.2 saves ~3,300 tokens per LLM call. Over thousands of meetings per month, that adds up.

Recommended settings

Bear-2 Low (τ = 0.05) for pipelines where fidelity is critical and you want the accuracy boost without visible changes to the text. Legal, compliance, verbatim records.

Bear-2 Medium (τ = 0.2) for general-purpose transcript and document processing. Best accuracy, meaningful token savings, and the output reads cleanly.

For longer contexts (10K+ tokens), you can typically push aggressiveness higher. The compressor has more material to work with and can make better decisions about what's redundant. The CoQA stories are short enough that even medium aggressiveness starts cutting into useful content, so longer transcripts will see proportionally larger savings.

Limitations

This evaluation covers 150 questions from four of the five in-domain sources (Gutenberg was not represented in our sample). Running the complete 8,000-turn validation set and expanding to out-of-domain data would give a more complete picture.

The passages in CoQA are also relatively short (a few hundred tokens). Real meeting transcripts are 10 to 100x longer, where compression typically delivers larger savings. These results likely understate the benefit for production transcript workflows.

Try Bear-2 compression on your own data

Free tier includes 50M tokens. No credit card required.