← Back to Benchmarks

Zero accuracy loss on conversational QA with 14% fewer tokens

bear-1.2 at aggressiveness 0.05 matches baseline accuracy exactly on CoQA — while removing 14% of input tokens across 4 source domains

March 2026Compression: bear-1.2LLM: gpt-5-mini150 questions1,200 total evaluations
ConversationCompressed0pp14% fewer tokensCompressing conversational context preserves multi-turn QA accuracy

87.3%

Accuracy maintained

0pp change from baseline

14.3%

Token reduction

7,860 tokens saved

150

Questions

Multi-turn conversations

4

Source domains

CNN, MCTest, RACE, Wikipedia

What is CoQA

CoQA (Conversational Question Answering) is a benchmark created by Reddy et al. at Stanford University. It tests whether models can engage in multi-turn conversations about a passage of text — answering a series of interconnected questions where each answer may depend on the context of previous turns.

The dataset spans 7 diverse domains including children's stories, news articles, Wikipedia, and English exam passages. This diversity tests whether comprehension holds across different writing styles and complexity levels.

What makes CoQA challenging is the conversational nature: questions often use pronouns and references that only make sense given the prior dialogue. The model must track context across turns while also understanding the source passage — a dual attention requirement that stresses both memory and comprehension.

Evaluation design

We sampled 150 questions from the CoQA validation set, drawn from conversations across 4 source domains (CNN, MCTest, RACE, Wikipedia). Each question was run through seven bear-1.2 compression configurations (aggressiveness from 0.05 to 0.7) and one uncompressed control.

The control sends the complete, uncompressed passage and conversation history straight to gpt-5-mini. This isolates compression as the only variable — any accuracy difference is attributable to what bear-1.2 kept or removed. Responses are evaluated by gpt-5-mini acting as an LLM-judge, comparing each answer against the gold answer key from the dataset.

Passage + History
bear-1.2 compression
Compressed context
gpt-5-mini
Answer
LLM-judge

Results

At aggressiveness 0.05, bear-1.2 matched the uncompressed baseline exactly — 87.3% accuracy, 131 out of 150 correct — while removing 14.3% of input tokens. This is a free reduction in cost and latency with zero impact on answer quality.

Accuracy remained close to baseline at 0.1 (85.3%, −2.0pp) and degraded gradually with heavier compression. At the most aggressive setting (0.7), accuracy dropped to 63.3% — a 24 percentage point loss, indicating that conversational QA is sensitive to heavy token removal.

No compressionWith bear-1.2 compression
60%65%70%75%80%85%90%baseline87.3%No comp.87.3%0.0514.3% tokens85.3%0.119.5% tokens80%0.333.6% tokens78.7%0.439.1% tokens74.7%0.544.4% tokens63.3%0.757.1% tokens
ConfigAccuracyCorrectChangeToken Reduction
No compression87.3%131/150
bear-1.2 @ 0.0587.3%131/1500pp14.3%
bear-1.2 @ 0.185.3%128/150-2.0pp19.5%
bear-1.2 @ 0.380%120/150-7.3pp33.6%
bear-1.2 @ 0.478.7%118/150-8.6pp39.1%
bear-1.2 @ 0.574.7%112/150-12.6pp44.4%
bear-1.2 @ 0.763.3%95/150-24.0pp57.1%

Performance across domains

CoQA draws from multiple source domains, each with different writing styles and complexity. At aggressiveness 0.05, accuracy held steady across all four domains — and one domain actually improved.

MCTest (children's stories) improved from 85.4% to 90.2% — a 4.8 percentage point gain. These simpler narratives may contain more compressible noise, and removing it helps the model focus on the story elements needed for answers.

RACE and Wikipedia were unchanged at 81.1% and 92.3% respectively. CNN saw a small drop from 91.5% to 88.1%, though it remained the second-highest performing domain.

No compressionbear-1.2 @ 0.05
70%75%80%85%90%95%100%91.5%88.1%-3.4ppCNNNews articles85.4%90.2%+4.8ppMCTestChildren's stories81.1%81.1%RACEEnglish exams92.3%92.3%WikipediaEncyclopedia articles
DomainnBaselinebear-1.2 @ 0.05Change
CNN5954/59 (91.5%)52/59 (88.1%)-3.4pp
MCTest4135/41 (85.4%)37/41 (90.2%)+4.8pp
RACE3730/37 (81.1%)30/37 (81.1%)0pp
Wikipedia1312/13 (92.3%)12/13 (92.3%)0pp

Compression works consistently across diverse text domains

At aggressiveness 0.05, accuracy was maintained or improved across all four source domains — from news articles to children's stories to encyclopedia text. MCTest actually gained 4.8 percentage points, suggesting that simpler narratives benefit from noise removal.

Accuracy across conversation depth

CoQA is a multi-turn dataset where each conversation contains up to 20 follow-up questions about a passage. Later turns often require resolving coreferences and building on prior answers, making them inherently harder.

At aggressiveness 0.05, bear-1.2 actually improved early-turn accuracy from 90.9% to 94.5% (+3.6pp) on turns 1–5, while mid-conversation turns (6–10) saw a modest dip from 88.9% to 83.3%. Late-conversation turns (11–15) improved from 78.6% to 85.7%, suggesting that light compression can help the model focus on the most relevant parts of longer passages.

No compressionbear-1.2 @ 0.05
70%75%80%85%90%95%100%90.9%94.5%+3.6ppTurns 1–5Early questions · n=5588.9%83.3%-5.6ppTurns 6–10Mid conversation · n=5478.6%85.7%+7.1ppTurns 11–15Late conversation · n=2884.6%76.9%-7.7ppTurns 16–20Deep conversation · n=13
Turn groupnBaselinebear-1.2 @ 0.05Change
Turns 1–5Early questions5550/55 (90.9%)52/55 (94.5%)+3.6pp
Turns 6–10Mid conversation5448/54 (88.9%)45/54 (83.3%)-5.6pp
Turns 11–15Late conversation2822/28 (78.6%)24/28 (85.7%)+7.1pp
Turns 16–20Deep conversation1311/13 (84.6%)10/13 (76.9%)-7.7pp
View per-turn breakdown (turns 1–20)
TurnnBaselinebear-1.2 @ 0.05Change
111100.0%90.9%-9.1pp
21190.9%100.0%+9.1pp
31181.8%100.0%+18.2pp
41190.9%90.9%0pp
51190.9%90.9%0pp
61190.9%81.8%-9.1pp
71181.8%81.8%0pp
811100.0%100.0%0pp
91181.8%72.7%-9.1pp
101090.0%80.0%-10.0pp
11887.5%75.0%-12.5pp
12785.7%85.7%0pp
135100.0%100.0%0pp
14450.0%75.0%+25.0pp
15450.0%100.0%+50.0pp
16366.7%66.7%0pp
173100.0%100.0%0pp
18366.7%66.7%0pp
192100.0%100.0%0pp
202100.0%50.0%-50.0pp

Light compression preserves — and sometimes improves — multi-turn accuracy

At 0.05, overall accuracy is identical to baseline despite turn-by-turn variation. Early turns (1–5) and late turns (11–15) both improved, while mid-conversation turns (6–10) saw a small drop. Individual turn sample sizes are small (2–11 questions), so turn-level differences should be interpreted with caution.

The efficiency tradeoff

Aggressiveness 0.05 is the clear sweet spot: identical accuracy to the uncompressed baseline with 14% token savings. At 0.1, accuracy drops by only 2 percentage points while saving nearly 20% of tokens.

Beyond 0.3, accuracy falls below 80% and the degradation accelerates. At 0.7, over half the tokens are removed but accuracy drops to 63.3%. Conversational QA appears more sensitive to heavy compression than single-turn tasks, likely because multi-turn context tracking requires preserving more of the passage and dialogue history.

60%65%70%75%80%85%90%0%10%20%30%40%50%60%Token reductionbaselineNo compression0.050.10.30.40.50.7

Key findings

Light compression is free

At aggressiveness 0.05, accuracy matched the uncompressed baseline exactly (87.3%) while removing 14.3% of tokens. This translates directly to lower API costs and faster response times with no quality tradeoff.

Multi-domain robustness

Accuracy held across all four source domains at 0.05 — CNN, RACE, and Wikipedia were stable, while MCTest (children's stories) actually improved by 4.8 percentage points. Compression generalizes across different text types.

Conversational context is preserved

CoQA requires tracking multi-turn dialogue where questions reference prior answers. bear-1.2 at 0.05 maintained this conversational thread — evidence that compression preserves the discourse structure needed for follow-up questions.

Heavy compression degrades conversational QA significantly

At aggressiveness 0.7, accuracy dropped to 63.3% — a 24 percentage point loss. MCTest was hit hardest, falling from 85.4% to 41.5%. Multi-turn tasks are more sensitive to aggressive token removal than single-turn benchmarks.

Methodology

Dataset

CoQA — 150 questions from the validation set across 4 domains (CNN, MCTest, RACE, Wikipedia) in multi-turn conversations

Evaluation

gpt-5-mini generates answers and serves as LLM-judge, evaluating responses against the gold answer key

Configurations

7 bear-1.2 aggressiveness levels (0.05, 0.1, 0.3, 0.4, 0.5, 0.7) + 1 uncompressed control

Reproducibility

Full code and results published at github.com/TheTokenCompany/Benchmarks

Limitations

  • This evaluation used gpt-5-mini. Results may vary across different model families and sizes.
  • CoQA tests conversational reading comprehension. Other dialogue formats (open-domain chat, instruction following) may respond differently to compression.
  • Token reduction percentages reflect the specific mix of passage lengths and conversation depths in our sample. Different samples may compress at different rates.

Ready to try it?

Create an account to get your API key and start compressing.