Zero accuracy loss on conversational QA with 14% fewer tokens
bear-1.2 at aggressiveness 0.05 matches baseline accuracy exactly on CoQA — while removing 14% of input tokens across 4 source domains
87.3%
Accuracy maintained
0pp change from baseline
14.3%
Token reduction
7,860 tokens saved
150
Questions
Multi-turn conversations
4
Source domains
CNN, MCTest, RACE, Wikipedia
What is CoQA
CoQA (Conversational Question Answering) is a benchmark created by Reddy et al. at Stanford University. It tests whether models can engage in multi-turn conversations about a passage of text — answering a series of interconnected questions where each answer may depend on the context of previous turns.
The dataset spans 7 diverse domains including children's stories, news articles, Wikipedia, and English exam passages. This diversity tests whether comprehension holds across different writing styles and complexity levels.
What makes CoQA challenging is the conversational nature: questions often use pronouns and references that only make sense given the prior dialogue. The model must track context across turns while also understanding the source passage — a dual attention requirement that stresses both memory and comprehension.
Evaluation design
We sampled 150 questions from the CoQA validation set, drawn from conversations across 4 source domains (CNN, MCTest, RACE, Wikipedia). Each question was run through seven bear-1.2 compression configurations (aggressiveness from 0.05 to 0.7) and one uncompressed control.
The control sends the complete, uncompressed passage and conversation history straight to gpt-5-mini. This isolates compression as the only variable — any accuracy difference is attributable to what bear-1.2 kept or removed. Responses are evaluated by gpt-5-mini acting as an LLM-judge, comparing each answer against the gold answer key from the dataset.
Results
At aggressiveness 0.05, bear-1.2 matched the uncompressed baseline exactly — 87.3% accuracy, 131 out of 150 correct — while removing 14.3% of input tokens. This is a free reduction in cost and latency with zero impact on answer quality.
Accuracy remained close to baseline at 0.1 (85.3%, −2.0pp) and degraded gradually with heavier compression. At the most aggressive setting (0.7), accuracy dropped to 63.3% — a 24 percentage point loss, indicating that conversational QA is sensitive to heavy token removal.
| Config | Accuracy | Correct | Change | Token Reduction |
|---|---|---|---|---|
| No compression | 87.3% | 131/150 | — | — |
| bear-1.2 @ 0.05 | 87.3% | 131/150 | 0pp | 14.3% |
| bear-1.2 @ 0.1 | 85.3% | 128/150 | -2.0pp | 19.5% |
| bear-1.2 @ 0.3 | 80% | 120/150 | -7.3pp | 33.6% |
| bear-1.2 @ 0.4 | 78.7% | 118/150 | -8.6pp | 39.1% |
| bear-1.2 @ 0.5 | 74.7% | 112/150 | -12.6pp | 44.4% |
| bear-1.2 @ 0.7 | 63.3% | 95/150 | -24.0pp | 57.1% |
Performance across domains
CoQA draws from multiple source domains, each with different writing styles and complexity. At aggressiveness 0.05, accuracy held steady across all four domains — and one domain actually improved.
MCTest (children's stories) improved from 85.4% to 90.2% — a 4.8 percentage point gain. These simpler narratives may contain more compressible noise, and removing it helps the model focus on the story elements needed for answers.
RACE and Wikipedia were unchanged at 81.1% and 92.3% respectively. CNN saw a small drop from 91.5% to 88.1%, though it remained the second-highest performing domain.
| Domain | n | Baseline | bear-1.2 @ 0.05 | Change |
|---|---|---|---|---|
| CNN | 59 | 54/59 (91.5%) | 52/59 (88.1%) | -3.4pp |
| MCTest | 41 | 35/41 (85.4%) | 37/41 (90.2%) | +4.8pp |
| RACE | 37 | 30/37 (81.1%) | 30/37 (81.1%) | 0pp |
| Wikipedia | 13 | 12/13 (92.3%) | 12/13 (92.3%) | 0pp |
Compression works consistently across diverse text domains
At aggressiveness 0.05, accuracy was maintained or improved across all four source domains — from news articles to children's stories to encyclopedia text. MCTest actually gained 4.8 percentage points, suggesting that simpler narratives benefit from noise removal.
Accuracy across conversation depth
CoQA is a multi-turn dataset where each conversation contains up to 20 follow-up questions about a passage. Later turns often require resolving coreferences and building on prior answers, making them inherently harder.
At aggressiveness 0.05, bear-1.2 actually improved early-turn accuracy from 90.9% to 94.5% (+3.6pp) on turns 1–5, while mid-conversation turns (6–10) saw a modest dip from 88.9% to 83.3%. Late-conversation turns (11–15) improved from 78.6% to 85.7%, suggesting that light compression can help the model focus on the most relevant parts of longer passages.
| Turn group | n | Baseline | bear-1.2 @ 0.05 | Change |
|---|---|---|---|---|
| Turns 1–5Early questions | 55 | 50/55 (90.9%) | 52/55 (94.5%) | +3.6pp |
| Turns 6–10Mid conversation | 54 | 48/54 (88.9%) | 45/54 (83.3%) | -5.6pp |
| Turns 11–15Late conversation | 28 | 22/28 (78.6%) | 24/28 (85.7%) | +7.1pp |
| Turns 16–20Deep conversation | 13 | 11/13 (84.6%) | 10/13 (76.9%) | -7.7pp |
View per-turn breakdown (turns 1–20)
| Turn | n | Baseline | bear-1.2 @ 0.05 | Change |
|---|---|---|---|---|
| 1 | 11 | 100.0% | 90.9% | -9.1pp |
| 2 | 11 | 90.9% | 100.0% | +9.1pp |
| 3 | 11 | 81.8% | 100.0% | +18.2pp |
| 4 | 11 | 90.9% | 90.9% | 0pp |
| 5 | 11 | 90.9% | 90.9% | 0pp |
| 6 | 11 | 90.9% | 81.8% | -9.1pp |
| 7 | 11 | 81.8% | 81.8% | 0pp |
| 8 | 11 | 100.0% | 100.0% | 0pp |
| 9 | 11 | 81.8% | 72.7% | -9.1pp |
| 10 | 10 | 90.0% | 80.0% | -10.0pp |
| 11 | 8 | 87.5% | 75.0% | -12.5pp |
| 12 | 7 | 85.7% | 85.7% | 0pp |
| 13 | 5 | 100.0% | 100.0% | 0pp |
| 14 | 4 | 50.0% | 75.0% | +25.0pp |
| 15 | 4 | 50.0% | 100.0% | +50.0pp |
| 16 | 3 | 66.7% | 66.7% | 0pp |
| 17 | 3 | 100.0% | 100.0% | 0pp |
| 18 | 3 | 66.7% | 66.7% | 0pp |
| 19 | 2 | 100.0% | 100.0% | 0pp |
| 20 | 2 | 100.0% | 50.0% | -50.0pp |
Light compression preserves — and sometimes improves — multi-turn accuracy
At 0.05, overall accuracy is identical to baseline despite turn-by-turn variation. Early turns (1–5) and late turns (11–15) both improved, while mid-conversation turns (6–10) saw a small drop. Individual turn sample sizes are small (2–11 questions), so turn-level differences should be interpreted with caution.
The efficiency tradeoff
Aggressiveness 0.05 is the clear sweet spot: identical accuracy to the uncompressed baseline with 14% token savings. At 0.1, accuracy drops by only 2 percentage points while saving nearly 20% of tokens.
Beyond 0.3, accuracy falls below 80% and the degradation accelerates. At 0.7, over half the tokens are removed but accuracy drops to 63.3%. Conversational QA appears more sensitive to heavy compression than single-turn tasks, likely because multi-turn context tracking requires preserving more of the passage and dialogue history.
Key findings
Light compression is free
At aggressiveness 0.05, accuracy matched the uncompressed baseline exactly (87.3%) while removing 14.3% of tokens. This translates directly to lower API costs and faster response times with no quality tradeoff.
Multi-domain robustness
Accuracy held across all four source domains at 0.05 — CNN, RACE, and Wikipedia were stable, while MCTest (children's stories) actually improved by 4.8 percentage points. Compression generalizes across different text types.
Conversational context is preserved
CoQA requires tracking multi-turn dialogue where questions reference prior answers. bear-1.2 at 0.05 maintained this conversational thread — evidence that compression preserves the discourse structure needed for follow-up questions.
Heavy compression degrades conversational QA significantly
At aggressiveness 0.7, accuracy dropped to 63.3% — a 24 percentage point loss. MCTest was hit hardest, falling from 85.4% to 41.5%. Multi-turn tasks are more sensitive to aggressive token removal than single-turn benchmarks.
Methodology
Dataset
CoQA — 150 questions from the validation set across 4 domains (CNN, MCTest, RACE, Wikipedia) in multi-turn conversations
Evaluation
gpt-5-mini generates answers and serves as LLM-judge, evaluating responses against the gold answer key
Configurations
7 bear-1.2 aggressiveness levels (0.05, 0.1, 0.3, 0.4, 0.5, 0.7) + 1 uncompressed control
Reproducibility
Full code and results published at github.com/TheTokenCompany/Benchmarks
Limitations
- This evaluation used gpt-5-mini. Results may vary across different model families and sizes.
- CoQA tests conversational reading comprehension. Other dialogue formats (open-domain chat, instruction following) may respond differently to compression.
- Token reduction percentages reflect the specific mix of passage lengths and conversation depths in our sample. Different samples may compress at different rates.
Ready to try it?
Create an account to get your API key and start compressing.