Zero accuracy loss on conversational QA with 14% fewer tokens

bear-1.2 at aggressiveness 0.05 matches baseline accuracy exactly on CoQA — while removing 14% of input tokens across 4 source domains

March 2026Compression: bear-1.2LLM: gpt-5-mini150 questions1,200 total evaluations

Code & results on GitHub CoQA dataset & leaderboard Original paper (arXiv)

87.3%

Accuracy maintained

0pp change from baseline

14.3%

Token reduction

7,860 tokens saved

150

Questions

Multi-turn conversations

Source domains

CNN, MCTest, RACE, Wikipedia

What is CoQA

CoQA (Conversational Question Answering) is a benchmark created by Reddy et al. at Stanford University. It tests whether models can engage in multi-turn conversations about a passage of text — answering a series of interconnected questions where each answer may depend on the context of previous turns.

The dataset spans 7 diverse domains including children's stories, news articles, Wikipedia, and English exam passages. This diversity tests whether comprehension holds across different writing styles and complexity levels.

What makes CoQA challenging is the conversational nature: questions often use pronouns and references that only make sense given the prior dialogue. The model must track context across turns while also understanding the source passage — a dual attention requirement that stresses both memory and comprehension.

Evaluation design

We sampled 150 questions from the CoQA validation set, drawn from conversations across 4 source domains (CNN, MCTest, RACE, Wikipedia). Each question was run through seven bear-1.2 compression configurations (aggressiveness from 0.05 to 0.7) and one uncompressed control.

The control sends the complete, uncompressed passage and conversation history straight to gpt-5-mini. This isolates compression as the only variable — any accuracy difference is attributable to what bear-1.2 kept or removed. Responses are evaluated by gpt-5-mini acting as an LLM-judge, comparing each answer against the gold answer key from the dataset.

Passage + History

→

bear-1.2 compression

→

Compressed context

→

gpt-5-mini

→

Answer

→

LLM-judge

Results

At aggressiveness 0.05, bear-1.2 matched the uncompressed baseline exactly — 87.3% accuracy, 131 out of 150 correct — while removing 14.3% of input tokens. This is a free reduction in cost and latency with zero impact on answer quality.

Accuracy remained close to baseline at 0.1 (85.3%, −2.0pp) and degraded gradually with heavier compression. At the most aggressive setting (0.7), accuracy dropped to 63.3% — a 24 percentage point loss, indicating that conversational QA is sensitive to heavy token removal.

No compressionWith bear-1.2 compression

Config	Accuracy	Correct	Change	Token Reduction
No compression	87.3%	131/150	—	—
bear-1.2 @ 0.05	87.3%	131/150	0pp	14.3%
bear-1.2 @ 0.1	85.3%	128/150	-2.0pp	19.5%
bear-1.2 @ 0.3	80%	120/150	-7.3pp	33.6%
bear-1.2 @ 0.4	78.7%	118/150	-8.6pp	39.1%
bear-1.2 @ 0.5	74.7%	112/150	-12.6pp	44.4%
bear-1.2 @ 0.7	63.3%	95/150	-24.0pp	57.1%

Performance across domains

CoQA draws from multiple source domains, each with different writing styles and complexity. At aggressiveness 0.05, accuracy held steady across all four domains — and one domain actually improved.

MCTest (children's stories) improved from 85.4% to 90.2% — a 4.8 percentage point gain. These simpler narratives may contain more compressible noise, and removing it helps the model focus on the story elements needed for answers.

RACE and Wikipedia were unchanged at 81.1% and 92.3% respectively. CNN saw a small drop from 91.5% to 88.1%, though it remained the second-highest performing domain.

No compressionbear-1.2 @ 0.05

Domain	n	Baseline	bear-1.2 @ 0.05	Change
CNN	59	54/59 (91.5%)	52/59 (88.1%)	-3.4pp
MCTest	41	35/41 (85.4%)	37/41 (90.2%)	+4.8pp
RACE	37	30/37 (81.1%)	30/37 (81.1%)	0pp
Wikipedia	13	12/13 (92.3%)	12/13 (92.3%)	0pp

Compression works consistently across diverse text domains

At aggressiveness 0.05, accuracy was maintained or improved across all four source domains — from news articles to children's stories to encyclopedia text. MCTest actually gained 4.8 percentage points, suggesting that simpler narratives benefit from noise removal.

Accuracy across conversation depth

CoQA is a multi-turn dataset where each conversation contains up to 20 follow-up questions about a passage. Later turns often require resolving coreferences and building on prior answers, making them inherently harder.

At aggressiveness 0.05, bear-1.2 actually improved early-turn accuracy from 90.9% to 94.5% (+3.6pp) on turns 1–5, while mid-conversation turns (6–10) saw a modest dip from 88.9% to 83.3%. Late-conversation turns (11–15) improved from 78.6% to 85.7%, suggesting that light compression can help the model focus on the most relevant parts of longer passages.

No compressionbear-1.2 @ 0.05

Turn group	n	Baseline	bear-1.2 @ 0.05	Change
Turns 1–5Early questions	55	50/55 (90.9%)	52/55 (94.5%)	+3.6pp
Turns 6–10Mid conversation	54	48/54 (88.9%)	45/54 (83.3%)	-5.6pp
Turns 11–15Late conversation	28	22/28 (78.6%)	24/28 (85.7%)	+7.1pp
Turns 16–20Deep conversation	13	11/13 (84.6%)	10/13 (76.9%)	-7.7pp

View per-turn breakdown (turns 1–20)

Turn	n	Baseline	bear-1.2 @ 0.05	Change
1	11	100.0%	90.9%	-9.1pp
2	11	90.9%	100.0%	+9.1pp
3	11	81.8%	100.0%	+18.2pp
4	11	90.9%	90.9%	0pp
5	11	90.9%	90.9%	0pp
6	11	90.9%	81.8%	-9.1pp
7	11	81.8%	81.8%	0pp
8	11	100.0%	100.0%	0pp
9	11	81.8%	72.7%	-9.1pp
10	10	90.0%	80.0%	-10.0pp
11	8	87.5%	75.0%	-12.5pp
12	7	85.7%	85.7%	0pp
13	5	100.0%	100.0%	0pp
14	4	50.0%	75.0%	+25.0pp
15	4	50.0%	100.0%	+50.0pp
16	3	66.7%	66.7%	0pp
17	3	100.0%	100.0%	0pp
18	3	66.7%	66.7%	0pp
19	2	100.0%	100.0%	0pp
20	2	100.0%	50.0%	-50.0pp

Light compression preserves — and sometimes improves — multi-turn accuracy

At 0.05, overall accuracy is identical to baseline despite turn-by-turn variation. Early turns (1–5) and late turns (11–15) both improved, while mid-conversation turns (6–10) saw a small drop. Individual turn sample sizes are small (2–11 questions), so turn-level differences should be interpreted with caution.

The efficiency tradeoff

Aggressiveness 0.05 is the clear sweet spot: identical accuracy to the uncompressed baseline with 14% token savings. At 0.1, accuracy drops by only 2 percentage points while saving nearly 20% of tokens.

Beyond 0.3, accuracy falls below 80% and the degradation accelerates. At 0.7, over half the tokens are removed but accuracy drops to 63.3%. Conversational QA appears more sensitive to heavy compression than single-turn tasks, likely because multi-turn context tracking requires preserving more of the passage and dialogue history.

Key findings

Light compression is free

At aggressiveness 0.05, accuracy matched the uncompressed baseline exactly (87.3%) while removing 14.3% of tokens. This translates directly to lower API costs and faster response times with no quality tradeoff.

Multi-domain robustness

Accuracy held across all four source domains at 0.05 — CNN, RACE, and Wikipedia were stable, while MCTest (children's stories) actually improved by 4.8 percentage points. Compression generalizes across different text types.

Conversational context is preserved

CoQA requires tracking multi-turn dialogue where questions reference prior answers. bear-1.2 at 0.05 maintained this conversational thread — evidence that compression preserves the discourse structure needed for follow-up questions.

Heavy compression degrades conversational QA significantly

At aggressiveness 0.7, accuracy dropped to 63.3% — a 24 percentage point loss. MCTest was hit hardest, falling from 85.4% to 41.5%. Multi-turn tasks are more sensitive to aggressive token removal than single-turn benchmarks.

Methodology

Dataset

CoQA — 150 questions from the validation set across 4 domains (CNN, MCTest, RACE, Wikipedia) in multi-turn conversations

Evaluation

gpt-5-mini generates answers and serves as LLM-judge, evaluating responses against the gold answer key

Configurations

7 bear-1.2 aggressiveness levels (0.05, 0.1, 0.3, 0.4, 0.5, 0.7) + 1 uncompressed control

Reproducibility

Full code and results published at github.com/TheTokenCompany/Benchmarks

Limitations

This evaluation used gpt-5-mini. Results may vary across different model families and sizes.
CoQA tests conversational reading comprehension. Other dialogue formats (open-domain chat, instruction following) may respond differently to compression.
Token reduction percentages reflect the specific mix of passage lengths and conversation depths in our sample. Different samples may compress at different rates.

Ready to try it?

Create an account to get your API key and start compressing.

Get started Read the docs