Improving LLM reading comprehension with compression

Compressing context passages improves SQuAD 2.0 accuracy by 4.0 percentage points — with 17% fewer input tokens

March 2026Compression: bear-1.2LLM: gpt-5-mini150 questions1,200 total evaluations

Code & results on GitHub SQuAD 2.0 dataset & leaderboard Original paper (arXiv)

72.0%

Best accuracy

+4.0pp over baseline

17.3%

Token reduction

At best accuracy config

150

Questions

69 answerable, 81 unanswerable

Configurations

+ 1 control

What is SQuAD 2.0

SQuAD 2.0 (Stanford Question Answering Dataset) is a reading comprehension benchmark created by Rajpurkar et al. at Stanford University. It tests whether models can read a passage of text and correctly answer questions about it — or recognize when no answer exists.

The dataset combines 100,000+ answerable questions from the original SQuAD 1.1 with over 50,000 adversarially crafted unanswerable questions. The unanswerable questions are designed to look plausible — they are written by crowdworkers who were shown the passage and asked to write questions that cannot be answered from it.

This makes SQuAD 2.0 significantly harder than its predecessor. Models must not only find correct answers in the text but also learn when to abstain — a critical capability for production systems where confidently wrong answers are worse than no answer at all.

Evaluation design

We sampled 150 questions from the SQuAD 2.0 validation set (69 answerable, 81 unanswerable), all drawn from a single article about the Normans to keep passage context consistent. Each question was run through seven bear-1.2 compression configurations (aggressiveness from 0.05 to 0.7) and one uncompressed control.

The control sends the complete, uncompressed Wikipedia passage straight to gpt-5-mini. This isolates compression as the only variable — any accuracy difference is attributable to what bear-1.2 kept or removed. Responses are evaluated by gpt-5-mini acting as an LLM-judge, comparing each answer against the gold answer key from the dataset.

Wikipedia Passage

→

bear-1.2 compression

→

Compressed passage

→

gpt-5-mini

→

Answer

→

LLM-judge

Results

Light compression improved accuracy over the uncompressed baseline. Heavier compression reduced accuracy below baseline, showing a clear sensitivity to aggressiveness level on this benchmark.

The best result: 72.0% accuracy at aggressiveness 0.05 — a 4.0 percentage point improvement over the 68.0% baseline. On 150 questions, that translates to 6 additional correct answers while removing 17.3% of input tokens.

No compressionWith bear-1.2 compression

Config	Accuracy	Correct	Change	Token Reduction
No compression	68%	102/150	—	—
bear-1.2 @ 0.05	72%	108/150	+4.0pp	17.3%
bear-1.2 @ 0.1	68.7%	103/150	+0.7pp	21.5%
bear-1.2 @ 0.3	66%	99/150	-2.0pp	31.3%
bear-1.2 @ 0.4	67.3%	101/150	-0.7pp	34.7%
bear-1.2 @ 0.5	64%	96/150	-4.0pp	38.4%
bear-1.2 @ 0.7	60.7%	91/150	-7.3pp	45.4%

Answerable vs unanswerable

SQuAD 2.0 tests two distinct capabilities: finding correct answers in the text (answerable) and recognizing when no answer exists (unanswerable). The impact of compression differed sharply between them.

Answerable questions were highly robust to compression. Accuracy stayed at 95.7% through aggressiveness 0.3 and remained above 88% even at 0.7 — the key information needed to extract an answer was consistently preserved.

Unanswerable questions showed the most dramatic improvement. At aggressiveness 0.05, detection improved from 44.4% to 51.9% — a 7.5 percentage point gain. By removing noise, compression made it easier for the model to confidently determine that a passage does not contain the answer.

AnswerableUnanswerable

Config	Answerable	Unanswerable	Overall
No compression	66/69 (95.7%)	36/81 (44.4%)	68%
bear-1.2 @ 0.05	66/69 (95.7%)	42/81 (51.9%)	72%
bear-1.2 @ 0.1	66/69 (95.7%)	37/81 (45.7%)	68.7%
bear-1.2 @ 0.3	66/69 (95.7%)	33/81 (40.7%)	66%
bear-1.2 @ 0.4	65/69 (94.2%)	36/81 (44.4%)	67.3%
bear-1.2 @ 0.5	63/69 (91.3%)	33/81 (40.7%)	64%
bear-1.2 @ 0.7	61/69 (88.4%)	30/81 (37%)	60.7%

Compression most improves the hardest task: knowing when NOT to answer

Unanswerable detection improved by 7.5 percentage points at bear-1.2 @ 0.05. By removing noise, the model more confidently identifies that the passage does not contain the answer — the single biggest accuracy driver in this benchmark.

The efficiency tradeoff

Unlike FinanceBench, where all configurations stayed above baseline, SQuAD 2.0 shows a clear inflection point. Light compression (0.05) is the sweet spot — accuracy goes up by 4.0 percentage points while removing 17% of tokens.

At aggressiveness 0.1, accuracy returns to near-baseline (68.7%). Beyond 0.3, accuracy drops below the uncompressed control. This makes reading comprehension more sensitive to compression than financial document QA, likely because SQuAD answers require extracting specific spans from the text — removing too many tokens risks discarding the exact phrase the model needs.

Key findings

Light compression improves reading comprehension

At aggressiveness 0.05, accuracy improved by 4.0 percentage points while reducing input tokens by 17.3%. The compression acts as a denoising step, focusing the model on the most relevant parts of the passage.

Unanswerable detection benefits the most from compression

Unanswerable question accuracy jumped from 44.4% to 51.9% — a 7.5 percentage point gain. Removing noise helps the model more confidently determine that a passage does not contain the answer, rather than hallucinating one.

Answerable accuracy is robust to compression

Answerable question accuracy stayed at 95.7% through aggressiveness 0.3 and remained above 88% even at 0.7. The specific text spans needed for answers are consistently preserved by bear-1.2.

Reading comprehension is more sensitive to heavy compression

Beyond aggressiveness 0.3, overall accuracy drops below the uncompressed baseline. Unlike financial QA, extractive reading comprehension requires preserving specific text spans — aggressive token removal risks discarding the exact phrase the model needs to answer.

Methodology

Dataset

SQuAD 2.0 — 150 questions from the validation set (69 answerable, 81 unanswerable) on a Wikipedia article about the Normans

Evaluation

gpt-5-mini generates answers and serves as LLM-judge, evaluating responses against the gold answer key for both answerable and unanswerable questions

Configurations

7 bear-1.2 aggressiveness levels (0.05, 0.1, 0.3, 0.4, 0.5, 0.7) + 1 uncompressed control

Reproducibility

Full code and results published at github.com/TheTokenCompany/Benchmarks

Limitations

This evaluation used gpt-5-mini. Results may vary across different model families and sizes.
We tested 150 of the 11,873 questions in the SQuAD 2.0 validation set. A larger sample may show different accuracy distributions.
All questions are from a single article (Normans). Different article topics and writing styles may respond differently to compression.
SQuAD 2.0 tests extractive reading comprehension — answers are text spans from the passage. Other QA formats (abstractive, generative) may behave differently under compression.
Token reduction percentages reflect Wikipedia prose. Technical documentation, code, or structured data may compress differently.

Ready to try it?

Create an account to get your API key and start compressing.

Get started Read the docs