Improving LLM reading comprehension with compression
Compressing context passages improves SQuAD 2.0 accuracy by 4.0 percentage points — with 17% fewer input tokens
72.0%
Best accuracy
+4.0pp over baseline
17.3%
Token reduction
At best accuracy config
150
Questions
69 answerable, 81 unanswerable
7
Configurations
+ 1 control
What is SQuAD 2.0
SQuAD 2.0 (Stanford Question Answering Dataset) is a reading comprehension benchmark created by Rajpurkar et al. at Stanford University. It tests whether models can read a passage of text and correctly answer questions about it — or recognize when no answer exists.
The dataset combines 100,000+ answerable questions from the original SQuAD 1.1 with over 50,000 adversarially crafted unanswerable questions. The unanswerable questions are designed to look plausible — they are written by crowdworkers who were shown the passage and asked to write questions that cannot be answered from it.
This makes SQuAD 2.0 significantly harder than its predecessor. Models must not only find correct answers in the text but also learn when to abstain — a critical capability for production systems where confidently wrong answers are worse than no answer at all.
Evaluation design
We sampled 150 questions from the SQuAD 2.0 validation set (69 answerable, 81 unanswerable), all drawn from a single article about the Normans to keep passage context consistent. Each question was run through seven bear-1.2 compression configurations (aggressiveness from 0.05 to 0.7) and one uncompressed control.
The control sends the complete, uncompressed Wikipedia passage straight to gpt-5-mini. This isolates compression as the only variable — any accuracy difference is attributable to what bear-1.2 kept or removed. Responses are evaluated by gpt-5-mini acting as an LLM-judge, comparing each answer against the gold answer key from the dataset.
Results
Light compression improved accuracy over the uncompressed baseline. Heavier compression reduced accuracy below baseline, showing a clear sensitivity to aggressiveness level on this benchmark.
The best result: 72.0% accuracy at aggressiveness 0.05 — a 4.0 percentage point improvement over the 68.0% baseline. On 150 questions, that translates to 6 additional correct answers while removing 17.3% of input tokens.
| Config | Accuracy | Correct | Change | Token Reduction |
|---|---|---|---|---|
| No compression | 68% | 102/150 | — | — |
| bear-1.2 @ 0.05 | 72% | 108/150 | +4.0pp | 17.3% |
| bear-1.2 @ 0.1 | 68.7% | 103/150 | +0.7pp | 21.5% |
| bear-1.2 @ 0.3 | 66% | 99/150 | -2.0pp | 31.3% |
| bear-1.2 @ 0.4 | 67.3% | 101/150 | -0.7pp | 34.7% |
| bear-1.2 @ 0.5 | 64% | 96/150 | -4.0pp | 38.4% |
| bear-1.2 @ 0.7 | 60.7% | 91/150 | -7.3pp | 45.4% |
Answerable vs unanswerable
SQuAD 2.0 tests two distinct capabilities: finding correct answers in the text (answerable) and recognizing when no answer exists (unanswerable). The impact of compression differed sharply between them.
Answerable questions were highly robust to compression. Accuracy stayed at 95.7% through aggressiveness 0.3 and remained above 88% even at 0.7 — the key information needed to extract an answer was consistently preserved.
Unanswerable questions showed the most dramatic improvement. At aggressiveness 0.05, detection improved from 44.4% to 51.9% — a 7.5 percentage point gain. By removing noise, compression made it easier for the model to confidently determine that a passage does not contain the answer.
| Config | Answerable | Unanswerable | Overall |
|---|---|---|---|
| No compression | 66/69 (95.7%) | 36/81 (44.4%) | 68% |
| bear-1.2 @ 0.05 | 66/69 (95.7%) | 42/81 (51.9%) | 72% |
| bear-1.2 @ 0.1 | 66/69 (95.7%) | 37/81 (45.7%) | 68.7% |
| bear-1.2 @ 0.3 | 66/69 (95.7%) | 33/81 (40.7%) | 66% |
| bear-1.2 @ 0.4 | 65/69 (94.2%) | 36/81 (44.4%) | 67.3% |
| bear-1.2 @ 0.5 | 63/69 (91.3%) | 33/81 (40.7%) | 64% |
| bear-1.2 @ 0.7 | 61/69 (88.4%) | 30/81 (37%) | 60.7% |
Compression most improves the hardest task: knowing when NOT to answer
Unanswerable detection improved by 7.5 percentage points at bear-1.2 @ 0.05. By removing noise, the model more confidently identifies that the passage does not contain the answer — the single biggest accuracy driver in this benchmark.
The efficiency tradeoff
Unlike FinanceBench, where all configurations stayed above baseline, SQuAD 2.0 shows a clear inflection point. Light compression (0.05) is the sweet spot — accuracy goes up by 4.0 percentage points while removing 17% of tokens.
At aggressiveness 0.1, accuracy returns to near-baseline (68.7%). Beyond 0.3, accuracy drops below the uncompressed control. This makes reading comprehension more sensitive to compression than financial document QA, likely because SQuAD answers require extracting specific spans from the text — removing too many tokens risks discarding the exact phrase the model needs.
Key findings
Light compression improves reading comprehension
At aggressiveness 0.05, accuracy improved by 4.0 percentage points while reducing input tokens by 17.3%. The compression acts as a denoising step, focusing the model on the most relevant parts of the passage.
Unanswerable detection benefits the most from compression
Unanswerable question accuracy jumped from 44.4% to 51.9% — a 7.5 percentage point gain. Removing noise helps the model more confidently determine that a passage does not contain the answer, rather than hallucinating one.
Answerable accuracy is robust to compression
Answerable question accuracy stayed at 95.7% through aggressiveness 0.3 and remained above 88% even at 0.7. The specific text spans needed for answers are consistently preserved by bear-1.2.
Reading comprehension is more sensitive to heavy compression
Beyond aggressiveness 0.3, overall accuracy drops below the uncompressed baseline. Unlike financial QA, extractive reading comprehension requires preserving specific text spans — aggressive token removal risks discarding the exact phrase the model needs to answer.
Methodology
Dataset
SQuAD 2.0 — 150 questions from the validation set (69 answerable, 81 unanswerable) on a Wikipedia article about the Normans
Evaluation
gpt-5-mini generates answers and serves as LLM-judge, evaluating responses against the gold answer key for both answerable and unanswerable questions
Configurations
7 bear-1.2 aggressiveness levels (0.05, 0.1, 0.3, 0.4, 0.5, 0.7) + 1 uncompressed control
Reproducibility
Full code and results published at github.com/TheTokenCompany/Benchmarks
Limitations
- This evaluation used gpt-5-mini. Results may vary across different model families and sizes.
- We tested 150 of the 11,873 questions in the SQuAD 2.0 validation set. A larger sample may show different accuracy distributions.
- All questions are from a single article (Normans). Different article topics and writing styles may respond differently to compression.
- SQuAD 2.0 tests extractive reading comprehension — answers are text spans from the passage. Other QA formats (abstractive, generative) may behave differently under compression.
- Token reduction percentages reflect Wikipedia prose. Technical documentation, code, or structured data may compress differently.
Ready to try it?
Create an account to get your API key and start compressing.