Accuracy Benchmark
Reduce 66% of LLM input tokens while improving accuracy up to +1.1%
66%
Reduction
+1.1%
Accuracy gain
230
Questions
50
Runs
The problem
Large-context LLMs struggle with lengthy inputs through “lost-in-the-middle” failure modes, context limits, higher costs, and slower latency.
Token scaling alone doesn't resolve attention, budget, or throughput constraints. Compression provides a complementary approach that works within existing model architectures.
Results
11 configurations tested: baseline through 0.95 aggressiveness cutoff.
| Cutoff | Accuracy | Change | Token Reduction |
|---|---|---|---|
| Baseline | 28.2% | — | — |
| 0.1 | 28.3% | +0.1% | 6.3% |
| 0.2 | 27.9% | -0.3% | 13.5% |
| 0.3 | 29.2% | +1.0% | 23.4% |
| 0.4 | 29.0% | +0.8% | 33.2% |
| 0.5 | 28.9% | +0.7% | 42.1% |
| 0.6 | 28.8% | +0.6% | 50.3% |
| 0.7 | 28.0% | -0.2% | 57.8% |
| 0.8 | 29.1% | +0.9% | 62.4% |
| 0.9 | 29.2% | +1.1% | 66.1% |
| 0.95 | 27.7% | -0.5% | 77.4% |
Key findings
Six configurations showed statistically significant improvements
Cutoffs 0.3, 0.4, 0.5, 0.6, 0.8, and 0.9 all outperformed the uncompressed baseline.
Best performance at 0.9
The 0.9 cutoff achieved the highest accuracy gain (+1.1%) with maximum token reduction (66%). This is the sweet spot for cost-sensitive workloads.
Non-monotonic behavior
Some thresholds (0.2, 0.7) underperformed neighbors despite removing fewer tokens — attributed to importance score distribution interactions rather than fundamental compression properties.
Aggressive compression risks
The 0.95 cutoff removed excessive context, losing accuracy. There is a clear threshold beyond which compression degrades performance.
Recommended configurations
Conservative
0.3 cutoff
+1.0% accuracy with 23% token reduction. Optimal for users prioritizing performance with minimal context loss.
Aggressive
0.9 cutoff
+1.1% accuracy with 66% token reduction. Ideal for cost-sensitive or latency-constrained workloads.
Both achieved statistical significance across 50 runs.
Bootstrap analysis
10,000 bootstrap iterations confirmed bear-1's effectiveness, with P(better) reaching 100% for multiple configurations. This provides strong statistical evidence that the observed accuracy gains are not due to random chance.
Methodology
Dataset
LongBench v2 multiple-choice questions (230 from 503 total, filtered to ≤100k tokens)
Token counting
tiktoken (gpt-4o-mini encoding)
Runs
50 independent evaluations per configuration
Temperature
0 (near-deterministic)
Limitations
- Results are specific to GPT-4o-mini; may differ across other models.
- LongBench v2 subset due to token constraints.
- Effect sizes are modest (~1%); practical significance depends on use case.
Ready to try it?
Create an account to get your API key and start compressing.