Accuracy Benchmark

Reduce 66% of LLM input tokens while improving accuracy up to +1.1%

December 15, 2025Model: bear-1230 samples126,500 API calls

66%

Reduction

+1.1%

Accuracy gain

230

Questions

Runs

The problem

Large-context LLMs struggle with lengthy inputs through “lost-in-the-middle” failure modes, context limits, higher costs, and slower latency.

Token scaling alone doesn't resolve attention, budget, or throughput constraints. Compression provides a complementary approach that works within existing model architectures.

Results

11 configurations tested: baseline through 0.95 aggressiveness cutoff.

Cutoff	Accuracy	Change	Token Reduction
Baseline	28.2%	—	—
0.1	28.3%	+0.1%	6.3%
0.2	27.9%	-0.3%	13.5%
0.3	29.2%	+1.0%	23.4%
0.4	29.0%	+0.8%	33.2%
0.5	28.9%	+0.7%	42.1%
0.6	28.8%	+0.6%	50.3%
0.7	28.0%	-0.2%	57.8%
0.8	29.1%	+0.9%	62.4%
0.9	29.2%	+1.1%	66.1%
0.95	27.7%	-0.5%	77.4%

Key findings

Six configurations showed statistically significant improvements

Cutoffs 0.3, 0.4, 0.5, 0.6, 0.8, and 0.9 all outperformed the uncompressed baseline.

Best performance at 0.9

The 0.9 cutoff achieved the highest accuracy gain (+1.1%) with maximum token reduction (66%). This is the sweet spot for cost-sensitive workloads.

Non-monotonic behavior

Some thresholds (0.2, 0.7) underperformed neighbors despite removing fewer tokens — attributed to importance score distribution interactions rather than fundamental compression properties.

Aggressive compression risks

The 0.95 cutoff removed excessive context, losing accuracy. There is a clear threshold beyond which compression degrades performance.

Recommended configurations

Conservative

0.3 cutoff

+1.0% accuracy with 23% token reduction. Optimal for users prioritizing performance with minimal context loss.

Aggressive

0.9 cutoff

+1.1% accuracy with 66% token reduction. Ideal for cost-sensitive or latency-constrained workloads.

Both achieved statistical significance across 50 runs.

Bootstrap analysis

10,000 bootstrap iterations confirmed bear-1's effectiveness, with P(better) reaching 100% for multiple configurations. This provides strong statistical evidence that the observed accuracy gains are not due to random chance.

Methodology

Dataset

LongBench v2 multiple-choice questions (230 from 503 total, filtered to ≤100k tokens)

Token counting

tiktoken (gpt-4o-mini encoding)

Runs

50 independent evaluations per configuration

Temperature

0 (near-deterministic)

Limitations

Results are specific to GPT-4o-mini; may differ across other models.
LongBench v2 subset due to token constraints.
Effect sizes are modest (~1%); practical significance depends on use case.

Ready to try it?

Create an account to get your API key and start compressing.

Get started Read the docs