Benchmarks

Comprehensive performance evaluations of The Token Company compression API. Each benchmark provides detailed methodology, statistical analysis, and reproducible results.

Case Study

One of the biggest token consumers globally improved quality by removing context bloat

Pax Historia, processing 193B tokens/month on OpenRouter, ran a 268K-vote model arena with bear-1.1 compression. Compressed models scored higher and A/B tests showed +5% purchase amount lift.

February 2026

financial QA

Making LLMs understand financial documents better

Compression improved financial QA accuracy by 2.7 percentage points on 150 SEC filing questions — while reducing input tokens by up to 20%.

February 2026

E2E latency

Reducing LLM response times through compression

Up to 37% faster on Claude Opus 4.6 and 30% on GPT-5.2 — saving seconds per request across 5 input sizes with sub-120ms compression overhead.

February 2026

reading comprehension

Improving LLM reading comprehension with compression

Compression improved SQuAD 2.0 accuracy by 4.0 percentage points on 150 reading comprehension questions — while reducing input tokens by 17%.

March 2026

conversational QA

Zero accuracy loss on conversational QA with 14% fewer tokens

Compression maintained 87.3% accuracy on 150 multi-turn CoQA questions across 4 domains — while reducing input tokens by 14%.

March 2026

We are working on updating the benchmarks to include more models and domains using our next generation of compression models.

Earlier results: Accuracy benchmark (bear-1, December 2025)

Why benchmark?

Token compression must balance efficiency with quality. Removing too many tokens risks degrading model performance, while removing too few limits cost savings.

Every result is reproducible. We publish the exact configurations, datasets, and evaluation criteria so you can verify our claims.