Making LLMs understand financial documents better
Compressing bloat from long financial documents improves QA accuracy by 2.7 percentage points — with up to 20% fewer input tokens
84.7%
Best accuracy
+2.7pp over baseline
20%
Max token reduction
Still above baseline
150
Questions
Real SEC filings
7
Configurations
+ 1 control
What is FinanceBench
FinanceBench is an open-source benchmark of 150 financial questions built from real SEC filings of publicly traded companies. It was created by Patronus AI to test whether LLMs can handle the kind of document analysis that financial professionals do every day — not synthetic puzzles, but real questions about real companies.
The benchmark covers three question types. Domain-relevant questions test financial terminology and concepts. Metrics-generated questions require extracting specific numbers from filings. Novel-generated questions demand multi-step reasoning — combining data from different parts of a document to reach an answer.
When the benchmark was first published, even GPT-4 Turbo with retrieval incorrectly answered or refused to answer 81% of questions. It remains a demanding test for any LLM pipeline.
Evaluation design
We ran each of the 150 questions through seven bear-1.2 compression configurations (aggressiveness from 0.05 to 0.7) and one uncompressed control. Each configuration processes the full source document, compresses it at the specified level, and passes the result to the LLM.
The control sends the complete, uncompressed document straight to the LLM. This isolates compression as the only variable — any accuracy difference is attributable to what bear-1.2 kept or removed.
Results
Every compression configuration matched or exceeded the uncompressed baseline of 82.0%.
The best result: 84.7% accuracy at aggressiveness 0.05 and 0.1 — a 2.7 percentage point improvement. On 150 questions, that translates to 4 additional correct answers. Even the heaviest compression we tested (0.7, removing 20% of tokens) scored 83.3%, still 1.3 points above the control.
| Config | Accuracy | Correct | Change | Token Reduction |
|---|---|---|---|---|
| No compression | 82% | 123/150 | — | — |
| bear-1.2 @ 0.05 | 84.7% | 127/150 | +2.7pp | 1.5% |
| bear-1.2 @ 0.1 | 84.7% | 127/150 | +2.7pp | 3.9% |
| bear-1.2 @ 0.3 | 84% | 126/150 | +2.0pp | 10.4% |
| bear-1.2 @ 0.4 | 83.3% | 125/150 | +1.3pp | 12.4% |
| bear-1.2 @ 0.5 | 82.7% | 124/150 | +0.7pp | 14.4% |
| bear-1.2 @ 0.7 | 83.3% | 125/150 | +1.3pp | 20% |
Where compression helps most
The accuracy improvement was not uniform across question types. Novel-generated questions — the hardest category, requiring synthesis and reasoning — showed the largest gain.
Metrics-generated questions (extracting specific financial figures) were stable at 94–96% regardless of compression level. The signal in these questions is explicit enough that removing low-importance tokens has no effect.
Novel-generated questions improved from 70% to 78% — an 8 percentage point gain. Removing noise from long financial documents gives the model's attention mechanism less irrelevant context to process. For harder reasoning tasks, that cleaner context makes a measurable difference.
The hardest questions benefited the most from compression
Novel-generated questions require combining data from multiple sections of a filing. By reducing noise, compression concentrates the model's attention on the passages that matter — yielding the largest accuracy gain of any category (+8pp).
The efficiency tradeoff
The practical question: how much can you compress before accuracy drops below the uncompressed baseline?
On FinanceBench, every configuration we tested stayed above baseline. At 20% token reduction, accuracy was still 1.3 points higher than no compression at all. That means one in five input tokens can be removed — reducing both cost and latency — while getting better results.
Light compression (under 5% reduction) produced the highest accuracy. Heavier compression still outperformed the control, making it viable for cost-sensitive workloads where throughput matters more than marginal accuracy gains.
Reasoning type breakdown
Performance across the reasoning categories in FinanceBench.
| Reasoning Type | n | Baseline | Best w/ Compression |
|---|---|---|---|
| Information extraction | 31 | 96.8% | 100% |
| Numerical reasoning | 43 | 93.0% | 93.0% |
| Logical reasoning (numerical) | 5 | 60.0% | 60.0% |
| Combined reasoning | 5 | 80.0% | 80.0% |
Information extraction was near-perfect regardless of compression, reaching 100% in some configurations. Numerical reasoning held steady at 93%. Logical reasoning — the hardest category with only 5 samples — showed the most variance, consistent with its inherent difficulty and small sample size.
Key findings
Compression does not degrade financial QA accuracy
All seven compression configurations matched or exceeded the uncompressed control. This held across question types, reasoning categories, and compression levels up to 20%.
Light compression provides a free accuracy boost
At aggressiveness 0.05 and 0.1, accuracy improved by 2.7 percentage points with minimal token reduction. The compression acts as a denoising step, removing tokens that were actively hurting the model's ability to reason about the document.
Harder questions benefit the most
Novel-generated questions — the category requiring the most multi-step reasoning — saw the largest accuracy gain (+8pp). Straightforward extraction tasks were already near-ceiling and unaffected.
20% token reduction with no accuracy penalty
The most aggressive configuration tested (bear-1.2 @ 0.7) removed one in five tokens while still scoring 1.3 points above the uncompressed baseline. For production workloads, this directly translates to lower API costs and faster response times.
Methodology
Dataset
FinanceBench — 150 questions from real SEC filings of publicly traded companies
Evaluation
Exact-match grading against verified ground-truth answers
Configurations
7 bear-1.2 aggressiveness levels (0.05, 0.1, 0.3, 0.4, 0.5, 0.7) + 1 uncompressed control
Reproducibility
Full code and results published at github.com/TheTokenCompany/Benchmarks
Limitations
- This evaluation used a single LLM. Results may vary across different model families and sizes.
- FinanceBench questions have clear-cut answers. Real-world financial analysis often involves ambiguity that this benchmark does not capture.
- Token reduction percentages reflect this specific document set (SEC filings). Other document types may compress differently.
- We tested aggressiveness levels up to 0.7. Higher settings may show different accuracy characteristics.
- Some reasoning categories have small sample sizes (n=5), limiting the statistical power of per-category comparisons.
Ready to try it?
Create an account to get your API key and start compressing.