Making LLMs understand financial documents better

Compressing bloat from long financial documents improves QA accuracy by 2.7 percentage points — with up to 20% fewer input tokens

February 2026Model: bear-1.2150 questions1,050 total evaluations

Code & results on GitHub FinanceBench dataset & benchmark Original paper (arXiv)

84.7%

Best accuracy

+2.7pp over baseline

20%

Max token reduction

Still above baseline

150

Questions

Real SEC filings

Configurations

+ 1 control

What is FinanceBench

FinanceBench is an open-source benchmark of 150 financial questions built from real SEC filings of publicly traded companies. It was created by Patronus AI to test whether LLMs can handle the kind of document analysis that financial professionals do every day — not synthetic puzzles, but real questions about real companies.

The benchmark covers three question types. Domain-relevant questions test financial terminology and concepts. Metrics-generated questions require extracting specific numbers from filings. Novel-generated questions demand multi-step reasoning — combining data from different parts of a document to reach an answer.

When the benchmark was first published, even GPT-4 Turbo with retrieval incorrectly answered or refused to answer 81% of questions. It remains a demanding test for any LLM pipeline.

Evaluation design

We ran each of the 150 questions through seven bear-1.2 compression configurations (aggressiveness from 0.05 to 0.7) and one uncompressed control. Each configuration processes the full source document, compresses it at the specified level, and passes the result to the LLM.

The control sends the complete, uncompressed document straight to the LLM. This isolates compression as the only variable — any accuracy difference is attributable to what bear-1.2 kept or removed.

SEC Filing

→

bear-1.2 compression

→

Compressed document

→

LLM

→

Answer

Results

Every compression configuration matched or exceeded the uncompressed baseline of 82.0%.

The best result: 84.7% accuracy at aggressiveness 0.05 and 0.1 — a 2.7 percentage point improvement. On 150 questions, that translates to 4 additional correct answers. Even the heaviest compression we tested (0.7, removing 20% of tokens) scored 83.3%, still 1.3 points above the control.

No compressionWith bear-1.2 compression

Config	Accuracy	Correct	Change	Token Reduction
No compression	82%	123/150	—	—
bear-1.2 @ 0.05	84.7%	127/150	+2.7pp	1.5%
bear-1.2 @ 0.1	84.7%	127/150	+2.7pp	3.9%
bear-1.2 @ 0.3	84%	126/150	+2.0pp	10.4%
bear-1.2 @ 0.4	83.3%	125/150	+1.3pp	12.4%
bear-1.2 @ 0.5	82.7%	124/150	+0.7pp	14.4%
bear-1.2 @ 0.7	83.3%	125/150	+1.3pp	20%

Where compression helps most

The accuracy improvement was not uniform across question types. Novel-generated questions — the hardest category, requiring synthesis and reasoning — showed the largest gain.

Metrics-generated questions (extracting specific financial figures) were stable at 94–96% regardless of compression level. The signal in these questions is explicit enough that removing low-importance tokens has no effect.

Novel-generated questions improved from 70% to 78% — an 8 percentage point gain. Removing noise from long financial documents gives the model's attention mechanism less irrelevant context to process. For harder reasoning tasks, that cleaner context makes a measurable difference.

No compressionBest bear-1.2 config

The hardest questions benefited the most from compression

Novel-generated questions require combining data from multiple sections of a filing. By reducing noise, compression concentrates the model's attention on the passages that matter — yielding the largest accuracy gain of any category (+8pp).

The efficiency tradeoff

The practical question: how much can you compress before accuracy drops below the uncompressed baseline?

On FinanceBench, every configuration we tested stayed above baseline. At 20% token reduction, accuracy was still 1.3 points higher than no compression at all. That means one in five input tokens can be removed — reducing both cost and latency — while getting better results.

Light compression (under 5% reduction) produced the highest accuracy. Heavier compression still outperformed the control, making it viable for cost-sensitive workloads where throughput matters more than marginal accuracy gains.

Reasoning type breakdown

Performance across the reasoning categories in FinanceBench.

Reasoning Type	n	Baseline	Best w/ Compression
Information extraction	31	96.8%	100%
Numerical reasoning	43	93.0%	93.0%
Logical reasoning (numerical)	5	60.0%	60.0%
Combined reasoning	5	80.0%	80.0%

Information extraction was near-perfect regardless of compression, reaching 100% in some configurations. Numerical reasoning held steady at 93%. Logical reasoning — the hardest category with only 5 samples — showed the most variance, consistent with its inherent difficulty and small sample size.

Key findings

Compression does not degrade financial QA accuracy

All seven compression configurations matched or exceeded the uncompressed control. This held across question types, reasoning categories, and compression levels up to 20%.

Light compression provides a free accuracy boost

At aggressiveness 0.05 and 0.1, accuracy improved by 2.7 percentage points with minimal token reduction. The compression acts as a denoising step, removing tokens that were actively hurting the model's ability to reason about the document.

Harder questions benefit the most

Novel-generated questions — the category requiring the most multi-step reasoning — saw the largest accuracy gain (+8pp). Straightforward extraction tasks were already near-ceiling and unaffected.

20% token reduction with no accuracy penalty

The most aggressive configuration tested (bear-1.2 @ 0.7) removed one in five tokens while still scoring 1.3 points above the uncompressed baseline. For production workloads, this directly translates to lower API costs and faster response times.

Methodology

Dataset

FinanceBench — 150 questions from real SEC filings of publicly traded companies

Evaluation

Exact-match grading against verified ground-truth answers

Configurations

7 bear-1.2 aggressiveness levels (0.05, 0.1, 0.3, 0.4, 0.5, 0.7) + 1 uncompressed control

Reproducibility

Full code and results published at github.com/TheTokenCompany/Benchmarks

Limitations

This evaluation used a single LLM. Results may vary across different model families and sizes.
FinanceBench questions have clear-cut answers. Real-world financial analysis often involves ambiguity that this benchmark does not capture.
Token reduction percentages reflect this specific document set (SEC filings). Other document types may compress differently.
We tested aggressiveness levels up to 0.7. Higher settings may show different accuracy characteristics.
Some reasoning categories have small sample sizes (n=5), limiting the statistical power of per-category comparisons.

Ready to try it?

Create an account to get your API key and start compressing.

Get started Read the docs