← Back to Benchmarks

Reducing LLM response times through compression

Up to 37% faster on Claude Opus 4.6 and 30% on GPT-5.2 — saving seconds per request with sub-120ms compression overhead

February 2026Models: Claude Opus 4.6, GPT-5.2~50% token reductionbear-1.2 @ aggressiveness 0.9
100K tokensCompressed-37%2.0s saved at 100KFewer tokens means faster time-to-first-token and lower end-to-end latency

37%

Faster (Claude, 100K)

2,016ms saved

30%

Faster (GPT-5.2, 200K)

827ms saved

1M+

Tok/sec throughput

100K input, P50

50

Runs per size

Per model

What we measured

End-to-end (E2E) latency is the total wall-clock time from sending a request to receiving the LLM's first token. For chatbots, coding assistants, and search applications, this number determines how fast your product feels to users.

When you compress context before sending it to an LLM, two things happen: you pay a small overhead for compression, but the LLM processes fewer tokens — which means faster prefill and faster time-to-first-token. The question is whether the latency savings from shorter context outweigh the compression cost.

This benchmark measures both paths end-to-end. The baseline sends full context directly to the LLM. The compressed path runs context through the TTC API first, then sends the compressed output to the LLM. The difference tells you exactly how much time compression saves (or costs) at each input size.

Evaluation design

We measured E2E latency across five input sizes (10K to 200K tokens) using documents from LongBench v2. Each configuration ran 50 times to account for network and API variability.

The compressed path uses bear-1.2 at aggressiveness 0.9, reducing tokens by approximately 50%. Both models were tested with identical text, and all requests were made from the same US West Coast location with gzip encoding enabled.

Document
bear-1.2 compression
Compressed context
LLM
First token

Results: Claude Opus 4.6

On Claude Opus 4.6, compression delivers clear latency savings starting at 25K tokens. The benefit scales with input size — at 100K tokens, the compressed path is 37% faster, saving over 2 seconds per request.

At 10K tokens, the overhead is negligible: compression adds just 6ms, well within noise. At 200K tokens, savings remain substantial at 1.6 seconds (27%), though the percentage is lower than at 100K due to increased compression time for very large inputs.

Baseline (direct to LLM)With bear-1.2 compression
02s4s6s1.8s1.8s10Ktokens2.3s2.1s-8%25Ktokens3.1s2.5s-20%50Ktokens5.4s3.4s-37%100Ktokens5.8s4.3s-27%200Ktokens
Input SizeBaselineWith TTCSaved
10K tokens1.8s1.8s+6ms
25K tokens2.3s2.1s+178ms (8%)
50K tokens3.1s2.5s+622ms (20%)
100K tokens5.4s3.4s+2.0s (37%)
200K tokens5.8s4.3s+1.6s (27%)

Results: GPT-5.2

GPT-5.2 shows a similar pattern with lower absolute latencies. At 200K tokens, compression saves 827ms (30%). At 50K tokens, the saving is 218ms (21%).

The 25K data point is an outlier — compressed latency was 42ms higher than baseline. This is within the variance of a 50-run benchmark and likely reflects network jitter rather than a systematic effect. At every other size, compression reduced latency.

Baseline (direct to LLM)With bear-1.2 compression
01s2s3s563ms520ms-8%10Ktokens731ms773ms25Ktokens1.0s801ms-21%50Ktokens1.6s1.4s-14%100Ktokens2.7s1.9s-30%200Ktokens
Input SizeBaselineWith TTCSaved
10K tokens563ms520ms+43ms (8%)
25K tokens731ms773ms-42ms
50K tokens1.0s801ms+218ms (21%)
100K tokens1.6s1.4s+228ms (14%)
200K tokens2.7s1.9s+827ms (30%)

50 runs per size per model. Savings scale with input size.

Compression API performance

The TTC compression API processes tokens at extremely high throughput. At 100K tokens, compression runs at 1M tokens per second with a P50 latency of 100ms. Even at 200K tokens, throughput reaches 1.7M tok/sec with P50 at 115ms.

Compression stays sub-120ms at all input sizes tested. This means the overhead is consistently small relative to LLM processing time, which is why the end-to-end savings are so significant at larger context sizes.

0500K1M1.5M2MInput size (tokens)250K tok/s10K455K tok/s25K714K tok/s50K1M tok/s100K1.7M tok/s200K

Sub-120ms compression at every input size

The compression step adds between 40ms (10K tokens) and 115ms (200K tokens) — consistently small enough that the LLM latency savings more than compensate, especially above 25K tokens.

Key findings

Compression pays for itself at 25K+ tokens

At 25K tokens and above, the LLM latency reduction from processing fewer tokens consistently exceeds the compression overhead. At 10K tokens, the overhead is negligible but savings are minimal.

Savings scale with input size

Larger contexts benefit more from compression. At 100K tokens on Claude, the savings reach 2 seconds — a 37% reduction in end-to-end latency. This is because LLM prefill time grows with input length, while compression overhead grows much more slowly.

Both model families benefit

Claude Opus 4.6 and GPT-5.2 both showed significant latency reductions despite having very different baseline latency profiles. The benefit is not model-specific — it comes from reducing the fundamental work the LLM needs to do.

Compression overhead is consistently low

The TTC compression API processes up to 1.7M tokens per second with sub-120ms P50 latency at all sizes tested. This makes it practical to add compression to any real-time pipeline without introducing a noticeable bottleneck.

Methodology

Dataset

LongBench v2 documents at five size tiers: 10K, 25K, 50K, 100K, and 200K tokens

Measurement

50 runs per input size per model, measuring wall-clock E2E latency including all network overhead

Configuration

bear-1.2 @ aggressiveness 0.9 (~50% token reduction). Gzip encoding enabled for all requests

Reproducibility

Full code and raw results will be published on GitHub once the benchmark repository is made public

Limitations

  • All measurements were taken from a single geographic region (US West Coast). Latency characteristics may differ from other locations.
  • Only one aggressiveness level (0.9) was tested. Lower aggressiveness settings would reduce fewer tokens and yield smaller latency savings.
  • Network variability affects individual runs. While 50 runs per configuration mitigates this, results reflect median behavior rather than worst-case.
  • Only two model families were tested. Other models may have different prefill characteristics that affect the break-even point.
  • LongBench v2 documents may not represent all real-world use cases. Highly structured or code-heavy inputs may compress differently.

Ready to try it?

Create an account to get your API key and start compressing.