SuperCompress · 1 / 11 · · N notes

SuperCompress

The new standard of context compression.

The hidden cost

AI is something we know and love.

But every answer has a physical cost.

Behind the interface are data centers drawing electricity, cooling water, and grid capacity at an accelerating rate.

The numbers

The scale is already difficult to ignore.

415 TWh Global data-center electricity use in 2024.
945 TWh Projected global use by 2030 - more than double.
4.4% Share of all U.S. electricity used by data centers in 2023.
66B L Direct water consumed by U.S. data centers in 2023.

Sources: IEA Energy and AI; U.S. Department of Energy / Lawrence Berkeley National Laboratory.

Why it keeps growing

Agents process the same context again and again.

  • Every turn can resend the entire conversation.
  • Documents and tool outputs accumulate as the agent works.
  • The GPU still processes lines that have nothing to do with the current question.
Turn 1Context
Turn 2Context + history
Turn 3Context + history + tools
GPU prefill
The quality trap

Removing tokens is easy. Keeping the right ones is hard.

Keep everything Accurate, but wasteful.

Every turn pays to process context that may never affect the answer.

Truncate blindly Cheaper, but fragile.

The one critical line may sit in the middle of the text that gets removed.

Summarize first Slower and lossy.

Another model call adds cost and may rewrite details the answer depends on.

There is a solution now

SuperCompress.

Learned context compression that keeps what matters before the prompt reaches the language model.

What it is

A smarter memory layer for AI agents.

SuperCompress sits between your application and your language model. It reduces the prompt without replacing the model or changing the workflow.

  • Query-aware. It keeps context for the question being asked now.
  • Model-agnostic. It works with hosted and open-weight models.
  • CPU-first. It does not spend another LLM call to save tokens.
Context and question enter SuperCompress before the language model
What changes

Less context goes in. The important context stays.

Before Long prompt
SuperCompress
After Focused prompt
~65%
typical token reduction
Any LLM
same model, smaller prompt
CPU
compression before GPU inference
How it works

Score every line. Preserve the signal.

01 / Read

Context and question enter together.

The current query defines what information is relevant.

02 / Score

Each line receives a relevance score.

The policy uses nine features including recency, position, and query overlap.

03 / Keep

The strongest lines stay in order.

The focused context is sent to the language model for inference.

Technical details

Small enough to run before every model call.

~5Klearned parameters
9line-level features
~30 msCPU time on benchmark seeds
0extra LLM calls
SuperCompress preserves answer-critical lines at the same token budget
At a 35% token budget, SuperCompress preserved 100% of answer-critical lines on benchmark seeds.

SuperCompress