Vizuara · AI Daily Vol. — Long-context efficiency Jun 11, 2026

The quiet revolution inside attention

Shrinking the
KV cache

A million-token context used to mean melting your GPU. In 2026 the cleverest models — Gemma 4, ZAYA1, DeepSeek V4 — barely break a sweat. Here is how they did it, drawn out step by step.

The headline numbers in modern LLMs are about size and benchmarks. The interesting story is quieter: a stack of small architectural tricks that attack the one resource long context burns fastest — memory. This is a visual tour of those tricks.

Scroll to begin.
↓

The cache that eats your GPU

When a language model reads a long document, it does not re-read everything from scratch for every new word. That would be hopelessly slow. Instead, for every token it has already seen, it stores two little vectors — a key and a value — and keeps them in memory so future tokens can attend back to them. This running store is the KV cache, and it is the single most important object in modern long-context inference.

The trouble is arithmetic. The cache grows linearly with the number of tokens: one key and one value per token, per attention head, per layer. A short chat barely registers. But push the context to 128,000 tokens — a long codebase, a book, a multi-hour agent session — and the cache balloons into several gigabytes of GPU memory that must be held the entire time.

Diagram of the KV cache growing linearly with context length — Fig 1 Every token a model reads leaves behind a key and a value. The cache grows linearly with context, and each new token must read the *entire* cache — so memory bandwidth, not arithmetic, becomes the bottleneck.

And memory is only half the pain. To generate the next token, the model has to read the whole cache back out of memory. The longer the context, the more bytes shuffle across the memory bus for every single word produced. This is why long-context generation feels slow even on fast chips: the bottleneck is not the maths of attention, it is the cost of moving the cache around.

So the entire 2026 wave of architecture tweaks can be read as different answers to one question: how do we make the KV cache smaller without making the model dumber? There turn out to be three big families of answer — share it, compress it, or summarize it.

Share, then compress

The first wave of fixes lived inside a single attention layer, and they form a neat progression. Start with classic Multi-Head Attention (MHA): every query head gets its own private key and value. Lots of heads means lots of cache — this is the expensive baseline that everything else improves on.

Grouped-Query Attention (GQA) was the pragmatic compromise that dominated 2024–2026. Instead of one key/value per query head, several query heads share a single key/value pair. Llama 3 and 4, for instance, run 32 query heads against just 8 shared K/V groups. The cache roughly halves, with only a small quality cost.

Comparison of MHA, GQA and MLA attention variants — Fig 2 Three ways to shrink the cache. MHA stores a key/value per head; GQA makes heads share them; MLA squeezes them into a tiny learned *latent* that is stored instead — and projected back up only when needed.

DeepSeek's Multi-Head Latent Attention (MLA) took a different route. Rather than reducing how many keys and values you keep, it shrinks what each one is. The K and V tensors are squeezed through a learned low-rank projection into a single small latent vector, and it is that latent — not the full-size keys and values — that lives in the cache. At use time, the latent is projected back up to full size.

The payoff is the surprising part. At a 4K context with 32 heads, MLA cuts cache memory by roughly 75% versus MHA — and in DeepSeek's own ablations it matched or slightly beat full MHA on quality, while GQA gave up about half a perplexity point. Compression, done right, turned out to be nearly free.

Sharing keys is a compromise. Compressing them, it turns out, can be a free lunch.

These are the foundations. Everything in the 2026 crop builds on top of them — by pushing the same ideas across layers, into the attention computation itself, and along the length of the sequence.

Reuse across layers

GQA shares keys and values within a layer. Google's Gemma 4 asks the obvious follow-up: why not share them across layers too? Stacked transformer layers often compute remarkably similar key/value representations, especially deeper in the network — so storing a fresh copy at every layer is partly redundant.

In Gemma 4 E2B, only the first 15 of its 35 layers compute their own keys and values. The remaining 20 layers reuse the K/V tensors from the most recent earlier layer of the same attention type. Each layer still computes its own queries — so it can still ask its own questions — but it borrows the keys and values it asks them against.

Diagram of cross-layer KV sharing in Gemma 4 — Fig 3 Cross-layer KV sharing in Gemma 4: only the lower layers store keys and values; upper layers reuse them and keep just their own queries. About half the layers stop paying for a cache at all.

The accounting is clean: store KV for fewer than half the layers and you cut that part of the cache by roughly half — about 2.7 GB saved at a 128K context in bfloat16, on a small model that has to fit on modest hardware. Gemma 4 pairs this with "per-layer embeddings" — cheap lookup-table vectors that add capacity to a small model without growing the expensive transformer stack. Different trick, same spirit: spend memory where it is cheap, not where it is dear.

This is the theme of the whole movement in miniature. No exotic new mathematics — just the recognition that a resource you were spending uniformly can be spent selectively, because not every layer actually needs its own copy.

Attend inside the squeeze

MLA compresses the keys and values for storage, but then projects them back to full size before computing attention. So it saves memory, not arithmetic — the attention maths still happens at full width. Compressed Convolutional Attention (CCA), introduced in ZAYA1-8B, asks: what if we just did the attention inside the compressed space and never expanded back?

In CCA, the queries, keys and values are all down-projected into a narrow latent space, and the attention scores are computed directly there. Because the vectors are smaller, you save on cache and on the floating-point operations of attention itself — during training and during the prefill of long prompts, not just generation.

Diagram of Compressed Convolutional Attention in ZAYA1 — Fig 4 CCA compresses Q, K and V, then attends in the compressed space — saving FLOPs as well as memory. A small 1-D convolution mixes the narrowed Q and K so they keep enough local context to stay expressive.

There is an obvious risk: squeeze the vectors too hard and attention gets blurry — narrow vectors simply carry less information. CCA's fix is the "convolutional" half of its name. Before computing scores, a small 1-D convolution mixes each compressed query and key with its neighbours, giving every position a little local context to lean on. That extra structure buys back the expressiveness that compression took away — and in the paper's tests, CCA outperformed MLA at comparable compression.

It is a subtle shift in where the savings come from. GQA and MLA make the cache cheaper. CCA makes the act of attending cheaper — a different lever on the same problem.

Compress the timeline, not the token

Every trick so far compresses things per token: smaller keys, shared keys, latent keys. But you still keep one entry for every token you have ever seen. At a million tokens, that is a million entries no matter how small each one is. DeepSeek V4's boldest move is to compress along a different axis entirely — the sequence dimension — by summarizing groups of tokens into far fewer entries.

It does this with two complementary mechanisms running in parallel, plus a safety net for recent context:

Diagram of CSA and HCA sequence compression in DeepSeek V4 — Fig 5 DeepSeek V4 compresses the sequence itself. A sliding window keeps the last 128 tokens exact; CSA mildly compresses and sparsely selects detailed history; HCA crushes 128 tokens into one block for a cheap global view. Layers interleave CSA and HCA.

CSA — Compressed Sparse Attention applies mild compression, bundling tokens into small groups (around 4-to-1), then uses sparse top-k selection to attend only to the handful of compressed blocks that actually matter for the current token. It keeps detail, but looks at little of it.

HCA — Heavily Compressed Attention goes the other way: it crushes blocks of 128 tokens into a single entry and attends densely over those. Each block is a coarse summary, but there are so few of them that the model can afford to read all of them — a cheap, blurry, global view of the whole context.

Neither is safe alone, so DeepSeek interleaves CSA and HCA layers — detail where it is needed, global coverage everywhere — and both branches sit alongside a 128-token sliding window that keeps the most recent tokens fully uncompressed and exact.

1Mtokens of context

27%of the prior FLOPs/token

10%of the prior KV cache

128:1HCA compression ratio

The result is the headline of the whole article: at a million-token context, DeepSeek V4-Pro runs on roughly 27% of the per-token FLOPs and 10% of the KV cache of the previous generation. Summarizing the timeline, rather than every token on it, is what finally makes million-token context economical.

A wider residual highway

Not every 2026 trick is about the cache. The last one, manifold-constrained hyper-connections (mHC), is about expressiveness — and it appears in DeepSeek V4 right beside the compression machinery, because the two pull in opposite directions and need to be balanced.

Every transformer has a "residual stream" — the running highway of information that each layer reads from and writes back to. Traditionally there is exactly one such stream. mHC replaces it with several parallel streams (four, in DeepSeek V4), giving the model a wider road to carry information up through the network.

Diagram of manifold-constrained hyper-connections — Fig 6 mHC widens the single residual stream into four parallel ones. Mappings combine them into each layer and redistribute the output; a doubly-stochastic mixing matrix between layers keeps signal from blowing up or vanishing.

The danger with multiple streams is stability: mix them carelessly across dozens of layers and the signal either explodes or fades to nothing. The "manifold-constrained" part is the cure. The matrix that mixes the streams between layers is forced to be doubly stochastic — every row and every column sums to one, and all entries stay non-negative. That constraint guarantees the total signal is neither amplified nor shrunk as it travels, keeping very deep models stable.

And it is cheap. Four parallel streams add only about 6.7% training overhead on a 27B model — a small price for a richer information highway. Where the compression tricks claw back resources, mHC quietly reinvests a little of them into making the model think better.

Why this matters, and what to watch

Step back and the pattern is striking. None of these ideas is a dramatic reinvention of the transformer. They are six careful answers to one mundane constraint — memory — and together they are what separate a model that can read a million tokens from one that can do it affordably.

Trick

What it shrinks

The core idea

GQA / MLA

The per-token cache

Share keys, or compress them into a latent

KV sharing

Cache across layers

Upper layers reuse lower layers' K/V

CCA

Attention FLOPs

Attend inside the compressed space

CSA / HCA

The sequence itself

Summarize groups of tokens, not each one

mHC

(adds capacity)

Several stable parallel residual streams

For anyone building with these models, the practical upshot is that long context is quietly getting cheap. The agent that holds an entire codebase in mind, the assistant that remembers a months-long conversation, the model that reasons over a whole book — these stop being expensive luxuries and become the default. The constraint that shaped a generation of prompt-engineering workarounds is being engineered away.

What to watch next: how aggressively sequence-level compression (CSA/HCA) can be pushed before quality genuinely suffers, whether compressed-space attention like CCA spreads beyond a handful of labs, and whether these tricks compose cleanly — or interfere — when stacked in the same model. The interesting frontier in LLMs right now is not only bigger. It is lighter.

※ ※ ※

Adapted and illustrated from Sebastian Raschka's "Recent Developments in LLM Architectures." Explained simply, drawn out fully.

Shrinking theKV cache