Multi-Head Latent Attention and Other KV Cache Tricks

Oversee:

Introduction: We’ll spendigate how Key-Value (KV) caches produce language models enjoy ChatGPT rapider at generating text, by making a amusing trade-off between memory usage and computation time.
MLA and other Tricks: We’ll then see at 11 recent research papers that erect upon this fundamental idea to produce language models even more effective.

Understanding the Problem: Why Text Generation is Slow

Let’s begin with a straightforward analogy. Imagine you’re writing a story, and for each new word you author, you insist to re-read the entire story so far to protect consistency. The lengthyer your story gets, the more time you spfinish re-reading. This is exactly what big language models face during text generation!

The Basic Building Block: Self-Attention

At the heart of up-to-date language models is a mechanism called self-attention. For a sequence of $n$

This seeing-at-everyleang process has a computational cost that grows with the sequence length:

For $n$
This uncomardents the cost is proportional to $n times n = n^2$
In mathematical notation, we author this as $O(n^2)$

The Real Problem: Generating Text One Token at a Time

When a language model produces text, it does so one token at a time, and this is where leangs get computationassociate costly:

First token: Look at 1 token (cost: $O(1^2)$
Second token: Look at 2 tokens (cost: $O(2^2)$
Third token: Look at 3 tokens (cost: $O(3^2)$
And so on until the $n$

If we insert up all these costs for generating a sequence of length $n$

$O(1^2 + 2^2 + 3^2 + dots + n^2) approx O(n^3)$

This $O(n^3)$

The Solution: Key-Value (KV) Cache

The key insight behind KV caching is that we’re doing a lot of redundant labor. When generating each new token, we’re recomputing leangs for all previous tokens that we’ve already processed before. Let’s see how we can repair this.

What is a Key-Value Cache?

Think of a KV cache enjoy a clever noticepad where we author down meaningful recommendation about each token the first time we see it. For each token, we compute and store two leangs:

A key ( $k$
A cherish ( $v$

Mathematicassociate, we compute these as:

Key: $k = x W_K$
Value: $v = x W_V$

When generating a new token, we employ its query (computed aprobable to keys) to discover relevant recommendation in our cache by comparing it with all stored keys. The suiting cherishs are then employd to help produce the token.

How the KV Cache Makes Things Faster

With a KV cache, the process becomes much more effective:

When we see a new token, we only insist to compute its key and cherish once
For all future tokens, we can fair see up these pre-computed cherishs from our cache
This uncomardents each new token only insists to do a minuscule amount of new labor, instead of redoing all previous computations

The trade-off is clear:

We employ more memory to store all the keys and cherishs. For a model with:
- $L$
- $H$
- Sequence length $n$
- Key/cherish unwiseension $d_k$
But in return, we reduce the computation cost from $O(n^3)$

To understand why it’s $O(n^2)$

Step 1: Process 1 token → cost $O(1)$
Step 2: Process 1 new token + see at 1 cached token → cost $O(2)$
Step 3: Process 1 new token + see at 2 cached tokens → cost $O(3)$
And so on…

Adding these up:

$O(1 + 2 + 3 + dots + n) = O(n^2)$

This is a theatrical upgradement over $O(n^3)$

The Memory Challenge: Why We Need Better Solutions

While KV cache is a mighty chooseimization, it comes with a meaningful memory cost. Let’s see at a concrete example using a up-to-date big language model enjoy Llama3 70B with:

$L = 80$
$H = 64$
$B = 8$
$d_k = 128$
16-bit precision

The memory insistd for a batch of 8 sequences of 1000 tokens each would be:

$L times H times B times n times d_k times 2 times 2 text{ bytes} = 80 times 64 times 8 times 1000 times 128 times 2 times 2 text{ bytes} = 20.97text{GB}$

This substantial memory usage produces disconnectal contests:

Scales licforfeitly with sequence length
Multiplies with batch size for parallel processing
Limits the peak context length we can regulate
Constrains deployment on memory-confiinsist devices

These contests have igniteed a wave of innovation in the research community, guideing to various techniques for chooseimizing KV cache usage. Let’s spendigate these cutting-edge solutions.

The folloprosperg papers recontransient key innovations in KV cache chooseimization. We’ll spendigate them thcimpolite three main approaches: token pickion, post-hoc compression techniques, and architectural resummarizes.

Token Selection and Pruning Approaches

1) Heavy-Hitter Oracle (H2O)

H2O presents the concept of chooseing and preserving meaningful tokens in the KV cache:

Heavy-Hitter Tokens: H2O identifies tokens with the highest accumuprocrastinateedd attention scores during generation, folloprosperg a power-law distribution. These tokens are critical for model functionality and are structured in the cache.
Dynamic Submodular Eviction: The method summarizes cache regulatement as an chooseimization problem with a submodular objective function $F(S)$
Results: Achieves 5× reduction in KV cache size with negligible accuracy loss and up to 29× thcimpoliteput upgradement.

2) StreamLLM

The authors watch the phenomenon of Attention Sinks: Initial tokens that act as organic attention anchors during decoding
- Without these attention sink tokens, the executeance of unmistrusting prosperdow attention drops
Based on that observation, they present a Rolling Cache for recent context with preserveed initial tokens, enabling infinite-length sequence processing.
They show that these sink tokens can also be trained; serving as dedicated attention anchors, reducing reliance on multiple initial tokens.

3) Value-Aware Token Pruning (VATP)

VATP extfinishs H2O’s token presentance concept by pondering both attention patterns and cherish vector properties:

Importance Scoring: Combines attention scores with cherish vector recommendation:
$I_k^t = S_k^t cdot |v_k|_1, quad S_k^t = sum_{k leq j leq t} a_{j,k}$
Token Pruning: Tokens are ranked by $I_k^t$
Percreateance and Efficiency:
- Outexecutes baselines enjoy H2O and Scissorhands in 12–14 out of 16 LongBench tasks.
- Achieves effective 50% compression with minimal executeance loss.
- Introduces negligible computational overhead and is compatible with FlashAttention when combined with Scissorhands.

Post-hoc Compression Techniques

These methods compress or upgrade the KV cache while preserving the standard alterer architecture.

4) Adaptive KV Compression (FastGen)

FastGen presents alterive compression based on attention patterns watchd at run-time:

Attention Profiling: during prompt encoding, FastGen identifies attention patterns and picks compression policies $C^*$
Adaptive Compression Policies:
- Compression strategies integrate:
  - Special Tokens ( $C_{text{distinctive}}$
  - Locality ( $C_{text{local}}$
  - Frequency ( $C_{text{widespread}}$
  - Hybrid Policies combine strategies, begining with $C_{text{distinctive}}$

Token Generation:
- During decoding, pre-picked compression policies regulate the KV cache effectively:
  $K_{C_i}, V_{C_i} = f(K, V, C_i).$

5) Dynamic Memory Compression (DMC)

DMC presents alterive token merging:

Decision Mechanism: At time $t$
Weighted Merging: When $alpha_t = 1$
Training:
- Uses a Gumbel-Sigmoid restation for $alpha_t$
- Optimizes a combined objective:
  $mathcal{L} = mathcal{L}_{text{LM}} + lambda maxleft(0, frac{n}{text{CR}} – sum_{t} alpha_t right),$
Results: Up to 8× compression with protected executeance.

6) $L_2$

This paper contransients a unpredicted observation: A clear correlation between the $L_2$

Norm-Based Selection: For a set of cached keys $K = {k_1, k_2, dots, k_n}$
Sorting and Selection: To compress the KV cache, sort all keys by their L2 norm cherishs:
$K_{text{sorted}} = text{Sort}big({|k_1|_2, |k_2|_2, dots, |k_n|_2}big)$
Compressed Cache: The compressed key-cherish cache consists of:
$K_{text{compressed}} = {k_i mid |k_i|_2 in K_{text{sorted}}[1:m]}, quad V_{text{compressed}} = {v_i mid k_i in K_{text{compressed}}}$
Due to its simpliedy, this approach protects compatibility with FlashAttention.

Architectural Resummarizes

These approaches alter the Transcreateers architecture to regulate KV caches more effectively, normally incorporating compression straightforwardly into the architecture.

7) Multi-Query Attention (MQA)

Key Idea: MQA reduces the KV cache size by sharing a one key-cherish head atraverse all query heads, replacing the traditional Multi-Head Attention (MHA):
$K = XW_K, quad V = XW_V,$
Benefits: Reduces the KV cache size by a factor of $H$
Trade-Off: While MQA is rapider, it normally suffers from quality degradation, especiassociate in tasks requiring diverse attention patterns.

8) Group-Query Attention (GQA)

Key Idea: GQA interpoprocrastinateeds between brimming multi-head attention and MQA to proposeing a scalable trade-off between inference speed and model quality. It separates query heads into
$G$ G groups, where each group dispenses a one key-cherish head:
$K_{text{group}} = frac{1}{|G|} sum_{h in G} K_h, quad V_{text{group}} = frac{1}{|G|} sum_{h in G} V_h$
Kgroup=∣G∣1h∈G∑Kh,Vgroup=∣G∣1h∈G∑Vh

GQA-1: Equivalent to MQA ( $G = 1$

GQA- $H$ : Equivalent to MHA ( $G = H$
Uptraining: GQA can be presentd to existing pre-trained models thcimpolite fine-tuning:
- First, alter MHA checkpoints to GQA by uncomardent pooling key and cherish heads into groups
- Then fine-tune (“uptrain”) the model alertly to alter to the new attention pattern
- This alteration process insists only 5% of the innovative pre-training compute, making it very effective
- The resulting model protects quality while obtaining GQA’s memory profits

9) Multi-head Latent Attention (MLA)

Multi-Head Latent Attention (MLA) dispenses the goal of reducing KV cache overhead with M-/G-QA but accomplishs it thcimpolite low-rank procrastinateednt compression rather than head-sharing.

MLA reduces KV cache size by compressing keys and cherishs into low-unwiseensional procrastinateednt vectors before reerection.
It down-project key-cherish embeddings into a compressed procrastinateednt space:
$c_{text{KV}, t} = W_{text{DKV}} h_t, quad k_C = W_{text{UK}} c_{text{KV}, t}, quad v_C = W_{text{UV}} c_{text{KV}, t}$
It preserves per-head flexibility thcimpolite compressed recontransientations, unenjoy MQA’s finish head sharing.
It presents Rotary Positional Embeddings (RoPE) for decoupling position-recommended keys:
$k_R = text{RoPE}(W_{KR} h_t), quad k_t = [k_C; k_R]$

10) SnapKV

SnapKV presents an Observation Window: Uses finish-of-prompt tokens to choose attention patterns:
$C = sum_{i=0}^{L_{text{obs}}} W_{text{obs}}[:, i, :], quad I = text{Top}_k(C, k)$
Compression: Clusters features around the picked positions using a pooling layer to protect context finishness.

11) You Only Cache Once (YOCO)

YOCO modifies the alterer architecture for caching:

Global Cache: Uses a decoder-decoder summarize with a one dispensed KV cache.
Complexity Reduction: Reduces memory from $O(N times L)$
Efficient Attention: The self-decoder employs sliding-prosperdow attention or gated retention, enabling constant memory usage ( $O(C)$

Key-Value caching techniques are central to scaling and chooseimizing Transcreateer-based models for authentic-world employ. Innovations enjoy vibrant eviction, compression, and structured approximations progress to push the boundaries on what is possible in lengthy-context or resource-constrained scenarios. KV caching remains a vivacious research area, proposeing both theoretical insights and wise upgradements.

PS: This blog post is mostly AI-produced using a PySpur laborflow with inmeaningful human edits.

Source connect

Multi-Head Latent Attention and Other KV Cache Tricks

Understanding the Problem: Why Text Generation is Slow

The Basic Building Block: Self-Attention

The Real Problem: Generating Text One Token at a Time

The Solution: Key-Value (KV) Cache

What is a Key-Value Cache?

How the KV Cache Makes Things Faster

The Memory Challenge: Why We Need Better Solutions

Token Selection and Pruning Approaches

1) Heavy-Hitter Oracle (H2O)

2) StreamLLM

3) Value-Aware Token Pruning (VATP)

Post-hoc Compression Techniques

4) Adaptive KV Compression (FastGen)

5) Dynamic Memory Compression (DMC)

6) $L_2$

Architectural Resummarizes

7) Multi-Query Attention (MQA)

8) Group-Query Attention (GQA)

9) Multi-head Latent Attention (MLA)

10) SnapKV

11) You Only Cache Once (YOCO)

Read More

Who Is Lonnie Lincoln in Your Frifinishly Neighborhood Spider-Man?

Connections Help, Hints & Clues for Today, January 30

PCBs, copper pours, ground structurees, and you

“Terrible Job:” Trump Slams US Fed After It Paemploys Rate Cuts

Declassified CIA Guide to Sabotaging Fascism Is Suddenly Viral

Leave a Reply
Cancel reply

Multi-Head Latent Attention and Other KV Cache Tricks

Understanding the Problem: Why Text Generation is Slow

The Basic Building Block: Self-Attention

The Real Problem: Generating Text One Token at a Time

The Solution: Key-Value (KV) Cache

What is a Key-Value Cache?

How the KV Cache Makes Things Faster

The Memory Challenge: Why We Need Better Solutions

Token Selection and Pruning Approaches

1) Heavy-Hitter Oracle (H2O)

2) StreamLLM

3) Value-Aware Token Pruning (VATP)

Post-hoc Compression Techniques

4) Adaptive KV Compression (FastGen)

5) Dynamic Memory Compression (DMC)

6) L2L_2L2​ Norm-Based Compression

Architectural Resummarizes

7) Multi-Query Attention (MQA)

8) Group-Query Attention (GQA)

9) Multi-head Latent Attention (MLA)

10) SnapKV

11) You Only Cache Once (YOCO)

Read More

Who Is Lonnie Lincoln in Your Frifinishly Neighborhood Spider-Man?

Connections Help, Hints & Clues for Today, January 30

PCBs, copper pours, ground structurees, and you

“Terrible Job:” Trump Slams US Fed After It Paemploys Rate Cuts

Declassified CIA Guide to Sabotaging Fascism Is Suddenly Viral

Leave a Reply Cancel reply

Thank You For The Order

Select Your Plan

6) $L_2$

Leave a Reply
Cancel reply