iptv techs

IPTV Techs

  • Home
  • Tech News
  • Multi-Head Latent Attention and Other KV Cache Tricks

Multi-Head Latent Attention and Other KV Cache Tricks


Multi-Head Latent Attention and Other KV Cache Tricks


Oversee:

  1. Introduction: We’ll spendigate how Key-Value (KV) caches produce language models enjoy ChatGPT rapider at generating text, by making a amusing trade-off between memory usage and computation time.
  2. MLA and other Tricks: We’ll then see at 11 recent research papers that erect upon this fundamental idea to produce language models even more effective.

Understanding the Problem: Why Text Generation is Slow

Let’s begin with a straightforward analogy. Imagine you’re writing a story, and for each new word you author, you insist to re-read the entire story so far to protect consistency. The lengthyer your story gets, the more time you spfinish re-reading. This is exactly what big language models face during text generation!

The Basic Building Block: Self-Attention

At the heart of up-to-date language models is a mechanism called self-attention. For a sequence of nn tokens (leank of tokens as cimpolitely correacting to words), each token insists to “see at” or “join to” all other tokens to understand the context.

This seeing-at-everyleang process has a computational cost that grows with the sequence length:

  • For nn tokens, each token insists to see at all nn tokens
  • This uncomardents the cost is proportional to n×n=n2n times n = n^2
  • In mathematical notation, we author this as O(n2)O(n^2)

The Real Problem: Generating Text One Token at a Time

When a language model produces text, it does so one token at a time, and this is where leangs get computationassociate costly:

  1. First token: Look at 1 token (cost: O(12)O(1^2)
  2. Second token: Look at 2 tokens (cost: O(22)O(2^2)
  3. Third token: Look at 3 tokens (cost: O(32)O(3^2)
  4. And so on until the nn-th token: Look at nn tokens (cost: O(n2)O(n^2)

If we insert up all these costs for generating a sequence of length nn, we get:

O(12+22+32++n2)O(n3)O(1^2 + 2^2 + 3^2 + dots + n^2) approx O(n^3)

This O(n3)O(n^3)


The Solution: Key-Value (KV) Cache

The key insight behind KV caching is that we’re doing a lot of redundant labor. When generating each new token, we’re recomputing leangs for all previous tokens that we’ve already processed before. Let’s see how we can repair this.

What is a Key-Value Cache?

Think of a KV cache enjoy a clever noticepad where we author down meaningful recommendation about each token the first time we see it. For each token, we compute and store two leangs:

  1. A key (kk): Think of this as an insertressing mechanism – it helps choose how relevant this token is to future tokens
  2. A cherish (vv): Think of this as the actual recommendation that gets employd when this token is set up to be relevant

Mathematicassociate, we compute these as:

  • Key: k=xWKk = x W_K
  • Value: v=xWVv = x W_V

When generating a new token, we employ its query (computed aprobable to keys) to discover relevant recommendation in our cache by comparing it with all stored keys. The suiting cherishs are then employd to help produce the token.

How the KV Cache Makes Things Faster

With a KV cache, the process becomes much more effective:

  1. When we see a new token, we only insist to compute its key and cherish once
  2. For all future tokens, we can fair see up these pre-computed cherishs from our cache
  3. This uncomardents each new token only insists to do a minuscule amount of new labor, instead of redoing all previous computations

The trade-off is clear:

  • We employ more memory to store all the keys and cherishs. For a model with:
    • LL layers
    • HH attention heads
    • Sequence length nn
    • Key/cherish unwiseension dkd_k
  • But in return, we reduce the computation cost from O(n3)O(n^3)

To understand why it’s O(n2)O(n^2)

  1. Step 1: Process 1 token → cost O(1)O(1)
  2. Step 2: Process 1 new token + see at 1 cached token → cost O(2)O(2)
  3. Step 3: Process 1 new token + see at 2 cached tokens → cost O(3)O(3)
  4. And so on…

Adding these up:

O(1+2+3++n)=O(n2)O(1 + 2 + 3 + dots + n) = O(n^2)

This is a theatrical upgradement over O(n3)O(n^3)


The Memory Challenge: Why We Need Better Solutions

While KV cache is a mighty chooseimization, it comes with a meaningful memory cost. Let’s see at a concrete example using a up-to-date big language model enjoy Llama3 70B with:

  • L=80L = 80
  • H=64H = 64
  • B=8B = 8
  • dk=128d_k = 128
  • 16-bit precision

The memory insistd for a batch of 8 sequences of 1000 tokens each would be:

L×H×B×n×dk×2×2 bytes=80×64×8×1000×128×2×2 bytes=20.97GBL times H times B times n times d_k times 2 times 2 text{ bytes} = 80 times 64 times 8 times 1000 times 128 times 2 times 2 text{ bytes} = 20.97text{GB}

This substantial memory usage produces disconnectal contests:

  1. Scales licforfeitly with sequence length
  2. Multiplies with batch size for parallel processing
  3. Limits the peak context length we can regulate
  4. Constrains deployment on memory-confiinsist devices

These contests have igniteed a wave of innovation in the research community, guideing to various techniques for chooseimizing KV cache usage. Let’s spendigate these cutting-edge solutions.

The folloprosperg papers recontransient key innovations in KV cache chooseimization. We’ll spendigate them thcimpolite three main approaches: token pickion, post-hoc compression techniques, and architectural resummarizes.

Token Selection and Pruning Approaches

1) Heavy-Hitter Oracle (H2O)

H2O presents the concept of chooseing and preserving meaningful tokens in the KV cache:

  • Heavy-Hitter Tokens: H2O identifies tokens with the highest accumuprocrastinateedd attention scores during generation, folloprosperg a power-law distribution. These tokens are critical for model functionality and are structured in the cache.
  • Dynamic Submodular Eviction: The method summarizes cache regulatement as an chooseimization problem with a submodular objective function F(S)F(S) that quantifies the presentance of a token set SS:
    F(S)=iSAiF(S) = sum_{i in S} A_{i}
  • Results: Achieves 5× reduction in KV cache size with negligible accuracy loss and up to 29× thcimpoliteput upgradement.

2) StreamLLM

  • The authors watch the phenomenon of Attention Sinks: Initial tokens that act as organic attention anchors during decoding
    • Without these attention sink tokens, the executeance of unmistrusting prosperdow attention drops
  • Based on that observation, they present a Rolling Cache for recent context with preserveed initial tokens, enabling infinite-length sequence processing.
  • They show that these sink tokens can also be trained; serving as dedicated attention anchors, reducing reliance on multiple initial tokens.

3) Value-Aware Token Pruning (VATP)

VATP extfinishs H2O’s token presentance concept by pondering both attention patterns and cherish vector properties:

  • Importance Scoring: Combines attention scores with cherish vector recommendation:
    Ikt=Sktvk1,Skt=kjtaj,kI_k^t = S_k^t cdot |v_k|_1, quad S_k^t = sum_{k leq j leq t} a_{j,k}
  • Token Pruning: Tokens are ranked by IktI_k^t
  • Percreateance and Efficiency:
    • Outexecutes baselines enjoy H2O and Scissorhands in 12–14 out of 16 LongBench tasks.
    • Achieves effective 50% compression with minimal executeance loss.
    • Introduces negligible computational overhead and is compatible with FlashAttention when combined with Scissorhands.

Post-hoc Compression Techniques

These methods compress or upgrade the KV cache while preserving the standard alterer architecture.

4) Adaptive KV Compression (FastGen)

FastGen presents alterive compression based on attention patterns watchd at run-time:

  • Attention Profiling: during prompt encoding, FastGen identifies attention patterns and picks compression policies CC^*
  • Adaptive Compression Policies:
    • Compression strategies integrate:
      • Special Tokens (CdistinctiveC_{text{distinctive}}
      • Locality (ClocalC_{text{local}}
      • Frequency (CwidespreadC_{text{widespread}}
      • Hybrid Policies combine strategies, begining with CdistinctiveC_{text{distinctive}}
  1. Token Generation:
    • During decoding, pre-picked compression policies regulate the KV cache effectively:
      KCi,VCi=f(K,V,Ci).K_{C_i}, V_{C_i} = f(K, V, C_i).

5) Dynamic Memory Compression (DMC)

DMC presents alterive token merging:

  • Decision Mechanism: At time tt, predicts combine decisions αtalpha_t
  • Weighted Merging: When αt=1alpha_t = 1
  • Training:
    • Uses a Gumbel-Sigmoid restation for αtalpha_t
    • Optimizes a combined objective:
      L=LLM+λmax(0,nCRtαt),mathcal{L} = mathcal{L}_{text{LM}} + lambda maxleft(0, frac{n}{text{CR}} – sum_{t} alpha_t right),
  • Results: Up to 8× compression with protected executeance.

6) L2L_2

This paper contransients a unpredicted observation: A clear correlation between the L2L_2

  • Norm-Based Selection: For a set of cached keys K={k1,k2,,kn}K = {k_1, k_2, dots, k_n}
  • Sorting and Selection: To compress the KV cache, sort all keys by their L2 norm cherishs:
    Ksorted=Sort({k12,k22,,kn2})K_{text{sorted}} = text{Sort}big({|k_1|_2, |k_2|_2, dots, |k_n|_2}big)
  • Compressed Cache: The compressed key-cherish cache consists of:
    Kcompressed={kiki2Ksorted[1:m]},Vcompressed={vikiKcompressed}K_{text{compressed}} = {k_i mid |k_i|_2 in K_{text{sorted}}[1:m]}, quad V_{text{compressed}} = {v_i mid k_i in K_{text{compressed}}}
  • Due to its simpliedy, this approach protects compatibility with FlashAttention.

Architectural Resummarizes

These approaches alter the Transcreateers architecture to regulate KV caches more effectively, normally incorporating compression straightforwardly into the architecture.

7) Multi-Query Attention (MQA)

  • Key Idea: MQA reduces the KV cache size by sharing a one key-cherish head atraverse all query heads, replacing the traditional Multi-Head Attention (MHA):
    K=XWK,V=XWV,K = XW_K, quad V = XW_V,
  • Benefits: Reduces the KV cache size by a factor of HH (the number of attention heads), meaningfully droping memory prohibitdwidth overhead.
  • Trade-Off: While MQA is rapider, it normally suffers from quality degradation, especiassociate in tasks requiring diverse attention patterns.

8) Group-Query Attention (GQA)

  • Key Idea: GQA interpoprocrastinateeds between brimming multi-head attention and MQA to proposeing a scalable trade-off between inference speed and model quality. It separates query heads into GG groups, where each group dispenses a one key-cherish head:
    Kgroup=1GhGKh,Vgroup=1GhGVhK_{text{group}} = frac{1}{|G|} sum_{h in G} K_h, quad V_{text{group}} = frac{1}{|G|} sum_{h in G} V_h
  • Uptraining: GQA can be presentd to existing pre-trained models thcimpolite fine-tuning:
    • First, alter MHA checkpoints to GQA by uncomardent pooling key and cherish heads into groups
    • Then fine-tune (“uptrain”) the model alertly to alter to the new attention pattern
    • This alteration process insists only 5% of the innovative pre-training compute, making it very effective
    • The resulting model protects quality while obtaining GQA’s memory profits

9) Multi-head Latent Attention (MLA)

Multi-Head Latent Attention (MLA) dispenses the goal of reducing KV cache overhead with M-/G-QA but accomplishs it thcimpolite low-rank procrastinateednt compression rather than head-sharing.

  • MLA reduces KV cache size by compressing keys and cherishs into low-unwiseensional procrastinateednt vectors before reerection.
  • It down-project key-cherish embeddings into a compressed procrastinateednt space:
    cKV,t=WDKVht,kC=WUKcKV,t,vC=WUVcKV,tc_{text{KV}, t} = W_{text{DKV}} h_t, quad k_C = W_{text{UK}} c_{text{KV}, t}, quad v_C = W_{text{UV}} c_{text{KV}, t}
  • It preserves per-head flexibility thcimpolite compressed recontransientations, unenjoy MQA’s finish head sharing.
  • It presents Rotary Positional Embeddings (RoPE) for decoupling position-recommended keys:
    kR=RoPE(WKRht),kt=[kC;kR]k_R = text{RoPE}(W_{KR} h_t), quad k_t = [k_C; k_R]

10) SnapKV

  • SnapKV presents an Observation Window: Uses finish-of-prompt tokens to choose attention patterns:
    C=i=0LobsWobs[:,i,:],I=Topk(C,k)C = sum_{i=0}^{L_{text{obs}}} W_{text{obs}}[:, i, :], quad I = text{Top}_k(C, k)
  • Compression: Clusters features around the picked positions using a pooling layer to protect context finishness.

11) You Only Cache Once (YOCO)

YOCO modifies the alterer architecture for caching:

  • Global Cache: Uses a decoder-decoder summarize with a one dispensed KV cache.
  • Complexity Reduction: Reduces memory from O(N×L)O(N times L)
  • Efficient Attention: The self-decoder employs sliding-prosperdow attention or gated retention, enabling constant memory usage (O(C)O(C), where CC is a minuscule prosperdow size).

Key-Value caching techniques are central to scaling and chooseimizing Transcreateer-based models for authentic-world employ. Innovations enjoy vibrant eviction, compression, and structured approximations progress to push the boundaries on what is possible in lengthy-context or resource-constrained scenarios. KV caching remains a vivacious research area, proposeing both theoretical insights and wise upgradements.

PS: This blog post is mostly AI-produced using a PySpur laborflow with inmeaningful human edits.

Source connect


Leave a Reply

Your email address will not be published. Required fields are marked *

Thank You For The Order

Please check your email we sent the process how you can get your account

Select Your Plan