Context Parallelism splits long sequences across GPUs so each device only computes attention for its chunk — enabling million-token contexts without running out of memory.
Self-attention memory scales quadratically with sequence length. At 128K tokens, a single GPU simply cannot hold the attention matrix.
Drag the slider to see how attention memory explodes with sequence length.
At S = 128K, the attention matrix alone is 32 GB — nearly half an A100's memory, leaving no room for model weights, gradients, or activations. Context Parallelism splits this across GPUs.
Arrange GPUs in a ring. Each GPU holds Q for its chunk and passes K,V around the ring, computing partial attention at each step.
Each GPU starts with its own K,V chunk. Compute local Q × KT attention.
At each step, only 1/P of K,V is in flight between neighbors. Memory per GPU stays at O(S²/P²) for the attention matrix, achieving linear memory scaling with the number of GPUs.
Naive chunking + causal masking creates severe load imbalance. Zigzag interleaving fixes this elegantly.
With naive chunking, GPU 0 gets the earliest tokens and must attend to all chunks (the top-left triangle is full). GPU 3 gets the latest tokens but its causal mask blocks most of the attention — leaving it mostly idle.
Row = query token, Col = key token. Colored = computed, dark = masked.
Number of attention computations (Q×K pairs)
By assigning tokens in a zigzag pattern (first + last, second + second-to-last, ...), each GPU gets a mix of early and late tokens. Under causal masking, this ensures each GPU performs roughly the same number of attention computations — eliminating the load imbalance problem.
Both methods use the same ring communication topology. Zigzag adds smarter token assignment for causal models.
Contiguous chunks → Uneven causal work
Interleaved tokens → Balanced causal work
| Metric | Naive Ring Attention | Zigzag Ring Attention |
|---|---|---|
| Token Assignment | Contiguous chunks [0..S/P-1], [S/P..2S/P-1], ... | Interleaved: GPU i gets tokens i, 2P-1-i, 2P+i, ... |
| Load Balance (Causal) | Poor — GPU 0 does most work, GPU P-1 least | Excellent — near-perfect balance |
| Memory per GPU | O(S²/P²) attention + O(S/P) for K,V | Same: O(S²/P²) attention + O(S/P) for K,V |
| Communication Volume | P-1 rounds of K,V transfer, each O(S/P) | Same: P-1 rounds of K,V transfer |
| Communication Pattern | Ring (each GPU sends to next neighbor) | Same ring topology |
| Causal Mask Efficiency | Wastes compute on masked positions | Minimal wasted compute |
| Implementation Complexity | Simple | Slightly more complex token remapping |
| Best For | Bidirectional attention (BERT, encoders) | Causal / autoregressive models (GPT, LLaMA) |
For bidirectional attention (BERT-style), naive Ring Attention is fine — every GPU does the same work. For causal/autoregressive models (GPT, LLaMA, etc.), Zigzag Ring Attention is strictly better: same communication cost, same memory, but perfectly balanced workloads.