How weight matrices are split within a layer, and how SP handles the regions TP can't
A single linear layer computes Y = XW + b. The weight matrix W is large. Two ways to split it across GPUs.
Each GPU holds the full input X but only 1/4 of the weight columns. The output is naturally partitioned — each GPU gets different columns of Y. No AllReduce needed after this operation.
The MLP block has two linear layers: up-projection (h → 4h) then down-projection (4h → h). TP cleverly combines column and row parallelism to need only ONE AllReduce.
A standard Transformer MLP (FFN) has two linear layers with a GeLU activation in between:
The key insight of Megatron-LM: use Column Parallel for W₁ and Row Parallel for W₂. They fit together perfectly.
Split W₁ by columns across 4 GPUs. Each GPU gets W₁ᵢ with shape [h × h].
GeLU is applied element-wise to each GPU's partial activation independently. Since each GPU has a complete slice of the columns, GeLU can be applied without any communication.
Now W₂ is split by rows. Each GPU already has the right input slice from the column parallel output!
The genius of Megatron-LM: Column parallel feeds directly into Row parallel with zero communication in between!
Only 1 AllReduce per MLP block. The column-to-row transition is seamless — each GPU's column-partitioned output is exactly the input slice the row-parallel layer needs. No reshuffling required.
Attention has Q, K, V projections and an output projection. TP splits attention heads across GPUs — each GPU handles independent heads.
The Q, K, V weight matrices are split by columns, which maps to splitting attention heads across GPUs.
After attention, each GPU has partial output from its heads. The output projection uses row parallelism.
TP handles linear layers brilliantly — but LayerNorm and Dropout can't be tensor-parallelized. SP solves this.
In a transformer block, some operations work on the full hidden dimension and cannot be split by TP:
A complete animated view of one transformer block with TP+SP working together. Every activation is partitioned — zero redundancy.
TP+SP together ensure EVERY activation is partitioned — no redundancy anywhere. In TP regions, activations are split along the hidden dimension. In non-TP regions (LayerNorm, Dropout, Residual), activations are split along the sequence dimension. The AllGather and ReduceScatter ops handle the transitions seamlessly.
All communication ops per transformer block, and why TP+SP has the same bandwidth cost but better memory.
| Operation | Type | Volume | Where |
|---|---|---|---|
| AllGather | Before Attention | O(b × s × h) |
SP → TP transition |
| ReduceScatter | After Attention | O(b × s × h) |
TP → SP transition |
| AllGather | Before MLP | O(b × s × h) |
SP → TP transition |
| ReduceScatter | After MLP | O(b × s × h) |
TP → SP transition |
Each AllReduce = ReduceScatter + AllGather internally. Non-TP regions have replicated activations.
Same total communication volume as 2 AllReduces! But activations are always partitioned.