Tensor Parallelism

Tensor & Sequence Parallelism Visualized

How weight matrices are split within a layer, and how SP handles the regions TP can't

Column
Column Parallel
Row
Row Parallel
Seq
Sequence Parallel
900
GB/s NVLink Speed

The Core Idea — Splitting Weight Matrices

A single linear layer computes Y = XW + b. The weight matrix W is large. Two ways to split it across GPUs.

Y = X × W + b W has shape [hidden_dim × hidden_dim] — too large for one GPU
Column Parallel: Split W along its columns. W = [W₀ | W₁ | W₂ | W₃]. Input X is replicated on all GPUs. Each GPU computes Yᵢ = X × Wᵢ, producing a slice of the output columns. No communication needed!
Input X (replicated) × Weight W (column-split) = Output Y (column-split)
GPU 0 GPU 1 GPU 2 GPU 3
Key Insight

Each GPU holds the full input X but only 1/4 of the weight columns. The output is naturally partitioned — each GPU gets different columns of Y. No AllReduce needed after this operation.

TP in the Transformer MLP

The MLP block has two linear layers: up-projection (h → 4h) then down-projection (4h → h). TP cleverly combines column and row parallelism to need only ONE AllReduce.

Step 1: The MLP Block Structure

A standard Transformer MLP (FFN) has two linear layers with a GeLU activation in between:

MLP(x) = GeLU(x × W₁) × W₂ W₁: [h × 4h] (up-projection)    W₂: [4h × h] (down-projection)

The key insight of Megatron-LM: use Column Parallel for W₁ and Row Parallel for W₂. They fit together perfectly.

Step 2: First Linear — Column Parallel

Split W₁ by columns across 4 GPUs. Each GPU gets W₁ᵢ with shape [h × h].

X (replicated)
[b×s, h]
×
W₁ (col-split)
[h, 4h]
=
Y₁ (col-split)
[b×s, 4h]
No communication! Each GPU has the full X and its column slice of W₁. The output is naturally split by columns.

Step 3: GeLU Activation

GeLU is applied element-wise to each GPU's partial activation independently. Since each GPU has a complete slice of the columns, GeLU can be applied without any communication.

Y₁ᵢ
on each GPU
GeLU(Y₁ᵢ)
element-wise
Zᵢ
partial activations
Still no communication! GeLU is element-wise, so each GPU applies it to its local slice. This is why column parallelism is used for the first linear: the non-linearity works independently on each partition.

Step 4: Second Linear — Row Parallel

Now W₂ is split by rows. Each GPU already has the right input slice from the column parallel output!

Zᵢ (col-partitioned)
[b×s, 4h]
×
W₂ (row-split)
[4h, h]
AllReduce
sum partials → full Y
AllReduce here! Each GPU has a partial sum. We need to sum across all GPUs to get the final output Y. This is the only communication in the MLP block.

Step 5: Complete MLP Data Flow

The genius of Megatron-LM: Column parallel feeds directly into Row parallel with zero communication in between!

X
replicated
Col Parallel
W₁ split
GeLU
local
Row Parallel
W₂ split
AllReduce
sum
Y
replicated
Result

Only 1 AllReduce per MLP block. The column-to-row transition is seamless — each GPU's column-partitioned output is exactly the input slice the row-parallel layer needs. No reshuffling required.

TP in Multi-Head Attention

Attention has Q, K, V projections and an output projection. TP splits attention heads across GPUs — each GPU handles independent heads.

Q, K, V Projections — Column Parallel

The Q, K, V weight matrices are split by columns, which maps to splitting attention heads across GPUs.

32 heads ÷ 4 GPUs = 8 heads/GPU
Each GPU computes attention for its local heads independently
No communication needed! Each head is independent during the attention computation.

Output Projection — Row Parallel

After attention, each GPU has partial output from its heads. The output projection uses row parallelism.

Partial outputs → Row Parallel W_o → AllReduce
ONE AllReduce per attention block
AllReduce here! Same pattern as MLP: column parallel Q/K/V feeds into row parallel output projection.

Head Distribution Across 4 GPUs (32 heads total)

X
replicated
Q,K,V
col parallel
Attention
local heads
Output Proj
row parallel
AllReduce
sum
Y
replicated

The Sequence Parallelism Complement

TP handles linear layers brilliantly — but LayerNorm and Dropout can't be tensor-parallelized. SP solves this.

The Problem with Pure TP

In a transformer block, some operations work on the full hidden dimension and cannot be split by TP:

LayerNorm Operates on full hidden dim — REPLICATED across all GPUs
Self-Attention Tensor Parallel — split across GPUs
Dropout + Residual REPLICATED — wasted memory!
LayerNorm REPLICATED — wasted memory!
MLP (FFN) Tensor Parallel — split across GPUs
Dropout + Residual REPLICATED — wasted memory!
Red = Wasted Memory. Every GPU stores identical copies of activations in non-TP regions. With 4 GPUs, that's 4× redundant memory for LayerNorm, Dropout, and residual connections.

Without SP — Replicated Non-TP Regions

LayerNorm Full activations [b, s, h] on EVERY GPU
Attention (TP) Each GPU: [b, s, h/P] — partitioned
AllReduce sum partial outputs
Dropout + Residual Full activations [b, s, h] on EVERY GPU
LayerNorm Full activations [b, s, h] on EVERY GPU
MLP (TP) Each GPU: [b, s, 4h/P] — partitioned
AllReduce sum partial outputs
Dropout + Residual Full activations [b, s, h] on EVERY GPU

GPU 0 Memory (Without SP)

LayerNorm 1
b×s×h (FULL)
Attention
b×s×h/4
Dropout+Res
b×s×h (FULL)
LayerNorm 2
b×s×h (FULL)
MLP
b×s×4h/4
Dropout+Res
b×s×h (FULL)
4 regions at FULL size = wasted

GPU 1, 2, 3

LayerNorm 1
IDENTICAL copy
Attention
different slice
Dropout+Res
IDENTICAL copy
LayerNorm 2
IDENTICAL copy
MLP
different slice
Dropout+Res
IDENTICAL copy
Exact same data = pure waste!

Full TP+SP Data Flow

A complete animated view of one transformer block with TP+SP working together. Every activation is partitioned — zero redundancy.

1
Input (Sequence-Partitioned) Each GPU holds tokens [s/P] with full hidden dim [h]
AllGather Gather sequence chunks → each GPU now has full [s, h] for TP region
2
Self-Attention (Tensor Parallel) Each GPU computes 8/32 heads on the full sequence. Q, K, V col-parallel, attention local.
ReduceScatter Sum partial outputs from row-parallel W_o, scatter result along sequence dim
3
LayerNorm + Residual (Sequence-Partitioned) Each GPU processes its s/P tokens. No redundancy.
AllGather Gather sequence chunks → each GPU now has full [s, h] for MLP TP region
4
MLP / FFN (Tensor Parallel) Col-parallel up-projection → GeLU → row-parallel down-projection
ReduceScatter Sum partial MLP outputs, scatter along sequence dim
5
LayerNorm + Residual (Sequence-Partitioned) Each GPU has s/P tokens. Ready for the next transformer block.
The Key Guarantee

TP+SP together ensure EVERY activation is partitioned — no redundancy anywhere. In TP regions, activations are split along the hidden dimension. In non-TP regions (LayerNorm, Dropout, Residual), activations are split along the sequence dimension. The AllGather and ReduceScatter ops handle the transitions seamlessly.

Communication Summary

All communication ops per transformer block, and why TP+SP has the same bandwidth cost but better memory.

Operation Type Volume Where
AllGather Before Attention O(b × s × h) SP → TP transition
ReduceScatter After Attention O(b × s × h) TP → SP transition
AllGather Before MLP O(b × s × h) SP → TP transition
ReduceScatter After MLP O(b × s × h) TP → SP transition

TP Alone

2 × AllReduce per block

Each AllReduce = ReduceScatter + AllGather internally. Non-TP regions have replicated activations.

Memory: activations replicated in non-TP regions

TP + SP

2 × ReduceScatter + 2 × AllGather per block

Same total communication volume as 2 AllReduces! But activations are always partitioned.

Memory: ALL activations partitioned across GPUs
All communication runs over NVLink (900 GB/s on H100 SXM). TP groups are always within a single node. The high bandwidth of NVLink makes these frequent, fine-grained collective ops fast enough to keep GPU utilization high.