Tensor & Sequence Parallelism — Interactive Visual Guide

Section A

The Core Idea — Splitting Weight Matrices

A single linear layer computes Y = XW + b. The weight matrix W is large. Two ways to split it across GPUs.

Y = X × W + b W has shape [hidden_dim × hidden_dim] — too large for one GPU

Column Parallel: Split W along its columns. W = [W₀ | W₁ | W₂ | W₃]. Input X is replicated on all GPUs. Each GPU computes Yᵢ = X × Wᵢ, producing a slice of the output columns. No communication needed!

Input X (replicated) × Weight W (column-split) = Output Y (column-split)

GPU 0 GPU 1 GPU 2 GPU 3

Key Insight

Each GPU holds the full input X but only 1/4 of the weight columns. The output is naturally partitioned — each GPU gets different columns of Y. No AllReduce needed after this operation.

Section B

TP in the Transformer MLP

The MLP block has two linear layers: up-projection (h → 4h) then down-projection (4h → h). TP cleverly combines column and row parallelism to need only ONE AllReduce.

Step 1: The MLP Block Structure

A standard Transformer MLP (FFN) has two linear layers with a GeLU activation in between:

MLP(x) = GeLU(x × W₁) × W₂ W₁: [h × 4h] (up-projection) W₂: [4h × h] (down-projection)

The key insight of Megatron-LM: use Column Parallel for W₁ and Row Parallel for W₂. They fit together perfectly.

Step 2: First Linear — Column Parallel

Split W₁ by columns across 4 GPUs. Each GPU gets W₁ᵢ with shape [h × h].

X (replicated)
[b×s, h]

×

W₁ (col-split)
[h, 4h]

=

Y₁ (col-split)
[b×s, 4h]

No communication! Each GPU has the full X and its column slice of W₁. The output is naturally split by columns.

Step 3: GeLU Activation

GeLU is applied element-wise to each GPU's partial activation independently. Since each GPU has a complete slice of the columns, GeLU can be applied without any communication.

Y₁ᵢ
on each GPU

→

GeLU(Y₁ᵢ)
element-wise

→

Zᵢ
partial activations

Still no communication! GeLU is element-wise, so each GPU applies it to its local slice. This is why column parallelism is used for the first linear: the non-linearity works independently on each partition.

Step 4: Second Linear — Row Parallel

Now W₂ is split by rows. Each GPU already has the right input slice from the column parallel output!

Zᵢ (col-partitioned)
[b×s, 4h]

×

W₂ (row-split)
[4h, h]

→

AllReduce
sum partials → full Y

AllReduce here! Each GPU has a partial sum. We need to sum across all GPUs to get the final output Y. This is the only communication in the MLP block.

Step 5: Complete MLP Data Flow

The genius of Megatron-LM: Column parallel feeds directly into Row parallel with zero communication in between!

X
replicated

→

Col Parallel
W₁ split

→

GeLU
local

→

Row Parallel
W₂ split

→

AllReduce
sum

→

Y
replicated

Result

Only 1 AllReduce per MLP block. The column-to-row transition is seamless — each GPU's column-partitioned output is exactly the input slice the row-parallel layer needs. No reshuffling required.

Section C

TP in Multi-Head Attention

Attention has Q, K, V projections and an output projection. TP splits attention heads across GPUs — each GPU handles independent heads.

Q, K, V Projections — Column Parallel

The Q, K, V weight matrices are split by columns, which maps to splitting attention heads across GPUs.

32 heads ÷ 4 GPUs = 8 heads/GPU
Each GPU computes attention for its local heads independently

No communication needed! Each head is independent during the attention computation.

Output Projection — Row Parallel

After attention, each GPU has partial output from its heads. The output projection uses row parallelism.

Partial outputs → Row Parallel W_o → AllReduce
ONE AllReduce per attention block

AllReduce here! Same pattern as MLP: column parallel Q/K/V feeds into row parallel output projection.

Head Distribution Across 4 GPUs (32 heads total)

X
replicated

→

Q,K,V
col parallel

→

Attention
local heads

→

Output Proj
row parallel

→

AllReduce
sum

→

Y
replicated

Section D

The Sequence Parallelism Complement

TP handles linear layers brilliantly — but LayerNorm and Dropout can't be tensor-parallelized. SP solves this.

The Problem with Pure TP

In a transformer block, some operations work on the full hidden dimension and cannot be split by TP:

LayerNorm Operates on full hidden dim — REPLICATED across all GPUs

↓

Self-Attention Tensor Parallel — split across GPUs

↓

Dropout + Residual REPLICATED — wasted memory!

↓

LayerNorm REPLICATED — wasted memory!

↓

MLP (FFN) Tensor Parallel — split across GPUs

↓

Dropout + Residual REPLICATED — wasted memory!

Red = Wasted Memory. Every GPU stores identical copies of activations in non-TP regions. With 4 GPUs, that's 4× redundant memory for LayerNorm, Dropout, and residual connections.

Without SP — Replicated Non-TP Regions

LayerNorm Full activations [b, s, h] on EVERY GPU

↓

Attention (TP) Each GPU: [b, s, h/P] — partitioned

AllReduce sum partial outputs

↓

Dropout + Residual Full activations [b, s, h] on EVERY GPU

↓

LayerNorm Full activations [b, s, h] on EVERY GPU

↓

MLP (TP) Each GPU: [b, s, 4h/P] — partitioned

AllReduce sum partial outputs

↓

Dropout + Residual Full activations [b, s, h] on EVERY GPU

GPU 0 Memory (Without SP)

LayerNorm 1

b×s×h (FULL)

Attention

b×s×h/4

Dropout+Res

b×s×h (FULL)

LayerNorm 2

b×s×h (FULL)

MLP

b×s×4h/4

Dropout+Res

b×s×h (FULL)

4 regions at FULL size = wasted

GPU 1, 2, 3

LayerNorm 1

IDENTICAL copy

Attention

different slice

Dropout+Res

IDENTICAL copy

LayerNorm 2

IDENTICAL copy

MLP

different slice

Dropout+Res

IDENTICAL copy

Exact same data = pure waste!

Section E

Full TP+SP Data Flow

A complete animated view of one transformer block with TP+SP working together. Every activation is partitioned — zero redundancy.

1

Input (Sequence-Partitioned) Each GPU holds tokens [s/P] with full hidden dim [h]

↓

AllGather Gather sequence chunks → each GPU now has full [s, h] for TP region

2

Self-Attention (Tensor Parallel) Each GPU computes 8/32 heads on the full sequence. Q, K, V col-parallel, attention local.

↓

ReduceScatter Sum partial outputs from row-parallel W_o, scatter result along sequence dim

3

LayerNorm + Residual (Sequence-Partitioned) Each GPU processes its s/P tokens. No redundancy.

↓

AllGather Gather sequence chunks → each GPU now has full [s, h] for MLP TP region

4

MLP / FFN (Tensor Parallel) Col-parallel up-projection → GeLU → row-parallel down-projection

↓

ReduceScatter Sum partial MLP outputs, scatter along sequence dim

5

LayerNorm + Residual (Sequence-Partitioned) Each GPU has s/P tokens. Ready for the next transformer block.

The Key Guarantee

TP+SP together ensure EVERY activation is partitioned — no redundancy anywhere. In TP regions, activations are split along the hidden dimension. In non-TP regions (LayerNorm, Dropout, Residual), activations are split along the sequence dimension. The AllGather and ReduceScatter ops handle the transitions seamlessly.

Section F

Communication Summary

All communication ops per transformer block, and why TP+SP has the same bandwidth cost but better memory.

Operation	Type	Volume	Where
AllGather	Before Attention	`O(b × s × h)`	SP → TP transition
ReduceScatter	After Attention	`O(b × s × h)`	TP → SP transition
AllGather	Before MLP	`O(b × s × h)`	SP → TP transition
ReduceScatter	After MLP	`O(b × s × h)`	TP → SP transition

TP Alone

2 × AllReduce per block

Each AllReduce = ReduceScatter + AllGather internally. Non-TP regions have replicated activations.

Memory: activations replicated in non-TP regions

TP + SP

2 × ReduceScatter + 2 × AllGather per block

Same total communication volume as 2 AllReduces! But activations are always partitioned.

Memory: ALL activations partitioned across GPUs

All communication runs over NVLink (900 GB/s on H100 SXM). TP groups are always within a single node. The high bandwidth of NVLink makes these frequent, fine-grained collective ops fast enough to keep GPU utilization high.

Tensor & Sequence Parallelism Visualized

The Core Idea — Splitting Weight Matrices

TP in the Transformer MLP

Step 1: The MLP Block Structure

Step 2: First Linear — Column Parallel

Step 3: GeLU Activation

Step 4: Second Linear — Row Parallel

Step 5: Complete MLP Data Flow

TP in Multi-Head Attention

Q, K, V Projections — Column Parallel

Output Projection — Row Parallel

Head Distribution Across 4 GPUs (32 heads total)

The Sequence Parallelism Complement

The Problem with Pure TP

Without SP — Replicated Non-TP Regions

GPU 0 Memory (Without SP)

GPU 1, 2, 3

With SP — Sequence-Partitioned Non-TP Regions

GPU 0 Memory (With SP)

GPU 1, 2, 3

Full TP+SP Data Flow

Communication Summary

TP Alone

TP + SP