The Mission

Train an SLM for 850 million speakers.

You're the founding engineer at Vizz AI. Your startup just raised Series A to build the first high-quality Small Language Model for the Vizz language — spoken by 850 million people but virtually absent from ChatGPT, Claude, and every other LLM.

You've collected 500 billion tokens from local publishers, radio transcripts, government records, and web scrapes. Your training data includes documents up to 128K tokens long — legal proceedings, parliamentary debates, epic folklore.

Your target: a 7 billion parameter transformer. Your cluster is ready. Now you need to figure out how to actually train this thing.

Parameters

H100 GPUs

128K

Max Seq Len

Cluster Topology — 8 Nodes × 8 GPUs

NVLink 900 GB/s (intra-node)

InfiniBand 50 GB/s (inter-node)

Single GPU Memory — 80 GB H100

The Problem

A 7B model in FP16 needs 14 GB just for parameters. Adam optimizer states add another 28 GB. Gradients need 14 GB. And activations at 128K sequence length? That's ~60+ GB per layer accumulated through the forward pass. Total: ~116+ GB. Your H100 has 80 GB. It doesn't fit. You need to distribute this across your 64 GPUs. But how?

Step 1

Tensor Parallelism + Sequence Parallelism

Start from the fastest link. NVLink is 900 GB/s intra-node. Split weight matrices of each layer across 8 GPUs within a node. Each GPU holds 1/8th of each layer.

The Idea

Take each layer's weight matrix and slice it column-wise (or row-wise) across 8 GPUs within a node. Each GPU does 1/8th of the matrix multiply, then they combine results with AllReduce. Since AllReduce is on the critical compute path, this requires fast NVLink — hence it's kept intra-node. Sequence Parallelism (SP) complements TP by splitting activations along the sequence dimension for operations outside TP regions (LayerNorm, Dropout).

Weight Matrix Slicing

A single layer's weight matrix, split across 8 GPUs in one node:

Full Weight Matrix (4096 × 4096)

→

Split across 8 GPUs (column-wise)

AllReduce over NVLink after each layer

One Node: 8 GPUs Sharing Every Layer

Each GPU within a node holds 1/8th of every layer's weight matrix. They all work on the same layer together.

Sequence Parallelism (SP)

In TP regions (Attention, MLP), each GPU computes a slice of the hidden dimension. But outside TP regions — LayerNorm, Dropout — all GPUs would need the full activations. SP fixes this by splitting activations along the sequence dimension in non-TP regions.

TP Regions (Attention, MLP)

Split along hidden dimension

Each GPU: full seq, 1/8 hidden

SP Regions (LayerNorm, Dropout)

Split along sequence dimension

Each GPU: 1/8 seq, full hidden

Why TP Must Stay Intra-Node

Link Type	Bandwidth	Latency
NVLink (intra-node)	900 GB/s	~1 μs
InfiniBand (inter-node)	50 GB/s	~5 μs
Ratio	18× faster intra-node

TP AllReduce is on the critical compute path — the GPU blocks waiting for the AllReduce to complete before it can proceed to the next operation. With inter-node links, every single layer computation would stall. This is why TP is always kept within a node.

Vizz AI Decision

TP = 8 (saturate the node). Each GPU holds 1/8th of every layer's weight matrix. All 8 GPUs within a node communicate via NVLink AllReduce. With 32 attention heads, each GPU handles 4 heads.

Problem That Remains

TP=8 is set. Each GPU holds 1/8th of weights per layer. But we still have all 32 layers, and at 128K sequence length, how many sequences can we even process at once? How much memory do activations take? Weight matrices are shape [H, H] — independent of batch size. But activation tensors are shape [b, S, H] — batch size is in every tensor. We need to figure out the maximum micro-batch size before anything else.

Step 2

Find Max Micro-Batch Size

GPU memory splits into 4 buckets. Only activations scale with batch size. We need to find the largest micro-batch size (mbs) that fits in memory.

The 4 Memory Buckets

Weight matrices have shapes determined by hidden size H and layers L — independent of how many sequences you process. Whether mbs = 1 or 100, the weight matrix W remains shape [H, H]. But activation tensors include the batch dimension: shape [b, S, H]. Activation memory grows linearly with b.

Parameters

4N bytes

FIXED

Shape: [H, H]

Gradients

4N bytes

FIXED

Same shape as params

Optimizer

8N bytes

FIXED

Adam: 2 states per param

Activations

f(b, S, H, L)

SCALES WITH b

Shape: [b, S, H]

Key Insight

Weight matrices have shape [H, H] — independent of batch size. Activation tensors have shape [b, S, H] — b is in every tensor. The fixed costs (params + grads + optimizer) total 16N bytes. Everything else scales with how many sequences you try to fit.

The Memory Formula

Total GPU memory needed:

M_total = 16N + L × (34 × S × b × H + 5 × n_heads × S² × b) × 2

Fixed cost: 16N bytes (params + grads + optimizer) Activation cost: scales with b (micro-batch size)

Solving for maximum batch size:

b_max = ⌊ (M_GPU − 16N) / (L × (34×S×H + 5×n×S²) × 2) ⌋

Take your GPU's total memory, subtract the fixed cost, divide by per-sequence activation cost.

Key Insight: The S² Term

The activation formula has an S² term from attention (the QK^T score matrix). Doubling sequence length MORE than halves your max batch size. At 128K, this quadratic term dominates everything. Even mbs=1 might be tight! This is why long-context training is fundamentally harder than short-context training.

The Empirical Approach: Binary Search

In practice, the formula gives a starting estimate. Then you find the exact max by trial:

Start with b=1, then double: 1 → 2 → 4 → 8 ... until OOM. Last successful = your b_max.

If mbs=1 Still Doesn't Fit?

Activation Checkpointing

Recompute activations in backward pass instead of storing them. ~90% activation reduction. Costs ~33% more compute.

Mixed Precision (bf16)

Halves activation memory (2 bytes vs 4). Fixed costs remain 16N (fp32 master weights needed for optimizer).

Add PP (Next Step)

Fewer layers per GPU = less activation memory. Each GPU only stores activations for its own layers.

Vizz AI Decision

With TP=8, each GPU holds 1/8th of each layer's weights. At 128K sequence length, the S² attention term is massive. Even with bf16, mbs=1 is tight for 32 layers. We'll need activation checkpointing and likely PP to split layers. Once those are set: mbs = 1 (possibly mbs = 2 with aggressive checkpointing).

Problem That Remains

Even with activation checkpointing, 32 layers of activations at 128K is a lot for one GPU (set). Each GPU in our TP group still needs to store activations for all 32 layers during the forward pass. Can we reduce the number of layers per GPU? If we split layers across different nodes, each node only needs to store activations for its assigned layers.

Step 3

Pipeline Parallelism

Split layers across nodes. Instead of all 32 layers on one GPU set, split into stages. Each stage = a group of layers on a different node. Only activations communicated at boundaries.

The Idea

Divide the 32 transformer layers into pipeline stages. Each stage = a group of layers assigned to a different node. Input micro-batches flow through the pipeline like an assembly line. Each node only stores and computes activations for its own layers. Communication between stages: just the activation tensor at the boundary — far less than all-gathering full parameters.

Pipeline Schedule Visualization

Watch how micro-batches flow through pipeline stages. Gray cells = bubble (idle time). More micro-batches shrink the bubble.

Bubble overhead: 43% More micro-batches = smaller bubble

One Pipeline: How Layers Split Across Nodes

With PP=2, one pipeline uses 2 nodes. Each node runs 8 TP GPUs on its layers. Activations flow from Stage 0 to Stage 1.

Stage 0 (Layers 0-15)

Stage 1 (Layers 16-31)

TP AllReduce (intra-node, NVLink)

PP activations (inter-node, InfiniBand)

This one pipeline uses 2 nodes = 16 GPUs. We have 64 GPUs total. What about the other 48?

Rule: Minimize PP Stages

Fewer stages = less bubble waste. Only add PP stages until the model fits. The pipeline bubble is the time GPUs sit idle waiting for micro-batches to flow through.

PP Stages	Layers / Stage	Nodes / Pipeline	Bubble (4 mbs)
PP=1	32	1 node	0% (no bubble)
PP=2	16	2 nodes	~25%
PP=4	8	4 nodes	~43%

Vizz AI Decision

With TP=8 and activation checkpointing, try PP=2 first (16 layers per stage). This halves activation memory per GPU-set and uses 2 nodes per pipeline. If it fits — stop. Don't go to PP=4 unless necessary. PP = 2.

Problem That Remains

TP=8, PP=2 uses 16 GPUs per pipeline. We have 64 GPUs. What about the other 48? They need something to do. This is where Data Parallelism comes in — not as a choice, but as what remains after TP and PP are set.

Step 4

Data Parallelism + ZeRO

DP is not a choice — it's the remaining GPUs. DP = Total / (TP × PP). Each replica processes different data, then gradients are synchronized via AllReduce.

The Idea

After TP and PP are set, the remaining GPUs form DP replicas. Each replica is an independent copy of the full pipeline. They process different data in parallel. After each step, gradients are averaged across all DP replicas via AllReduce.

Computing DP

DP = Total GPUs / (TP × PP) = 64 / (8 × 2) = 4

4 DP replicas. Each processes a different micro-batch. All replicas synchronize gradients via AllReduce after every step.

ZeRO: Shard Across DP Replicas

ZeRO optimization shards memory across the DP replicas. Start with the cheapest level and only go deeper if you're still memory-desperate.

Per-GPU Memory (with TP=8, PP=2) ~11.3 GB

Each GPU: params (0.88 GB) + optimizer (1.75 GB) + gradients (0.88 GB) + activations (~7.5 GB)

Level	Shards	Overhead	When
ZeRO-1	Optimizer states	Almost free	Always
ZeRO-2	+ Gradients	Minimal	Good default
ZeRO-3	+ Parameters	Heavy	Last resort

Full Cluster — 4 DP Replicas

4 DP replicas, each with 2 PP stages, each stage = 1 node with 8 TP GPUs.

4 DP replicas × 2 PP stages × 8 TP GPUs = 64 GPUs total

Vizz AI Decision

DP = 4, ZeRO-1 (free optimizer sharding). With TP=8 and PP=2, memory is already comfortable. No need for ZeRO-2 or ZeRO-3.

Problem That Remains

We have 4 replicas, each processes mbs sequences per step. Global batch = mbs × 4. With mbs=1, that's only 4 sequences per step — just 4 × 128K = 512K tokens. Our target might be 4M tokens per step for stable training. How do we hit the target batch size without needing more GPUs?

Step 5

Hit the Target Batch Size

Use gradient accumulation to bridge the gap between what fits in memory and what training requires.

The Idea

Gradient accumulation lets you simulate a larger batch by accumulating gradients over multiple forward-backward passes before updating weights. Each pass processes mbs sequences, and you accumulate for grad_acc steps before doing an optimizer step.

The Batch Size Formula

Global Batch = mbs × DP × grad_acc_steps

mbs

Sequences per GPU per pass

Parallel replicas

grad_acc

Accumulation steps

Gradient Accumulation Explained

Instead of updating weights after every forward-backward pass, you accumulate gradients over multiple passes. Only after grad_acc passes do you average the accumulated gradients and take an optimizer step.

Vizz AI example:

Target: 4M tokens per optimizer step

Per pass: mbs(1) × DP(4) = 4 sequences × 128K = 512K tokens

Need: 4M / 512K = 8 accumulation steps

Result: 1 × 4 × 8 × 128K = 4,194,304 tokens per step

Vizz AI Decision

grad_acc = 8. Each optimizer step processes 4M tokens. This gives us stable training dynamics without needing more GPUs or larger micro-batches.

Problem That Remains

Training works! But at 128K sequence length, each attention head computes QK^T = 128K × 128K = 16 billion elements. That's ~32 GB in FP16 for just one head's attention scores. Even with TP splitting across heads, each head still attends to the full 128K sequence. TP splits across heads, not across the sequence. We need to split the sequence itself.

Steps 6-7

Context Parallelism + Expert Parallelism

CP splits the long sequence across GPUs using Ring Attention. EP distributes MoE experts. These are conditional — only needed for specific use cases.

Step 6: Context Parallelism

Split the Sequence with Ring Attention

The Idea

Divide the 128K token sequence into 8 chunks of 16K tokens each, one per GPU. For MLP and LayerNorm, each chunk processes independently (no communication). For attention, we use Ring Attention: KV blocks circulate around a ring of GPUs so every chunk can attend to every other chunk without any GPU holding the full 128K × 128K matrix.

Sequence Splitting

128K token sequence → 8 chunks of 16K

Attention per GPU: 16K × 16K = 256M elements

vs full: 128K × 128K = 16B elements

That's a 32× reduction per head!

Ring Attention

KV blocks circulate around the ring.
Each step: compute attention on local Q with received KV, then pass KV to next GPU.

Step 7: Expert Parallelism

Scale Capacity with Mixture of Experts

A Different Architecture Decision

What if instead of a 7B dense model, Vizz AI switches to a Mixture of Experts architecture? The model has 20B+ total parameters, but each token only activates 2 experts (~2B params). EP distributes experts across GPUs. Tokens are routed to their assigned expert GPUs via all-to-all communication.

Dense vs MoE Architecture

Every token passes through all parameters

Attention

All heads

→

MLP

Full 7B

+ CP Advantages

Attention memory reduced from O(S²) to O(S²/P) per GPU
Enables training with 128K+ sequences
MLP/LayerNorm need zero communication
Ring Attention overlaps compute and communication

- CP Disadvantages

Communication overhead in attention layers
Only needed for very long sequences (> 32K)
Adds complexity to attention implementation

+ EP Advantages

Scale model capacity without scaling per-token compute
20B+ total params, only 2B active per token
Experts naturally map to different GPUs
Better quality for same compute budget

- EP Disadvantages

Requires MoE architecture (design choice)
All-to-all communication for token routing
Load balancing across experts is challenging
More total parameters = more storage

Vizz AI Decision

CP enabled for 128K sequences — Ring Attention splits the sequence across GPUs within each node. Dense model for now (EP not needed). If scaling to MoE later, add EP across nodes.

The Complete Configuration

The Practical Decision Framework

We walked through every step in the order you'd actually make these decisions. Here's the summary staircase showing how each step solved a problem and revealed the next.

The Decision Staircase

116 GB doesn't fit in 80 GB

→

Setup: 64 H100s (8×8)

Which link to use first?

→

TP=8 (NVLink, fastest)

How many sequences fit?

→

mbs=1 (S² dominates)

32 layers too many per GPU-set

→

PP=2 (16 layers/stage)

48 GPUs still unused

→

DP=4 + ZeRO-1

Only 512K tokens/step

→

grad_acc=8 → 4M tokens

128K×128K attention explodes

→

CP (Ring Attention)

Want more capacity?

→

EP (if switching to MoE)

Vizz AI: Final Configurations

Three configurations for different architectures, all on 64 H100 GPUs with 128K sequence length.

Recommended — Dense 7B

TP8

PP2

DP4

ZeROLevel 1

CPEnabled

mbs1

grad_acc8

TP=8 saturates NVLink. PP=2 minimizes bubble. DP=4 for throughput. ZeRO-1 is free. CP for 128K sequences.

Alternative — More DP

TP8

PP1

DP8

ZeROLevel 2

CPEnabled

mbs1

grad_acc4

No pipeline bubble at all. ZeRO-2 needed for gradient memory. More DP replicas = more throughput if comm is fast enough.

MoE Variant — 20B

TP8

PP2

DP4

EP8 experts

CPEnabled

mbs1

grad_acc8

20B+ total params, 2B active per token. EP distributes experts. DeepSeek-V3 style.

Interactive Configuration Builder

Play with the sliders to explore other configurations. Constraint: DP × TP × PP = 64 GPUs.

TP 8

PP 2

DP 4

TP(8) × PP(2) × DP(4) = 64 ✓

Estimated Per-GPU

Params0.88 GB

Optimizer1.75 GB

Gradients0.88 GB

Activations~3.75 GB

Total ~7.3 GB

Cluster Layout

TP (intra-node)

PP / DP (inter-node)

All 5 Dimensions at a Glance (Practical Order)

Order	Strategy	Splits what?	Communication	Where?
1	Tensor + SP	Weight matrices + activations	AllReduce (critical path)	Intra-node (NVLink)
2	Pipeline	Model layers	Activations at boundaries	Inter-node
3	Data + ZeRO	Data batches	AllReduce gradients	Inter-node
4	Context	Sequence chunks	Ring Attention KV	Intra/Inter (overlapped)
5	Expert	MoE experts	All-to-all routing	Inter-node

The 3-Step Framework

Fit in Memory

Set TP to saturate NVLink. Find max mbs. Add PP until layers fit. Use activation checkpointing.

Hit Batch Size

DP = remaining GPUs. Use gradient accumulation to reach target global batch size. Add ZeRO-1 for free memory savings.

Max Throughput

Add CP for long sequences. Add EP for MoE. Profile, measure MFU, and iterate on the config.

There is no silver bullet. The framework gives you a working baseline. Then you run experiments, profile, and iterate. The best config depends on your model, data, cluster, and network. Measure MFU. Find the bottleneck. Tune.

Now go train that Vizz model.