Train an SLM for 850 million speakers.

You're the founding engineer at Vizz AI. Your startup just raised Series A to build the first high-quality Small Language Model for the Vizz language — spoken by 850 million people but virtually absent from ChatGPT, Claude, and every other LLM.

You've collected 500 billion tokens from local publishers, radio transcripts, government records, and web scrapes. Your training data includes documents up to 128K tokens long — legal proceedings, parliamentary debates, epic folklore.

Your target: a 7 billion parameter transformer. Your cluster is ready. Now you need to figure out how to actually train this thing.

7B
Parameters
64
H100 GPUs
128K
Max Seq Len

Cluster Topology — 8 Nodes × 8 GPUs

NVLink 900 GB/s (intra-node)
InfiniBand 50 GB/s (inter-node)

Single GPU Memory — 80 GB H100

The Problem

A 7B model in FP16 needs 14 GB just for parameters. Adam optimizer states add another 28 GB. Gradients need 14 GB. And activations at 128K sequence length? That's ~60+ GB per layer accumulated through the forward pass. Total: ~116+ GB. Your H100 has 80 GB. It doesn't fit. You need to distribute this across your 64 GPUs. But how?

Tensor Parallelism + Sequence Parallelism

Start from the fastest link. NVLink is 900 GB/s intra-node. Split weight matrices of each layer across 8 GPUs within a node. Each GPU holds 1/8th of each layer.

The Idea

Take each layer's weight matrix and slice it column-wise (or row-wise) across 8 GPUs within a node. Each GPU does 1/8th of the matrix multiply, then they combine results with AllReduce. Since AllReduce is on the critical compute path, this requires fast NVLink — hence it's kept intra-node. Sequence Parallelism (SP) complements TP by splitting activations along the sequence dimension for operations outside TP regions (LayerNorm, Dropout).

Weight Matrix Slicing

A single layer's weight matrix, split across 8 GPUs in one node:

Full Weight Matrix (4096 × 4096)

Split across 8 GPUs (column-wise)

AllReduce over NVLink after each layer

One Node: 8 GPUs Sharing Every Layer

Each GPU within a node holds 1/8th of every layer's weight matrix. They all work on the same layer together.

Sequence Parallelism (SP)

In TP regions (Attention, MLP), each GPU computes a slice of the hidden dimension. But outside TP regions — LayerNorm, Dropout — all GPUs would need the full activations. SP fixes this by splitting activations along the sequence dimension in non-TP regions.

TP Regions (Attention, MLP)
Split along hidden dimension
Each GPU: full seq, 1/8 hidden
SP Regions (LayerNorm, Dropout)
Split along sequence dimension
Each GPU: 1/8 seq, full hidden

Why TP Must Stay Intra-Node

Link TypeBandwidthLatency
NVLink (intra-node)900 GB/s~1 μs
InfiniBand (inter-node)50 GB/s~5 μs
Ratio18× faster intra-node

TP AllReduce is on the critical compute path — the GPU blocks waiting for the AllReduce to complete before it can proceed to the next operation. With inter-node links, every single layer computation would stall. This is why TP is always kept within a node.

Vizz AI Decision

TP = 8 (saturate the node). Each GPU holds 1/8th of every layer's weight matrix. All 8 GPUs within a node communicate via NVLink AllReduce. With 32 attention heads, each GPU handles 4 heads.

Problem That Remains

TP=8 is set. Each GPU holds 1/8th of weights per layer. But we still have all 32 layers, and at 128K sequence length, how many sequences can we even process at once? How much memory do activations take? Weight matrices are shape [H, H] — independent of batch size. But activation tensors are shape [b, S, H] — batch size is in every tensor. We need to figure out the maximum micro-batch size before anything else.

Find Max Micro-Batch Size

GPU memory splits into 4 buckets. Only activations scale with batch size. We need to find the largest micro-batch size (mbs) that fits in memory.

The 4 Memory Buckets

Weight matrices have shapes determined by hidden size H and layers L — independent of how many sequences you process. Whether mbs = 1 or 100, the weight matrix W remains shape [H, H]. But activation tensors include the batch dimension: shape [b, S, H]. Activation memory grows linearly with b.

Parameters
4N bytes
FIXED
Shape: [H, H]
Gradients
4N bytes
FIXED
Same shape as params
Optimizer
8N bytes
FIXED
Adam: 2 states per param
Activations
f(b, S, H, L)
SCALES WITH b
Shape: [b, S, H]

Key Insight

Weight matrices have shape [H, H] — independent of batch size. Activation tensors have shape [b, S, H] — b is in every tensor. The fixed costs (params + grads + optimizer) total 16N bytes. Everything else scales with how many sequences you try to fit.

The Memory Formula

Total GPU memory needed:
Mtotal = 16N + L × (34 × S × b × H + 5 × nheads × S² × b) × 2
Fixed cost: 16N bytes (params + grads + optimizer) Activation cost: scales with b (micro-batch size)
Solving for maximum batch size:
bmax = ⌊ (MGPU − 16N) / (L × (34×S×H + 5×n×S²) × 2) ⌋

Take your GPU's total memory, subtract the fixed cost, divide by per-sequence activation cost.

Key Insight: The S² Term

The activation formula has an S² term from attention (the QKT score matrix). Doubling sequence length MORE than halves your max batch size. At 128K, this quadratic term dominates everything. Even mbs=1 might be tight! This is why long-context training is fundamentally harder than short-context training.

The Empirical Approach: Binary Search

In practice, the formula gives a starting estimate. Then you find the exact max by trial:

Start with b=1, then double: 1 → 2 → 4 → 8 ... until OOM. Last successful = your bmax.

If mbs=1 Still Doesn't Fit?

Activation Checkpointing
Recompute activations in backward pass instead of storing them. ~90% activation reduction. Costs ~33% more compute.
Mixed Precision (bf16)
Halves activation memory (2 bytes vs 4). Fixed costs remain 16N (fp32 master weights needed for optimizer).
Add PP (Next Step)
Fewer layers per GPU = less activation memory. Each GPU only stores activations for its own layers.

Vizz AI Decision

With TP=8, each GPU holds 1/8th of each layer's weights. At 128K sequence length, the S² attention term is massive. Even with bf16, mbs=1 is tight for 32 layers. We'll need activation checkpointing and likely PP to split layers. Once those are set: mbs = 1 (possibly mbs = 2 with aggressive checkpointing).

Problem That Remains

Even with activation checkpointing, 32 layers of activations at 128K is a lot for one GPU (set). Each GPU in our TP group still needs to store activations for all 32 layers during the forward pass. Can we reduce the number of layers per GPU? If we split layers across different nodes, each node only needs to store activations for its assigned layers.

Pipeline Parallelism

Split layers across nodes. Instead of all 32 layers on one GPU set, split into stages. Each stage = a group of layers on a different node. Only activations communicated at boundaries.

The Idea

Divide the 32 transformer layers into pipeline stages. Each stage = a group of layers assigned to a different node. Input micro-batches flow through the pipeline like an assembly line. Each node only stores and computes activations for its own layers. Communication between stages: just the activation tensor at the boundary — far less than all-gathering full parameters.

Pipeline Schedule Visualization

Watch how micro-batches flow through pipeline stages. Gray cells = bubble (idle time). More micro-batches shrink the bubble.

Bubble overhead: 43% More micro-batches = smaller bubble

One Pipeline: How Layers Split Across Nodes

With PP=2, one pipeline uses 2 nodes. Each node runs 8 TP GPUs on its layers. Activations flow from Stage 0 to Stage 1.

Stage 0 (Layers 0-15)
Stage 1 (Layers 16-31)
TP AllReduce (intra-node, NVLink)
PP activations (inter-node, InfiniBand)

This one pipeline uses 2 nodes = 16 GPUs. We have 64 GPUs total. What about the other 48?

Rule: Minimize PP Stages

Fewer stages = less bubble waste. Only add PP stages until the model fits. The pipeline bubble is the time GPUs sit idle waiting for micro-batches to flow through.

PP StagesLayers / StageNodes / PipelineBubble (4 mbs)
PP=1321 node0% (no bubble)
PP=2162 nodes~25%
PP=484 nodes~43%

Vizz AI Decision

With TP=8 and activation checkpointing, try PP=2 first (16 layers per stage). This halves activation memory per GPU-set and uses 2 nodes per pipeline. If it fits — stop. Don't go to PP=4 unless necessary. PP = 2.

Problem That Remains

TP=8, PP=2 uses 16 GPUs per pipeline. We have 64 GPUs. What about the other 48? They need something to do. This is where Data Parallelism comes in — not as a choice, but as what remains after TP and PP are set.

Data Parallelism + ZeRO

DP is not a choice — it's the remaining GPUs. DP = Total / (TP × PP). Each replica processes different data, then gradients are synchronized via AllReduce.

The Idea

After TP and PP are set, the remaining GPUs form DP replicas. Each replica is an independent copy of the full pipeline. They process different data in parallel. After each step, gradients are averaged across all DP replicas via AllReduce.

Computing DP

DP = Total GPUs / (TP × PP) = 64 / (8 × 2) = 4

4 DP replicas. Each processes a different micro-batch. All replicas synchronize gradients via AllReduce after every step.

ZeRO: Shard Across DP Replicas

ZeRO optimization shards memory across the DP replicas. Start with the cheapest level and only go deeper if you're still memory-desperate.

Per-GPU Memory (with TP=8, PP=2) ~11.3 GB

Each GPU: params (0.88 GB) + optimizer (1.75 GB) + gradients (0.88 GB) + activations (~7.5 GB)

LevelShardsOverheadWhen
ZeRO-1 Optimizer states Almost free Always
ZeRO-2 + Gradients Minimal Good default
ZeRO-3 + Parameters Heavy Last resort

Full Cluster — 4 DP Replicas

4 DP replicas, each with 2 PP stages, each stage = 1 node with 8 TP GPUs.

4 DP replicas × 2 PP stages × 8 TP GPUs = 64 GPUs total

Vizz AI Decision

DP = 4, ZeRO-1 (free optimizer sharding). With TP=8 and PP=2, memory is already comfortable. No need for ZeRO-2 or ZeRO-3.

Problem That Remains

We have 4 replicas, each processes mbs sequences per step. Global batch = mbs × 4. With mbs=1, that's only 4 sequences per step — just 4 × 128K = 512K tokens. Our target might be 4M tokens per step for stable training. How do we hit the target batch size without needing more GPUs?

Hit the Target Batch Size

Use gradient accumulation to bridge the gap between what fits in memory and what training requires.

The Idea

Gradient accumulation lets you simulate a larger batch by accumulating gradients over multiple forward-backward passes before updating weights. Each pass processes mbs sequences, and you accumulate for grad_acc steps before doing an optimizer step.

The Batch Size Formula

Global Batch = mbs × DP × grad_acc_steps
mbs
1
Sequences per GPU per pass
DP
4
Parallel replicas
grad_acc
8
Accumulation steps

Gradient Accumulation Explained

Instead of updating weights after every forward-backward pass, you accumulate gradients over multiple passes. Only after grad_acc passes do you average the accumulated gradients and take an optimizer step.

Vizz AI example:
Target: 4M tokens per optimizer step
Per pass: mbs(1) × DP(4) = 4 sequences × 128K = 512K tokens
Need: 4M / 512K = 8 accumulation steps
Result: 1 × 4 × 8 × 128K = 4,194,304 tokens per step

Vizz AI Decision

grad_acc = 8. Each optimizer step processes 4M tokens. This gives us stable training dynamics without needing more GPUs or larger micro-batches.

Problem That Remains

Training works! But at 128K sequence length, each attention head computes QKT = 128K × 128K = 16 billion elements. That's ~32 GB in FP16 for just one head's attention scores. Even with TP splitting across heads, each head still attends to the full 128K sequence. TP splits across heads, not across the sequence. We need to split the sequence itself.

Context Parallelism + Expert Parallelism

CP splits the long sequence across GPUs using Ring Attention. EP distributes MoE experts. These are conditional — only needed for specific use cases.

Split the Sequence with Ring Attention

The Idea

Divide the 128K token sequence into 8 chunks of 16K tokens each, one per GPU. For MLP and LayerNorm, each chunk processes independently (no communication). For attention, we use Ring Attention: KV blocks circulate around a ring of GPUs so every chunk can attend to every other chunk without any GPU holding the full 128K × 128K matrix.

Sequence Splitting

128K token sequence → 8 chunks of 16K

Attention per GPU: 16K × 16K = 256M elements

vs full: 128K × 128K = 16B elements

That's a 32× reduction per head!

Ring Attention

KV blocks circulate around the ring.
Each step: compute attention on local Q with received KV, then pass KV to next GPU.

Scale Capacity with Mixture of Experts

A Different Architecture Decision

What if instead of a 7B dense model, Vizz AI switches to a Mixture of Experts architecture? The model has 20B+ total parameters, but each token only activates 2 experts (~2B params). EP distributes experts across GPUs. Tokens are routed to their assigned expert GPUs via all-to-all communication.

Dense vs MoE Architecture

Every token passes through all parameters

Attention
All heads
MLP
Full 7B

+ CP Advantages

  • Attention memory reduced from O(S²) to O(S²/P) per GPU
  • Enables training with 128K+ sequences
  • MLP/LayerNorm need zero communication
  • Ring Attention overlaps compute and communication

- CP Disadvantages

  • Communication overhead in attention layers
  • Only needed for very long sequences (> 32K)
  • Adds complexity to attention implementation

+ EP Advantages

  • Scale model capacity without scaling per-token compute
  • 20B+ total params, only 2B active per token
  • Experts naturally map to different GPUs
  • Better quality for same compute budget

- EP Disadvantages

  • Requires MoE architecture (design choice)
  • All-to-all communication for token routing
  • Load balancing across experts is challenging
  • More total parameters = more storage

Vizz AI Decision

CP enabled for 128K sequences — Ring Attention splits the sequence across GPUs within each node. Dense model for now (EP not needed). If scaling to MoE later, add EP across nodes.

The Practical Decision Framework

We walked through every step in the order you'd actually make these decisions. Here's the summary staircase showing how each step solved a problem and revealed the next.

The Decision Staircase

116 GB doesn't fit in 80 GB
Setup: 64 H100s (8×8)
Which link to use first?
TP=8 (NVLink, fastest)
How many sequences fit?
mbs=1 (S² dominates)
32 layers too many per GPU-set
PP=2 (16 layers/stage)
48 GPUs still unused
DP=4 + ZeRO-1
Only 512K tokens/step
grad_acc=8 → 4M tokens
128K×128K attention explodes
CP (Ring Attention)
Want more capacity?
EP (if switching to MoE)

Vizz AI: Final Configurations

Three configurations for different architectures, all on 64 H100 GPUs with 128K sequence length.

Alternative — More DP
TP8
PP1
DP8
ZeROLevel 2
CPEnabled
mbs1
grad_acc4

No pipeline bubble at all. ZeRO-2 needed for gradient memory. More DP replicas = more throughput if comm is fast enough.

MoE Variant — 20B
TP8
PP2
DP4
EP8 experts
CPEnabled
mbs1
grad_acc8

20B+ total params, 2B active per token. EP distributes experts. DeepSeek-V3 style.

Interactive Configuration Builder

Play with the sliders to explore other configurations. Constraint: DP × TP × PP = 64 GPUs.

8
2
4
TP(8) × PP(2) × DP(4) = 64 ✓

Estimated Per-GPU

Params0.88 GB
Optimizer1.75 GB
Gradients0.88 GB
Activations~3.75 GB
Total ~7.3 GB

Cluster Layout

TP (intra-node)
PP / DP (inter-node)

All 5 Dimensions at a Glance (Practical Order)

Order Strategy Splits what? Communication Where?
1 Tensor + SP Weight matrices + activations AllReduce (critical path) Intra-node (NVLink)
2 Pipeline Model layers Activations at boundaries Inter-node
3 Data + ZeRO Data batches AllReduce gradients Inter-node
4 Context Sequence chunks Ring Attention KV Intra/Inter (overlapped)
5 Expert MoE experts All-to-all routing Inter-node

The 3-Step Framework

1

Fit in Memory

Set TP to saturate NVLink. Find max mbs. Add PP until layers fit. Use activation checkpointing.

2

Hit Batch Size

DP = remaining GPUs. Use gradient accumulation to reach target global batch size. Add ZeRO-1 for free memory savings.

3

Max Throughput

Add CP for long sequences. Add EP for MoE. Profile, measure MFU, and iterate on the config.

There is no silver bullet. The framework gives you a working baseline. Then you run experiments, profile, and iterate. The best config depends on your model, data, cluster, and network. Measure MFU. Find the bottleneck. Tune.

Now go train that Vizz model.