You're the founding engineer at Vizz AI. Your startup just raised Series A to build the first high-quality Small Language Model for the Vizz language — spoken by 850 million people but virtually absent from ChatGPT, Claude, and every other LLM.
You've collected 500 billion tokens from local publishers, radio transcripts, government records, and web scrapes. Your training data includes documents up to 128K tokens long — legal proceedings, parliamentary debates, epic folklore.
Your target: a 7 billion parameter transformer. Your cluster is ready. Now you need to figure out how to actually train this thing.
A 7B model in FP16 needs 14 GB just for parameters. Adam optimizer states add another 28 GB. Gradients need 14 GB. And activations at 128K sequence length? That's ~60+ GB per layer accumulated through the forward pass. Total: ~116+ GB. Your H100 has 80 GB. It doesn't fit. You need to distribute this across your 64 GPUs. But how?
Start from the fastest link. NVLink is 900 GB/s intra-node. Split weight matrices of each layer across 8 GPUs within a node. Each GPU holds 1/8th of each layer.
Take each layer's weight matrix and slice it column-wise (or row-wise) across 8 GPUs within a node. Each GPU does 1/8th of the matrix multiply, then they combine results with AllReduce. Since AllReduce is on the critical compute path, this requires fast NVLink — hence it's kept intra-node. Sequence Parallelism (SP) complements TP by splitting activations along the sequence dimension for operations outside TP regions (LayerNorm, Dropout).
A single layer's weight matrix, split across 8 GPUs in one node:
Full Weight Matrix (4096 × 4096)
Split across 8 GPUs (column-wise)
Each GPU within a node holds 1/8th of every layer's weight matrix. They all work on the same layer together.
In TP regions (Attention, MLP), each GPU computes a slice of the hidden dimension. But outside TP regions — LayerNorm, Dropout — all GPUs would need the full activations. SP fixes this by splitting activations along the sequence dimension in non-TP regions.
| Link Type | Bandwidth | Latency |
|---|---|---|
| NVLink (intra-node) | 900 GB/s | ~1 μs |
| InfiniBand (inter-node) | 50 GB/s | ~5 μs |
| Ratio | 18× faster intra-node | |
TP AllReduce is on the critical compute path — the GPU blocks waiting for the AllReduce to complete before it can proceed to the next operation. With inter-node links, every single layer computation would stall. This is why TP is always kept within a node.
TP = 8 (saturate the node). Each GPU holds 1/8th of every layer's weight matrix. All 8 GPUs within a node communicate via NVLink AllReduce. With 32 attention heads, each GPU handles 4 heads.
TP=8 is set. Each GPU holds 1/8th of weights per layer. But we still have all 32 layers, and at 128K sequence length, how many sequences can we even process at once? How much memory do activations take? Weight matrices are shape [H, H] — independent of batch size. But activation tensors are shape [b, S, H] — batch size is in every tensor. We need to figure out the maximum micro-batch size before anything else.
GPU memory splits into 4 buckets. Only activations scale with batch size. We need to find the largest micro-batch size (mbs) that fits in memory.
Weight matrices have shapes determined by hidden size H and layers L — independent of how many sequences you process. Whether mbs = 1 or 100, the weight matrix W remains shape [H, H]. But activation tensors include the batch dimension: shape [b, S, H]. Activation memory grows linearly with b.
Weight matrices have shape [H, H] — independent of batch size. Activation tensors have shape [b, S, H] — b is in every tensor. The fixed costs (params + grads + optimizer) total 16N bytes. Everything else scales with how many sequences you try to fit.
Take your GPU's total memory, subtract the fixed cost, divide by per-sequence activation cost.
The activation formula has an S² term from attention (the QKT score matrix). Doubling sequence length MORE than halves your max batch size. At 128K, this quadratic term dominates everything. Even mbs=1 might be tight! This is why long-context training is fundamentally harder than short-context training.
In practice, the formula gives a starting estimate. Then you find the exact max by trial:
Start with b=1, then double: 1 → 2 → 4 → 8 ... until OOM. Last successful = your bmax.
With TP=8, each GPU holds 1/8th of each layer's weights. At 128K sequence length, the S² attention term is massive. Even with bf16, mbs=1 is tight for 32 layers. We'll need activation checkpointing and likely PP to split layers. Once those are set: mbs = 1 (possibly mbs = 2 with aggressive checkpointing).
Even with activation checkpointing, 32 layers of activations at 128K is a lot for one GPU (set). Each GPU in our TP group still needs to store activations for all 32 layers during the forward pass. Can we reduce the number of layers per GPU? If we split layers across different nodes, each node only needs to store activations for its assigned layers.
Split layers across nodes. Instead of all 32 layers on one GPU set, split into stages. Each stage = a group of layers on a different node. Only activations communicated at boundaries.
Divide the 32 transformer layers into pipeline stages. Each stage = a group of layers assigned to a different node. Input micro-batches flow through the pipeline like an assembly line. Each node only stores and computes activations for its own layers. Communication between stages: just the activation tensor at the boundary — far less than all-gathering full parameters.
Watch how micro-batches flow through pipeline stages. Gray cells = bubble (idle time). More micro-batches shrink the bubble.
With PP=2, one pipeline uses 2 nodes. Each node runs 8 TP GPUs on its layers. Activations flow from Stage 0 to Stage 1.
This one pipeline uses 2 nodes = 16 GPUs. We have 64 GPUs total. What about the other 48?
Fewer stages = less bubble waste. Only add PP stages until the model fits. The pipeline bubble is the time GPUs sit idle waiting for micro-batches to flow through.
| PP Stages | Layers / Stage | Nodes / Pipeline | Bubble (4 mbs) |
|---|---|---|---|
| PP=1 | 32 | 1 node | 0% (no bubble) |
| PP=2 | 16 | 2 nodes | ~25% |
| PP=4 | 8 | 4 nodes | ~43% |
With TP=8 and activation checkpointing, try PP=2 first (16 layers per stage). This halves activation memory per GPU-set and uses 2 nodes per pipeline. If it fits — stop. Don't go to PP=4 unless necessary. PP = 2.
TP=8, PP=2 uses 16 GPUs per pipeline. We have 64 GPUs. What about the other 48? They need something to do. This is where Data Parallelism comes in — not as a choice, but as what remains after TP and PP are set.
DP is not a choice — it's the remaining GPUs. DP = Total / (TP × PP). Each replica processes different data, then gradients are synchronized via AllReduce.
After TP and PP are set, the remaining GPUs form DP replicas. Each replica is an independent copy of the full pipeline. They process different data in parallel. After each step, gradients are averaged across all DP replicas via AllReduce.
4 DP replicas. Each processes a different micro-batch. All replicas synchronize gradients via AllReduce after every step.
ZeRO optimization shards memory across the DP replicas. Start with the cheapest level and only go deeper if you're still memory-desperate.
Each GPU: params (0.88 GB) + optimizer (1.75 GB) + gradients (0.88 GB) + activations (~7.5 GB)
| Level | Shards | Overhead | When |
|---|---|---|---|
| ZeRO-1 | Optimizer states | Almost free | Always |
| ZeRO-2 | + Gradients | Minimal | Good default |
| ZeRO-3 | + Parameters | Heavy | Last resort |
4 DP replicas, each with 2 PP stages, each stage = 1 node with 8 TP GPUs.
4 DP replicas × 2 PP stages × 8 TP GPUs = 64 GPUs total
DP = 4, ZeRO-1 (free optimizer sharding). With TP=8 and PP=2, memory is already comfortable. No need for ZeRO-2 or ZeRO-3.
We have 4 replicas, each processes mbs sequences per step. Global batch = mbs × 4. With mbs=1, that's only 4 sequences per step — just 4 × 128K = 512K tokens. Our target might be 4M tokens per step for stable training. How do we hit the target batch size without needing more GPUs?
Use gradient accumulation to bridge the gap between what fits in memory and what training requires.
Gradient accumulation lets you simulate a larger batch by accumulating gradients over multiple forward-backward passes before updating weights. Each pass processes mbs sequences, and you accumulate for grad_acc steps before doing an optimizer step.
Instead of updating weights after every forward-backward pass, you accumulate gradients over multiple passes. Only after grad_acc passes do you average the accumulated gradients and take an optimizer step.
grad_acc = 8. Each optimizer step processes 4M tokens. This gives us stable training dynamics without needing more GPUs or larger micro-batches.
Training works! But at 128K sequence length, each attention head computes QKT = 128K × 128K = 16 billion elements. That's ~32 GB in FP16 for just one head's attention scores. Even with TP splitting across heads, each head still attends to the full 128K sequence. TP splits across heads, not across the sequence. We need to split the sequence itself.
CP splits the long sequence across GPUs using Ring Attention. EP distributes MoE experts. These are conditional — only needed for specific use cases.
Divide the 128K token sequence into 8 chunks of 16K tokens each, one per GPU. For MLP and LayerNorm, each chunk processes independently (no communication). For attention, we use Ring Attention: KV blocks circulate around a ring of GPUs so every chunk can attend to every other chunk without any GPU holding the full 128K × 128K matrix.
128K token sequence → 8 chunks of 16K
Attention per GPU: 16K × 16K = 256M elements
vs full: 128K × 128K = 16B elements
That's a 32× reduction per head!
KV blocks circulate around the ring.
Each step: compute attention on local Q with received KV, then pass KV to next GPU.
What if instead of a 7B dense model, Vizz AI switches to a Mixture of Experts architecture? The model has 20B+ total parameters, but each token only activates 2 experts (~2B params). EP distributes experts across GPUs. Tokens are routed to their assigned expert GPUs via all-to-all communication.
Every token passes through all parameters
CP enabled for 128K sequences — Ring Attention splits the sequence across GPUs within each node. Dense model for now (EP not needed). If scaling to MoE later, add EP across nodes.
We walked through every step in the order you'd actually make these decisions. Here's the summary staircase showing how each step solved a problem and revealed the next.
Three configurations for different architectures, all on 64 H100 GPUs with 128K sequence length.
TP=8 saturates NVLink. PP=2 minimizes bubble. DP=4 for throughput. ZeRO-1 is free. CP for 128K sequences.
No pipeline bubble at all. ZeRO-2 needed for gradient memory. More DP replicas = more throughput if comm is fast enough.
20B+ total params, 2B active per token. EP distributes experts. DeepSeek-V3 style.
Play with the sliders to explore other configurations. Constraint: DP × TP × PP = 64 GPUs.
| Order | Strategy | Splits what? | Communication | Where? |
|---|---|---|---|---|
| 1 | Tensor + SP | Weight matrices + activations | AllReduce (critical path) | Intra-node (NVLink) |
| 2 | Pipeline | Model layers | Activations at boundaries | Inter-node |
| 3 | Data + ZeRO | Data batches | AllReduce gradients | Inter-node |
| 4 | Context | Sequence chunks | Ring Attention KV | Intra/Inter (overlapped) |
| 5 | Expert | MoE experts | All-to-all routing | Inter-node |
Set TP to saturate NVLink. Find max mbs. Add PP until layers fit. Use activation checkpointing.
DP = remaining GPUs. Use gradient accumulation to reach target global batch size. Add ZeRO-1 for free memory savings.
Add CP for long sequences. Add EP for MoE. Profile, measure MFU, and iterate on the config.
There is no silver bullet. The framework gives you a working baseline. Then you run experiments, profile, and iterate. The best config depends on your model, data, cluster, and network. Measure MFU. Find the bottleneck. Tune.
Now go train that Vizz model.