Interactive, from-first-principles visual guides for every concept in distributed GPU training — from data parallelism to ZeRO optimizer internals.
Understand how multiple GPUs collaborate on the same model — from naive replication to overlapped gradient synchronization.
From single-GPU to production DDP. Covers naive DP, interleaved DP, PyTorch DDP, and ring all-reduce with Colab benchmarks on real hardware.
Naive DP vs DDP overlap. See exactly how DDP hides communication behind backward compute with live step-by-step animations and profiler timelines.
Simulate large batch sizes without the memory cost. Interactive diagrams show how micro-batches accumulate gradients before a single optimizer step.
Master the practical knobs of distributed training — memory budgets, batch size selection, and hardware utilization.
The classic memory-compute tradeoff. Visualize how selective and full recomputation slash activation memory, with transformer layer diagrams and memory bar charts.
How to find the largest batch that fits in GPU memory. Complete memory breakdown with GPU sharding, model presets, and an interactive batch size calculator.
From matrix multiply to TensorBoard. Understand SM occupancy, memory bandwidth bottlenecks, and how to read profiler traces to squeeze every FLOP from your GPU.
Choosing the right batch size isn't arbitrary. Explore the gradient noise scale (Bnoise) that tells you when larger batches stop helping and just waste compute.
A progressive walkthrough of the ZeRO optimizer — from config parameters to full FSDP, each with concrete numbers on a tiny transformer.
Every ZeRO tuning knob explained visually. Bucket sizes, compute-comm overlap, parameter persistence, napkin math for why 14 buckets is enough, and a full interactive timeline builder.
A concrete walkthrough with actual numbers. See how sharding Adam optimizer states across GPUs cuts memory by 37.5% with zero extra communication cost.
Building on ZeRO-1, now gradients are sharded too via reduce-scatter. Walk through every step with concrete numbers — 43.8% memory savings, same communication.
The final frontier — parameters themselves are sharded. No full replica anywhere. Walk through gather-compute-flush, prefetching, and 50% memory savings at 1.5× communication cost.
How weight matrices are split within a layer, how SP complements TP, and why the right sharding strategy changes everything.
See how weight matrices split across GPUs — column parallel for up-projection, row parallel for down. Then watch SP eliminate redundant activations in LayerNorm & Dropout regions.
Watch a prompt arrive and tokens generate one by one. See side-by-side what ZeRO-3 and Tensor Parallelism do at each layer — and why TP moves ~10,000× less data per token.
How pipeline parallelism splits models into stages across GPUs — and how TP, PP, and DP combine into 3D parallelism at scale.
Interactive 24-GPU Megatron-LM layout. Hover to explore how TP groups (NVLink), PP stages (P2P), and DP replicas (AllReduce) partition communication across the network hierarchy.
Interleaved stages (v chunks per GPU), Zero Bubble (B/W decomposition), and DualPipe (DeepSeek-V3). Interactive bubble calculator, animated pipeline diagrams, and full comparison.
Step-by-step walkthrough: from single-GPU memory overflow to TP limits to the depth-axis split. Interactive 4-step visualization with adjustable GPU count (2/4/8).
How long sequences are split across GPUs using ring communication — and why zigzag assignment balances the workload.
Sparse models that activate only a fraction of parameters per token — and how to distribute experts across GPUs with All-to-All communication.
Dense vs sparse models, router gating, top-K expert selection, and load balancing. Watch tokens flow through the gating network to their assigned experts in real time.
Distribute MoE experts across GPUs. Watch the All-to-All dispatch: tokens travel to their expert’s GPU, compute, then return. See how EP combines with TP and DP.
Everything comes together — TP, PP, DP, ZeRO, CP, and EP in a realistic end-to-end training setup. Follow a startup’s journey configuring 64 H100 GPUs to train a 7B-parameter model.