Vizuara 5D Parallelism Workshop — Interactive Visual Guides

Data Parallelism

Data Parallelism Fundamentals

Understand how multiple GPUs collaborate on the same model — from naive replication to overlapped gradient synchronization.

Data Parallelism Deep Dive

Data Parallel

From single-GPU to production DDP. Covers naive DP, interleaved DP, PyTorch DDP, and ring all-reduce with Colab benchmarks on real hardware.

Colab Notebooks Ring AllReduce Benchmarks

Explore guide →

Gradient Synchronization

Communication

Naive DP vs DDP overlap. See exactly how DDP hides communication behind backward compute with live step-by-step animations and profiler timelines.

Live Animation Profiler View Bucket Overlap

Explore guide →

Gradient Accumulation

Training

Simulate large batch sizes without the memory cost. Interactive diagrams show how micro-batches accumulate gradients before a single optimizer step.

Interactive Slider Batch Calculator Pipeline View

Explore guide →

Memory & Efficiency

Memory, Batching & GPU Performance

Master the practical knobs of distributed training — memory budgets, batch size selection, and hardware utilization.

Activation Recomputation

Memory

The classic memory-compute tradeoff. Visualize how selective and full recomputation slash activation memory, with transformer layer diagrams and memory bar charts.

Memory Charts Layer Diagrams Strategy Selector

Explore guide →

Finding Max Local Batch Size

Batch Size

How to find the largest batch that fits in GPU memory. Complete memory breakdown with GPU sharding, model presets, and an interactive batch size calculator.

Batch Calculator Memory Bar GPU Presets

Explore guide →

GPU Efficiency & SM Utilization

Performance

From matrix multiply to TensorBoard. Understand SM occupancy, memory bandwidth bottlenecks, and how to read profiler traces to squeeze every FLOP from your GPU.

SM Occupancy Roofline Model Profiler Traces

Explore guide →

Gradient Noise Scale

Optimization

Choosing the right batch size isn't arbitrary. Explore the gradient noise scale (B_noise) that tells you when larger batches stop helping and just waste compute.

Noise Calculator B_noise Curves Scaling Laws

Explore guide →

ZeRO Optimizer

ZeRO Deep Dive Series

A progressive walkthrough of the ZeRO optimizer — from config parameters to full FSDP, each with concrete numbers on a tiny transformer.

DeepSpeed ZeRO Config Parameters

ZeRO Optimizer

Every ZeRO tuning knob explained visually. Bucket sizes, compute-comm overlap, parameter persistence, napkin math for why 14 buckets is enough, and a full interactive timeline builder.

Napkin Math Timeline Builder Lifecycle Animation

Explore guide →

ZeRO-1: Optimizer State Sharding

ZeRO Stage 1

A concrete walkthrough with actual numbers. See how sharding Adam optimizer states across GPUs cuts memory by 37.5% with zero extra communication cost.

260 Parameters Adam Step-by-Step Memory Charts

Explore guide →

ZeRO-2: Gradient Sharding

ZeRO Stage 2

Building on ZeRO-1, now gradients are sharded too via reduce-scatter. Walk through every step with concrete numbers — 43.8% memory savings, same communication.

Reduce-Scatter Gradient Sharding Concrete Numbers

Explore guide →

ZeRO-3 / FSDP: Full Sharding

ZeRO Stage 3

The final frontier — parameters themselves are sharded. No full replica anywhere. Walk through gather-compute-flush, prefetching, and 50% memory savings at 1.5× communication cost.

Gather-Compute-Flush Prefetching 16P/N Scaling NEW

Explore guide →

Tensor Parallelism

Tensor & Sequence Parallelism

How weight matrices are split within a layer, how SP complements TP, and why the right sharding strategy changes everything.

Tensor + Sequence Parallelism Visual

TP + SP

See how weight matrices split across GPUs — column parallel for up-projection, row parallel for down. Then watch SP eliminate redundant activations in LayerNorm & Dropout regions.

Matrix Splitting MLP & Attention TP SP Memory Savings NEW

Explore guide →

TP vs ZeRO for Inference

Inference

Watch a prompt arrive and tokens generate one by one. See side-by-side what ZeRO-3 and Tensor Parallelism do at each layer — and why TP moves ~10,000× less data per token.

Token-by-Token Animation Side-by-Side Compare Interactive Calculator NEW

Explore guide →

Pipeline Parallelism

Pipeline & 3D Parallelism

How pipeline parallelism splits models into stages across GPUs — and how TP, PP, and DP combine into 3D parallelism at scale.

3D Parallelism: TP + PP + DP

Pipeline Parallel

Interactive 24-GPU Megatron-LM layout. Hover to explore how TP groups (NVLink), PP stages (P2P), and DP replicas (AllReduce) partition communication across the network hierarchy.

24 GPU Layout Hover Explorer Comm Summary NEW

Explore guide →

Advanced Pipeline Schedules

Interleaved + Zero Bubble + DualPipe

Interleaved stages (v chunks per GPU), Zero Bubble (B/W decomposition), and DualPipe (DeepSeek-V3). Interactive bubble calculator, animated pipeline diagrams, and full comparison.

Bubble Calculator B/W Decomposition DualPipe NEW

Explore guide →

Why Pipeline Parallelism Exists

Motivation

Step-by-step walkthrough: from single-GPU memory overflow to TP limits to the depth-axis split. Interactive 4-step visualization with adjustable GPU count (2/4/8).

4-Step Story GPU Selector Memory Bars NEW

Explore guide →

Context Parallelism

Context Parallelism & Ring Attention

How long sequences are split across GPUs using ring communication — and why zigzag assignment balances the workload.

Context Parallelism Visualized

Ring Attention

Watch Ring Attention in action: K,V blocks rotate around a GPU ring step-by-step. Then see how Zigzag assignment fixes the causal mask load imbalance for perfectly balanced workloads.

Ring Animation Zigzag vs Naive Load Balancing NEW

Explore guide →

Expert Parallelism

Mixture of Experts & Expert Parallelism

Sparse models that activate only a fraction of parameters per token — and how to distribute experts across GPUs with All-to-All communication.

Mixture of Experts Explained

MoE Architecture

Dense vs sparse models, router gating, top-K expert selection, and load balancing. Watch tokens flow through the gating network to their assigned experts in real time.

Router Gating Token Routing Load Balancing NEW

Explore guide →

Expert Parallelism Visualized

All-to-All Comm

Distribute MoE experts across GPUs. Watch the All-to-All dispatch: tokens travel to their expert’s GPU, compute, then return. See how EP combines with TP and DP.

All-to-All Animation Expert Distribution EP + TP + DP NEW

Explore guide →

Capstone

The Full 5D Parallelism Pipeline

Everything comes together — TP, PP, DP, ZeRO, CP, and EP in a realistic end-to-end training setup. Follow a startup’s journey configuring 64 H100 GPUs to train a 7B-parameter model.

5D Parallelism: The Complete Journey

5D Capstone

Follow Vizz AI’s startup story: 64 H100s, a 7B SLM, 128K context. Walk through every decision in order — TP, micro-batch sizing, PP, DP+ZeRO, gradient accumulation, CP, and EP — with interactive visualizations at each step.

8 Interactive Steps 64-GPU Cluster Decision Framework NEW

Explore guide →

Vizuara 5D ParallelismWorkshop Visualizations

Data Parallelism Fundamentals

Data Parallelism Deep Dive

Gradient Synchronization

Gradient Accumulation

Memory, Batching & GPU Performance

Activation Recomputation

Finding Max Local Batch Size

GPU Efficiency & SM Utilization

Gradient Noise Scale

ZeRO Deep Dive Series

DeepSpeed ZeRO Config Parameters

ZeRO-1: Optimizer State Sharding

ZeRO-2: Gradient Sharding

ZeRO-3 / FSDP: Full Sharding

Tensor & Sequence Parallelism

Tensor + Sequence Parallelism Visual

TP vs ZeRO for Inference

Pipeline & 3D Parallelism

3D Parallelism: TP + PP + DP

Advanced Pipeline Schedules

Why Pipeline Parallelism Exists

Context Parallelism & Ring Attention

Context Parallelism Visualized

Mixture of Experts & Expert Parallelism

Mixture of Experts Explained

Expert Parallelism Visualized

The Full 5D Parallelism Pipeline

5D Parallelism: The Complete Journey

Vizuara 5D Parallelism
Workshop Visualizations