Advanced Pipeline Schedules — Visual Deep Dive

The Starting Point

Where 1F1B Left Us

1F1B improved memory by limiting in-flight micro-batches, but the pipeline bubble remained stubbornly large.

📈

1F1B Bubble

The bubble fraction with standard 1F1B is (p-1)/m where p = number of pipeline stages and m = number of micro-batches.

🔌

Memory Win

1F1B limits peak activations to just p micro-batches in flight (vs. m in GPipe), but the idle time stays the same.

❓

Can We Do Better?

Yes! By slicing the model differently across GPUs and being clever about backward pass decomposition, we can push the bubble toward zero.

Standard 1F1B Bubble Fraction r_bubble = (p - 1) / m
For p=4, m=8: bubble = 3/8 = 37.5%

The challenge: Adding more micro-batches m reduces the bubble but increases memory. Is there another knob we can turn? Enter interleaved stages.

Schedule #1

Interleaved Stages

Instead of assigning consecutive layers to each GPU, spread them out. Each GPU holds v non-contiguous chunks, creating a "looping pipeline."

How layers are assigned to GPUs

Standard Assignment v = 1

Interleaved Assignment v = 2

The "looping" idea: With interleaved stages, a micro-batch visits each GPU multiple times. In a standard setup with v=1, micro-batch goes GPU0 → GPU1 → GPU2 → GPU3. With v=2, it goes GPU0 → GPU1 → GPU2 → GPU3 → GPU0 → GPU1 → GPU2 → GPU3. Each pass through a GPU is faster (fewer layers), enabling better interleaving.

Why does this reduce the bubble?

Each forward and backward pass on a GPU takes 1/v the time of the standard case. This means the "warmup" and "cooldown" phases of the pipeline are v times shorter, directly shrinking the bubble.

Interleaved Bubble Time t_pb = (p - 1) × (t_f + t_b) / v

The trade-off: more communication

Since each micro-batch visits every GPU v times instead of once, the number of point-to-point communications increases by a factor of v.

Communication overhead: Standard = 2(p-1) sends per micro-batch
Interleaved = 2v(p-1) sends per micro-batch

Interleaved 1F1B Schedule (p=4, v=2)

Dark blue = chunk 1 (early layers), Teal = chunk 2 (late layers). Each forward/backward is half the width of standard.

Interactive

Bubble Fraction Calculator

Explore how pipeline stages (p), micro-batches (m), and interleaved chunks (v) affect the bubble. Drag the sliders to see the formula in action.

Pipeline stages (p) 4

Micro-batches (m) 8

Chunks per GPU (v) 1

Bubble Fraction Formula r_bubble = (p - 1) / (v × m) = (3) / (1 × 8) = 37.5%

Standard (v=1)

37.5%

0%Bubble100%

Interleaved (v=1)

37.5%

0%Bubble100%

Reduction

1x

smaller bubble

Bubble Fraction for p=8 Across Configurations

Special case m=1, v=1 is Naive PP. v=1 cases are standard 1F1B. v>1 are interleaved.

Bubble > 50% Bubble 20-50% Bubble < 20%

Scheduling Strategies

Depth-First vs Breadth-First

With interleaved stages, a GPU must decide: process later layers of earlier micro-batches (depth-first), or earlier layers of later micro-batches (breadth-first)?

Depth-First

Priority: Get each micro-batch through the full model as fast as possible.

+ Lower latency per micro-batch
+ Fewer activations stored simultaneously
+ Memory efficient
- May leave more bubble gaps

Breadth-First

Priority: Fill the pipeline as quickly as possible with many micro-batches.

+ Better pipeline utilization
+ Smaller total bubble
- Higher peak memory (more in-flight micro-batches)
- Higher latency per individual micro-batch

Llama 3.1: Meta's Llama 3.1 uses a 1F1B schedule with interleaved stages and a tunable priority setting between depth-first and breadth-first. The optimal setting depends on the specific model size, cluster topology, and memory constraints.

Scheduling Order Visualization (p=4, v=2, m=4)

Watch how micro-batches flow through the pipeline under each strategy

Schedule #2

Zero Bubble Pipeline

The key insight: the backward pass through a matrix multiplication involves two separate operations that can be scheduled independently.

Decomposing the Backward Pass

B: Backward for Inputs

Computes gradients with respect to the inputs of a layer. This is needed by the preceding stage to continue its backward pass.

Must run before: The B operation of the previous stage

W: Backward for Weights

Computes gradients with respect to the weights of a layer. This is only needed before the optimizer step, not by other stages.

Can run anytime: After its corresponding B, before optimizer step

Standard Backward Coarse-grained

Layer computation:

F (Forward)

→

B (Full Backward)

B computes both input gradients and weight gradients together

Decomposed Backward Fine-grained

Layer computation:

F (Forward)

→

B (Input grad)

→

W (Weight grad)

B must run in order (needed by previous stage). W can be deferred!

The magic: Since W can be flexibly scheduled anywhere after its corresponding B (and before the optimizer step), we can strategically place W operations to fill the pipeline bubbles. This is the core insight behind zero-bubble schedules.

Schedules Compared (p=4, m=4)

Standard 1F1B Baseline

Solving for the Optimal Schedule

In practice, fully optimizing these fine-grained schedules involves:

Carefully profiling the duration of F, B, and W operations
Formulating an Integer Linear Programming (ILP) problem
Minimizing total bubble time subject to dependency constraints
Applying heuristics when exact ILP solutions are too costly

Schedule #3

DualPipe

DeepSeek-V3/R1's pipeline schedule: two micro-batch streams propagating from both ends of the pipeline, interleaved to minimize idle time.

🚀

Dual Streams

One stream of micro-batches enters from GPU 0 (left), another from GPU p-1 (right). They meet in the middle, maximizing utilization.

✂

B/W Decomposition

Like Zero Bubble, DualPipe splits backward into B (input grad) and W (weight grad), gaining scheduling flexibility.

🎯

Near-Zero Overhead

DeepSeek-V3 achieved "near-zero all-to-all communication overhead" using this schedule with 671B parameters.

DualPipe Schedule Visualization

Stream A → Micro-batches entering from left

← Stream B Micro-batches entering from right

F (Stream A) F (Stream B) B (Input grad) W (Weight grad) Idle

How it works: By sending micro-batches from both ends simultaneously, GPUs in the middle of the pipeline always have work from one stream when the other would leave a gap. Combined with B/W decomposition, DualPipe achieves near-complete overlap of computation and communication.

DeepSeek-V3 Setup 671B parameters × 2048 pipeline stages × DualPipe
Result: near-zero all-to-all communication overhead

Side by Side

Schedule Comparison

How do all the advanced pipeline schedules stack up against each other?

Property	Interleaved 1F1B	Zero Bubble (ZB-H2)	DualPipe
Bubble Fraction	`(p-1)/(v*m)`	~0% (theoretical)	~0% (practical)
Key Technique	v model chunks per GPU	B/W decomposition	Dual streams + B/W decomposition
Communication	v× more p2p sends	Same as 1F1B	Near-zero overhead
Scheduling Complexity	Moderate	High (ILP solver)	Very High
Memory Overhead	Same as 1F1B	Slightly higher (deferred W)	Higher (dual streams)
Backward Granularity	Coarse (full backward)	Fine (B + W split)	Fine (B + W split)
Used In	Llama 3.1 (Meta)	Sea AI Lab research	DeepSeek-V3/R1
Year Introduced	2021 (Megatron-LM)	2023 (Sea AI Lab)	2024 (DeepSeek)

Evolution of Pipeline Schedules

2019

GPipe

High bubble

→

2019

1F1B

Better memory

→

2021

Interleaved

1/v bubble

→

2023

Zero Bubble

~0% bubble

→

2024

DualPipe

Near-zero overhead

Summary

Key Takeaways

01

Interleaved = v× Smaller Bubble

By assigning v non-contiguous model chunks to each GPU, the bubble shrinks by factor v. Trade-off: v× more communication.

02

B/W Decomposition Unlocks Flexibility

Splitting backward into input-gradient (B) and weight-gradient (W) lets us schedule W to fill bubbles. This is the key to zero-bubble schedules.

03

DualPipe: State of the Art

DeepSeek-V3/R1's DualPipe combines dual streams from both pipeline ends with B/W decomposition, achieving near-zero overhead at 671B parameter scale.

04

Depth vs Breadth Trade-off

Interleaved schedules must choose between processing micro-batches deep (lower memory) or wide (better utilization). Llama 3.1 makes this tunable.

05

ILP for Optimal Schedules

Finding the mathematically optimal schedule requires solving an Integer Linear Programming problem. In practice, heuristics are used.

06

Complexity Increases

Each advancement brings more scheduling complexity. The implementation effort grows from simple 1F1B to the highly engineered DualPipe.

Beyond 1F1B

Where 1F1B Left Us

1F1B Bubble

Memory Win

Can We Do Better?

Interleaved Stages

How layers are assigned to GPUs

Why does this reduce the bubble?

The trade-off: more communication

Interleaved 1F1B Schedule (p=4, v=2)

Bubble Fraction Calculator

Bubble Fraction for p=8 Across Configurations

Depth-First vs Breadth-First

Depth-First

Breadth-First

Scheduling Order Visualization (p=4, v=2, m=4)

Zero Bubble Pipeline

Decomposing the Backward Pass

B: Backward for Inputs

W: Backward for Weights

Schedules Compared (p=4, m=4)

Solving for the Optimal Schedule

DualPipe

Dual Streams

B/W Decomposition

Near-Zero Overhead

DualPipe Schedule Visualization

Schedule Comparison

Evolution of Pipeline Schedules

Key Takeaways

Interleaved = v× Smaller Bubble

B/W Decomposition Unlocks Flexibility

DualPipe: State of the Art

Depth vs Breadth Trade-off

ILP for Optimal Schedules

Complexity Increases