Pipeline Parallelism - Advanced Schedules

Beyond 1F1B

From interleaved stages that shrink the bubble by a factor of v, to zero-bubble schedules that decompose backward passes into B and W operations, to DualPipe's dual-stream design used in DeepSeek-V3. The frontier of pipeline parallelism.

3
Advanced Schedules
~0%
Bubble (Zero Bubble)
1/v
Bubble Reduction Factor
671B
DeepSeek-V3 Params

Where 1F1B Left Us

1F1B improved memory by limiting in-flight micro-batches, but the pipeline bubble remained stubbornly large.

📈

1F1B Bubble

The bubble fraction with standard 1F1B is (p-1)/m where p = number of pipeline stages and m = number of micro-batches.

🔌

Memory Win

1F1B limits peak activations to just p micro-batches in flight (vs. m in GPipe), but the idle time stays the same.

Can We Do Better?

Yes! By slicing the model differently across GPUs and being clever about backward pass decomposition, we can push the bubble toward zero.

Standard 1F1B Bubble Fraction rbubble = (p - 1) / m
For p=4, m=8: bubble = 3/8 = 37.5%
The challenge: Adding more micro-batches m reduces the bubble but increases memory. Is there another knob we can turn? Enter interleaved stages.

Interleaved Stages

Instead of assigning consecutive layers to each GPU, spread them out. Each GPU holds v non-contiguous chunks, creating a "looping pipeline."

How layers are assigned to GPUs

Standard Assignment v = 1
Interleaved Assignment v = 2
The "looping" idea: With interleaved stages, a micro-batch visits each GPU multiple times. In a standard setup with v=1, micro-batch goes GPU0 → GPU1 → GPU2 → GPU3. With v=2, it goes GPU0 → GPU1 → GPU2 → GPU3 → GPU0 → GPU1 → GPU2 → GPU3. Each pass through a GPU is faster (fewer layers), enabling better interleaving.

Why does this reduce the bubble?

Each forward and backward pass on a GPU takes 1/v the time of the standard case. This means the "warmup" and "cooldown" phases of the pipeline are v times shorter, directly shrinking the bubble.

Interleaved Bubble Time tpb = (p - 1) × (tf + tb) / v

The trade-off: more communication

Since each micro-batch visits every GPU v times instead of once, the number of point-to-point communications increases by a factor of v.

Communication overhead: Standard = 2(p-1) sends per micro-batch
Interleaved = 2v(p-1) sends per micro-batch

Interleaved 1F1B Schedule (p=4, v=2)

Dark blue = chunk 1 (early layers), Teal = chunk 2 (late layers). Each forward/backward is half the width of standard.

Bubble Fraction Calculator

Explore how pipeline stages (p), micro-batches (m), and interleaved chunks (v) affect the bubble. Drag the sliders to see the formula in action.

Pipeline stages (p) 4
Micro-batches (m) 8
Chunks per GPU (v) 1
Bubble Fraction Formula rbubble = (p - 1) / (v × m) = (3) / (1 × 8) = 37.5%
Standard (v=1)
37.5%
0%Bubble100%
Interleaved (v=1)
37.5%
0%Bubble100%
Reduction
1x
smaller bubble

Bubble Fraction for p=8 Across Configurations

Special case m=1, v=1 is Naive PP. v=1 cases are standard 1F1B. v>1 are interleaved.
Bubble > 50% Bubble 20-50% Bubble < 20%

Depth-First vs Breadth-First

With interleaved stages, a GPU must decide: process later layers of earlier micro-batches (depth-first), or earlier layers of later micro-batches (breadth-first)?

Depth-First

Priority: Get each micro-batch through the full model as fast as possible.

  • + Lower latency per micro-batch
  • + Fewer activations stored simultaneously
  • + Memory efficient
  • - May leave more bubble gaps

Breadth-First

Priority: Fill the pipeline as quickly as possible with many micro-batches.

  • + Better pipeline utilization
  • + Smaller total bubble
  • - Higher peak memory (more in-flight micro-batches)
  • - Higher latency per individual micro-batch
Llama 3.1: Meta's Llama 3.1 uses a 1F1B schedule with interleaved stages and a tunable priority setting between depth-first and breadth-first. The optimal setting depends on the specific model size, cluster topology, and memory constraints.

Scheduling Order Visualization (p=4, v=2, m=4)

Watch how micro-batches flow through the pipeline under each strategy

Zero Bubble Pipeline

The key insight: the backward pass through a matrix multiplication involves two separate operations that can be scheduled independently.

Decomposing the Backward Pass

B: Backward for Inputs

Computes gradients with respect to the inputs of a layer. This is needed by the preceding stage to continue its backward pass.

Must run before: The B operation of the previous stage

W: Backward for Weights

Computes gradients with respect to the weights of a layer. This is only needed before the optimizer step, not by other stages.

Can run anytime: After its corresponding B, before optimizer step
Standard Backward Coarse-grained
Layer computation:
F (Forward)
B (Full Backward)
B computes both input gradients and weight gradients together
Decomposed Backward Fine-grained
Layer computation:
F (Forward)
B (Input grad)
W (Weight grad)
B must run in order (needed by previous stage). W can be deferred!
The magic: Since W can be flexibly scheduled anywhere after its corresponding B (and before the optimizer step), we can strategically place W operations to fill the pipeline bubbles. This is the core insight behind zero-bubble schedules.

Schedules Compared (p=4, m=4)

Standard 1F1B Baseline

Solving for the Optimal Schedule

In practice, fully optimizing these fine-grained schedules involves:

  1. Carefully profiling the duration of F, B, and W operations
  2. Formulating an Integer Linear Programming (ILP) problem
  3. Minimizing total bubble time subject to dependency constraints
  4. Applying heuristics when exact ILP solutions are too costly

DualPipe

DeepSeek-V3/R1's pipeline schedule: two micro-batch streams propagating from both ends of the pipeline, interleaved to minimize idle time.

🚀

Dual Streams

One stream of micro-batches enters from GPU 0 (left), another from GPU p-1 (right). They meet in the middle, maximizing utilization.

B/W Decomposition

Like Zero Bubble, DualPipe splits backward into B (input grad) and W (weight grad), gaining scheduling flexibility.

🎯

Near-Zero Overhead

DeepSeek-V3 achieved "near-zero all-to-all communication overhead" using this schedule with 671B parameters.

DualPipe Schedule Visualization

Stream A → Micro-batches entering from left
← Stream B Micro-batches entering from right
F (Stream A) F (Stream B) B (Input grad) W (Weight grad) Idle
How it works: By sending micro-batches from both ends simultaneously, GPUs in the middle of the pipeline always have work from one stream when the other would leave a gap. Combined with B/W decomposition, DualPipe achieves near-complete overlap of computation and communication.
DeepSeek-V3 Setup 671B parameters × 2048 pipeline stages × DualPipe
Result: near-zero all-to-all communication overhead

Schedule Comparison

How do all the advanced pipeline schedules stack up against each other?

Property Interleaved 1F1B Zero Bubble (ZB-H2) DualPipe
Bubble Fraction (p-1)/(v*m) ~0% (theoretical) ~0% (practical)
Key Technique v model chunks per GPU B/W decomposition Dual streams + B/W decomposition
Communication v× more p2p sends Same as 1F1B Near-zero overhead
Scheduling Complexity Moderate High (ILP solver) Very High
Memory Overhead Same as 1F1B Slightly higher (deferred W) Higher (dual streams)
Backward Granularity Coarse (full backward) Fine (B + W split) Fine (B + W split)
Used In Llama 3.1 (Meta) Sea AI Lab research DeepSeek-V3/R1
Year Introduced 2021 (Megatron-LM) 2023 (Sea AI Lab) 2024 (DeepSeek)

Evolution of Pipeline Schedules

2019
GPipe
High bubble
2019
1F1B
Better memory
2021
Interleaved
1/v bubble
2023
Zero Bubble
~0% bubble
2024
DualPipe
Near-zero overhead

Key Takeaways

01

Interleaved = v× Smaller Bubble

By assigning v non-contiguous model chunks to each GPU, the bubble shrinks by factor v. Trade-off: v× more communication.

02

B/W Decomposition Unlocks Flexibility

Splitting backward into input-gradient (B) and weight-gradient (W) lets us schedule W to fill bubbles. This is the key to zero-bubble schedules.

03

DualPipe: State of the Art

DeepSeek-V3/R1's DualPipe combines dual streams from both pipeline ends with B/W decomposition, achieving near-zero overhead at 671B parameter scale.

04

Depth vs Breadth Trade-off

Interleaved schedules must choose between processing micro-batches deep (lower memory) or wide (better utilization). Llama 3.1 makes this tunable.

05

ILP for Optimal Schedules

Finding the mathematically optimal schedule requires solving an Integer Linear Programming problem. In practice, heuristics are used.

06

Complexity Increases

Each advancement brings more scheduling complexity. The implementation effort grows from simple 1F1B to the highly engineered DualPipe.