Vizuara Presents

Understanding the Adam Optimizer

A visual, interactive journey through the most popular optimizer in deep learning — from momentum to adaptive learning rates.

Begin the Journey ↓

The Problem: Why Curvature Matters

Gradient descent struggles on loss surfaces with different curvatures along different dimensions. Watch how vanilla GD oscillates while Adam converges smoothly.

● Vanilla GD ● Adam

HIGH curvature (x-axis) → steep gradient

LOW curvature (y-axis) → gentle gradient

Step: 0

Learning Rate: 0.020

Click canvas to set start point

High curvature direction (x-axis) — The loss changes very rapidly along x. GD overshoots because the gradient is large, but the optimal step is small.

Low curvature direction (y-axis) — The loss changes slowly along y. GD crawls because the gradient is tiny, making progress painfully slow.

Momentum: The Past Matters

Without momentum, the optimizer only sees the current gradient at each step — it has no memory. Momentum accumulates past gradients to push through noise and accelerate along consistent directions.

● Without Momentum (SGD) ● With Momentum

Elliptical loss: 50x² + 0.5y²

Step: 0

β₁: 0.80

Why Does SGD Zig-Zag?

Look at the red gradient arrows — they point mostly sideways (across the valley), not downhill along it.

With a learning rate of 0.018, each SGD step overshoots in x (the high curvature direction), bouncing back and forth. Meanwhile it barely moves in y (the low curvature direction).

Result: Zig-zag in x, painfully slow progress in y.

How Momentum Fixes This

Heavy-ball momentum builds up velocity:
v_t = β · v_t-1 - α · g_t
θ_t = θ_t-1 + v_t

In x: alternating gradients → velocity stays small → less oscillation
In y: consistent gradients → velocity builds up → faster convergence

How β₁ Controls Memory

β₁ ≈ 0.9 — averages over ~10 steps
β₁ ≈ 0.99 — averages over ~100 steps
Higher β₁ = more momentum = smoother but slower to turn.

Key Insight: Watch the red path (SGD) — it bounces left and right across the narrow valley, barely making progress downward. The red arrows show the gradient direction: mostly horizontal! Now watch the green path (Momentum) — the oscillating horizontal gradients cancel in the velocity term, while the consistent downward component builds up. Momentum reaches the minimum much faster with a smoother path.

Adaptive Learning Rate: Reading the Terrain

Instead of using one learning rate for all parameters, adaptive methods use a different effective learning rate per dimension — automatically adjusting to the local curvature.

← HIGH curvature (x): large ∇, small step

↕ LOW curvature (y): small ∇, large step

Step: 0

β₂: 0.999

The Core Idea

Track the squared gradient per dimension:
v_t = β₂ · v_t-1 + (1-β₂) · g_t²

Then divide the update by √v_t. This normalizes each dimension:

High Curvature → Small Steps

The x-dimension has large gradients → v_x is large → dividing by √v_x shrinks the step → prevents overshooting.

Low Curvature → Large Steps

The y-dimension has small gradients → v_y is small → dividing by √v_y amplifies the step → speeds up progress.

Per-Dimension Metrics

Key Insight: The ellipses drawn along the path show the effective step size in each dimension. Notice how they're wide in y (large steps in the flat direction) and narrow in x (small steps in the steep direction). This is how β₂ helps the optimizer read the terrain!

Adam = Best of Both Worlds

Adam combines momentum (direction from the past) and adaptive learning rates (reading the terrain) into one elegant algorithm. Compare it with plain gradient descent on a challenging surface.

Vanilla GD

Adam

Step: 0

Speed:

Vanilla GD struggles: it oscillates in steep directions and crawls in flat ones. Each step only uses the current gradient — no memory, no adaptation.

Adam shines: momentum smooths the path, and adaptive rates automatically handle different curvatures. It converges faster and more reliably.

The Math Behind Adam

Five elegant equations that form the complete Adam update rule. Each builds on the last.

m_t = β₁ · m_t-1 + (1 - β₁) · g_t

First moment estimate (momentum) — exponential moving average of gradients

v_t = β₂ · v_t-1 + (1 - β₂) · g_t²

Second moment estimate (adaptive LR) — exponential moving average of squared gradients

m̂_t = m_t / (1 - β₁^t)

Bias-corrected first moment — compensates for zero-initialization

v̂_t = v_t / (1 - β₂^t)

Bias-corrected second moment — especially important in early training steps

θ_t = θ_t-1 - α · m̂_t / (√v̂_t + ε)

Full Adam update — momentum provides direction, adaptive LR provides magnitude

Bias Correction: Why It Matters

Without bias correction, the moving averages are biased toward zero in early steps. This chart shows the dramatic difference.

● Corrected m̂_t ● Uncorrected m_t

Bias Correction

β₁: 0.90

The Zero-Initialization Problem

Since m₀ = 0, the first few estimates are biased toward zero. Dividing by (1 - β₁^t) corrects this.

At step 1 with β₁=0.9: the correction factor is 1/(1-0.9) = 10×
At step 10: factor is ≈2.9×
At step 50: factor is ≈1.005× (negligible)

Adam Interactive Playground

Explore how Adam navigates different loss surfaces. Click on the canvas to set a starting point, adjust the hyperparameters, and watch Adam optimize!

How to Use This Playground

1. Click the canvas to place Adam's starting position
2. Press Play (or Step) to watch Adam optimize
3. Change the loss function to see different terrains (Beale has a narrow valley, Rosenbrock has a curved valley, Himmelblau has 4 minima)
4. Tune hyperparameters: α (learning rate), β₁ (momentum), β₂ (adaptive rate), ε (numerical stability)
5. The + markers show the global minima — can Adam find them?

Step: 0

Function:

Speed:

α: 0.005

β₁: 0.90

β₂: 0.999

ε: 1e-8