Vizuara Presents
Understanding the Adam Optimizer
A visual, interactive journey through the most popular optimizer in deep learning — from momentum to adaptive learning rates.
Begin the Journey ↓Vizuara Presents
A visual, interactive journey through the most popular optimizer in deep learning — from momentum to adaptive learning rates.
Begin the Journey ↓Gradient descent struggles on loss surfaces with different curvatures along different dimensions. Watch how vanilla GD oscillates while Adam converges smoothly.
Without momentum, the optimizer only sees the current gradient at each step — it has no memory. Momentum accumulates past gradients to push through noise and accelerate along consistent directions.
Look at the red gradient arrows — they point mostly sideways (across the valley), not downhill along it.
With a learning rate of 0.018, each SGD step overshoots in x (the high curvature direction), bouncing back and forth. Meanwhile it barely moves in y (the low curvature direction).
Result: Zig-zag in x, painfully slow progress in y.
Heavy-ball momentum builds up velocity:
vt = β · vt-1 - α · gt
θt = θt-1 + vt
In x: alternating gradients → velocity stays small → less oscillation
In y: consistent gradients → velocity builds up → faster convergence
β₁ ≈ 0.9 — averages over ~10 steps
β₁ ≈ 0.99 — averages over ~100 steps
Higher β₁ = more momentum = smoother but slower to turn.
Instead of using one learning rate for all parameters, adaptive methods use a different effective learning rate per dimension — automatically adjusting to the local curvature.
Track the squared gradient per dimension:
vt = β₂ · vt-1 + (1-β₂) · gt²
Then divide the update by √vt. This normalizes each dimension:
The x-dimension has large gradients → vx is large → dividing by √vx shrinks the step → prevents overshooting.
The y-dimension has small gradients → vy is small → dividing by √vy amplifies the step → speeds up progress.
Adam combines momentum (direction from the past) and adaptive learning rates (reading the terrain) into one elegant algorithm. Compare it with plain gradient descent on a challenging surface.
Five elegant equations that form the complete Adam update rule. Each builds on the last.
Without bias correction, the moving averages are biased toward zero in early steps. This chart shows the dramatic difference.
Since m₀ = 0, the first few estimates are biased toward zero. Dividing by (1 - β₁t) corrects this.
At step 1 with β₁=0.9: the correction factor is 1/(1-0.9) = 10×
At step 10: factor is ≈2.9×
At step 50: factor is ≈1.005× (negligible)
Explore how Adam navigates different loss surfaces. Click on the canvas to set a starting point, adjust the hyperparameters, and watch Adam optimize!
1. Click the canvas to place Adam's starting position
2. Press Play (or Step) to watch Adam optimize
3. Change the loss function to see different terrains (Beale has a narrow valley, Rosenbrock has a curved valley, Himmelblau has 4 minima)
4. Tune hyperparameters: α (learning rate), β₁ (momentum), β₂ (adaptive rate), ε (numerical stability)
5. The + markers show the global minima — can Adam find them?