Gradient Descent

Quick Reference

← Previous: Regression 📖 Read Deep Dive

Home

Core Concept

Iterative optimization algorithm to minimize loss by following the slope downhill. The compass that guides ML models to optimal accuracy.

Mountain Analogy

Mountain: Loss function (error landscape)
Your location: Current model parameters
Slope: Gradient (direction of steepest ascent)
Step size: Learning rate
Village: Optimal model state (minimum loss)

Strategy: Feel slope, step downward. Repeat until bottom reached.

Update Rule

θ⁽ᵗ⁺¹⁾ = θ⁽ᵗ⁾ - η \nablaJ(θ⁽ᵗ⁾)

Components:

θ: Model parameters (weights, biases)
η: Learning rate (step size)
∇J(θ): Gradient vector [∂J/∂θ₁, ∂J/∂θ₂, ...]
Negative sign: Gradient points up, subtract to go down

Goal: Find θ* = arg min J(θ)

Learning Rate (Critical Parameter)

Too Small:

Slow convergence (takes forever)
Risk getting stuck in local minima

Too Large:

Overshoot minimum
Bounce back and forth
May diverge (climb instead of descend)

Sweet Spot: "Goldilocks" zone for stable, efficient convergence

Types of Gradient Descent

Batch Gradient Descent

Calculate gradient using entire dataset
Pros: Stable, precise steps
Cons: Slow and expensive for large data

Stochastic Gradient Descent (SGD)

Use single random data point per step
Pros: Very fast, can escape local minima (noisy randomness)
Cons: Jagged path, may not settle exactly at minimum

Mini-Batch SGD (Modern Standard)

Use small batch (32-64 examples)
Best of both: Speed of SGD + stability of batch
Industry standard approach

Gradient vs. Backpropagation

Backpropagation: Calculates the gradient (the slope) Gradient Descent: Updates weights using that gradient

They are partners, not alternatives.

Common Issues

Local Minima

Non-convex functions have multiple valleys
Algorithm can settle in shallow valley (not deepest)
Solution: Momentum-based optimizers, SGD randomness

Why Need Learning Rate?

Gradient shows direction, not distance
Without it (or with huge value), assumes constant slope
Causes wild overshooting

Modern Approaches (2025)

Adam (Adaptive Moment Estimation)

Automatically adjusts learning rate per parameter
Fast parameters → slow down
Stuck parameters → speed up
Current industry standard

Learn to Optimize (L2O)

AI designs optimization process itself
Neural network predicts best step size and direction
Outperforms hand-tuned rules

Resource Efficiency

Gradient Checkpointing: Save memory during training
LoRA (Low-Rank Adaptation): Freeze most parameters, optimize tiny subset
Enables training massive LLMs on consumer hardware

Quick Facts

Powers almost all modern AI (spam filters to LLMs)
Vanilla GD rarely used alone in 2025
Foundation of deep learning training
Iterative: improves with each step

← Previous: Regression 📖 Read Deep Dive

Home