Core Concept
Iterative optimization algorithm to minimize loss by following the slope downhill. The compass that guides ML models to optimal accuracy.
Mountain Analogy
- Mountain: Loss function (error landscape)
- Your location: Current model parameters
- Slope: Gradient (direction of steepest ascent)
- Step size: Learning rate
- Village: Optimal model state (minimum loss)
Update Rule
θ⁽ᵗ⁺¹⁾ = θ⁽ᵗ⁾ - η ∇J(θ⁽ᵗ⁾)
- θ: Model parameters (weights, biases)
- η: Learning rate (step size)
- ∇J(θ): Gradient vector [∂J/∂θ₁, ∂J/∂θ₂, ...]
- Negative sign: Gradient points up, subtract to go down
Learning Rate (Critical Parameter)
Too Small:- Slow convergence (takes forever)
- Risk getting stuck in local minima
- Overshoot minimum
- Bounce back and forth
- May diverge (climb instead of descend)
Types of Gradient Descent
Batch Gradient Descent
- Calculate gradient using entire dataset
- Pros: Stable, precise steps
- Cons: Slow and expensive for large data
Stochastic Gradient Descent (SGD)
- Use single random data point per step
- Pros: Very fast, can escape local minima (noisy randomness)
- Cons: Jagged path, may not settle exactly at minimum
Mini-Batch SGD (Modern Standard)
- Use small batch (32-64 examples)
- Best of both: Speed of SGD + stability of batch
- Industry standard approach
Gradient vs. Backpropagation
Backpropagation: Calculates the gradient (the slope) Gradient Descent: Updates weights using that gradientThey are partners, not alternatives.
Common Issues
Local Minima
- Non-convex functions have multiple valleys
- Algorithm can settle in shallow valley (not deepest)
- Solution: Momentum-based optimizers, SGD randomness
Why Need Learning Rate?
- Gradient shows direction, not distance
- Without it (or with huge value), assumes constant slope
- Causes wild overshooting
Modern Approaches (2025)
Adam (Adaptive Moment Estimation)
- Automatically adjusts learning rate per parameter
- Fast parameters → slow down
- Stuck parameters → speed up
- Current industry standard
Learn to Optimize (L2O)
- AI designs optimization process itself
- Neural network predicts best step size and direction
- Outperforms hand-tuned rules
Resource Efficiency
- Gradient Checkpointing: Save memory during training
- LoRA (Low-Rank Adaptation): Freeze most parameters, optimize tiny subset
- Enables training massive LLMs on consumer hardware
Quick Facts
- Powers almost all modern AI (spam filters to LLMs)
- Vanilla GD rarely used alone in 2025
- Foundation of deep learning training
- Iterative: improves with each step
Home