Support Vector Machines Cheatsheet

← Previous: Perceptron 📖 Read Deep Dive Next: Features →

Core Concept

Find the decision boundary with the widest possible margin. Like parking in the center of a space, maximizing distance to both neighbors.

Philosophy

Perceptron: Satisfied with any separating line SVM: Perfectionist, searches for the best line with maximum margin

The Margin

Distance between decision boundary and nearest data points from either class.

Support Vectors: Nearest points that "support" or define the boundary. Why margin matters: Wide margin = robustness to noisy data. Boundary has breathing room.

SVM Objective

Find decision boundary that:

1. Maximizes the margin

2. Correctly classifies all training points

Hinge Loss

L(y, f(x)) = max(0, 1 - y \cdot f(x))

Where f(x) = θᵀx + θ₀ and y ∈ {-1, +1}

Three Cases: Wrong (y · f(x) < 0):

Big penalty (loss > 1)
Misclassified point

Barely Right (0 < y · f(x) < 1):

Small penalty
Correct but too close to boundary (inside margin)

Confidently Right (y · f(x) ≥ 1):

Zero penalty
Outside margin, exactly where it should be

Key Insight: Penalizes points too close to boundary even if technically correct. Forces wide margin.

Optimization Problem

minimize: (1/2)||θ||² + C \cdot Σ max(0, 1 - y⁽ⁱ⁾(θᵀx⁽ⁱ⁾ + θ₀))

Two Competing Terms: (1/2)||θ||²: Regularization

Minimizing this maximizes margin
Margin inversely proportional to ||θ||

C · Σ Hinge Loss: Data fitting

Penalizes misclassifications and margin violations

Hyperparameter C

Large C:

Prioritize perfect training fit
Small margin
Risk overfitting

Small C:

Prioritize wide margin
Tolerate some training errors
Better generalization

The Kernel Trick

Map data to higher-dimensional space where it becomes linearly separable, without computing transformation explicitly.

Common Kernels: Linear: K(x, x') = xᵀx'

Standard SVM

Polynomial: K(x, x') = (xᵀx' + c)ᵈ

Polynomial boundaries

RBF (Radial Basis Function): K(x, x') = exp(-γ||x - x'||²)

Complex, smooth boundaries

Enables learning complex, nonlinear boundaries with linear optimization elegance.

Quick Facts

Margin maximization for robust classification
Hinge loss creates wide margins
Kernel trick enables nonlinear boundaries
Still relevant for small data and interpretability
Theoretical rigor vs. neural network empiricism