← Previous: Features 📋 Quick Reference Next: Gradient Descent →

Regression

Predicting the Continuous World

Imagine you're trying to guess the price of a house. You look at its size, the number of bedrooms, and the neighborhood. You don't just say "Expensive" or "Cheap" (that would be classification). You say "$450,000" or "$455,500".

This is Regression: the art of predicting a continuous number. It is one of the most fundamental tools in Machine Learning, serving as the bedrock for everything from stock market forecasting to climate modeling.

The Line of Best Fit

At its heart, regression is about finding relationships. If you plot house price against square footage, you'll likely see a trend: as size goes up, price goes up. Regression algorithms try to draw a line (or a curve) through these points that "fits" the data best.

In the simplest case, Linear Regression, we assume this relationship is a straight line. We try to find the slope and intercept that minimize the distance between our line and the actual data points. It’s like trying to lay a stick down on a scatter plot so that it's as close to all the dots as possible.

Beyond the Line

The Cost of Being Wrong

How do we know if our line is "good"? We measure the error. A common way is to look at the difference between our prediction and the actual value, square it, and take the average. This is the Mean Squared Error (MSE).

Why Square the Error? Squaring ensures that positive and negative errors don't cancel each other out. It also punishes large mistakes much more than small ones—being off by 100 is 100 times worse than being off by 10, not just 10 times.

Finding the Bottom of the Valley

To find the best line, we need to minimize this MSE. We can do this in two main ways:

Analytical Solution (OLS): For small datasets, we can use linear algebra to calculate the perfect line in one go. It’s exact but computationally expensive for huge amounts of data.
Gradient Descent: For larger datasets, we take an iterative approach. We start with a random line, look at the slope of our error landscape, and take small steps "downhill" until we find the minimum error.

Practical Mastery

Feature Scaling matters

If one of your features is "number of bedrooms" (1-5) and another is "square footage" (500-5000), the larger numbers will dominate the math. Scaling your features so they are all in a similar range (e.g., 0 to 1) helps the algorithm converge faster and more accurately.

Watch out for Outliers

Because we square our errors, a single massive outlier can drag the whole line towards it. Always check your data for anomalies before training.

The Mathematics

Let's formalize our intuition. We have an input vector x (features) and want to predict a scalar y.

1. The Hypothesis

Our model, or hypothesis h(x), is a linear combination of the inputs:

h(x) = θ T x + θ 0

Where θ is our weight vector (slope) and θ₀ is the bias (intercept).

2. The Objective Function

We want to find the parameters θ that minimize the cost function J(θ):

J(θ) = (1/2n) * Σ (h(x (i)) - y (i)) 2

The factor of 1/2 is a mathematical convenience that cancels out when we take the derivative during gradient descent.

Frequently Asked Questions

Regression vs. Classification?

Regression predicts a quantity (e.g., temperature: 72.5°F). Classification predicts a category (e.g., weather: Sunny or Rainy). If the output is a number, it's usually regression.

What is R-Squared?

R-squared is a metric that tells you how much of the variance in the target variable is explained by your model. An R-squared of 1.0 means perfect prediction; 0.0 means your model is no better than just guessing the average.

References

← Previous: Features 📋 Quick Reference Next: Gradient Descent →