Feature Representation

Turning Raw Data into Learnable Signals

Imagine you are a chef. You have the finest raw ingredients: a sack of flour, a basket of tomatoes, and a block of cheese. But you cannot simply throw the sack of flour into the oven and expect a pizza. You must first process the ingredients: grind, chop, knead, and season.

In Machine Learning, Feature Representation is this culinary prep work. Your raw data—whether it's pixels, text, or database rows—is often unusable in its native form. To make a machine learn, we must transform this raw "flour" into a structured "dough" that algorithms can digest. It is the bridge between the messy real world and the precise mathematical world of AI.

The Art of Translation

Algorithms live in a world of numbers. They understand vectors, matrices, and geometry. They do not understand "Red", "Blue", or "Happy".

The goal of feature representation is to define a mapping, let's call it ϕ (phi), that translates a real-world object x into a vector of numbers v. If we do this well, simple linear models can solve complex, non-linear problems.

Discrete & Text: The Menu and The Soup

The Menu: One-Hot Encoding

Suppose your data is a restaurant menu with categories: [Italian, Mexican, Thai]. You might be tempted to assign numbers: Italian=1, Mexican=2, Thai=3. But this is dangerous! It implies that Mexican is "twice" as much as Italian, or that Thai is "greater" than Mexican.

Instead, we use One-Hot Encoding. We create a vector with a slot for every possible category. "Mexican" becomes [0, 1, 0]. "Thai" becomes [0, 0, 1]. Each category gets its own independent dimension, preventing false mathematical relationships.

The Soup: Bag of Words

For text, imagine chopping up a sentence into a bowl of word soup. We don't care about the order, just the ingredients. The sentence "The quick brown fox" is represented by a vector counting the presence of each word. While simple, this Bag of Words model powers everything from spam filters to early search engines.

Polynomial Features: Bending the Line

Linear classifiers are great, but they are rigid. They can only draw straight lines. What if your data follows a curve?

We don't need a more complex algorithm; we need richer features. By checking the interactions between our existing features—squaring them, multiplying them together—we can create Polynomial Features.

If you give a linear model x and x2 as inputs, it can effectively draw a parabola while still thinking it's drawing a straight line in a higher dimension. Use this wisely, though; add too many complexities, and your model might start hallucinating patterns that don't exist (Overfitting).

The Mathematics

Let's formalize this transformation. We define a feature map ϕ: X → Rd.

1. The Transformed Model

Our original linear hypothesis was h(x) = θTx. With feature transformation, it becomes:

h(x) = θTϕ(x) + θ0

This allows us to learn non-linear boundaries in the original input space X, as long as they are linear in the transformed feature space Rd.

2. Polynomial Basis Example

For a 1D input x, a k-th order polynomial feature map looks like:

ϕ(x) = [1, x, x2, ..., xk]T

For a d-dimensional input, this expands rapidly to include all cross-terms (xixj), leading to a combinatorial explosion in dimensions. This curse of dimensionality is why we must obtain domain knowledge to select only the relevant features.

2025: Encoded Meaning

In 2025, we have moved beyond manually cooking up polynomial features. We now rely on Embeddings.

The Embedding Revolution: Instead of us telling the computer "Mexican is a category," we let a Neural Network read millions of menus. It learns to place "Taco" and "Burrito" close together in a 1000-dimensional space, far away from "Sushi".

References & Further Reading