OUTPUT

The blog of Maxime Kjaer

CS-443 Machine Learning

The course follows a few books:

The repository for code labs and lecture notes is on GitHub. A useful website for this course is matrixcalculus.org.

In this course, we’ll always denote the dataset as a matrix , where is the data size and is the dimensionality, or the number of features. We’ll always use subscript for data point, and for feature. The labels, if any, are denoted in a vector, and the weights are denoted by :

Vectors are denoted in bold and lowercase (e.g. or ), and matrices are bold and uppercase (e.g. ). Scalars and functions are in normal font weight1.

Linear regression

A linear regression is a model that assumes a linear relationship between inputs and the output. We will study three types of methods:

  1. Grid search
  2. Iterative optimization algorithms
  3. Least squares

Simple linear regression

For a single input dimension (), we can use a simple linear regression, which is given by:

are the parameters of the model.

Multiple linear regression

If our data has multiple input dimensions, we obtain multivariate linear regression:

👉 If we wanted to be a little more strict, we should write , as the model of course also depends on the weights.

The tilde notation means that we have included the offset term , also known as the bias:

The problem

If the number of parameters exceeds the number of data examples, we say that the task is under-determined. This can be solved by regularization, which we’ll get to more precisely later.

Cost functions

is the data, which we can easily understand where comes from. But how does one find a good from the data?

A cost function (also called loss function) is used to learn parameters that explain the data well. It quantifies how well our model does by giving errors a score, quantifying penalties for errors. Our goal is to find parameters that minimize the loss functions.

Properties

Desirable properties of cost functions are:

  • Symmetry around 0: that is, being off by a positive or negative amount is equivalent; what matters is the amplitude of the error, not the sign.
  • Robustness: penalizes large errors at about the same rate as very large errors. This is a way to make sure that outliers don’t completely dominate our regression.

Good cost functions

MSE

Probably the most commonly used cost function is Mean Square Error (MSE):

MSE is symmetrical around 0, but also tends to penalize outliers quite harshly (because it squares error): MSE is not robust. In practice, this is problematic, because outliers occur more often than we’d like to.

Note that we often use MSE with a factor instead of . This is because it makes for a cleaner derivative, but we’ll get into that later. Just know that for all intents and purposes, it doesn’t really change anything about the behavior of the models we’ll study.

MAE

When outliers are present, Mean Absolute Error (MAE) tends to fare better:

Instead of squaring, we take the absolute value. This is more robust. Note that MAE isn’t differentiable at 0, but we’ll talk about that later.

There are other cost functions that are even more robust; these are available as additional reading, but are not exam material.

Convexity

A function is convex iff a line joining two points never intersects with the function anywhere else. More strictly defined, a function with is convex if, for any , and for any , we have:

A function is strictly convex if the above inequality is strict (). This inequality is known as Jensen’s inequality.

A strictly convex function has a unique global minimum . For convex functions, every local minimum is a global minimum. This makes it a desirable property for loss functions, since it means that cost function optimization is guaranteed to find the global minimum.

Linear (and affine) functions are convex, and sums of convex functions are also convex. Therefore, MSE and MAE are convex.

We’ll see another way of characterizing convexity for differentiable functions later in the course.

Optimization

Learning / Estimation / Fitting

Given a cost function (or loss function) , we wish to find which minimizes the cost:

This is what we call learning: learning is simply an optimization problem, and as such, we’ll use an optimization algorithm to solve it – that is, find a good .

This is one of the simplest optimization algorithms, although far from being the most efficient one. It can be described as “try all the values”, a kind of brute-force algorithm; you can think of it as nested for-loops over the individual weights.

For instance, if our weights are , then we can try, say 4 values for , 4 values for , for a total of 16 values of .

But obviously, complexity is exponential (where is the number of values to try), which is really bad, especially when we can have millions of parameters. Additionally, grid search has no guarantees that it’ll find an optimum; it’ll just find the best value we tried.

If grid search sounds bad for optimization, that’s because it is. In practice, it is not used for optimization of parameters, but it is used to tune hyperparameters.

Optimization landscapes

Local minimum

A vector is a local minimum of a function (we’re interested in the minimum of cost functions , which we denote with , as opposed to any other value , but this obviously holds for any function) if such that

In other words, the local minimum is better than all the neighbors in some non-zero radius.

Global minimum

The global minimum is defined by getting rid of the radius and comparing to all other values:

Strict minimum

A minimum is said to be strict if the corresponding equality is strict for , that is, there is only one minimum value.

Smooth (differentiable) optimization

Gradient

A gradient at a given point is the slope of the tangent to the function at that point. It points to the direction of largest increase of the function. By following the gradient (in the opposite direction, because we’re searching for a minimum and not a maximum), we can find the minimum.

Graphs of MSE and MAE

Gradient is defined by:

This is a vector, i.e. . Each dimension of the vector indicates how fast the cost changes depending on the weight .

Gradient descent

Gradient descent is an iterative algorithm. We start from a candidate , and iterate.

As stated previously, we’re adding the negative gradient to find the minimum, hence the subtraction.

is known as the step-size, which is a small value (maybe 0.1). You don’t want to be too aggressive with it, or you might risk overshooting in your descent. In practice, the step-size that makes the learning as fast as possible is often found by trial and error 🤷🏼‍♂️.

As an example, we will take an analytical look at a gradient descent, in order to understand its behavior and components. We will do gradient descent on a 1-parameter model ( and ), in which we minimize the MSE, which is defined as follows:

Note that we’re dividing by 2 on top of the regular MSE; it has no impact on finding the minimum, but when we will compute the gradient below, it will conveniently cancel out the .

The gradient of is:

Where denotes the average of all values. And thus, our gradient descent is given by:

In this case, we’ve managed to find to this exact problem analytically from gradient descent. This sequence is guaranteed to converge to 2. This would set the cost function to 0, which is the minimum.

The choice of has an influence on the algorithm’s outcome:

  • If we pick , we would get to the optimum in one step
  • If we pick , we would get a little closer in every step, eventually converging to
  • If we pick , we are going to overshoot . Slightly bigger than 1 (say, 1.5) would still converge; would loop infinitely between two points; diverges.

Gradient descent for linear MSE

Our linear regression is given by a line that is a regression for some data :

We make predictions by multiplying the data by the weights, so our model is:

We define the error vector by:

The MSE can then be restated as follows:

And the gradient is, component-wise:

We’re using column notation to signify column of the matrix .

And thus, all in all, our gradient is:

To compute this expression, we must compute:

  • The error , which takes floating point operations (flops) for the matrix-vector multiplication, and for the subtraction, for a total of , which is
  • The gradient , which costs , which is .

In total, this process is at every step. This is not too bad, it’s equivalent to reading the data once.

Stochastic gradient descent (SGD)

In ML, most cost functions are formulated as a sum of:

In practice, this can be expensive to compute, so the solution is to sample a training point uniformly at random, to be able to make the sum go away.

The stochastic gradient descent step is thus:

Why is it allowed to pick just one instead of the full thing? We won’t give a full proof, but the intuition is that:

The gradient of a single n is:

Note that , and . Computational complexity for this is .

Mini-batch SGD

But perhaps just picking a single value is too extreme; there is an intermediate version in which we choose a subset instead of a single point.

Note that if then we’re performing a full gradient descent.

The computation of can be parallelized easily over GPU threads, which is quite common in practice; is thus often dictated by the number of available threads.

Computational complexity is .

Non-smooth (non-differentiable) optimization

We’ve defined convexity previously, but we can also use the following alternative characterization of convexity, for differentiable functions:

Meaning that the function must always lie above its linearization (which is the first-order Taylor expansion) to be convex.

A convex function lies above its linearization

Subgradients

A vector such that:

is called a subgradient to the function at . The subgradient forms a line that is always below the curve, somewhat like the gradient of a convex function.

The subgradient lies below the function

This definition is valid even for an arbitrary that may not be differentiable, and not even necessarily convex.

If the function is differentiable at , then the only subgradient at is .

Subgradient descent

This is exactly like gradient descent, except for the fact that we use the subgradient at the current iterate instead of the gradient:

For instance, MAE is not differentiable at 0, so we must use the subgradient.

Here, is somewhat confusing notation for the set of all possible subgradients at our position.

For linear regressions, the (sub)gradient is easy to compute using the chain rule.

Let be non-differentiable, differentiable, and . The chain rule tells us that, at , our subgradient is:

Stochastic subgradient descent

This is still commonly abbreviated SGD.

It’s exactly the same, except that is a subgradient to the randomly selected at the current iterate .

Comparison

  Smooth Non-smooth
Full gradient descent Gradient of
Complexity is
Subgradient of
Complexity is
Stochastic gradient descent Gradient of Subgradient of

Constrained optimization

Sometimes, optimization problems come posed with an additional constraint.

Convex sets

We’ve seen convexity for functions, but we can also define it for sets. A set is convex iff the line segment between any two points of lies in . That is, , we have:

This means that the line between any two points in the set must also be fully contained within the set.

Examples of convex and non-convex sets

A couple of properties of convex sets:

  • Intersection of convex sets is also convex.
  • Projections onto convex sets are unique (and often efficient to compute).

Projected gradient descent

When dealing with constrained problems, we have two options. The first one is to add a projection onto in every step:

The rule for gradient descent can thus be updated to become:

This means that at every step, we compute the new normally, but apply a projection on top of that. In other words, if the regular gradient descent sets our weights outside of the constrained space, we project them back.

Steps of projected SGD
Here, is the result of regular SGD, i.e.

This is the same for stochastic gradient descent, and we have the same convergence properties.

Note that the computational cost of the projection is very important here, since it is performed at every step.

Turning constrained problems into unconstrained problems

If projection as described above is approach A, this is approach B.

We use a penalty function, such as the “brick wall” indicator function below:

We could also perhaps use something with a less drastic error value than , if we don’t care about the constraint quite as extreme.

Note that this is similar to regularization, which we’ll talk about later.

Now, instead of directly solving , we solve for:

Implementation issues in gradient methods

Stopping criteria

When is zero (or close to zero), we are often close to the optimum.

Optimality

For a convex optimization problem, a necessary condition for optimality is that the gradient is 0 at the optimum:

For convex functions, if the gradient is 0, then we’re at an optimum:

This tells us when is an optimum, but says nothing about whether it’s a minimum or a maximum. To know about that, we must look at the second derivative, or in the general case where , the Hessian. The Hessian is the matrix of second derivatives, defined as follows:

If the Hessian of the optimum is positive semi-definite, then it is a minimum (and not a maximum or a saddle point):

The Hessian is also related to convexity; it is positive semi-definite on its entire domain (i.e. all its eigenvalues are non-negative) if and only if the function is convex.

Step size

If is too big, we might diverge (as seen previously). But if it is too small, we might be very slow! Convergence is only guaranteed for , which is a value that depends on the problem.

Least squares

Normal equations

In some rare cases, we can take an analytical approach to computing the optimum of the cost function, rather than a computational one; for instance, for linear regression with MSE, as we’ve done previously. These types of equations are sometimes called normal equations. This is one of the most popular methods for data fitting, called least squares.

How do we get these normal equations?

First, we show that the problem is convex. If that is the case, then according to the optimality conditions for convex functions, the point at which the derivative is zero is the optimum:

This gives us a system of equations known as the normal equations.

Single parameter linear regression

Let’s try this for a single parameter linear regression (where ), with MSE as the cost function. We will start by accepting that the cost function is convex in the parameter3.

As proven previously, we know that for the single parameter model, the derivative is:

This means that the derivative is 0 for . This allows us to define our optimum parameter as .

Multiple parameter linear regression

Having done , let’s look at the general case where . As we know by now, the cost function for linear regression with MSE is:

Where the matrices are defined as:

We denote the row of by . Each represents a different data point.

We claim that this cost function is convex in . We can prove that in any of the following ways:


Simplest way

The cost function is the sum of many convex functions, and is thus also convex.

Directly verify the definition

The left-hand side of the inequality reduces to:

which indeed is .

Compute the Hessian

As we’ve seen previously, if the Hessian is positive semidefinite, then the function is convex. For our case, the Hessian is given by:

This is indeed positive semi-definite, as its eigenvalues are the squares of the eigenvalues of , and must therefore be positive.


Knowing that the function is convex, we can find the minimum. If we take the gradient of this expression, we get:

We can set this to 0 to get the normal equations for linear regression, which are:

This proves that the normal equations for linear regression are given by .

Geometric interpretation

The above definition of normal equations are given by . How can visualize that?

The error is given by:

By definition, this error vector is orthogonal to all columns of . Indeed, it tells us how far above or below the span our prediction is.

The span of is the space spanned by the columns of . Every element of the span can be written as for some choice of .

For the normal equations, we must pick an optimal for which the gradient is 0. Picking an is equivalent to picking an optimal from the span of .

But which element of shall we take, which one is the optimal one? The normal equations tell us that the optimum choice for , called is the element such that is orthogonal to .

In other words, we should pick to be the projection of onto .

Geometric interpretation of the normal equations

Closed form

All we’ve done so far is to solve the same old problem of a matrix equation:

But we’ve always done so with a bit of a twist; there may not be an exact value of satisfying exact equality, but we could find one that gets us as close as possible:

This is also what least squares does. It attempts to minimize the MSE to get as close as possible to .

In this course, we often denote the data matrix as , the weights as , and as ; in other words, we’re trying to solve:

In least squares, we multiply this whole equation by on the left. We attempt to find , the minimal weight that gets us as minimally wrong as possible. In other we’re trying to solve:

One way to solve this problem would simply be to invert the matrix, which in our case is :

As such, we can use this model to predict values for unseen data points:

Invertibility and uniqueness

Note that the Gram matrix, defined as , is invertible if and only if has full column rank, or in other words, .

Unfortunately, in practice, our data matrix is often rank-deficient.

  • If , we always have (since column and row rank are the same, which implies that ).
  • If , but some of the columns are collinear (or in practice, nearly collinear), then the matrix is ill-conditioned. This leads to numerical issues when solving the linear system.

    To know how bad things are, we can compute the condition number, which is the maximum eigenvalue of the Gram matrix, divided by the minimum See course contents of Numerical Methods.

If our data matrix is rank-deficient or ill-conditioned (which is practically always the case), we certainly shouldn’t be inverting it directly! We’ll introduce high numerical errors that falsify our output.

That doesn’t mean we can’t do least squares in practice. We can still use a linear solver. In Python, that means you should use np.linalg.solve, which uses a LU decomposition internally and thus avoids the worst numerical errors. In any case, do not directly invert the matrix as we have done above!

Maximum likelihood

Maximum likelihood offers a second interpretation of least squares, but starting with a probabilistic approach.

Gaussian distribution

A Gaussian random variable in has mean and variance . Its distribution is given by:

For a Gaussian random vector, we have (instead of a single random variable in ). The vector has mean and covariance (which is positive semi-definite), and its distribution is given by:

As another reminder, two variables and are said to be independent when .

A probabilistic model for least squares

We assume that our data is generated by a linear model , with added Gaussian noise :

This is often a realistic assumption in practice.

Noise generated by a Gaussian source

The noise is for each dimension . In other words, it is centered at 0, has a certain variance, and the error in each dimension is independent of that in other dimensions.

The model is, as always, unknown. But we can try to do a thought experiment: if we did know the model the data , in a system without the noise , we would know the labels with 100% certainty. The only thing that prevents that is the noise ; therefore, given the model and data, the probability distribution of seeing a certain is only given by all the noise sources . Since they are generated independently in each dimension, we can take the product of these noise sources.

Therefore, given samples, the likelihood of the data vector given the model and the input is:

Intuitively, we’d like to maximize this likelihood over the choice of the best model . The best model is the one that maximizes this likelihood.

Defining cost with log-likelihood

The log-likelihood (LL) is given by:

Taking the log allows us to get away from the nasty product, and get a nice sum instead. Notice that this definition looks pretty similar to MSE:

Note that we would like to minimize MSE, but we want the log-likelihood to be as high as possible (intuitively, we can look at the sign to understand that).

Maximum likelihood estimator (MLE)

Maximizing the log-likelihood (and thus the likelihood) will be equivalent to minimizing the MSE; this gives us another way to design cost functions. We can describe the whole process as:

The maximum likelihood estimator (MLE) can be understood as finding the model under which the observed data is most likely to have been generated from (probabilistically). This interpretation has some advantages that we discuss below.

Properties of MLE

MLE is a sample approximation to the expected log-likelihood. In other words, if we had an infinite amount of data, MLE would perfectly be equal to the true expected value of the log-likelihood.

This means that MLE is consistent, i.e. it gives us the correct model assuming we have enough data. This means it converges in probability4 to the true value:

MLE is asymptotically normal, meaning that the difference between the approximation and the true value of the weights converges in distribution4 to a normal distribution centered at 0, and with variance times the Fisher information of the true value:

Where the Fisher information5 is:

This sounds amazing, but the catch is that this all is under the assumption that the noise indeed was generated under a Gaussian model, which may not always be true. We’ll relax this assumption later when we talk about exponential families.

Overfitting and underfitting

Models can be too limited; when we can’t find a function that fits the data well, we say that we are underfitting. But on the other hand, models can also be too rich: in this case, we don’t just model the data, but also the underlying noise. This is called overfitting. Knowing exactly where we are on this spectrum is difficult, since all we have is data, and we don’t know a priori what is signal and what is noise.

Sections 3 and 5 of Pedro Domingos’ paper A Few Useful Things to Know about Machine Learning are a good read on this topic.

Underfitting with linear models

Linear models can very easily underfit; as soon as the data itself is given by anything more complex than a line, fitting a linear model will underfit: the model is too simple for the data, and we’ll have huge errors.

But we can also easily overfit, where our model learns the specificities of the data too intimately. And this happens quite easily with linear combination of high-degree polynomials.

Extended feature vectors

We can actually get high-degree linear combinations of polynomials, but still keep our linear model. Instead of making the model more complex, we simply “augment” the input to become degree . If the input is one-dimensional, we can add a polynomial basis to the input:

Note that this is basically a Vandermonde matrix.

We then fit a linear model to this extended feature vector :

Here, . In other words, there are parameters in a degree extended feature vector. One should be careful with this degree; too high may overfit, too low may underfit.

If it is important to distinguish the original input from the augmented input then we will use the notation. But often, we can just consider this as a part of the pre-processing, and simply write as the input, which will save us a lot of notation.

Reducing overfitting

To reduce overfitting, we can chose a less complex model (in the above, we can pick a lower degree ), but we could also just add more data:

An overfitted model acts more reasonably when we add a bunch of data

Regularization

To prevent overfitting, we can introduce regularization to penalize complex models. This can be applied to any model.

The idea is to not only minimize cost, but also minimize a regularizer:

The function is the regularizer, measuring the complexity of the model. We’ll see some good candidates for the regularizer below.

-Regularization: Ridge Regression

The most frequently used regularizer is the standard Euclidean norm (-norm):

Where . The value of will affect the fit; can have overfitting, while can have underfitting.

The norm is given by:

The main effect of this is that large model weights will be penalized, while small ones won’t affect our minimization too much.

Ridge regression

Depending on the values we choose for and , we get into some special cases. For instance, choosing MSE for is called ridge regression, in which we optimize the following:

Least squares is also a special case of ridge regression, where

We can find an explicit solution for in ridge regression by differentiating the cost and regularizer, and setting them to zero:

We can now set the full cost to zero, which gives us the result:

Where . Note that for , we have the least squares solution.

Ridge regression to fight ill-conditioning

This formulation of is quite nice, because adding the identity matrix helps us get something that always is invertible; in cases where we have ill-conditioned matrices, it also means that we can invert with more stability.

We’ll prove that the matrix indeed is invertible. The gist is that the eigenvalues of are all at least .

To prove it, we’ll write the singular value decomposition (SVD) of as . We then have:

The singular value is “lifted” by an amount . There’s an alternative proof in the class notes, but we won’t go into that.

-Regularization: The Lasso

We can use a different norm as an alternative measure of complexity. The combination of -norm and MSE is known as The Lasso:

Where the -norm is defined as

If we draw out a constant value of the norm, we get a sort of “ball”. Below, we’ve graphed .

Graph of the lasso

To keep things in the following, we’ll just claim that is invertible. We’ll also claim that the following set is an ellipsoid which scales around the origin as we change :

The slides have a formal proof for this, but we won’t get into it.

Note that the above definition of the set corresponds to the set of points with equal loss (which we can assume is MSE, for instance):

Under these assumptions, we claim that for regularization, the optimum solution will likely be sparse (many zero components) compared to regularization.

To prove this, suppose we know the norm of the optimum solution. Visualizing that ball, we know that our optimum solution will be somewhere on the surface of that ball. We also know that there are ellipsoids, all with the same mean and rotation, describing the equal error surfaces. The optimum solution is where the “smallest” of these ellipsoids just touches the ball.

Intersection of the L1 ball and the cost ellipses

Due to the geometry of this ball this point is more likely to be on one of the “corner” points. In turn, sparsity is desirable, since it leads to a “simple” model.

Model selection

As we’ve seen in ridge regression, we have a regularization parameter that can be tuned to reduce overfitting by reducing model complexity. We say that the parameter is a hyperparameter.

We’ve also seen ways to enrich model complexity, like polynomial feature expansion, in which the degree is also a hyperparameter.

We’ll now see how best to choose these hyperparameters; this is called the model selection problem.

Probabilistic setup

We assume that there is an (unknown) underlying distribution producing the dataset, with range . The dataset we see is produced from samples from :

Based on this, the learning algorithm choses the “best” model using the dataset , under the parameters of the algorithm. The resulting prediction function is . To indicate that sometimes depend on hyperparameters, we can write the prediction function as .

Training Error vs. Generalization Error

Given a model , how can we assess if is any good? We already have the loss function, but its result is highly dependent on the error in the data, not to how good the model is. Instead, we can compute the expected error over all samples chosen according to .

Where is our loss function; e.g. for ridge regression, .

The quantity has many names, including generalization error (or true/expected error/risk/loss). This is the quantity that we are fundamentally interested in, but we cannot compute it since is unknown.

What we do know is the data subset6 . It’s therefore natural to compute the equivalent empirical quantity, which is the average loss:

But again, we run into trouble. The function is itself a function of the data , so what we really do is to compute the quantity:

is the trained model. This is called the training error. Usually, the training error is smaller than the generalization error, because overfitting can happen (even with regularization, because the hyperparameter may still be too low).

Splitting the data

To avoid validating the model on the same data subset we trained it on (which is conducive to overfitting), we can split the data into a training set and a test set (aka validation set), which we call and , so that . A typical split could be 80% for training and 20% for testing.

We apply the learning algorithm on the training set , and compute the function . We then compute the error on the test set, which is the test error:

If we have duplicates in our data, then this could be a bit dangerous. Still, in general, this really helps us with the problem of overfitting since is a “fresh” sample, which means that we can hope that defined above is close to the quantity . Indeed, in expectation both are the same:

The subscript on the expectation means that the expectation is over samples of the test set, and not for a particular test set (which could give a different result due to the randomness of the selection of ).

This is a quite nice property, but we paid a price for this. We had to split the data and thus reduce the size of our training data. But we will see that this can be mediated using cross-validation.

Generalization error vs test error

Assume that we have a model and that our loss function is bounded in . We are given a test set chosen i.i.d. from the underlying distribution .

How far apart is the empirical test error from the true generalization error? As we’ve seen above, they are the same in expectation. But we need to worry about the variation, about how far off from the true error we typically are:

We claim that:

Where is a quality parameter. This gives us an upper bound on how far away our empirical loss is from the true loss.

This bound gives us some nice insights. Error decreases in the size of the test set as , so the more data points we have, the more confident we can be in the empirical loss being close to the true loss.

We’ll prove . We assumed that each sample in the test set is chosen independently. Therefore, given a model , the associated losses are also i.i.d. random variables, taking values in by assumption. We can call each such loss :

This is just a naming alias; since the underlying value is that of the loss function, the expected value of is simply that of the loss function, which is the true loss:

The empirical loss on the other hand is equal to the average of such i.i.d. values.

The formula of gives us the probability that empirical loss diverges from the true loss by more than a given constant, which is a classical problem addressed in the following lemma (which we’ll just assert, not prove).

Chernoff Bound: Let be a sequence of i.i.d random variables with mean and range . Then, for any :

Using we can show . By setting , we find that as claimed.

Method and criteria for model selection

Grid search on hyperparameters

Our main goal was to look for a way to select the hyperparameters of our model. Given a finite set of values for of a hyperparameter , we can run the learning algorithm times on the same training set , and compute the prediction functions . For each such prediction function we compute the test error, and choose the which minimizes the test error.

Grid search on lambda

This is essentially a grid search on using the test error function.

Model selection based on test error

How do we know that, for a fixed function , is a good approximation to ? If we’re doing a grid search on hyperparameters to minimize the test error , we may pick a model that obtains a lower test error, but that may increase .

We’ll therefore try to see how much the bound increases if we pick a false positive, a model that has lower test error but that actually strays further away from the generalization error.

The answer to this follows the same idea as when we talked about generalization vs test error, but we now assume that we have models for . We assume again that the loss function is bounded in , and that we’re given a test set whose samples are chosen i.i.d. in .

How far is each of the (empirical) test errors from the true ? As before, we can bound the deviation for all candidates, by:

A bit of intuition of where this comes from: for a general , we check the deviations for independent samples and ask for the probability that for at least one such sample we get a deviation of at least (this is what the bound answers). Then by the union bound this probability is at most times as large as in the case where we are only concerned with a single instance. Thus the upper bound in Chernoff becomes , which gives us as above.

As before, this tells us that error decreases in .

However, now that we test hyperparameters, our error only goes up by a tiny amount of . This follows from , which we proved for the special case of . So we can reasonably do grid search, knowing that in the worst case, the error will only increase by a tiny amount.

Cross-validation

Splitting the data once into two parts (one for training and one for testing) is not the most efficient way to use the data. Cross-validation is a better way.

K-fold cross-validation is a popular variant. We randomly partition the data into groups, and train times. Each time, we use one of the groups as our test set, and the remaining groups for training.

To get a common result, we average out the results. This means we’ll use the average weights to get the average test error over the folds.

Cross-validation returns an unbiased estimate of the generalization error and its variance.

Bias-Variance decomposition

When we perform model selection, there is an inherent bias–variance trade-off.

Bullseye representation of bias vs variance
Graphical illustration of bias and variance. Taken from Scott Fortmann-Roe's website

If we were to build the same model over and over again with re-sampled datasets, our predictions would change because of the randomness in the used datasets. Bias tells us how far off from the correct value our predictions are in general, while variance tells us about the variability in predictions for a given point in-between realizations of the models.

For now, we’ll just look at “high-bias & low-variance” models, and “high-variance & low-bias” models.

  • High-bias & low-variance: the model is too simple. It’s underfit, has a large bias, and and the variance of is small (the variations due to the random sample ).
  • High-variance & low-bias: the model is too complex. It’s overfit, has a small bias and large variance of (the error depends largely on the exact choice of ; a single addition of a data point is likely to change the prediction function considerably)

Consider a linear regression with one-dimensional input and polynomial feature expansion of degree . The former can be achieved by picking a too low value for , while the latter by picking too high. The same principle applies for other parameters, such as ridge regression with hyperparameter .

Data generation model

Let’s assume that our data is generated by some arbitrary, unknown function , and a noise source with distribution (i.i.d. from sample to sample, and independent from the data). We can think of representing the precise, hypothetical function that perfectly produced the data. We assume that the noise has mean zero (without loss of generality, as a non-zero mean could be encoded into ).

We assume that is generated according to some fixed but unknown distribution . We’ll be working with square loss . We will denote the joint distribution on pairs as .

Error Decomposition

As always, we have a training set , which consists of i.i.d. samples from . Given our learning algorithm , we compute the prediction function . The square loss of a single prediction for a fixed element is given by the computation of:

Our experiment was to create , learn , and then evaluate the performance by computing the square loss for a fixed element . If we run this experiment many times, the expected value is written as:

This expectation is over randomly selected training sets of size , and over noise sources. We will now show that this expression can be rewritten as a sum of three non-negative terms:

Note that here, is a second training set, also sampled from , that is independent of the training set . It has the same expectation, but it is different and thus produces a different trained model .

Step uses as well as linearity of expectation to produce . Note that the part is zero as the noise is independent from .

Step uses the definition of variance as:

Seeing that our noise has mean zero, we have and therefore .

In step , we add and subtract the constant term to the expression like so:

We can then expand the square , where becomes the bias, and the variance. We can drop the expectation around