In statistics and machine learning, the bias–variance tradeoff describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train the model. In general, as we increase the number of tunable parameters in a model, it becomes more flexible, and can better fit a training data set. It is said to have lower error, or bias. However, for more flexible models, there will tend to be greater variance to the model fit each time we take a set of samples to create a new training data set. It is said that there is greater variance in the model's estimated parameters.
The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:[1] [2]
The bias–variance decomposition is a way of analyzing a learning algorithm's expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the irreducible error, resulting from noise in the problem itself.
See also: Accuracy and precision. The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data.
It is an often made fallacy[3] [4] to assume that complex models must have high variance. High variance models are "complex" in some sense, but the reverse needs not be true.[5] In addition, one has to be careful how to define complexity. In particular, the number of parameters used to describe the model is a poor measure of complexity. This is illustrated by an example adapted from:[6] The model
fa,b(x)=a\sin(bx)
a,b
An analogy can be made to the relationship between accuracy and precision. Accuracy is a description of bias and can intuitively be improved by selecting from only local information. Consequently, a sample will appear accurate (i.e. have low bias) under the aforementioned selection conditions, but may result in underfitting. In other words, test data may not agree as closely with training data, which would indicate imprecision and therefore inflated variance. A graphical example would be a straight line fit to data exhibiting quadratic behavior overall. Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space. The option to select many data points over a broad sample space is the ideal condition for any analysis. However, intrinsic constraints (whether physical, theoretical, computational, etc.) will always play a limiting role. The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall, but may also result in an overreliance on the training data (overfitting). This means that test data would also not agree as closely with the training data, but in this case the reason is inaccuracy or high bias. To borrow from the previous example, the graphical representation would appear as a high-order polynomial fit to the same data exhibiting quadratic behavior. Note that error in each case is measured the same way, but the reason ascribed to the error is different depending on the balance between bias and variance. To mitigate how much information is used from neighboring observations, a model can be smoothed via explicit regularization, such as shrinkage.
See main article: Mean squared error. Suppose that we have a training set consisting of a set of points
x1,...,xn
yi
xi
f(x)
y=f(x)+\varepsilon
\varepsilon
\sigma2
We want to find a function
\hat{f}(x;D)
f(x)
D=\{(x1,y1)...,(xn,yn)\}
y
\hat{f}(x;D)
(y-\hat{f}(x;D))2
x1,...,xn
yi
\varepsilon
Finding an
\hat{f}
\hat{f}
x
\operatorname{E}D,[(y-\hat{f}(x;D))2] =(\operatorname{Bias}D[\hat{f}(x;D)])2+\operatorname{Var}D[\hat{f}(x;D)]+\sigma2
where
\operatorname{Bias}D[\hat{f}(x;D)]=\operatorname{E}D[\hat{f}(x;D)-f(x)]=\operatorname{E}D[\hat{f}(x;D)]-\operatorname{E}y|x[y(x)],
\operatorname{Var}D[\hat{f}(x;D)]=\operatorname{E}D[(\operatorname{E}D[\hat{f}(x;D)]-\hat{f}(x;D))2].
and
2=\operatorname{E} | |
\sigma | |
y[(y-\underbrace{f(x)} |
Ey|x[y] |
)2]
The expectation ranges over different choices of the training set
D=\{(x1,y1)...,(xn,yn)\}
P(x,y)
f(x)
\hat{f}(x)
\hat{f}(x)
\sigma2
Since all three terms are non-negative, the irreducible error forms a lower bound on the expected error on unseen samples.
The more complex the model
\hat{f}(x)
The derivation of the bias–variance decomposition for squared error proceeds as follows.[9] [10] For notational convenience, we abbreviate
f=f(x)
\hat{f}=\hat{f}(x;D)
D
Let us write the mean-squared error of our model:
MSE\triangleq \operatorname{E}[(y-\hat{f})2]=\operatorname{E}[y2-2y\hat{f}+\hat{f}2]=\operatorname{E}[y2]-2\operatorname{E}[y\hat{f}]+\operatorname{E}[\hat{f}2]
Firstly, since we model
y=f+\varepsilon
\begin{align} \operatorname{E}[y2]&=\operatorname{E}[(f+\varepsilon)2]\\ &=\operatorname{E}[f2]+2\operatorname{E}[f\varepsilon]+\operatorname{E}[\varepsilon2]&&bylinearityof\operatorname{E}\\ &=f2+2f\operatorname{E}[\varepsilon]+\operatorname{E}[\varepsilon2]&&sincefdoesnotdependonthedata\\ &=f2+2f ⋅ 0+\sigma2&&since\varepsilonhaszeromeanandvariance\sigma2 \end{align}
Secondly,
\begin{align} \operatorname{E}[y\hat{f}]&=\operatorname{E}[(f+\varepsilon)\hat{f}]\\ &=\operatorname{E}[f\hat{f}]+\operatorname{E}[\varepsilon\hat{f}]&&bylinearityof\operatorname{E}\\ &=\operatorname{E}[f\hat{f}]+\operatorname{E}[\varepsilon]\operatorname{E}[\hat{f}]&&since\hat{f}and\varepsilonareindependent\\ &=f\operatorname{E}[\hat{f}]&&since\operatorname{E}[\varepsilon]=0 \end{align}
Lastly,
\begin{align} \operatorname{E}[\hat{f}2]&=\operatorname{Var}(\hat{f})+\operatorname{E}[\hat{f}]2&&since\operatorname{Var}[X]\triangleq\operatorname{E}[(X-\operatorname{E}[X])2]=\operatorname{E}[X2]-\operatorname{E}[X]2foranyrandomvariableX \end{align}
Eventually, we plug these 3 formulas in our previous derivation of
MSE
\begin{align} MSE &=f2+\sigma2-2f\operatorname{E}[\hat{f}]+\operatorname{Var}[\hat{f}]+\operatorname{E}[\hat{f}]2\\ &=(f-\operatorname{E}[\hat{f}])2+\sigma2+\operatorname{Var}[\hat{f}]\\[5pt] &=\operatorname{Bias}[\hat{f}]2+\sigma2+\operatorname{Var}[\hat{f}] \end{align}
Finally, MSE loss function (or negative log-likelihood) is obtained by taking the expectation value over
x\simP
MSE=\operatorname{E}x\{\operatorname{Bias}
2+\operatorname{Var} | |
D[\hat{f}(x;D)]\} |
+\sigma2.
Dimensionality reduction and feature selection can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance; for example,
One way of resolving the trade-off is to use mixture models and ensemble learning.[14] [15] For example, boosting combines many "weak" (high bias) models in an ensemble that has lower bias than the individual models, while bagging combines "strong" learners in a way that reduces their variance.
Model validation methods such as cross-validation (statistics) can be used to tune models so as to optimize the trade-off.
In the case of -nearest neighbors regression, when the expectation is taken over the possible labeling of a fixed training set, a closed-form expression exists that relates the bias–variance decomposition to the parameter :
\operatorname{E}\left[(y-\hat{f}(x))2\midX=x\right]=\left(f(x)-
1 | |
k |
k | |
\sum | |
i=1 |
f(Ni(x))\right)2+
\sigma2 | |
k |
+\sigma2
where
N1(x),...,Nk(x)
The bias–variance decomposition forms the conceptual basis for regression regularization methods such as LASSO and ridge regression. Regularization methods introduce bias into the regression solution that can reduce variance considerably relative to the ordinary least squares (OLS) solution. Although the OLS solution provides non-biased regression estimates, the lower variance solutions produced by regularization techniques provide superior MSE performance.
The bias–variance decomposition was originally formulated for least-squares regression. For the case of classification under the 0-1 loss (misclassification rate), it is possible to find a similar decomposition, with the caveat that the variance term becomes dependent on the target label.[16] [17] Alternatively, if the classification problem can be phrased as probabilistic classification, then the expected cross-entropy can instead be decomposed to give bias and variance terms with the same semantics but taking a different form.
It has been argued that as training data increases, the variance of learned models will tend to decrease, and hence that as training data quantity increases, error is minimised by methods that learn models with lesser bias, and that conversely, for smaller training data quantities it is ever more important to minimise variance.[18]
Even though the bias–variance decomposition does not directly apply in reinforcement learning, a similar tradeoff can also characterize generalization. When an agent has limited information on its environment, the suboptimality of an RL algorithm can be decomposed into the sum of two terms: a term related to an asymptotic bias and a term due to overfitting. The asymptotic bias is directly related to the learning algorithm (independently of the quantity of data) while the overfitting term comes from the fact that the amount of data is limited.[19]
While widely discussed in the context of machine learning, the bias–variance dilemma has been examined in the context of human cognition, most notably by Gerd Gigerenzer and co-workers in the context of learned heuristics. They have argued (see references below) that the human brain resolves the dilemma in the case of the typically sparse, poorly-characterized training-sets provided by experience by adopting high-bias/low variance heuristics. This reflects the fact that a zero-bias approach has poor generalizability to new situations, and also unreasonably presumes precise knowledge of the true state of the world. The resulting heuristics are relatively simple, but produce better inferences in a wider variety of situations.[20]
Geman et al. argue that the bias–variance dilemma implies that abilities such as generic object recognition cannot be learned from scratch, but require a certain degree of "hard wiring" that is later tuned by experience. This is because model-free approaches to inference require impractically large training sets if they are to avoid high variance.