Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.
The first algorithm for random decision forests was created in 1995 by Tin Kam Ho[1] using the random subspace method,[2] which, in Ho's formulation, is a way to implement the "stochastic discrimination" approach to classification proposed by Eugene Kleinberg.[3] [4] [5]
An extension of the algorithm was developed by Leo Breiman[6] and Adele Cutler, who registered[7] "Random Forests" as a trademark in 2006 (owned by Minitab, Inc.).[8] The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho[1] and later independently by Amit and Geman[9] in order to construct a collection of decision trees with controlled variance.
The general method of random decision forests was first proposed by Salzberg and Heath in 1993,[10] with a method that used a randomized decision tree algorithm to generate multiple different trees and then combine them using majority voting. This idea was developed further by Ho in 1995.[1] Ho established that forests of trees splitting with oblique hyperplanes can gain accuracy as they grow without suffering from overtraining, as long as the forests are randomly restricted to be sensitive to only selected feature dimensions. A subsequent work along the same lines[2] concluded that other splitting methods behave similarly, as long as they are randomly forced to be insensitive to some feature dimensions. Note that this observation of a more complex classifier (a larger forest) getting more accurate nearly monotonically is in sharp contrast to the common belief that the complexity of a classifier can only grow to a certain level of accuracy before being hurt by overfitting. The explanation of the forest method's resistance to overtraining can be found in Kleinberg's theory of stochastic discrimination.[3] [4] [5]
The early development of Breiman's notion of random forests was influenced by the work of Amit and Geman[9] who introduced the idea of searching over a random subset of the available decisions when splitting a node, in the context of growing a single tree. The idea of random subspace selection from Ho[2] was also influential in the design of random forests. In this method a forest of trees is grown, and variation among the trees is introduced by projecting the training data into a randomly chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, rather than a deterministic optimization was first introduced by Thomas G. Dietterich.[11]
The proper introduction of random forests was made in a paper by Leo Breiman.[6] This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging. In addition, this paper combines several ingredients, some previously known and some novel, which form the basis of the modern practice of random forests, in particular:
The report also offers the first theoretical result for random forests in the form of a bound on the generalization error which depends on the strength of the trees in the forest and their correlation.
See main article: Decision tree learning. Decision trees are a popular method for various machine learning tasks. Tree learning "come[s] closest to meeting the requirements for serving as an off-the-shelf procedure for data mining", say Hastie et al., "because it is invariant under scaling and various other transformations of feature values, is robust to inclusion of irrelevant features, and produces inspectable models. However, they are seldom accurate".
In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.
See main article: Bootstrap aggregating. The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set =, ..., with responses =, ...,, bagging repeatedly (B times) selects a random sample with replacement of the training set and fits trees to these samples:
After training, predictions for unseen samples can be made by averaging the predictions from all the individual regression trees on :
or by taking the plurality vote in the case of classification trees.
This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.
Additionally, an estimate of the uncertainty of the prediction can be made as the standard deviation of the predictions from all the individual regression trees on :
The number of samples/trees,, is a free parameter. Typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. An optimal number of trees can be found using cross-validation, or by observing the out-of-bag error: the mean prediction error on each training sample, using only the trees that did not have in their bootstrap sample.[12] The training and test error tend to level off after some number of trees have been fit.
See main article: Random subspace method. The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the trees, causing them to become correlated. An analysis of how bagging and random subspace projection contribute to accuracy gains under different conditions is given by Ho.[13]
Typically, for a classification problem with features, (rounded down) features are used in each split. For regression problems the inventors recommend (rounded down) with a minimum node size of 5 as the default. In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem.
Adding one further step of randomization yields extremely randomized trees, or ExtraTrees. While similar to ordinary random forests in that they are an ensemble of individual trees, there are two main differences: first, each tree is trained using the whole learning sample (rather than a bootstrap sample), and second, the top-down splitting in the tree learner is randomized. Instead of computing the locally optimal cut-point for each feature under consideration (based on, e.g., information gain or the Gini impurity), a random cut-point is selected. This value is selected from a uniform distribution within the feature's empirical range (in the tree's training set). Then, of all the randomly generated splits, the split that yields the highest score is chosen to split the node. Similar to ordinary random forests, the number of randomly selected features to be considered at each node can be specified. Default values for this parameter are
\sqrt{p}
p
p
The basic Random Forest procedure may not work well in situations where there are a large number of features but only a small proportion of these features are informative with respect to sample classification. This can be addressed by encouraging the procedure to focus mainly on features and trees that are informative. Some methods for accomplishing this are:
Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique was described in Breiman's original paper[6] and is implemented in the R package randomForest.[21]
The first step in measuring the variable importance in a data set
l{D}n=\{(Xi,Yi)\}
n | |
i=1 |
To measure the importance of the
j
j
j
Features which produce large values for this score are ranked as more important than features which produce small values. The statistical definition of the variable importance measure was given and analyzed by Zhu et al.[22]
This method of determining variable importance has some drawbacks.
This feature importance for random forests is the default implementation in sci-kit learn and R. It is described in the book "Classification and Regression Trees" by Leo Breiman.[30] Variables which decrease the impurity during splits a lot are considered important:[31] where
x
nT
Ti
i
p | (j)= | |
Ti |
nj | |
n |
j
\Delta
i | |
Ti |
(j)
t
j
The normalized importance is then obtained by normalizing over all features, so that the sum of normalized feature importances is 1.
The sci-kit learn default implementation of Mean Decrease in Impurity Feature Importance is susceptible to misleading feature importances:
A relationship between random forests and the -nearest neighbor algorithm (-NN) was pointed out by Lin and Jeon in 2002.[33] It turns out that both can be viewed as so-called weighted neighborhoods schemes. These are models built from a training set
\{(xi,yi)\}
n | |
i=1 |
\hat{y}
Here,
W(xi,x')
xi
W(xi,x')=
1 | |
k |
W(xi,x')=
1 | |
k' |
Since a forest averages the predictions of a set of trees with individual weight functions
Wj
This shows that the whole forest is again a weighted neighborhood scheme, with weights that average those of the individual trees. The neighbors of in this interpretation are the points
xi
j
As part of their construction, random forest predictors naturally lead to a dissimilarity measure among the observations. One can also define a random forest dissimilarity measure between unlabeled data: the idea is to construct a random forest predictor that distinguishes the "observed" data from suitably generated synthetic data.[6] [34] The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. A random forest dissimilarity can be attractive because it handles mixed variable types very well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The random forest dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection; for example, the "Addcl 1" random forest dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. The random forest dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data.[35]
Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particular multinomial logistic regression and naive Bayes classifiers.[36] [37] [38] In cases that the relationship between the predictors and the target variable is linear, the base learners may have an equally high accuracy as the ensemble learner.[39]
In machine learning, kernel random forests (KeRF) establish the connection between random forests and kernel methods. By slightly modifying their definition, random forests can be rewritten as kernel methods, which are more interpretable and easier to analyze.[40]
Leo Breiman[41] was the first person to notice the link between random forest and kernel methods. He pointed out that random forests which are grown using i.i.d. random vectors in the tree construction are equivalent to a kernel acting on the true margin. Lin and Jeon[42] established the connection between random forests and adaptive nearest neighbor, implying that random forests can be seen as adaptive kernel estimates. Davies and Ghahramani[43] proposed Random Forest Kernel and show that it can empirically outperform state-of-art kernel methods. Scornet[40] first defined KeRF estimates and gave the explicit link between KeRF estimates and random forest. He also gave explicit expressions for kernels based on centered random forest[44] and uniform random forest,[45] two simplified models of random forest. He named these two KeRFs Centered KeRF and Uniform KeRF, and proved upper bounds on their rates of consistency.
Centered forest[44] is a simplified model for Breiman's original random forest, which uniformly selects an attribute among all attributes and performs splits at the center of the cell along the pre-chosen attribute. The algorithm stops when a fully binary tree of level
k
k\inN
Uniform forest[45] is another simplified model for Breiman's original random forest, which uniformly selects a feature among all features and performs splits at a point uniformly drawn on the side of the cell, along the preselected feature.
Given a training sample
l{D}n=\{(Xi,Yi)\}
n | |
i=1 |
[0,1]p x R
(X,Y)
\operatorname{E}[Y2]<infty
Y
X
m(x)=\operatorname{E}[Y\midX=x]
M
mn(x,\Thetaj)
x
j
\Theta1,\ldots,\ThetaM
\Theta
l{D}n
mM,(x,\Theta1,\ldots,\ThetaM)=
1 | |
M |
M | |
\sum | |
j=1 |
mn(x,\Thetaj)
mn=
| ||||||||||||||
\sum | ||||||||||||||
i=1 |
An(x,\Thetaj)
x
\Thetaj
l{D}n
Nn(x,\Thetaj)=
n | |
\sum | |
i=1 |
1 | |
Xi\inAn(x,\Thetaj) |
Thus random forest estimates satisfy, for all
x\in[0,1]d
mM,n(x,\Theta1,\ldots,\ThetaM)=
1 | |
M |
M | |
\sum | |
j=1 |
| ||||||||||||||
\left(\sum | ||||||||||||||
i=1 |
\right)
Yi
x
M
KM,n(x,z)=
1 | |
M |
M | |
\sum | |
j=1 |
1 | |
z\inAn(x,\Thetaj) |
x
z
\tilde{m}M,n(x,\Theta1,\ldots,\ThetaM)=
| ||||||||||
|
The construction of Centered KeRF of level
k
\tilde{m}M,n(x,\Theta1,\ldots,\ThetaM)
Uniform KeRF is built in the same way as uniform forest, except that predictions are made by
\tilde{m}M,n(x,\Theta1,\ldots,\ThetaM)
Predictions given by KeRF and random forests are close if the number of points in each cell is controlled:
Assume that there exist sequencessuch that, almost surely,Then almost surely,(an),(bn)
When the number of trees
M
Assume that there exist sequencessuch that, almost surely(\varepsilonn),(an),(bn)
\operatorname{E}[Nn(x,\Theta)]\ge1,
\operatorname{P}[an\leNn(x,\Theta)\lebn\midl{D}n]\ge1-\varepsilonn/2,
Then almost surely,\operatorname{P}[an\le\operatorname{E}\Theta[Nn(x,\Theta)]\lebn\midl{D}n]\ge1-\varepsilonn/2,
Assume that
Y=m(X)+\varepsilon
\varepsilon
X
\sigma2<infty
X
[0,1]d
m
Providing
k → infty
n/2k → infty
C1>0
n
cc | |
E[\tilde{m} | |
n |
(X)-m(X)]2\leC1n-1/(3+dlog(logn)2
Providing
k → infty
n/2k → infty
C>0
uf | |
E[\tilde{m} | |
n |
(X)-m(X)]2\leCn-2/(6+3dlog2)(logn)2
While random forests often achieve higher accuracy than a single decision tree, they sacrifice the intrinsic interpretability present in decision trees. Decision trees are among a fairly small family of machine learning models that are easily interpretable along with linear models, rule-based models, and attention-based models. This interpretability is one of the most desirable qualities of decision trees. It allows developers to confirm that the model has learned realistic information from the data and allows end-users to have trust and confidence in the decisions made by the model. For example, following the path that a decision tree takes to make its decision is quite trivial, but following the paths of tens or hundreds of trees is much harder. To achieve both performance and interpretability, some model compression techniques allow transforming a random forest into a minimal "born-again" decision tree that faithfully reproduces the same decision function.[46] [47] If it is established that the predictive attributes are linearly correlated with the target variable, using random forest may not enhance the accuracy of the base learner. Furthermore, in problems with multiple categorical variables, random forest may not be able to increase the accuracy of the base learner.[48]