In statistics, jackknife variance estimates for random forest are a way to estimate the variance in random forest models, in order to eliminate the bootstrap effects.
The sampling variance of bagged learners is:
V(x)=Var[\hat{\theta}infty(x)]
\hat{V}j=
n-1 | |
n |
n | |
\sum | |
i=1 |
(\hat\theta(-i)-\overline\theta)2
\hat{V}j=
n-1 | |
n |
n | |
\sum | |
i=1 |
(\overline
\star | |
t | |
(-i) |
(x)-\overlinet\star(x))2
t\star
\star | |
t | |
(-i) |
ith
E-mail spam problem is a common classification problem, in this problem, 57 features are used to classify spam e-mail and non-spam e-mail. Applying IJ-U variance formula to evaluate the accuracy of models with m=15,19 and 57. The results shows in paper(Confidence Intervals for Random Forests: The jackknife and the Infinitesimal Jackknife) that m = 57 random forest appears to be quite unstable, while predictions made by m=5 random forest appear to be quite stable, this results is corresponding to the evaluation made by error percentage, in which the accuracy of model with m=5 is high and m=57 is low.
Here, accuracy is measured by error rate, which is defined as:
ErrorRate=
1 | |
N |
N | |
\sum | |
i=1 |
M | |
\sum | |
j=1 |
yij,
yij
ith
logloss=
1 | |
N |
N | |
\sum | |
i=1 |
M | |
\sum | |
j=1 |
yijlog(pij)
yij
ith
pij
ith
j
When using Monte Carlo MSEs for estimating
infty | |
V | |
IJ |
infty | |
V | |
J |
infty | ||
E[\hat{V} | ≈ | |
IJ |
| ||||||||||||||||
\star |
)2}{B}
B= | |
\hat{V} | |
IJ-U |
B | |
\hat{V} | |
IJ |
-
| ||||||||||||||||
\star |
)2}{B}
B= | |
\hat{V} | |
J-U |
B | |
\hat{V} | |
J |
-(e-1)
| ||||||||||||||||
\star |
)2}{B}