In statistics, the two-way analysis of variance (ANOVA) is an extension of the one-way ANOVA that examines the influence of two different categorical independent variables on one continuous dependent variable. The two-way ANOVA not only aims at assessing the main effect of each independent variable but also if there is any interaction between them.
In 1925, Ronald Fisher mentions the two-way ANOVA in his celebrated book, Statistical Methods for Research Workers (chapters 7 and 8). In 1934, Frank Yates published procedures for the unbalanced case.[1] Since then, an extensive literature has been produced. The topic was reviewed in 1993 by Yasunori Fujikoshi.[2] In 2005, Andrew Gelman proposed a different approach of ANOVA, viewed as a multilevel model.[3]
Let us imagine a data set for which a dependent variable may be influenced by two factors which are potential sources of variation. The first factor has
I
J
(i,j)
I x J
(i,j)
nij
k
From these data, we can build a contingency table, where
ni+=
J | |
\sum | |
j=1 |
nij
n+j=
I | |
\sum | |
i=1 |
nij
n=\sumi,jnij=\sumini+=\sumjn+j
The experimental design is balanced if each treatment has the same number of replicates,
K
\foralli,j nij=K
\foralli,j nij=
ni+ ⋅ n+j | |
n |
Upon observing variation among all
n
Yijk
yijk
k
(i,j)
\muij
\sigma2
Yijk|\muij,\sigma2 \overset{i.i.d.
Specifically, the mean of the response variable is modeled as a linear combination of the explanatory variables:
\muij=\mu+\alphai+\betaj+\gammaij
where
\mu
\alphai
i
\betaj
j
\gammaij
(i,j)
k=1,...,nij
Another equivalent way of describing the two-way ANOVA is by mentioning that, besides the variation explained by the factors, there remains some statistical noise. This amount of unexplained variation is handled via the introduction of one random variable per data point,
\epsilonijk
n
Yijk=\muij+\epsilonijkwith\epsilonijk\overset{i.i.d.
Following Gelman and Hill, the assumptions of the ANOVA, and more generally the general linear model, are, in decreasing order of importance:[5]
To ensure identifiability of parameters, we can add the following "sum-to-zero" constraints:
\sumi\alphai=\sumj\betaj=\sumi\gammaij=\sumj\gammaij=0
In the classical approach, testing null hypotheses (that the factors have no effect) is achieved via their significance which requires calculating sums of squares.
Testing if the interaction term is significant can be difficult because of the potentially-large number of degrees of freedom.[6]
The following hypothetical example gives the yields of 15 plants subject to two different environmental variations, and three different fertilisers.
Extra CO2 | Extra humidity | ||
---|---|---|---|
No fertiliser | 7, 2, 1 | 7, 6 | |
Nitrate | 11, 6 | 10, 7, 3 | |
Phosphate | 5, 3, 4 | 11, 4 |
Five sums of squares are calculated:
Factor | Calculation | Sum | \sigma2 | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Individual | 72+22+12+72+62+112+62+102+72+32+52+32+42+112+42 | 641 | 15 | ||||||||||||||||||
Fertilizer × Environment |
+
+
+
+
+
| 556.1667 | 6 | ||||||||||||||||||
Fertilizer |
+
+
| 525.4 | 3 | ||||||||||||||||||
Environment |
+
| 519.2679 | 2 | ||||||||||||||||||
Composite |
| 504.6 | 1 |
Finally, the sums of squared deviations required for the analysis of variance can be calculated.
Factor | Sum | \sigma2 | Total | Environment | Fertiliser | Fertiliser × Environment | Residual | |
---|---|---|---|---|---|---|---|---|
Individual | 641 | 15 | 1 | 1 | ||||
Fertiliser × Environment | 556.1667 | 6 | 1 | -1 | ||||
Fertiliser | 525.4 | 3 | 1 | -1 | ||||
Environment | 519.2679 | 2 | 1 | -1 | ||||
Composite | 504.6 | 1 | -1 | -1 | -1 | 1 | ||
Squared deviations | 136.4 | 14.668 | 20.8 | 16.099 | 84.833 | |||
Degrees of freedom | 14 | 1 | 2 | 2 | 9 |