See also: Conditional dependence.
In probability theory, conditional independence describes situations wherein an observation is irrelevant or redundant when evaluating the certainty of a hypothesis. Conditional independence is usually formulated in terms of conditional probability, as a special case where the probability of the hypothesis given the uninformative observation is equal to the probability without. If
A
B
C
P(A\midB,C)=P(A\midC)
where
P(A\midB,C)
A
B
C
A
C
A
B
C
B
A
A
B
C
(A\perp\perpB\midC)
f(y)
g(y)
y
f\left(y\right)~\overset{\curvearrowleft\curvearrowright}{=}~g\left(y\right)
P(f\midg,y)=P(f\midy)
The concept of conditional independence is essential to graph-based theories of statistical inference, as it establishes a mathematical relation between a collection of conditional statements and a graphoid.
Let
A
B
C
A
B
C
P(C)>0
P(A\midB,C)=P(A\midC)
This property is often written:
(A\perp\perpB\midC)
((A\perp\perpB)\vertC)
Equivalently, conditional independence may be stated as:
P(A,B|C)=P(A|C)P(B|C)
where
P(A,B|C)
A
B
C
A
B
C
It demonstrates that
(A\perp\perpB\midC)
(B\perp\perpA\midC)
P(A,B\midC)=P(A\midC)P(B\midC)
iff
P(A,B,C) | |
P(C) |
=\left(
P(A,C) | |
P(C) |
\right)\left(
P(B,C) | |
P(C) |
\right)
iff
P(A,B,C)=
P(A,C)P(B,C) | |
P(C) |
P(C)
iff
P(A,B,C) | |
P(B,C) |
=
P(A,C) | |
P(C) |
P(B,C)
iff
P(A\midB,C)=P(A\midC)
\therefore
Each cell represents a possible outcome. The events
\color{red}R
\color{blue}B
\color{gold}Y
\color{red}R
\color{blue}B
The probabilities of these events are shaded areas with respect to the total area. In both examples
\color{red}R
\color{blue}B
\color{gold}Y
\Pr({\color{red}R},{\color{blue}B}\mid{\color{gold}Y})=\Pr({\color{red}R}\mid{\color{gold}Y})\Pr({\color{blue}B}\mid{\color{gold}Y})
but not conditionally independent given
\left[not{\color{gold}Y}\right]
\Pr({\color{red}R},{\color{blue}B}\midnot{\color{gold}Y})\not=\Pr({\color{red}R}\midnot{\color{gold}Y})\Pr({\color{blue}B}\midnot{\color{gold}Y})
Let events A and B be defined as the probability that person A and person B will be home in time for dinner where both people are randomly sampled from the entire world. Events A and B can be assumed to be independent i.e. knowledge that A is late has minimal to no change on the probability that B will be late. However, if a third event is introduced, person A and person B live in the same neighborhood, the two events are now considered not conditionally independent. Traffic conditions and weather-related events that might delay person A, might delay person B as well. Given the third event and knowledge that person A was late, the probability that person B will be late does meaningfully change.[2]
Conditional independence depends on the nature of the third event. If you roll two dice, one may assume that the two dice behave independently of each other. Looking at the results of one die will not tell you about the result of the second die. (That is, the two dice are independent.) If, however, the 1st die's result is a 3, and someone tells you about a third event - that the sum of the two results is even - then this extra unit of information restricts the options for the 2nd result to an odd number. In other words, two events can be independent, but NOT conditionally independent.
Height and vocabulary are dependent since very small people tend to be children, known for their more basic vocabularies. But knowing that two people are 19 years old (i.e., conditional on age) there is no reason to think that one person's vocabulary is larger if we are told that they are taller.
Two discrete random variables
X
Y
Z
Z
X
Y
Z
Z
X
Y
Y
X
where
FX,Y\midZ=z(x,y)=\Pr(X\leqx,Y\leqy\midZ=z)
X
Y
Z
Two events
R
B
\Sigma
\Pr(R,B\mid\Sigma)=\Pr(R\mid\Sigma)\Pr(B\mid\Sigma)a.s.
where
\Pr(A\mid\Sigma)
A
\chiA
\Sigma
\Pr(A\mid\Sigma):=\operatorname{E}[\chiA\mid\Sigma].
Two random variables
X
Y
\Sigma
R
\sigma(X)
B
\sigma(Y)
Two random variables
X
Y
W
W
X\perp\perpY\midW
X\perpY\midW
This it read "
X
Y
W
X
Y
W
(X\perp\perpY)\midW
This notation extends
X\perp\perpY
X
Y
If
W
[W=w]
The following two examples show that
X\perp\perpY
(X\perp\perpY)\midW
First, suppose
W
X
Y
W=1
X
Y
(X\perp\perpY)\midW
X
Y
For the second example, suppose
X\perp\perpY
W
X ⋅ Y
W=0
(X\perp\perpY)\midW
X
Y
Two random vectors
X=(X1,\ldots,X
T | |
l) |
Y=(Y1,\ldots,Y
T | |
m) |
Z=(Z1,\ldots,Z
T | |
n) |
Z
where
x=(x1,\ldots,x
T | |
l) |
y=(y1,\ldots,y
T | |
m) |
z=(z1,\ldots,z
T | |
n) |
\begin{align} FX,Y\midZ=z(x,y)&=\Pr(X1\leqx1,\ldots,Xl\leqxl,Y1\leqy1,\ldots,Ym\leqym\midZ1=z1,\ldots,Zn=zn)\\[6pt] FX\midZ=z(x)&=\Pr(X1\leqx1,\ldots,Xl\leqxl\midZ1=z1,\ldots,Zn=zn)\\[6pt] FY\midZ=z(y)&=\Pr(Y1\leqy1,\ldots,Ym\leqym\midZ1=z1,\ldots,Zn=zn) \end{align}
Let p be the proportion of voters who will vote "yes" in an upcoming referendum. In taking an opinion poll, one chooses n voters randomly from the population. For i = 1, ..., n, let Xi = 1 or 0 corresponding, respectively, to whether or not the ith chosen voter will or will not vote "yes".
In a frequentist approach to statistical inference one would not attribute any probability distribution to p (unless the probabilities could be somehow interpreted as relative frequencies of occurrence of some event or as proportions of some population) and one would say that X1, ..., Xn are independent random variables.
By contrast, in a Bayesian approach to statistical inference, one would assign a probability distribution to p regardless of the non-existence of any such "frequency" interpretation, and one would construe the probabilities as degrees of belief that p is in any interval to which a probability is assigned. In that model, the random variables X1, ..., Xn are not independent, but they are conditionally independent given the value of p. In particular, if a large number of the Xs are observed to be equal to 1, that would imply a high conditional probability, given that observation, that p is near 1, and thus a high conditional probability, given that observation, that the next X to be observed will be equal to 1.
A set of rules governing statements of conditional independence have been derived from the basic definition.[4] [5]
These rules were termed "Graphoid Axioms"by Pearl and Paz,[6] because they hold in graphs, where
X\perp\perpA\midB
X\perp\perpY ⇒ Y\perp\perpX
Note that we are required to prove if
P(X|Y)=P(X)
P(Y|X)=P(Y)
P(X|Y)=P(X)
P(X,Y)=P(X)P(Y)
P(Y|X)=P(X,Y)/P(X)=P(X)P(Y)/P(X)=P(Y)
X\perp\perpA,B ⇒ and \begin{cases} X\perp\perpA\\ X\perp\perpB \end{cases}
Proof
pX,A,B(x,a,b)=pX(x)pA,B(a,b)
X\perp\perpA,B
\intBpX,A,B(x,a,b)db=\intBpX(x)pA,B(a,b)db
pX,A(x,a)=pX(x)pA(a)
A similar proof shows the independence of X and B.
X\perp\perpA,B ⇒ and \begin{cases} X\perp\perpA\midB\\ X\perp\perpB\midA \end{cases}
Proof
\Pr(X)=\Pr(X\midA,B)
X\perp\perpB
\Pr(X)=\Pr(X\midB)
\Pr(X\midB)=\Pr(X\midA,B)
X\perp\perpA\midB
\left.\begin{align} X\perp\perpA\midB\\ X\perp\perpB \end{align}\right\}and ⇒ X\perp\perpA,B
Proof
This property can be proved by noticing
\Pr(X\midA,B)=\Pr(X\midB)=\Pr(X)
X\perp\perpA\midB
X\perp\perpB
For strictly positive probability distributions, the following also holds:
\left.\begin{align} X\perp\perpY\midZ,W\\ X\perp\perpW\midZ,Y \end{align}\right\}and ⇒ X\perp\perpW,Y\midZ
Proof
By assumption:
P(X|Z,W,Y)=P(X|Z,W)\landP(X|Z,W,Y)=P(X|Z,Y)\impliesP(X|Z,Y)=P(X|Z,W)
Using this equality, together with the Law of total probability applied to
P(X|Z)
\begin{align} P(X|Z)&=\sumwP(X|Z,W=w)P(W=w|Z)\\[4pt] &=\sumwP(X|Y,Z)P(W=w|Z)\\[4pt] &=P(X|Z,Y)\sumwP(W=w|Z)\\[4pt] &=P(X|Z,Y) \end{align}
Since
P(X|Z,W,Y)=P(X|Z,Y)
P(X|Z,Y)=P(X|Z)
P(X|Z,W,Y)=P(X|Z)\iffX\perp\perpY,W|Z
Technical note: since these implications hold for any probability space, they will still hold if one considers a sub-universe by conditioning everything on another variable, say K. For example,
X\perp\perpY ⇒ Y\perp\perpX
X\perp\perpY\midK ⇒ Y\perp\perpX\midK