Beliefs depend on the available information. This idea is formalized in probability theory by conditioning. Conditional probabilities, conditional expectations, and conditional probability distributions are treated on three levels: discrete probabilities, probability density functions, and measure theory. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.
Example: A fair coin is tossed 10 times; the random variable X is the number of heads in these 10 tosses, and Y is the number of heads in the first 3 tosses. In spite of the fact that Y emerges before X it may happen that someone knows X but not Y.
See main article: Conditional probability. Given that X = 1, the conditional probability of the event Y = 0 is
P(Y=0|X=1)=
P(Y=0,X=1) | |
P(X=1) |
=0.7
More generally,
\begin{align} P(Y=0|X=x)&=
\binom7x | |
\binom{10 |
x}=
7!(10-x)! | |
(7-x)!10! |
&&x=0,1,2,3,4,5,6,7.\\[4pt] P(Y=0|X=x)&=0&&x=8,9,10. \end{align}
One may also treat the conditional probability as a random variable, — a function of the random variable X, namely,
P(Y=0|X)=\begin{cases}\binom7X/\binom{10}X&X\leqslant7,\ 0&X>7.\end{cases}
The expectation of this random variable is equal to the (unconditional) probability,
E(P(Y=0|X))=\sumxP(Y=0|X=x)P(X=x)=P(Y=0),
namely,
7 | |
\sum | |
x=0 |
\binom7x | |
\binom{10 |
x} ⋅
1{2 | |
10 |
E(P(A|X))=P(A).
Thus,
P(Y=0|X=1)
P(Y=0|X)
P(Y=0|X=1)
See main article: Conditional expectation. Given that X = 1, the conditional expectation of the random variable Y is
E(Y|X=1)=\tfrac{3}{10}
E(Y|X=x)=
3{10} | |
x, |
x=0,\ldots,10.
(In this example it appears to be a linear function, but in general it is nonlinear.) One may also treat the conditional expectation as a random variable, — a function of the random variable X, namely,
E(Y|X)=
3{10} | |
X. |
The expectation of this random variable is equal to the (unconditional) expectation of Y,
E(E(Y|X))=\sumxE(Y|X=x)P(X=x)=E(Y),
namely,
10 | |
\sum | |
x=0 |
3 | |
10 |
x ⋅
1{2 | |
10 |
or simply
E\left(
3 | |
10 |
X\right)=
3 | |
10 |
E(X)=
3 | |
10 |
⋅ 5=
32, | |
E(E(Y|X))=E(Y).
The random variable
E(Y|X)
E(Y-f(X))2
E(Y|2X)=E(Y|X).
E(Y|2X)=\tfrac{3}{10} x 2X;
E(Y|2X)=\tfrac{3}{20} x 2X=\tfrac{3}{10}X.
E(Y|2X=2)=\tfrac{3}{10}.
E(Y|g(X))=E(Y|X)
\Omega=\{X=x1\}\uplus\{X=x2\}\uplus...
of the sample space Ω into disjoint sets . (Here
x1,x2,\ldots
Conditional probability may be treated as a special case of conditional expectation. Namely, P (A | X) = E (Y | X) if Y is the indicator of A. Therefore the conditional probability also depends on the partition αX generated by X rather than on X itself; P (A | g(X)) = P (A | X) = P (A | α), α = αX = αg(X).
On the other hand, conditioning on an event B is well-defined, provided that
P(B) ≠ 0,
See main article: Conditional probability distribution. Given X = x, the conditional distribution of Y is
P(Y=y|X=x)=
\binom3y\binom7{x-y | |
}{ |
\binom{10}x}=
\binomxy\binom{10-x | |
3-y |
}{\binom{10}3}
for 0 ≤ y ≤ min (3, x). It is the hypergeometric distribution H (x; 3, 7), or equivalently, H (3; x, 10-x). The corresponding expectation 0.3 x, obtained from the general formula
n
R | |
R+W |
for H (n; R, W), is nothing but the conditional expectation E (Y | X = x) = 0.3 x.
Treating H (X; 3, 7) as a random distribution (a random vector in the four-dimensional space of all measures on), one may take its expectation, getting the unconditional distribution of Y, — the binomial distribution Bin (3, 0.5). This fact amounts to the equality
10 | |
\sum | |
x=0 |
P(Y=y|X=x)P(X=x)=P(Y=y)=
1{2 | |
3} |
\binom3y
for y = 0,1,2,3; which is an instance of the law of total probability.
See main article: Probability density function and Conditional probability distribution. Example. A point of the sphere x2 + y2 + z2 = 1 is chosen at random according to the uniform distribution on the sphere.[1] The random variables X, Y, Z are the coordinates of the random point. The joint density of X, Y, Z does not exist (since the sphere is of zero volume), but the joint density fX,Y of X, Y exists,
fX,Y(x,y)=\begin{cases}
1{2\pi\sqrt{1-x | |
2-y |
2}}&ifx2+y2<1,\\ 0&otherwise. \end{cases}
(The density is non-constant because of a non-constant angle between the sphere and the plane.) The density of X may be calculated by integration,
fX(x)=
+infty | |
\int | |
-infty |
fX,Y(x,y)dy=
\int | |
-\sqrt{1-x2 |
surprisingly, the result does not depend on x in (−1,1),
fX(x)=\begin{cases} 0.5&for-1<x<1,\\ 0&otherwise, \end{cases}
which means that X is distributed uniformly on (−1,1). The same holds for Y and Z (and in fact, for aX + bY + cZ whenever a2 + b2 + c2 = 1).
Example. A different measure of calculating the marginal distribution function is provided below [2] [3]
fX,Y,Z(x,y,z)=
3{4\pi} | |
fX(x)=
\int | |
-\sqrt{1-y2-x2 |
Given that X = 0.5, the conditional probability of the event Y ≤ 0.75 is the integral of the conditional density,
fY|X=0.5(y)=
fX,Y(0.5,y) | |
fX(0.5) |
=\begin{cases}
1 | |
\pi\sqrt{0.75-y2 |
}&for-\sqrt{0.75}<y<\sqrt{0.75},\\ 0&otherwise. \end{cases}
P(Y\le0.75|X=0.5)=
0.75 | |
\int | |
-infty |
fY|X=0.5(y)dy=\int-\sqrt{0.75
P(Y\ley|X=x)=\tfrac12+\tfrac1{\pi}\arcsin
y | |
\sqrt{1-x2 |
}
style-\sqrt{1-x2}<y<\sqrt{1-x2}
P(Y\ley|X)=\begin{cases} 0&forX2\ge1-y2andy<0,\\
12 | |
+ |
1{\pi} | |
\arcsin |
y | |
\sqrt{1-X2 |
}&forX2<1-y2,\\ 1&forX2\ge1-y2andy>0. \end{cases}
E(P(Y\ley|X))=
+infty | |
\int | |
-infty |
P(Y\ley|X=x)fX(x)dx=P(Y\ley),
The conditional probability P (Y ≤ 0.75 | X = 0.5) cannot be interpreted as P (Y ≤ 0.75, X = 0.5) / P (X = 0.5), since the latter gives 0/0. Accordingly, P (Y ≤ 0.75 | X = 0.5) cannot be interpreted via empirical frequencies, since the exact value X = 0.5 has no chance to appear at random, not even once during an infinite sequence of independent trials.
The conditional probability can be interpreted as a limit,
\begin{align} P(Y\le0.75|X=0.5)&=\lim\varepsilon\to0+P(Y\le0.75|0.5-\varepsilon<X<0.5+\varepsilon)\\ &=\lim\varepsilon\to0+
P(Y\le0.75,0.5-\varepsilon<X<0.5+\varepsilon) | |
P(0.5-\varepsilon<X<0.5+\varepsilon) |
\\ &=\lim\varepsilon\to0+
| |||||||||||||||
|
. \end{align}
The conditional expectation E (Y | X = 0.5) is of little interest; it vanishes just by symmetry. It is more interesting to calculate E (|Z| | X = 0.5) treating |Z| as a function of X, Y:
\begin{align} |Z|&=h(X,Y)=\sqrt{1-X2-Y2};\\ E(|Z||X=0.5)&=
+infty | |
\int | |
-infty |
h(0.5,y)fY|X=0.5(y)dy=\\ &=\int-\sqrt{0.75
E(|Z||X=x)=
2\pi | |
\sqrt{1-x |
2}
E(|Z||X)=
2\pi | |
\sqrt{1-X |
2}.
E(E(|Z||X))=
+infty | |
\int | |
-infty |
E(|Z||X=x)fX(x)dx=E(|Z|),
+1 | |
\int | |
-1 |
2\pi | |
\sqrt{1-x |
2} ⋅
dx | |
2 |
=\tfrac{1}{2},
The random variable E(|Z| | X) is the best predictor of |Z| given X. That is, it minimizes the mean square error E (|Z| - f(X))2 on the class of all random variables of the form f(X). Similarly to the discrete case, E (|Z| | g(X)) = E (|Z| | X) for every measurable function g that is one-to-one on (-1,1).
Given X = x, the conditional distribution of Y, given by the density fY|X=x(y), is the (rescaled) arcsin distribution; its cumulative distribution function is
FY|X=x(y)=P(Y\ley|X=x)=
12 | |
+ |
1\pi | |
\arcsin |
y | |
\sqrt{1-x2 |
\begin{align} &
+infty | |
\int | |
-infty |
fY|X=x(y)fX(x)dx=fY(y),\\ &
+infty | |
\int | |
-infty |
FY|X=x(y)fX(x)dx=FY(y), \end{align}
See main article: Borel–Kolmogorov paradox. On the discrete level, conditioning is possible only if the condition is of nonzero probability (one cannot divide by zero). On the level of densities, conditioning on X = x is possible even though P (X = x) = 0. This success may create the illusion that conditioning is always possible. Regretfully, it is not, for several reasons presented below.
The result P (Y ≤ 0.75 | X = 0.5) = 5/6, mentioned above, is geometrically evident in the following sense. The points (x,y,z) of the sphere x2 + y2 + z2 = 1, satisfying the condition x = 0.5, are a circle y2 + z2 = 0.75 of radius
\sqrt{0.75}
This successful geometric explanation may create the illusion that the following question is trivial.
A point of a given sphere is chosen at random (uniformly). Given that the point lies on a given plane, what is its conditional distribution?
It may seem evident that the conditional distribution must be uniform on the given circle (the intersection of the given sphere and the given plane). Sometimes it really is, but in general it is not. Especially, Z is distributed uniformly on (-1,+1) and independent of the ratio Y/X, thus, P (Z ≤ 0.5 | Y/X) = 0.75. On the other hand, the inequality z ≤ 0.5 holds on an arc of the circle x2 + y2 + z2 = 1, y = cx (for any given c). The length of the arc is 2/3 of the length of the circle. However, the conditional probability is 3/4, not 2/3. This is a manifestation of the classical Borel paradox.
Another example. A random rotation of the three-dimensional space is a rotation by a random angle around a random axis. Geometric intuition suggests that the angle is independent of the axis and distributed uniformly. However, the latter is wrong; small values of the angle are less probable.
Given an event B of zero probability, the formula
styleP(A|B)=P(A\capB)/P(B)
styleP(A|B)=\limn\toinftyP(A\capBn)/P(Bn)
styleB1\supsetB2\supset...
styleB1\capB2\cap...=B
In the latter two examples the law of total probability is irrelevant, since only a single event (the condition) is given. By contrast, in the example above the law of total probability applies, since the event X = 0.5 is included into a family of events X = x where x runs over (−1,1), and these events are a partition of the probability space.
In order to avoid paradoxes (such as the Borel's paradox), the following important distinction should be taken into account. If a given event is of nonzero probability then conditioning on it is well-defined (irrespective of any other events), as was noted above. By contrast, if the given event is of zero probability then conditioning on it is ill-defined unless some additional input is provided. Wrong choice of this additional input leads to wrong conditional probabilities (expectations, distributions). In this sense, "the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible." (Kolmogorov)
The additional input may be (a) a symmetry (invariance group); (b) a sequence of events Bn such that Bn ↓ B, P (Bn) > 0; (c) a partition containing the given event. Measure-theoretic conditioning (below) investigates Case (c), discloses its relation to (b) in general and to (a) when applicable.
Some events of zero probability are beyond the reach of conditioning. An example: let Xn be independent random variables distributed uniformly on (0,1), and B the event "Xn → 0 as n → ∞"; what about P (Xn < 0.5 | B) ? Does it tend to 1, or not? Another example: let X be a random variable distributed uniformly on (0,1), and B the event "X is a rational number"; what about P (X = 1/n | B) ? The only answer is that, once again,
See main article: Conditional expectation. Example. Let Y be a random variable distributed uniformly on (0,1), and X = f(Y) where f is a given function. Two cases are treated below: f = f1 and f = f2, where f1 is the continuous piecewise-linear function
f1(y)=\begin{cases} 3y&for0\ley\le1/3,\\ 1.5(1-y)&for1/3\ley\le2/3,\\ 0.5&for2/3\ley\le1, \end{cases}
Given X = 0.75, two values of Y are possible, 0.25 and 0.5. It may seem evident that both values are of conditional probability 0.5 just because one point is congruent to another point. However, this is an illusion; see below.
The conditional probability P (Y ≤ 1/3 | X) may be defined as the best predictor of the indicator
I=\begin{cases} 1&ifY\le1/3,\\ 0&otherwise, \end{cases}
In the case f = f1 the corresponding function g = g1 may be calculated explicitly,[4]
g1(x)=\begin{cases} 1&for0<x<0.5,\\ 0&forx=0.5,\\ 1/3&for0.5<x<1. \end{cases}
Alternatively, the limiting procedure may be used,
g1(x)=\lim\varepsilon\to0+P(Y\le1/3|x-\varepsilon\leX\lex+\varepsilon),
Thus, P (Y ≤ 1/3 | X) = g1 (X). The expectation of this random variable is equal to the (unconditional) probability, E (P (Y ≤ 1/3 | X)) = P (Y ≤ 1/3), namely,
1 ⋅ P(X<0.5)+0 ⋅ P(X=0.5)+
13 | |
⋅ |
P(X>0.5)=1 ⋅
16 | |
+ |
0 ⋅
13 | |
+ |
13 | |
⋅ |
\left(
16 | |
+ |
13 | |
\right) |
=
13, | |
In the case f = f2 the corresponding function g = g2 probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically. Indeed, the space L2 (Ω) of all square integrable random variables is a Hilbert space; the indicator I is a vector of this space; and random variables of the form g (X) are a (closed, linear) subspace. The orthogonal projection of this vector to this subspace is well-defined. It can be computed numerically, using finite-dimensional approximations to the infinite-dimensional Hilbert space.
Once again, the expectation of the random variable P (Y ≤ 1/3 | X) = g2 (X) is equal to the (unconditional) probability, E (P (Y ≤ 1/3 | X)) = P (Y ≤ 1/3), namely,
1 | |
\int | |
0 |
g2(f2(y))dy=\tfrac13.
However, the Hilbert space approach treats g2 as an equivalence class of functions rather than an individual function. Measurability of g2 is ensured, but continuity (or even Riemann integrability) is not. The value g2 (0.5) is determined uniquely, since the point 0.5 is an atom of the distribution of X. Other values x are not atoms, thus, corresponding values g2 (x) are not determined uniquely. Once again, "the concept of a conditional probability with regard to an isolated hypothesis whose probability equals 0 is inadmissible." (Kolmogorov.
Alternatively, the same function g (be it g1 or g2) may be defined as the Radon–Nikodym derivative
g=
d\nu | |
d\mu |
,
\begin{align} \mu(B)&=P(X\inB),\\ \nu(B)&=P(X\inB,Y\le\tfrac{1}{3}) \end{align}
B\subsetR.
\nu(B)=P(X\inB|Y\le\tfrac{1}{3})P(Y\le\tfrac{1}{3})=\tfrac13P(X\inB|Y\le\tfrac{1}{3}).
Both approaches (via the Hilbert space, and via the Radon–Nikodym derivative) treat g as an equivalence class of functions; two functions g and g′ are treated as equivalent, if g (X) = g′ (X) almost surely. Accordingly, the conditional probability P (Y ≤ 1/3 | X) is treated as an equivalence class of random variables; as usual, two random variables are treated as equivalent if they are equal almost surely.
The conditional expectation
E(Y|X)
E(Y-h(X))2
In the case f = f1 the corresponding function h = h1 may be calculated explicitly,[5]
h1(x)=\begin{cases}
x | |
3 |
&0<x<
1 | |
2 |
\\[4pt]
5 | |
6 |
&x=
1 | |
2 |
\\[4pt]
1 | |
3 |
(2-x)&
1 | |
2 |
<x<1 \end{cases}
Alternatively, the limiting procedure may be used,
h1(x)=\lim\varepsilon\to0+E(Y|x-\varepsilon\leqslantX\leqslantx+\varepsilon),
giving the same result.
Thus,
E(Y|X)=h1(X).
E(E(Y|X))=E(Y),
1 | |
\int | |
0 |
h1(f1(y))dy=
| ||||
\int | ||||
0 |
3y | |
3 |
dy+
| ||||
\int | ||||
|
2-3y | |
3 |
dy+
| ||||
\int | ||||
|
| ||||||
3 |
dy+
1 | ||||
\int | ||||
|
56 | |
dy=
12, | |
E(E(Y|X))=E(Y).
In the case f = f2 the corresponding function h = h2 probably cannot be calculated explicitly. Nevertheless it exists, and can be computed numerically in the same way as g2 above, — as the orthogonal projection in the Hilbert space. The law of total expectation holds, since the projection cannot change the scalar product by the constant 1 belonging to the subspace.
Alternatively, the same function h (be it h1 or h2) may be defined as the Radon–Nikodym derivative
h=
d\nu | |
d\mu |
,
where measures μ, ν are defined by
\begin{align} \mu(B)&=P(X\inB)\\ \nu(B)&=E(Y,X\inB) \end{align}
for all Borel sets
B\subset\R.
E(Y;A)
E(Y|A)=E(Y;A)/P(A).
See main article: Disintegration theorem and Regular conditional probability. In the case f = f1 the conditional cumulative distribution function may be calculated explicitly, similarly to g1. The limiting procedure gives:
F | ||||
|
(y)=P\left(Y\leqslanty\left|X=\tfrac{3}{4}\right.\right)=
\lim | |
\varepsilon\to0+ |
P\left(Y\leqslanty\left|\tfrac{3}{4}-\varepsilon\leqslantX\leqslant\tfrac{3}{4}+\varepsilon\right.\right)=\begin{cases}0&-infty<y<\tfrac{1}{4}\\[4pt]\tfrac{1}{6}&y=\tfrac{1}{4}\\[4pt]\tfrac{1}{3}&\tfrac{1}{4}<y<\tfrac{1}{2}\\[4pt]\tfrac{2}{3}&y=\tfrac{1}{2}\\[4pt]1&\tfrac{1}{2}<y<infty\end{cases}
which cannot be correct, since a cumulative distribution function must be right-continuous!
This paradoxical result is explained by measure theory as follows. For a given y the corresponding
FY|X=x(y)=P(Y\leqslanty|X=x)
A right choice can be made as follows. First,
FY|X=x(y)=P(Y\leqslanty|X=x)
In general the conditional distribution is defined for almost all x (according to the distribution of X), but sometimes the result is continuous in x, in which case individual values are acceptable. In the considered example this is the case; the correct result for x = 0.75,
F | ||||
|
(y)=P\left(Y\leqslanty\left|X=\tfrac{3}{4}\right.\right)=\begin{cases}0&-infty<y<\tfrac{1}{4}\\[4pt]\tfrac{1}{3}&\tfrac{1}{4}\leqslanty<\tfrac{1}{2}\\[4pt] 1&\tfrac{1}{2}\leqslanty<infty\end{cases}
shows that the conditional distribution of Y given X = 0.75 consists of two atoms, at 0.25 and 0.5, of probabilities 1/3 and 2/3 respectively.
Similarly, the conditional distribution may be calculated for all x in (0, 0.5) or (0.5, 1).
The value x = 0.5 is an atom of the distribution of X, thus, the corresponding conditional distribution is well-defined and may be calculated by elementary means (the denominator does not vanish); the conditional distribution of Y given X = 0.5 is uniform on (2/3, 1). Measure theory leads to the same result.
The mixture of all conditional distributions is the (unconditional) distribution of Y.
The conditional expectation
E(Y|X=x)
In the case f = f2 the corresponding
FY|X=x(y)=P(Y\leqslanty|X=x)
Once again, the mixture of all conditional distributions is the (unconditional) distribution, and the conditional expectation is the expectation with respect to the conditional distribution.
\begin{align} E(I-g(X))2&=
1/3 | |
\int | |
0 |
(1-g(3y))2dy+
2/3 | |
\int | |
1/3 |
g2(1.5(1-y))dy+
1 | |
\int | |
2/3 |
g2(0.5)dy\\ &=
1 | |
\int | |
0 |
(1-g(x))2
dx | |
3 |
+
1 | |
\int | |
0.5 |
g2(x)
dx | |
1.5 |
+
13 | |
g |
2(0.5)\\ &=
13 | |
\int |
0.5 | |
0 |
(1-g(x))2dx+
13 | |
g |
2(0.5)+
13 | |
\int |
1 | |
0.5 |
((1-g(x))2+2g2(x))dx; \end{align}
\begin{align} E(Y-h1(X))2&=
1 | |
\int | |
0 |
\left(y-h1(f1(x))\right)2dy\\ &=
| ||||
\int | ||||
0 |
2 | |
(y-h | |
1(3y)) |
dy+
| ||||
\int | ||||
|
\left(y-h1(1.5(1-y))\right)2dy+
1 | ||||
\int | ||||
|
\left(y-h1(\tfrac{1}{2})\right)2dy\\ &=
1 | |
\int | |
0 |
\left(
x | |
3 |
-h1(x)\right)2
dx | |
3 |
+
1 | ||||
\int | ||||
|
\left(1-
x | |
1.5 |
-h1(x)\right)2
dx | |
1.5 |
+
13 | |
h |
2(\tfrac{1}{2}) | |
1 |
-
5 | |
9 |
h1(\tfrac{1}{2})+
19 | |
81 |
\\ &=
13 | |
\int |
| ||||
0 |
\left(h1(x)-
x | |
3 |
\right)2dx+\tfrac13
2(\tfrac{1}{2}) | |
h | |
1 |
-\tfrac{5}{9}h1(\tfrac{1}{2})+\tfrac{19}{81}+\tfrac13
1 | ||||
\int | ||||
|
\left(\left(h1(x)-
x | |
3 |
\right)2+2\left(h1(x)-1+
2x | |
3 |
\right)2\right)dx; \end{align}
it remains to note that
\left(a-
x | |
3 |
\right)2+2\left(a-1+
2x | |
3 |
\right)2
is minimal at
a=\tfrac{2-x}3,
\tfrac13a2-\tfrac{5}{9}a
a=\tfrac56.