Chain rule (probability) explained

In probability theory, the chain rule^[1] (also called the general product rule^[2] ^[3]) describes how to calculate the probability of the intersection of, not necessarily independent, events or the joint distribution of random variables respectively, using conditional probabilities. This rule allows one to express a joint probability in terms of only conditional probabilities.^[4] The rule is notably used in the context of discrete stochastic processes and in applications, e.g. the study of Bayesian networks, which describe a probability distribution in terms of conditional probabilities.

Chain rule for events

Two events

and

, the chain rule states that

P(A\capB)=P(B\midA)P(A)

where

P(B\midA)

denotes the conditional probability of

given

Example

An Urn A has 1 black ball and 2 white balls and another Urn B has 1 black ball and 3 white balls. Suppose we pick an urn at random and then select a ball from that urn. Let event

be choosing the first urn, i.e.

P(A)=P(\overline{A})=1/2

, where

\overlineA

is the complementary event of

. Let event

be the chance we choose a white ball. The chance of choosing a white ball, given that we have chosen the first urn, is

P(B|A)=2/3.

The intersection

A\capB

then describes choosing the first urn and a white ball from it. The probability can be calculated by the chain rule as follows:

P(A\capB)=P(B\midA)P(A)=

	23
	⋅

	12
	=

	13.

Finitely many events

For events

A_1,\ldots,A_n

whose intersection has not probability zero, the chain rule states

\begin{align} P\left(A₁\capA₂\cap\ldots\capA_n\right)&=P\left(A_n\midA₁\cap\ldots\capA_n-1\right)P\left(A₁\cap\ldots\capA_n-1\right)\\ &=P\left(A_n\midA₁\cap\ldots\capA_n-1\right)P\left(A_n-1\midA₁\cap\ldots\capA_n-2\right)P\left(A₁\cap\ldots\capA_n-2\right)\\ &=P\left(A_n\midA₁\cap\ldots\capA_n-1\right)P\left(A_n-1\midA₁\cap\ldots\capA_n-2\right) ⋅ \ldots ⋅ P(A₃\midA₁\capA₂₎P(A₂\midA₁₎P(A_1)\\
&=P(A₁₎P(A₂\midA₁₎P(A₃\midA₁\capA₂₎ ⋅ \ldots ⋅ P(A_n\midA₁\cap...\capA_n-1)\\ &=

	n
\prod
	k=1

P(A_k\midA₁\cap...\capA_k-1)\\ &=

	n
\prod
	k=1

P\left(A_k|

	k-1
cap
	j=1

A_{j\right).
\end{align}}

Example 1

For

n=4

, i.e. four events, the chain rule reads

\begin{align} P(A₁\capA₂\capA₃\capA₄₎&=P(A₄\midA₃\capA₂\capA_1)P(A₃\capA₂\capA₁₎\\ &=P(A₄\midA₃\capA₂\capA_1)P(A₃\midA₂\capA_1)P(A₂\capA₁₎\\ &=P(A₄\midA₃\capA₂\capA_1)P(A₃\midA₂\capA_1)P(A₂\midA_1)P(A_{1)
\end{align}}

Example 2

We randomly draw 4 cards (one at a time) without replacement from deck with 52 cards. What is the probability that we have picked 4 aces?

First, we set $A_n := \left\$ . Obviously, we get the following probabilities

P(A₁₎=

	4{52},
	P(A

₂\midA₁₎=

	3{51},
	P(A

₃\midA₁\capA₂₎=

	2{50},
	P(A

₄\midA₁\capA₂\capA₃₎=

	1{49}

Applying the chain rule,

P(A₁\capA₂\capA₃\capA₄₎=

	4{52}
	⋅

	3{51}
	⋅

	2{50}
	⋅

	1{49}
	=

	24
	6497400

Statement of the theorem and proof

Let

(\Omega,lA,P)

be a probability space. Recall that the conditional probability of an

A\inlA

given

B\inlA

is defined as

\begin{align} P(A\midB):=\begin{cases}

	P(A\capB)
	P(B)

,&P(B)>0,\ 0&P(B)=0.\end{cases} \end{align}

Then we have the following theorem.

Chain rule for discrete random variables

Two random variables

For two discrete random variables

X,Y

, we use the events

A:=\{X=x\}

and

B:=\{Y=y\}

in the definition above, and find the joint distribution as

P(X=x,Y=y)=P(X=x\midY=y)P(Y=y),

P_(X,Y)(x,y)=P_X(x\midy)P_Y(y),

where

P_X(x):=P(X=x)

is the probability distribution of

and

P_X(x\midy)

conditional probability distribution of

given

Finitely many random variables

Let

X_1,\ldots,X_n

be random variables and

x_1,...,x_n\inR

. By the definition of the conditional probability,

P\left(X_n=x_n,\ldots,X_1=x_1\right)=P\left(X_n=x_n|X_n-1=x_n-1,\ldots,X_1=x_1\right)P\left(X_n-1=x_n-1,\ldots,X_1=x_1\right)

and using the chain rule, where we set

A_k:=\{X_k=x_k\}

, we can find the joint distribution as

\begin{align} P\left(X₁=x_1,\ldotsX_n=x_n\right)&=P\left(X₁=x₁\midX₂=x_2,\ldots,X_n=x_n\right)P\left(X₂=x_2,\ldots,X_n=x_n\right)\\ &=P(X₁=x₁₎P(X₂=x₂\midX₁=x₁₎P(X₃=x₃\midX₁=x_1,X₂=x₂₎ ⋅ \ldots\\ & ⋅ P(X_n=x_n\midX₁=x_1,...,X_n-1=x_n-1)\\ \end{align}

Example

For

n=3

, i.e. considering three random variables. Then, the chain rule reads

\begin{align} P
	(X_1,X_2,X₃₎

(x_1,x_2,x_3)
&=P(X_1=x_1,X₂=x_2,X₃=x_3)\
&=P(X_3=x₃\midX₂=x_2,X₁=x₁₎P(X₂=x_2,X₁=x₁₎\\ &=P(X_3=x₃\midX₂=x_2,X₁=x₁₎P(X₂=x₂\midX₁=x₁₎P(X₁=x₁₎\\ &=

P
	X_3\midX_2,X₁

(x₃\midx_2,x₁₎

P
	X_2\midX₁

(x₂\midx₁₎

P
	X₁

(x_{1).
\end{align}}

Bibliography

- - , p. 496.

Notes and References

Book: Schilling, René L.. Measure, Integral, Probability & Processes - Probab(ilistical)ly the Theoretical Minimum . Technische Universität Dresden, Germany . 2021. 979-8-5991-0488-9. 136ff.
Book: Schum, David A.. The Evidential Foundations of Probabilistic Reasoning. 1994. Northwestern University Press. 978-0-8101-1821-8. 49.
Book: Klugh, Henry E.. Statistics: The Essentials for Research. 2013. Psychology Press. 978-1-134-92862-0. 149. 3rd.
Web site: Virtue . Pat . 10-606: Mathematical Foundations for Machine Learning .