p(x\mid\theta)
p(\theta\midx)
p(\theta)
p(x\mid\theta)
A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior; otherwise, numerical integration may be necessary. Further, conjugate priors may give intuition by more transparently showing how a likelihood function updates a prior distribution.
The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory.[1] A similar concept had been discovered independently by George Alfred Barnard.[2]
The form of the conjugate prior can generally be determined by inspection of the probability density or probability mass function of a distribution. For example, consider a random variable which consists of the number of successes
s
n
q
p(s)={n\chooses}qs(1-q)n-s
The usual conjugate prior is the beta distribution with parameters (
\alpha
\beta
p(q)={q\alpha-1(1-q)\beta-1\over\Beta(\alpha,\beta)}
\alpha
\beta
\alpha=1
\beta=1
\Beta(\alpha,\beta)
In this context,
\alpha
\beta
q
If we sample this random variable and get
s
f=n-s
\begin{align} P(s,f\midq=x)&={s+f\chooses}xs(1-x)f,\\ P(q=x)&={x\alpha-1(1-x)\beta-1\over\Beta(\alpha,\beta)},\\ P(q=x\mids,f)&=
P(s,f\midx)P(x) | |
\intP(s,f\midy)P(y)dy |
\\ &={{{s+f\chooses}xs+\alpha-1(1-x)f+\beta-1/\Beta(\alpha,\beta)}\over
1 | |
\int | |
y=0 |
\left({s+f\chooses}ys+\alpha-1(1-y)f+\beta-1/\Beta(\alpha,\beta)\right)dy}\\ &={xs+\alpha-1(1-x)f+\beta-1\over\Beta(s+\alpha,f+\beta)}, \end{align}
which is another Beta distribution with parameters
(\alpha+s,\beta+f)
It is often useful to think of the hyperparameters of a conjugate prior distribution corresponding to having observed a certain number of pseudo-observations with properties specified by the parameters. For example, the values
\alpha
\beta
\alpha-1
\beta-1
\alpha
\beta
One can think of conditioning on conjugate priors as defining a kind of (discrete time) dynamical system: from a given set of hyperparameters, incoming data updates these hyperparameters, so one can see the change in hyperparameters as a kind of "time evolution" of the system, corresponding to "learning". Starting at different points yields different flows over time. This is again analogous with the dynamical system defined by a linear operator, but note that since different samples lead to different inferences, this is not simply dependent on time but rather on data over time. For related approaches, see Recursive Bayesian estimation and Data assimilation.
Suppose a rental car service operates in your city. Drivers can drop off and pick up cars anywhere inside the city limits. You can find and rent cars using an app.
Suppose you wish to find the probability that you can find a rental car within a short distance of your home address at any time of day.
Over three days you look at the app and find the following number of cars within a short distance of your home address:
x=[3,4,1]
Suppose we assume the data comes from a Poisson distribution. In that case, we can compute the maximum likelihood estimate of the parameters of the model, which is Using this maximum likelihood estimate, we can compute the probability that there will be at least one car available on a given day:
This is the Poisson distribution that is the most likely to have generated the observed data
x
λ=3
λ=2
p(x>0|λ)
x
p(x|x)=\int\thetap(x|\theta)p(\theta|x)d\theta,
x
x
\theta
p(\theta|x)=
p(x|\theta)p(\theta) | |
p(x) |
,
p(x|x)=\int\thetap(x|\theta)
p(x|\theta)p(\theta) | |
p(x) |
d\theta.
p(\theta)
Returning to our example, if we pick the Gamma distribution as our prior distribution over the rate of the Poisson distributions, then the posterior predictive is the negative binomial distribution, as can be seen from the table below. The Gamma distribution is parameterized by two hyperparameters
\alpha,\beta
\alpha=\beta=2
Given the prior hyperparameters
\alpha
\beta
Given the posterior hyperparameters, we can finally compute the posterior predictive of
This much more conservative estimate reflects the uncertainty in the model parameters, which the posterior predictive takes into account.
Let n denote the number of observations. In all cases below, the data is assumed to consist of n points
x1,\ldots,xn
If the likelihood function belongs to the exponential family, then a conjugate prior exists, often also in the exponential family; see Exponential family: Conjugate distributions.
Likelihood p(xi | \theta)! | Model parameters \theta | Conjugate prior (and posterior) distribution p(\theta | \Theta), p(\theta | \mathbf,\Theta) = p(\theta | \Theta') ! | Prior hyperparameters \Theta | Posterior hyperparameters \Theta' | Interpretation of hyperparameters | Posterior predictive p(\tilde{x} | \mathbf, \Theta) = p(\tilde | \Theta') | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
p (probability) | \alpha,\beta\inR | \alpha+
xi,\beta+n-
xi | \alpha \beta | p(\tilde{x}=1)=
(Bernoulli) | |||||||||||||||||||||
Binomial with known number of trials, m | p (probability) | \alpha,\beta\inR | \alpha+
xi,\beta+
-
xi | \alpha \beta | \operatorname{BetaBin}(\tilde{x} | \alpha',\beta') (beta-binomial) | |||||||||||||||||||
Negative binomial with known failure number, r | p (probability) | \alpha,\beta\inR | \alpha+rn,\beta+
xi | \alpha \beta
r | \operatorname{BetaNegBin}(\tilde{x} | \alpha',\beta')(beta-negative binomial) | |||||||||||||||||||
Poisson | λ (rate) | Gamma | k,\theta\inR | k+
| k
| \operatorname{NB}\left(\tilde{x}\midk',
\right) (negative binomial) | |||||||||||||||||||
\alpha,\beta | \alpha+
xi, \beta+n | \alpha \beta | \operatorname{NB}\left(\tilde{x}\mid\alpha',
\right) (negative binomial) | ||||||||||||||||||||||
p (probability vector), k (number of categories; i.e., size of p) | \boldsymbol\alpha\inRk | \boldsymbol\alpha+(c1,\ldots,ck), ci | \alphai i | \begin{align} p(\tilde{x}=i)&=
i{\alphai}'}\\ &=
\end{align} (categorical) | |||||||||||||||||||||
p (probability vector), k (number of categories; i.e., size of p) | \boldsymbol\alpha\inRk | \boldsymbol\alpha+
| \alphai i | \operatorname{DirMult}(\tilde{x (Dirichlet-multinomial) | |||||||||||||||||||||
Hypergeometric with known total population size, N | M (number of target members) | n=N,\alpha,\beta | \alpha+
xi,\beta+
-
xi | \alpha \beta | |||||||||||||||||||||
p0 (probability) | \alpha,\beta\inR | \alpha+n,\beta+
xi | \alpha \beta |
Likelihood p(xi | \theta)! | Model parameters \theta | Conjugate prior (and posterior) distribution p(\theta | \Theta), p(\theta | \mathbf,\Theta) = p(\theta | \Theta') ! | Prior hyperparameters \Theta | Posterior hyperparameters \Theta' | Interpretation of hyperparameters | Posterior predictive p(\tilde{x} | \mathbf, \Theta) = p(\tilde | \Theta') | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Normal with known variance σ2 | μ (mean) | \mu0,
|
+
+
\right)-1 | mean was estimated from observations with total precision (sum of all individual precisions)
\mu0 | l{N}(\tilde{x} | \mu_0', ' +\sigma^2) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Normal with known precision τ | μ (mean) | \mu0,
|
,\left(\tau0+n\tau\right)-1 | mean was estimated from observations with total precision (sum of all individual precisions) \tau0 \mu0 | l{N}\left(\tilde{x}\mid\mu0',
\right) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Normal with known mean μ | σ2 (variance) | \alpha,\beta |
,\beta+
| variance was estimated from 2\alpha \beta/\alpha 2\beta \mu | t2\alpha'(\tilde{x} | \mu,\sigma^2 = \beta'/\alpha') | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Normal with known mean μ | σ2 (variance) | \nu,
| \nu+n,
| variance was estimated from \nu
| t\nu'(\tilde{x} | \mu,') | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Normal with known mean μ | τ (precision) | \alpha,\beta | \alpha+
,\beta+
| precision was estimated from 2\alpha \beta/\alpha 2\beta \mu | t2\alpha'(\tilde{x}\mid\mu,\sigma2=\beta'/\alpha') | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Normal[3] | μ and σ2 Assuming exchangeability | Normal-inverse gamma | \mu0,\nu,\alpha,\beta |
\beta+\tfrac{1}{2}
(xi-\bar{x})2+
\bar{x} | mean was estimated from \nu \mu0 2\alpha \mu0 2\beta | t2\alpha'\left(\tilde{x}\mid\mu',
\right) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
μ and τ Assuming exchangeability | Normal-gamma | \mu0,\nu,\alpha,\beta |
\beta+\tfrac{1}{2}
(xi-\bar{x})2+
\bar{x} | mean was estimated from \nu \mu0 2\alpha \mu0 2\beta | t2\alpha'\left(\tilde{x}\mid\mu',
\right) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Multivariate normal with known covariance matrix Σ | μ (mean vector) | \boldsymbol{\boldsymbol\mu}0,\boldsymbol\Sigma0 |
+n\boldsymbol\Sigma-1\right)-1\left(
\boldsymbol\mu0+n\boldsymbol\Sigma-1\bar{x
+n\boldsymbol\Sigma-1\right)-1 \bar{x | mean was estimated from observations with total precision (sum of all individual precisions)
\boldsymbol\mu0 | l{N}(\tilde{x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Multivariate normal with known precision matrix Λ | μ (mean vector) | \boldsymbol\mu0,\boldsymbolΛ0 | \left(\boldsymbolΛ0+n\boldsymbolΛ\right)-1\left(\boldsymbolΛ0\boldsymbol\mu0+n\boldsymbolΛ\bar{x \bar{x | mean was estimated from observations with total precision (sum of all individual precisions) \boldsymbolΛ0 \boldsymbol\mu0 | l{N}\left(\tilde{x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Multivariate normal with known mean μ | Σ (covariance matrix) | \nu,\boldsymbol\Psi | n+\nu,\boldsymbol\Psi+
-\boldsymbol\mu)
-\boldsymbol\mu)T | covariance matrix was estimated from \nu \boldsymbol\Psi | t\nu'-p+1\left(\tilde{x | \boldsymbol\mu,\frac\boldsymbol\Psi'\right) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Multivariate normal with known mean μ | Λ (precision matrix) | \nu,V | n+\nu,\left(V-1+
-\boldsymbol\mu)
-\boldsymbol\mu)T\right)-1 | covariance matrix was estimated from \nu V-1 | t\nu'-p+1\left(\tilde{x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
μ (mean vector) and Σ (covariance matrix) | \boldsymbol\mu0,\kappa0,\nu0,\boldsymbol\Psi |
\boldsymbol\Psi+C+
(\bar{x \bar{x C=
-\bar{x | mean was estimated from \kappa0 \boldsymbol\mu0 \nu0 \boldsymbol\mu0 \boldsymbol\Psi=\nu0\boldsymbol\Sigma0 |
'-p+1}\left(\tilde{x | ',\frac\boldsymbol\Psi'\right) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
μ (mean vector) and Λ (precision matrix) | \boldsymbol\mu0,\kappa0,\nu0,V |
\left(V-1+C+
(\bar{x \bar{x C=
-\bar{x | mean was estimated from \kappa0 \boldsymbol\mu0 \nu0 \boldsymbol\mu0 V-1 |
'-p+1}\left(\tilde{x | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
U(0,\theta) | xm,k | max\{x1,\ldots,xn,xm\},k+n | k xm | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Pareto with known minimum xm | k (shape) | \alpha,\beta | \alpha+n,
| \alpha \beta xm | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Weibull with known shape β | θ (scale) | Inverse gamma | a,b | a+n,
| a b | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Log-normal | Same as for the normal distribution after applying the natural logarithm to the data for the posterior hyperparameters. Please refer to to see the details. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
λ (rate) | \alpha,\beta | \alpha+n,
xi | \alpha \beta | \operatorname{Lomax}(\tilde{x}\mid\beta',\alpha') (Lomax distribution) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Gamma with known shape α | β (rate) | \alpha0,\beta0 | \alpha0+n\alpha,\beta0+\sum
xi | \alpha0/\alpha \beta0 | \operatorname{CG}(\tilde{x | \alpha,',1,') | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Inverse Gamma with known shape α | β (inverse scale) | \alpha0,\beta0 | \alpha0+n\alpha,\beta0+\sum
| \alpha0/\alpha \beta0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Gamma with known rate β | α (shape) | \propto
| a,b,c | a
xi,b+n,c+n | b c b \alpha c \beta a | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Gamma | α (shape), β (inverse scale) | \propto
| p,q,r,s | p
xi,q+
xi,r+n,s+n | \alpha r p \beta s q | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Beta | α, β | \propto
| p,q,k | p
xi,q
(1-xi),k+n | \alpha \beta k p q |