The forward–backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of observations/emissions
o1:T:=o1,...,oT
Xt\in\{X1,...,XT\}
P(Xt | o1:T)
The term forward–backward algorithm is also used to refer to any algorithm belonging to the general class of algorithms that operate on sequence models in a forward–backward manner. In this sense, the descriptions in the remainder of this article refer only to one specific instance of this class.
In the first pass, the forward–backward algorithm computes a set of forward probabilities which provide, for all
t\in\{1,...,T\}
t
P(Xt | o1:t)
t
P(ot+1:T | Xt)
P(Xt | o1:T)=P(Xt | o1:t,ot+1:T)\proptoP(ot+1:T | Xt)P(Xt|o1:t)
The last step follows from an application of the Bayes' rule and the conditional independence of
ot+1:T
o1:t
Xt
As outlined above, the algorithm involves three steps:
The forward and backward steps may also be called "forward message pass" and "backward message pass" - these terms are due to the message-passing used in general belief propagation approaches. At each single observation in the sequence, probabilities to be used for calculations at the next observation are computed. The smoothing step can be calculated simultaneously during the backward pass. This step allows the algorithm to take into account any past observations of output for computing more accurate results.
The forward–backward algorithm can be used to find the most likely state for any point in time. It cannot, however, be used to find the most likely sequence of states (see Viterbi algorithm).
The following description will use matrices of probability values rather than probability distributions, although in general the forward-backward algorithm can be applied to continuous as well as discrete probability models.
We transform the probability distributions related to a given hidden Markov model into matrix notation as follows.The transition probabilities
P(Xt\midXt-1)
Xt
T
j
i
\pit |
\pit+1 |
\pit+1 |
=
\pit |
T
T=\begin{pmatrix} 0.7&0.3\\ 0.3&0.7 \end{pmatrix}
In a typical Markov model, we would multiply a state vector by this matrix to obtain the probabilities for the subsequent state. In a hidden Markov model the state is unknown, and we instead observe events associated with the possible states. An event matrix of the form:
B=\begin{pmatrix} 0.9&0.1\\ 0.2&0.8 \end{pmatrix}
provides the probabilities for observing events given a particular state. In the above example, event 1 will be observed 90% of the time if we are in state 1 while event 2 has a 10% probability of occurring in this state. In contrast, event 1 will only be observed 20% of the time if we are in state 2 and event 2 has an 80% chance of occurring. Given an arbitrary row-vector describing the state of the system (
\pi
P(O=j)=\sumi\piiBi,j
The probability of a given state leading to the observed event j can be represented in matrix form by multiplying the state row-vector (
\pi
Oj |
=
diag(B | |
*,oj |
)
O1 |
=\begin{pmatrix} 0.9&0.0\\ 0.0&0.2 \end{pmatrix}
This allows us to calculate the new unnormalized probabilities state vector
\pi'
\pi
\pi'=\pi
O1 |
We can now make this general procedure specific to our series of observations. Assuming an initial state vector
\pi0
f0:0 |
=\pi0
f0:1 |
=\pi0T
|
This process can be carried forward with additional observations using:
f0:t |
=
f0:t-1 |
T
|
This value is the forward unnormalized probability vector. The i'th entry of this vector provides:
f0:t(i) |
=P(o1,o2,...,ot,Xt=xi|\pi0)
Typically, we will normalize the probability vector at each step so that its entries sum to 1. A scaling factor is thus introduced at each step such that:
\hat{f0:t
where
\hat{f0:t-1
ct
P(o1,o2,...,ot|\pi0)=
t | |
\prod | |
s=1 |
cs
This allows us to interpret the scaled probability vector as:
\hat{f0:t
We thus find that the product of the scaling factors provides us with the total probability for observing the given sequence up to time t and that the scaled probability vector provides us with the probability of being in each state at this time.
A similar procedure can be constructed to find backward probabilities. These intend to provide the probabilities:
bt:T(i) |
=P(ot+1,ot+2,...,oT|Xt=xi)
That is, we now want to assume that we start in a particular state (
Xt=xi
bT:T |
=[1 1 1 ...]T
Notice that we are now using a column vector while the forward probabilities used row vectors. We can then work backwards using:
bt-1:T |
=
TOtbt:T |
While we could normalize this vector as well so that its entries sum to one, this is not usually done. Noting that each entry contains the probability of the future event sequence given a particular initial state, normalizing this vector would be equivalent to applying Bayes' theorem to find the likelihood of each initial state given the future events (assuming uniform priors for the final state vector). However, it is more common to scale this vector using the same
ct
bT:T |
\hat{bt-1:T
where
\hat{bt:T
\hat{bt:T
This is useful because it allows us to find the total probability of being in each state at a given time, t, by multiplying these values:
\gammat(i) |
= P(Xt=xi|o1,o2,...,oT,\pi0)=
P(o1,o2,...,oT,Xt=xi|\pi0) | = | |
P(o1,o2,...,oT|\pi0) |
| ||||||||
|
= \hat{f0:t
To understand this, we note that
f0:t(i) |
⋅
bt:T(i) |
xi
Xt=xi
The values
\gammat(i) |
P(Xt=xi,Xt+1=xj) ≠ P(Xt=xi)P(Xt+1=xj)
This example takes as its basis the umbrella world in Russell & Norvig 2010 Chapter 15 pp. 567 in which we would like to infer the weather given observation of another person either carrying or not carrying an umbrella. We assume two possible states for the weather: state 1 = rain, state 2 = no rain. We assume that the weather has a 70% chance of staying the same each day and a 30% chance of changing. The transition probabilities are then:
T=\begin{pmatrix} 0.7&0.3\\ 0.3&0.7 \end{pmatrix}
We also assume each state generates one of two possible events: event 1 = umbrella, event 2 = no umbrella. The conditional probabilities for these occurring in each state are given by the probability matrix:
B=\begin{pmatrix} 0.9&0.1\\ 0.2&0.8 \end{pmatrix}
We then observe the following sequence of events: which we will represent in our calculations as:
O1 |
=\begin{pmatrix}0.9&0.0\ 0.0&0.2
\end{pmatrix}~~O2 |
=\begin{pmatrix}0.9&0.0\ 0.0&0.2
\end{pmatrix}~~O3 |
=\begin{pmatrix}0.1&0.0\ 0.0&0.8
\end{pmatrix}~~O4 |
=\begin{pmatrix}0.9&0.0\ 0.0&0.2
\end{pmatrix}~~O5 |
=\begin{pmatrix}0.9&0.0\ 0.0&0.2\end{pmatrix}
Note that
O3 |
In computing the forward probabilities we begin with:
f0:0= |
\begin{pmatrix}0.5&0.5\end{pmatrix}
which is our prior state vector indicating that we don't know which state the weather is in before our observations. While a state vector should be given as a row vector, we will use the transpose of the matrix so that the calculations below are easier to read. Our calculations are then written in the form:
(\hat{f0:t
instead of:
\hat{f0:t
Notice that the transformation matrix is also transposed, but in our example the transpose is equal to the original matrix. Performing these calculations and normalizing the results provides:
(\hat{f0:1
(\hat{f0:2
(\hat{f0:3
(\hat{f0:4
(\hat{f0:5
For the backward probabilities, we start with:
b5:5 |
=\begin{pmatrix}1.0\ 1.0\end{pmatrix}
We are then able to compute (using the observations in reverse order and normalizing with different constants):
\hat{b4:5
\hat{b3:5
\hat{b2:5
\hat{b1:5
\hat{b0:5
Finally, we will compute the smoothed probability values. These results must also be scaled so that its entries sum to 1 because we did not scale the backward probabilities with the
ct
T | |
(\gamma0) |
=\alpha\begin{pmatrix}0.5000\ 0.5000\end{pmatrix}\circ\begin{pmatrix}0.6469\ 0.3531\end{pmatrix}=\alpha\begin{pmatrix}0.3235\ 0.1765\end{pmatrix}=\begin{pmatrix}0.6469\ 0.3531\end{pmatrix}
T | |
(\gamma1) |
=\alpha\begin{pmatrix}0.8182\ 0.1818\end{pmatrix}\circ\begin{pmatrix}0.5923\ 0.4077\end{pmatrix}=\alpha\begin{pmatrix}0.4846\ 0.0741\end{pmatrix}=\begin{pmatrix}0.8673\ 0.1327\end{pmatrix}
T | |
(\gamma2) |
=\alpha\begin{pmatrix}0.8834\ 0.1166\end{pmatrix}\circ\begin{pmatrix}0.3763\ 0.6237\end{pmatrix}=\alpha\begin{pmatrix}0.3324\ 0.0728\end{pmatrix}=\begin{pmatrix}0.8204\ 0.1796\end{pmatrix}
T | |
(\gamma3) |
=\alpha\begin{pmatrix}0.1907\ 0.8093\end{pmatrix}\circ\begin{pmatrix}0.6533\ 0.3467\end{pmatrix}=\alpha\begin{pmatrix}0.1246\ 0.2806\end{pmatrix}=\begin{pmatrix}0.3075\ 0.6925\end{pmatrix}
T | |
(\gamma4) |
=\alpha\begin{pmatrix}0.7308\ 0.2692\end{pmatrix}\circ\begin{pmatrix}0.6273\ 0.3727\end{pmatrix}=\alpha\begin{pmatrix}0.4584\ 0.1003\end{pmatrix}=\begin{pmatrix}0.8204\ 0.1796\end{pmatrix}
T | |
(\gamma5) |
=\alpha\begin{pmatrix}0.8673\ 0.1327\end{pmatrix}\circ\begin{pmatrix}1.0000\ 1.0000\end{pmatrix}=\alpha\begin{pmatrix}0.8673\ 0.1327\end{pmatrix}=\begin{pmatrix}0.8673\ 0.1327\end{pmatrix}
Notice that the value of
\gamma0 |
\hat{b0:5
\gamma5 |
\hat{f0:5
\hat{f0:5
\hat{b0:5
\gamma0 |
\hat{b0:5
\hat{b0:5
The calculations above reveal that the most probable weather state on every day except for the third one was "rain". They tell us more than this, however, as they now provide a way to quantify the probabilities of each state at different times. Perhaps most importantly, our value at
\gamma5 |
The forward–backward algorithm runs with time complexity
O(S2T)
O(ST)
T
S
O(S2T2)
ST
O(T ⋅ ST)
An enhancement to the general forward-backward algorithm, called the Island algorithm, trades smaller memory usage for longer running time, taking
O(S2TlogT)
O(SlogT)
O(S)
O(S2T)
In addition, algorithms have been developed to compute
f0:t+1 |
algorithm forward_backward is input: guessState int sequenceIndex output: result if sequenceIndex is past the end of the sequence then return 1 if (guessState, sequenceIndex) has been seen before then return saved result result := 0 for each neighboring state n: result := result + (transition probability from guessState to n given observation element at sequenceIndex) × Backward(n, sequenceIndex + 1) save result for (guessState, sequenceIndex) return result
Given HMM (just like in Viterbi algorithm) represented in the Python programming language:
We can write the implementation of the forward-backward algorithm like this:
f_curr[st] = emm_prob[st][observation_i] * prev_f_sum
fwd.append(f_curr) f_prev = f_curr
p_fwd = sum(f_curr[k] * trans_prob[k][end_st] for k in states)
# Backward part of the algorithm bkw = [] for i, observation_i_plus in enumerate(reversed(observations[1:] + (None,))): b_curr = for st in states: if i
bkw.insert(0,b_curr) b_prev = b_curr
p_bkw = sum(start_prob[l] * emm_prob[l][observations[0]] * b_curr[l] for l in states)
# Merging the two parts posterior = [] for i in range(len(observations)): posterior.append
assert p_fwd
The function fwd_bkw
takes the following arguments: x
is the sequence of observations, e.g. ['normal', 'cold', 'dizzy']
; states
is the set of hidden states; a_0
is the start probability; a
are the transition probabilities; and e
are the emission probabilities.
For simplicity of code, we assume that the observation sequence x
is non-empty and that a[i][j]
and e[i][j]
is defined for all states i,j.
In the running example, the forward-backward algorithm is used as follows: