Forward algorithm explained

The forward algorithm, in the context of a hidden Markov model (HMM), is used to calculate a 'belief state': the probability of a state at a certain time, given the history of evidence. The process is also known as filtering. The forward algorithm is closely related to, but distinct from, the Viterbi algorithm.

Introduction

The forward and backward algorithms should be placed within the context of probability as they appear to simply be names given to a set of standard mathematical procedures within a few fields. For example, neither "forward algorithm" nor "Viterbi" appear in the Cambridge encyclopedia of mathematics. The main observation to take away from these algorithms is how to organize Bayesian updates and inference to be computationally efficient in the context of directed graphs of variables (see sum-product networks).

For an HMM such as this one:this probability is written as

p(xt|y1:t)

. Here

x(t)

is the hidden state which is abbreviated as

xt

and

y1:t

are the observations

1

to

t

.

The backward algorithm complements the forward algorithm by taking into account the future history if one wanted to improve the estimate for past times. This is referred to as smoothing and the forward/backward algorithm computes

p(xt|y1:T)

for

1<t<T

. Thus, the full forward/backward algorithm takes into account all evidence. Note that a belief state can be calculated at each time step, but doing this does not, in a strict sense, produce the most likely state sequence, but rather the most likely state at each time step, given the previous history. In order to achieve the most likely sequence, the Viterbi algorithm is required. It computes the most likely state sequence given the history of observations, that is, the state sequence that maximizes

p(x0:t|y0:t)

.

Algorithm

p(xt,y1:t)

, where for notational convenience we have abbreviated

x(t)

as

xt

and

(y(1),y(2),...,y(t))

as

y1:t

. Once the joint probability

p(xt,y1:t)

is computed, the other probabilities

p(xt|y1:t)

and

p(y1:t)

are easily obtained.

Both the state

xt

and observation

yt

are assumed to be discrete, finite random variables. The hidden Markov model's state transition probabilities

p(xt|xt-1)

, observation/emission probabilities

p(yt|xt)

, and initial prior probability

p(x0)

are assumed to be known. Furthermore, the sequence of observations

y1:t

are assumed to be given.

Computing

p(xt,y1:t)

naively would require marginalizing over all possible state sequences

\{x1:t-1\}

, the number of which grows exponentially with

t

. Instead, the forward algorithm takes advantage of the conditional independence rules of the hidden Markov model (HMM) to perform the calculation recursively.

To demonstrate the recursion, let

\alpha(xt)=p(xt,y1:t)=

\sum
xt-1

p(xt,xt-1,y1:t)

.

Using the chain rule to expand

p(xt,xt-1,y1:t)

, we can then write

\alpha(xt)=

\sum
xt-1

p(yt|xt,xt-1,y1:t-1)p(xt|xt-1,y1:t-1)p(xt-1,y1:t-1)

.

Because

yt

is conditionally independent of everything but

xt

, and

xt

is conditionally independent of everything but

xt-1

, this simplifies to

\alpha(xt)=p(yt|xt)\sum

xt-1

p(xt|xt-1)\alpha(xt-1)

.

Thus, since

p(yt|xt)

and

p(xt|xt-1)

are given by the model's emission distributions and transition probabilities, which are assumed to be known, one can quickly calculate

\alpha(xt)

from

\alpha(xt-1)

and avoid incurring exponential computation time.

The recursion formula given above can be written in a more compact form. Let

aij=p(xt=i|xt-1=j)

be the transition probabilities and

bij=p(yt=i|xt=j)

be the emission probabilities, then

\alphat=

T
b
t

\odotA\alphat-1

where

A=[aij]

is the transition probability matrix,

bt

is the i-th row of the emission probability matrix

B=[bij]

which corresponds to the actual observation

yt=i

at time

t

, and

\alphat=[\alpha(xt=1),\ldots,

T
\alpha(x
t=n)]
is the alpha vector. The

\odot

is the hadamard product between the transpose of

bt

and

A\alphat-1

.

The initial condition is set in accordance to the prior probability over

x0

as

\alpha(x0)=p(y0|x0)p(x0)

.

Once the joint probability

\alpha(xt)=p(xt,y1:t)

has been computed using the forward algorithm, we can easily obtain the related joint probability

p(y1:t)

as

p(y1:t)=

\sum
xt

p(xt,y1:t)=

\sum
xt

\alpha(xt)

and the required conditional probability

p(xt|y1:t)

as

p(xt|y1:t)=

p(xt,y1:t)
p(y1:t)

=

\alpha(xt)
\sum\alpha(xt)
xt

.

Once the conditional probability has been calculated, we can also find the point estimate of

xt

. For instance, the MAP estimate of

xt

is given by
MAP
\widehat{x}
t

=\arg

max
xt

p(xt|y1:t)=\arg

max
xt

\alpha(xt),

while the MMSE estimate of

xt

is given by
MMSE
\widehat{x}
t

=E[xt|y1:t]=

\sum
xt

xtp(xt|y1:t)=

\sumxt\alpha(xt)
xt
\sum\alpha(xt)
xt

.

The forward algorithm is easily modified to account for observations from variants of the hidden Markov model as well, such as the Markov jump linear system.

Pseudocode

  1. Initialize

t=0

,

transition probabilities,

p(xt|xt-1)

,

emission probabilities,

p(yt|xt)

,

observed sequence,

y1:T

prior probability,

\alpha(x0)

  1. For

t=1

to

T

\alpha(xt)=p(yt|xt)\sum

xt-1

p(xt|xt-1)\alpha(xt-1)

.
  1. Return

p(xT|y1:T)=

\alpha(xT)
\sum\alpha(xT)
xT

Example

This example on observing possible states of weather from the observed condition of seaweed. We have observations of seaweed for three consecutive days as dry, damp, and soggy in order. The possible states of weather can be sunny, cloudy, or rainy. In total, there can be

33=27

such weather sequences. Exploring all such possible state sequences is computationally very expensive. To reduce this complexity, Forward algorithm comes in handy, where the trick lies in using the conditional independence of the sequence steps to calculate partial probabilities,

\alpha(xt)=p(xt,y1:t)=p(yt|xt)\sum

xt-1

p(xt|xt-1)\alpha(xt-1)

as shown in the above derivation. Hence, we can calculate the probabilities as the product of the appropriate observation/emission probability,

p(yt|xt)

(probability of state

yt

seen at time t from previous observation) with the sum of probabilities of reaching that state at time t, calculated using transition probabilities. This reduces complexity of the problem from searching whole search space to just using previously computed

\alpha

's and transition probabilities.

Complexity

Complexity of Forward Algorithm is

\Theta(nm2)

, where

m

is the number of hidden or latent variables, like weather in the example above, and

n

is the length of the sequence of the observed variable. This is clear reduction from the adhoc method of exploring all the possible states with a complexity of

\Theta(nmn)

.

Variants of the algorithm

History

The forward algorithm is one of the algorithms used to solve the decoding problem. Since the development of speech recognition[4] and pattern recognition and related fields like computational biology which use HMMs, the forward algorithm has gained popularity.

Applications

The forward algorithm is mostly used in applications that need us to determine the probability of being in a specific state when we know about the sequence of observations. The algorithm can be applied wherever we can train a model as we receive data using Baum-Welch[5] or any general EM algorithm. The Forward algorithm will then tell us about the probability of data with respect to what is expected from our model. One of the applications can be in the domain of Finance, where it can help decide on when to buy or sell tangible assets.It can have applications in all fields where we apply Hidden Markov Models. The popular ones include Natural language processing domains like tagging part-of-speech and speech recognition.[4] Recently it is also being used in the domain of Bioinformatics.Forward algorithm can also be applied to perform Weather speculations. We can have a HMM describing the weather and its relation to the state of observations for few consecutive days (some examples could be dry, damp, soggy, sunny, cloudy, rainy etc.). We can consider calculating the probability of observing any sequence of observations recursively given the HMM. We can then calculate the probability of reaching an intermediate state as the sum of all possible paths to that state. Thus the partial probabilities for the final observation will hold the probability of reaching those states going through all possible paths.

See also

Further reading

Softwares

Notes and References

  1. Peng, Jian-Xun, Kang Li, and De-Shuang Huang. "A hybrid forward algorithm for RBF neural network construction." Neural Networks, IEEE Transactions on 17.6 (2006): 1439-1451.
  2. Zhang, Ping, and Christos G. Cassandras. "An improved forward algorithm for optimal control of a class of hybrid systems." Automatic Control, IEEE Transactions on 47.10 (2002): 1735-1739.
  3. Peng, Jian-Xun, Kang Li, and George W. Irwin. "A novel continuous forward algorithm for RBF neural modelling." Automatic Control, IEEE Transactions on 52.1 (2007): 117-122.
  4. [Lawrence Rabiner|Lawrence R. Rabiner]
  5. Zhang, Yanxue, Dongmei Zhao, and Jinxing Liu. "The Application of Baum-Welch Algorithm in Multistep Attack." The Scientific World Journal 2014.