Stochastic dynamic programming explained

Originally introduced by Richard E. Bellman in, stochastic dynamic programming is a technique for modelling and solving problems of decision making under uncertainty. Closely related to stochastic programming and dynamic programming, stochastic dynamic programming represents the problem under scrutiny in the form of a Bellman equation. The aim is to compute a policy prescribing how to act optimally in the face of uncertainty.

A motivating example: Gambling game

A gambler has $2, she is allowed to play a game of chance 4 times and her goal is to maximize her probability of ending up with a least $6. If the gambler bets $

on a play of the game, then with probability 0.4 she wins the game, recoup the initial bet, and she increases her capital position by $

; with probability 0.6, she loses the bet amount $

; all plays are pairwise independent. On any play of the game, the gambler may not bet more money than she has available at the beginning of that play.^[1]

Stochastic dynamic programming can be employed to model this problem and determine a betting strategy that, for instance, maximizes the gambler's probability of attaining a wealth of at least $6 by the end of the betting horizon.

Note that if there is no limit to the number of games that can be played, the problem becomes a variant of the well known St. Petersburg paradox.

Formal background

Consider a discrete system defined on

stages in which each stage

t=1,\ldots,n

is characterized by

an initial state

s_t\inS_t

, where

S_t

is the set of feasible states at the beginning of stage

;

a decision variable

x_t\inX_t

, where

X_t

is the set of feasible actions at stage

– note that

X_t

may be a function of the initial state

s_t

;

an immediate cost/reward function

p_t(s_t,x_t)

, representing the cost/reward at stage

s_t

is the initial state and

x_t

the action selected;

a state transition function

g_t(s_t,x_t)

that leads the system towards state

s_t+1=g_t(s_t,x_t)

Let

f_t(s_t)

represent the optimal cost/reward obtained by following an optimal policy over stages

t,t+1,\ldots,n

. Without loss of generality in what follow we will consider a reward maximisation setting. In deterministic dynamic programming one usually deals with functional equations taking the following structure

f_t(s_t)=max


	x_t\inX_t

\{p_t(s_t,x_t)+f_t+1(s_t+1)\}

where

s_t+1=g_t(s_t,x_t)

and the boundary condition of the system is

f_n(s_n)=max


	x_n\inX_n

\{p_n(s_n,x_n)\}.

The aim is to determine the set of optimal actions that maximise

f_1(s₁₎

. Given the current state

s_t

and the current action

x_t

, we know with certainty the reward secured during the current stage and – thanks to the state transition function

g_t

– the future state towards which the system transitions.

In practice, however, even if we know the state of the system at the beginning of the current stage as well as the decision taken, the state of the system at the beginning of the next stage and the current period reward are often random variables that can be observed only at the end of the current stage.

Stochastic dynamic programming deals with problems in which the current period reward and/or the next period state are random, i.e. with multi-stage stochastic systems. The decision maker's goal is to maximise expected (discounted) reward over a given planning horizon.

In their most general form, stochastic dynamic programs deal with functional equations taking the following structure

f_t(s_t)=max


	x_t\inX_t(s_t)

\left\{(expectedrewardduringstaget\mids_t,x_t)+

\alpha\sum
	s_t+1

\Pr(s_t+1\mids_t,x_t)f_t+1(s_t+1)\right\}

where

f_t(s_t)

is the maximum expected reward that can be attained during stages

t,t+1,\ldots,n

, given state

s_t

at the beginning of stage

;

x_t

belongs to the set

X_t(s_t)

of feasible actions at stage

given initial state

s_t

;

\alpha

is the discount factor;

\Pr(s_t+1\mids_t,x_t)

is the conditional probability that the state at the end of stage

s_t+1

given current state

s_t

and selected action

x_t

Markov decision processes represent a special class of stochastic dynamic programs in which the underlying stochastic process is a stationary process that features the Markov property.

Gambling game as a stochastic dynamic program

Gambling game can be formulated as a Stochastic Dynamic Program as follows: there are

n=4

games (i.e. stages) in the planning horizon

the state

in period

represents the initial wealth at the beginning of period

;

the action given state

in period

is the bet amount

;

the transition probability

	a
p
	i,j

from state

to state

when action

is taken in state

is easily derived from the probability of winning (0.4) or losing (0.6) a game.

Let

f_t(s)

be the probability that, by the end of game 4, the gambler has at least $6, given that she has $

at the beginning of game

the immediate profit incurred if action

is taken in state

is given by the expected value

p_t(s,b)=0.4f_t+1(s+b)+0.6f_t+1(s-b)

To derive the functional equation, define

b_t(s)

as a bet that attains

f_t(s)

, then at the beginning of game

t=4

s<3

it is impossible to attain the goal, i.e.

f_4(s)=0

for

s<3

;

s\geq6

the goal is attained, i.e.

f_4(s)=1

for

s\geq6

;

3\leqs\leq5

the gambler should bet enough to attain the goal, i.e.

f_4(s)=0.4

for

3\leqs\leq5

For

t<4

the functional equation is

f_t(s)=max


	b_t(s)

\{0.4f_t+1(s+b)+0.6f_t+1(s-b)\}

, where

b_t(s)

ranges in

0,...,s

; the aim is to find

f₁₍₂₎

Given the functional equation, an optimal betting policy can be obtained via forward recursion or backward recursion algorithms, as outlined below.

Solution methods

Stochastic dynamic programs can be solved to optimality by using backward recursion or forward recursion algorithms. Memoization is typically employed to enhance performance. However, like deterministic dynamic programming also its stochastic variant suffers from the curse of dimensionality. For this reason approximate solution methods are typically employed in practical applications.

Backward recursion

Given a bounded state space, backward recursion begins by tabulating

f_n(k)

for every possible state

belonging to the final stage

. Once these values are tabulated, together with the associated optimal state-dependent actions

x_n(k)

, it is possible to move to stage

n-1

and tabulate

f_n-1(k)

for all possible states belonging to the stage

n-1

. The process continues by considering in a backward fashion all remaining stages up to the first one. Once this tabulation process is complete,

f_1(s)

– the value of an optimal policy given initial state

– as well as the associated optimal action

x_1(s)

can be easily retrieved from the table. Since the computation proceeds in a backward fashion, it is clear that backward recursion may lead to computation of a large number of states that are not necessary for the computation of

f_1(s)

Example: Gambling game

Forward recursion

Given the initial state

of the system at the beginning of period 1, forward recursion computes

f_1(s)

by progressively expanding the functional equation (forward pass). This involves recursive calls for all

f_t+1( ⋅ ),f_t+2( ⋅ ),\ldots

that are necessary for computing a given

f_{t( ⋅ )}

. The value of an optimal policy and its structure are then retrieved via a (backward pass) in which these suspended recursive calls are resolved. A key difference from backward recursion is the fact that

f_t

is computed only for states that are relevant for the computation of

f_1(s)

. Memoization is employed to avoid recomputation of states that have been already considered.

Example: Gambling game

We shall illustrate forward recursion in the context of the Gambling game instance previously discussed. We begin the forward pass by considering

f_{1(2)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods1,2,3,4\\
\hline
0&0.4f}₂(2+0)+0.6f₂(2-0)\\ 1&0.4f₂(2+1)+0.6f₂(2-1)\\ 2&0.4f₂(2+2)+0.6f₂(2-2)\\ \end{array} \right.

At this point we have not computed yet

f₂(4),f₂(3),f₂(2),f₂(1),f₂(0)

, which are needed to compute

f₁₍₂₎

; we proceed and compute these items. Note that

f₂(2+0)=f₂(2-0)=f₂(2)

, therefore one can leverage memoization and perform the necessary computations only once.

Computation of

f₂(4),f₂(3),f₂(2),f₂(1),f₂(0)

f_{2(0)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods2,3,4\\
\hline
0&0.4f}₃(0+0)+0.6f₃(0-0)\\ \end{array} \right.

f_{2(1)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods2,3,4\\
\hline
0&0.4f}₃(1+0)+0.6f₃(1-0)\\ 1&0.4f₃(1+1)+0.6f₃(1-1)\\ \end{array} \right.

f_{2(2)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods2,3,4\\
\hline
0&0.4f}₃(2+0)+0.6f₃(2-0)\\ 1&0.4f₃(2+1)+0.6f₃(2-1)\\ 2&0.4f₃(2+2)+0.6f₃(2-2)\\ \end{array} \right.

f_{2(3)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods2,3,4\\
\hline
0&0.4f}₃(3+0)+0.6f₃(3-0)\\ 1&0.4f₃(3+1)+0.6f₃(3-1)\\ 2&0.4f₃(3+2)+0.6f₃(3-2)\\ 3&0.4f₃(3+3)+0.6f₃(3-3)\\ \end{array} \right.

f_{2(4)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods2,3,4\\
\hline
0&0.4f}₃(4+0)+0.6f₃(4-0)\\ 1&0.4f₃(4+1)+0.6f₃(4-1)\\ 2&0.4f₃(4+2)+0.6f₃(4-2) \end{array} \right.

We have now computed

f_2(k)

for all

that are needed to compute

f₁₍₂₎

. However, this has led to additional suspended recursions involving

f₃(4),f₃(3),f₃(2),f₃(1),f₃(0)

. We proceed and compute these values.

Computation of

f₃(4),f₃(3),f₃(2),f₃(1),f₃(0)

f_{3(0)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods3,4\\
\hline
0&0.4f}₄(0+0)+0.6f₄(0-0)\\ \end{array} \right.

f_{3(1)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods3,4\\
\hline
0&0.4f}₄(1+0)+0.6f₄(1-0)\\ 1&0.4f₄(1+1)+0.6f₄(1-1)\\ \end{array} \right.

f_{3(2)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods3,4\\
\hline
0&0.4f}₄(2+0)+0.6f₄(2-0)\\ 1&0.4f₄(2+1)+0.6f₄(2-1)\\ 2&0.4f₄(2+2)+0.6f₄(2-2)\\ \end{array} \right.

f_{3(3)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods3,4\\
\hline
0&0.4f}₄(3+0)+0.6f₄(3-0)\\ 1&0.4f₄(3+1)+0.6f₄(3-1)\\ 2&0.4f₄(3+2)+0.6f₄(3-2)\\ 3&0.4f₄(3+3)+0.6f₄(3-3)\\ \end{array} \right.

f_{3(4)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods3,4\\
\hline
0&0.4f}₄(4+0)+0.6f₄(4-0)\\ 1&0.4f₄(4+1)+0.6f₄(4-1)\\ 2&0.4f₄(4+2)+0.6f₄(4-2) \end{array} \right.

f_{3(5)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods3,4\\
\hline
0&0.4f}₄(5+0)+0.6f₄(5-0)\\ 1&0.4f₄(5+1)+0.6f₄(5-1) \end{array} \right.

Since stage 4 is the last stage in our system,

f₄( ⋅ )

represent boundary conditions that are easily computed as follows.

Boundary conditions

\begin{array}{ll} f_4(0)=0&b_4(0)=0\\
f_4(1)=0&b_{4(1)=\{0,1\}\\
f}_4(2)=0&b_{4(2)=\{0,1,2\}\\
f}_4(3)=0.4&b_{4(3)=\{3\}\\
f}_4(4)=0.4&b_{4(4)=\{2,3,4\}\\
f}_4(5)=0.4&b_{4(5)=\{1,2,3,4,5\}\\
f}_4(d)=1&b_{4(d)=\{0,\ldots,d-6\}ford\geq}6 \end{array}

At this point it is possible to proceed and recover the optimal policy and its value via a backward pass involving, at first, stage 3

Backward pass involving

f_{3( ⋅ )}

f_{3(0)=
min\left\{
\begin{array}{rr}
b&successprobabilityinperiods3,4\\
\hline
0&0.4(0)+0.6(0)=0\\
\end{array}
\right.}

f_{3(1)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods3,4&max\\
\hline
0&0.4(0)+0.6(0)=0&\leftarrow}b_{3(1)=0\\
1&0.4(0)+0.6(0)=0&\leftarrow}b_{3(1)=1\\
\end{array}
\right.}

f_{3(2)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods3,4&max\\
\hline
0&0.4(0)+0.6(0)=0\\
1&0.4(0.4)+0.6(0)=0.16&\leftarrow}b_{3(2)=1\\
2&0.4(0.4)+0.6(0)=0.16&\leftarrow}b_{3(2)=2\\
\end{array}
\right.}

f_{3(3)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods3,4&max\\
\hline
0&0.4(0.4)+0.6(0.4)=0.4&\leftarrow}b_{3(3)=0\\
1&0.4(0.4)+0.6(0)=0.16\\
2&0.4(0.4)+0.6(0)=0.16\\
3&0.4(1)+0.6(0)=0.4&\leftarrow}b_{3(3)=3\\
\end{array}
\right.}

f_{3(4)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods3,4&max\\
\hline
0&0.4(0.4)+0.6(0.4)=0.4&\leftarrow}b_{3(4)=0\\
1&0.4(0.4)+0.6(0.4)=0.4&\leftarrow}b_{3(4)=1\\
2&0.4(1)+0.6(0)=0.4&\leftarrow}b_{3(4)=2\\
\end{array}
\right.}

f_{3(5)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods3,4&max\\
\hline
0&0.4(0.4)+0.6(0.4)=0.4\\
1&0.4(1)+0.6(0.4)=0.64&\leftarrow}b_{3(5)=1\\
\end{array}
\right.}

and, then, stage 2.

Backward pass involving

f_{2( ⋅ )}

f_{2(0)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods2,3,4&max\\
\hline
0&0.4(0)+0.6(0)=0&\leftarrow}b_{2(0)=0\\
\end{array}
\right.}

f_{2(1)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods2,3,4&max\\
\hline
0&0.4(0)+0.6(0)=0\\
1&0.4(0.16)+0.6(0)=0.064&\leftarrow}b_{2(1)=1\\
\end{array}
\right.}

f_{2(2)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods2,3,4&max\\
\hline
0&0.4(0.16)+0.6(0.16)=0.16&\leftarrow}b_{2(2)=0\\
1&0.4(0.4)+0.6(0)=0.16&\leftarrow}b_{2(2)=1\\
2&0.4(0.4)+0.6(0)=0.16&\leftarrow}b_{2(2)=2\\
\end{array}
\right.}

f_{2(3)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods2,3,4&max\\
\hline
0&0.4(0.4)+0.6(0.4)=0.4&\leftarrow}b_{2(3)=0\\
1&0.4(0.4)+0.6(0.16)=0.256\\
2&0.4(0.64)+0.6(0)=0.256\\
3&0.4(1)+0.6(0)=0.4&\leftarrow}b_{2(3)=3\\
\end{array}
\right.}

f_{2(4)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods2,3,4&max\\
\hline
0&0.4(0.4)+0.6(0.4)=0.4\\
1&0.4(0.64)+0.6(0.4)=0.496&\leftarrow}b_{2(4)=1\\
2&0.4(1)+0.6(0.16)=0.496&\leftarrow}b_{2(4)=2\\
\end{array}
\right.}

We finally recover the value

f₁₍₂₎

of an optimal policy

f_{1(2)=
min\left\{
\begin{array}{rrr}
b&successprobabilityinperiods1,2,3,4&max\\
\hline
0&0.4(0.16)+0.6(0.16)=0.16\\
1&0.4(0.4)+0.6(0.064)=0.1984&\leftarrow}b_{1(2)=1\\
2&0.4(0.496)+0.6(0)=0.1984&\leftarrow}b_{1(2)=2\\
\end{array}
\right.}

This is the optimal policy that has been previously illustrated. Note that there are multiple optimal policies leading to the same optimal value

f_1(2)=0.1984

; for instance, in the first game one may either bet $1 or $2.

Python implementation. The one that follows is a complete Python implementation of this example.from typing import List, Tupleimport functools

class memoize: def __init__(self, func): self.func = func self.memoized = self.method_cache =

def __call__(self, *args): return self.cache_get(self.memoized, args, lambda: self.func(*args))

def __get__(self, obj, objtype): return self.cache_get(self.method_cache, obj, lambda: self.__class__(functools.partial(self.func, obj)),)

def cache_get(self, cache, key, func): try: return cache[key] except KeyError: cache[key] = func return cache[key]

def reset(self): self.memoized = self.method_cache =

class State: """the state of the gambler's ruin problem"""

def __init__(self, t: int, wealth: float): """state constructor

Arguments: t -- time period wealth -- initial wealth """ self.t, self.wealth = t, wealth

def __eq__(self, other): return self.__dict__

other.dict

def __str__(self): return str(self.t) + " " + str(self.wealth)

def __hash__(self): return hash(str(self))

class GamblersRuin: def __init__(self, bettingHorizon: int, targetWealth: float, pmf: List[List[Tuple[int, float]]],): """the gambler's ruin problem

Arguments: bettingHorizon -- betting horizon targetWealth -- target wealth pmf -- probability mass function """

# initialize instance variables self.bettingHorizon, self.targetWealth, self.pmf = (bettingHorizon, targetWealth, pmf,)

# lambdas self.ag = lambda s: [i for i in range(0, min(self.targetWealth // 2, s.wealth) + 1) ] # action generator self.st = lambda s, a, r: State(s.t + 1, s.wealth - a + a * r) # state transition self.iv = (lambda s, a, r: 1 if s.wealth - a + a * r >= self.targetWealth else 0) # immediate value function

self.cache_actions = # cache with optimal state/action pairs

def f(self, wealth: float) -> float: s = State(0, wealth) return self._f(s)

def q(self, t: int, wealth: float) -> float: s = State(t, wealth) return self.cache_actions[str(s)]

@memoize def _f(self, s: State) -> float: # Forward recursion values = [sum([p[1]*(self._f(self.st(s, a, p[0])) if s.t < self.bettingHorizon - 1 else self.iv(s, a, p[0])) # value function for p in self.pmf[s.t]]) # bet realisations for a in self.ag(s)] # actions

v = max(values) try: self.cache_actions[str(s)]=self.ag(s)[values.index(v)] # store best action except ValueError: self.cache_actions[str(s)]=None print("Error in retrieving best action") return v # return expected total cost

instance = gr, initial_wealth = GamblersRuin(**instance), 2

f_1(x) is gambler's probability of attaining $targetWealth at the end of bettingHorizon

print("f_1(" + str(initial_wealth) + "): " + str(gr.f(initial_wealth)))

Recover optimal action for period 2 when initial wealth at the beginning of period 2 is $1.

t, initial_wealth = 1, 1print("b_" + str(t + 1) + "(" + str(initial_wealth) + "): " + str(gr.q(t, initial_wealth)))

Java implementation. GamblersRuin.java is a standalone Java 8 implementation of the above example.

Approximate dynamic programming

An introduction to approximate dynamic programming is provided by .

Notes and References

This problem is adapted from W. L. Winston, Operations Research: Applications and Algorithms (7th Edition), Duxbury Press, 2003, chap. 19, example 3.

Stochastic dynamic programming explained

A motivating example: Gambling game

Formal background

Gambling game as a stochastic dynamic program

Solution methods

Backward recursion

Example: Gambling game

Forward recursion

Example: Gambling game

other.__dict__

Approximate dynamic programming

Further reading

Notes and References

other.dict