Originally introduced by Richard E. Bellman in, stochastic dynamic programming is a technique for modelling and solving problems of decision making under uncertainty. Closely related to stochastic programming and dynamic programming, stochastic dynamic programming represents the problem under scrutiny in the form of a Bellman equation. The aim is to compute a policy prescribing how to act optimally in the face of uncertainty.
A gambler has $2, she is allowed to play a game of chance 4 times and her goal is to maximize her probability of ending up with a least $6. If the gambler bets $
b
b
b
Stochastic dynamic programming can be employed to model this problem and determine a betting strategy that, for instance, maximizes the gambler's probability of attaining a wealth of at least $6 by the end of the betting horizon.
Note that if there is no limit to the number of games that can be played, the problem becomes a variant of the well known St. Petersburg paradox.
Consider a discrete system defined on
n
t=1,\ldots,n
st\inSt
St
t
xt\inXt
Xt
t
Xt
st
pt(st,xt)
t
st
xt
gt(st,xt)
st+1=gt(st,xt)
Let
ft(st)
t,t+1,\ldots,n
ft(st)=max
xt\inXt |
\{pt(st,xt)+ft+1(st+1)\}
st+1=gt(st,xt)
fn(sn)=max
xn\inXn |
\{pn(sn,xn)\}.
f1(s1)
st
xt
gt
In practice, however, even if we know the state of the system at the beginning of the current stage as well as the decision taken, the state of the system at the beginning of the next stage and the current period reward are often random variables that can be observed only at the end of the current stage.
Stochastic dynamic programming deals with problems in which the current period reward and/or the next period state are random, i.e. with multi-stage stochastic systems. The decision maker's goal is to maximise expected (discounted) reward over a given planning horizon.
In their most general form, stochastic dynamic programs deal with functional equations taking the following structure
ft(st)=max
xt\inXt(st) |
\left\{(expectedrewardduringstaget\midst,xt)+
\alpha\sum | |
st+1 |
\Pr(st+1\midst,xt)ft+1(st+1)\right\}
ft(st)
t,t+1,\ldots,n
st
t
xt
Xt(st)
t
st
\alpha
\Pr(st+1\midst,xt)
t
st+1
st
xt
Markov decision processes represent a special class of stochastic dynamic programs in which the underlying stochastic process is a stationary process that features the Markov property.
Gambling game can be formulated as a Stochastic Dynamic Program as follows: there are
n=4
s
t
t
s
t
b
a | |
p | |
i,j |
i
j
a
i
Let
ft(s)
s
t
b
s
pt(s,b)=0.4ft+1(s+b)+0.6ft+1(s-b)
To derive the functional equation, define
bt(s)
ft(s)
t=4
s<3
f4(s)=0
s<3
s\geq6
f4(s)=1
s\geq6
3\leqs\leq5
f4(s)=0.4
3\leqs\leq5
For
t<4
ft(s)=max
bt(s) |
\{0.4ft+1(s+b)+0.6ft+1(s-b)\}
bt(s)
0,...,s
f1(2)
Given the functional equation, an optimal betting policy can be obtained via forward recursion or backward recursion algorithms, as outlined below.
Stochastic dynamic programs can be solved to optimality by using backward recursion or forward recursion algorithms. Memoization is typically employed to enhance performance. However, like deterministic dynamic programming also its stochastic variant suffers from the curse of dimensionality. For this reason approximate solution methods are typically employed in practical applications.
Given a bounded state space, backward recursion begins by tabulating
fn(k)
k
n
xn(k)
n-1
fn-1(k)
n-1
f1(s)
s
x1(s)
f1(s)
Given the initial state
s
f1(s)
ft+1( ⋅ ),ft+2( ⋅ ),\ldots
ft( ⋅ )
ft
f1(s)
We shall illustrate forward recursion in the context of the Gambling game instance previously discussed. We begin the forward pass by considering
f1(2)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods1,2,3,4\\ \hline 0&0.4f2(2+0)+0.6f2(2-0)\\ 1&0.4f2(2+1)+0.6f2(2-1)\\ 2&0.4f2(2+2)+0.6f2(2-2)\\ \end{array} \right.
At this point we have not computed yet
f2(4),f2(3),f2(2),f2(1),f2(0)
f1(2)
f2(2+0)=f2(2-0)=f2(2)
f2(4),f2(3),f2(2),f2(1),f2(0)
f2(0)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods2,3,4\\ \hline 0&0.4f3(0+0)+0.6f3(0-0)\\ \end{array} \right.
f2(1)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods2,3,4\\ \hline 0&0.4f3(1+0)+0.6f3(1-0)\\ 1&0.4f3(1+1)+0.6f3(1-1)\\ \end{array} \right.
f2(2)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods2,3,4\\ \hline 0&0.4f3(2+0)+0.6f3(2-0)\\ 1&0.4f3(2+1)+0.6f3(2-1)\\ 2&0.4f3(2+2)+0.6f3(2-2)\\ \end{array} \right.
f2(3)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods2,3,4\\ \hline 0&0.4f3(3+0)+0.6f3(3-0)\\ 1&0.4f3(3+1)+0.6f3(3-1)\\ 2&0.4f3(3+2)+0.6f3(3-2)\\ 3&0.4f3(3+3)+0.6f3(3-3)\\ \end{array} \right.
f2(4)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods2,3,4\\ \hline 0&0.4f3(4+0)+0.6f3(4-0)\\ 1&0.4f3(4+1)+0.6f3(4-1)\\ 2&0.4f3(4+2)+0.6f3(4-2) \end{array} \right.
We have now computed
f2(k)
k
f1(2)
f3(4),f3(3),f3(2),f3(1),f3(0)
f3(4),f3(3),f3(2),f3(1),f3(0)
f3(0)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods3,4\\ \hline 0&0.4f4(0+0)+0.6f4(0-0)\\ \end{array} \right.
f3(1)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods3,4\\ \hline 0&0.4f4(1+0)+0.6f4(1-0)\\ 1&0.4f4(1+1)+0.6f4(1-1)\\ \end{array} \right.
f3(2)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods3,4\\ \hline 0&0.4f4(2+0)+0.6f4(2-0)\\ 1&0.4f4(2+1)+0.6f4(2-1)\\ 2&0.4f4(2+2)+0.6f4(2-2)\\ \end{array} \right.
f3(3)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods3,4\\ \hline 0&0.4f4(3+0)+0.6f4(3-0)\\ 1&0.4f4(3+1)+0.6f4(3-1)\\ 2&0.4f4(3+2)+0.6f4(3-2)\\ 3&0.4f4(3+3)+0.6f4(3-3)\\ \end{array} \right.
f3(4)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods3,4\\ \hline 0&0.4f4(4+0)+0.6f4(4-0)\\ 1&0.4f4(4+1)+0.6f4(4-1)\\ 2&0.4f4(4+2)+0.6f4(4-2) \end{array} \right.
f3(5)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods3,4\\ \hline 0&0.4f4(5+0)+0.6f4(5-0)\\ 1&0.4f4(5+1)+0.6f4(5-1) \end{array} \right.
Since stage 4 is the last stage in our system,
f4( ⋅ )
\begin{array}{ll} f4(0)=0&b4(0)=0\\ f4(1)=0&b4(1)=\{0,1\}\\ f4(2)=0&b4(2)=\{0,1,2\}\\ f4(3)=0.4&b4(3)=\{3\}\\ f4(4)=0.4&b4(4)=\{2,3,4\}\\ f4(5)=0.4&b4(5)=\{1,2,3,4,5\}\\ f4(d)=1&b4(d)=\{0,\ldots,d-6\}ford\geq6 \end{array}
At this point it is possible to proceed and recover the optimal policy and its value via a backward pass involving, at first, stage 3
f3( ⋅ )
f3(0)= min\left\{ \begin{array}{rr} b&successprobabilityinperiods3,4\\ \hline 0&0.4(0)+0.6(0)=0\\ \end{array} \right.
f3(1)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods3,4&max\\ \hline 0&0.4(0)+0.6(0)=0&\leftarrowb3(1)=0\\ 1&0.4(0)+0.6(0)=0&\leftarrowb3(1)=1\\ \end{array} \right.
f3(2)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods3,4&max\\ \hline 0&0.4(0)+0.6(0)=0\\ 1&0.4(0.4)+0.6(0)=0.16&\leftarrowb3(2)=1\\ 2&0.4(0.4)+0.6(0)=0.16&\leftarrowb3(2)=2\\ \end{array} \right.
f3(3)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods3,4&max\\ \hline 0&0.4(0.4)+0.6(0.4)=0.4&\leftarrowb3(3)=0\\ 1&0.4(0.4)+0.6(0)=0.16\\ 2&0.4(0.4)+0.6(0)=0.16\\ 3&0.4(1)+0.6(0)=0.4&\leftarrowb3(3)=3\\ \end{array} \right.
f3(4)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods3,4&max\\ \hline 0&0.4(0.4)+0.6(0.4)=0.4&\leftarrowb3(4)=0\\ 1&0.4(0.4)+0.6(0.4)=0.4&\leftarrowb3(4)=1\\ 2&0.4(1)+0.6(0)=0.4&\leftarrowb3(4)=2\\ \end{array} \right.
f3(5)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods3,4&max\\ \hline 0&0.4(0.4)+0.6(0.4)=0.4\\ 1&0.4(1)+0.6(0.4)=0.64&\leftarrowb3(5)=1\\ \end{array} \right.
and, then, stage 2.
f2( ⋅ )
f2(0)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods2,3,4&max\\ \hline 0&0.4(0)+0.6(0)=0&\leftarrowb2(0)=0\\ \end{array} \right.
f2(1)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods2,3,4&max\\ \hline 0&0.4(0)+0.6(0)=0\\ 1&0.4(0.16)+0.6(0)=0.064&\leftarrowb2(1)=1\\ \end{array} \right.
f2(2)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods2,3,4&max\\ \hline 0&0.4(0.16)+0.6(0.16)=0.16&\leftarrowb2(2)=0\\ 1&0.4(0.4)+0.6(0)=0.16&\leftarrowb2(2)=1\\ 2&0.4(0.4)+0.6(0)=0.16&\leftarrowb2(2)=2\\ \end{array} \right.
f2(3)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods2,3,4&max\\ \hline 0&0.4(0.4)+0.6(0.4)=0.4&\leftarrowb2(3)=0\\ 1&0.4(0.4)+0.6(0.16)=0.256\\ 2&0.4(0.64)+0.6(0)=0.256\\ 3&0.4(1)+0.6(0)=0.4&\leftarrowb2(3)=3\\ \end{array} \right.
f2(4)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods2,3,4&max\\ \hline 0&0.4(0.4)+0.6(0.4)=0.4\\ 1&0.4(0.64)+0.6(0.4)=0.496&\leftarrowb2(4)=1\\ 2&0.4(1)+0.6(0.16)=0.496&\leftarrowb2(4)=2\\ \end{array} \right.
We finally recover the value
f1(2)
f1(2)= min\left\{ \begin{array}{rrr} b&successprobabilityinperiods1,2,3,4&max\\ \hline 0&0.4(0.16)+0.6(0.16)=0.16\\ 1&0.4(0.4)+0.6(0.064)=0.1984&\leftarrowb1(2)=1\\ 2&0.4(0.496)+0.6(0)=0.1984&\leftarrowb1(2)=2\\ \end{array} \right.
This is the optimal policy that has been previously illustrated. Note that there are multiple optimal policies leading to the same optimal value
f1(2)=0.1984
Python implementation. The one that follows is a complete Python implementation of this example.
class memoize: def __init__(self, func): self.func = func self.memoized = self.method_cache =
def __call__(self, *args): return self.cache_get(self.memoized, args, lambda: self.func(*args))
def __get__(self, obj, objtype): return self.cache_get(self.method_cache, obj, lambda: self.__class__(functools.partial(self.func, obj)),)
def cache_get(self, cache, key, func): try: return cache[key] except KeyError: cache[key] = func return cache[key]
def reset(self): self.memoized = self.method_cache =
class State: """the state of the gambler's ruin problem"""
def __init__(self, t: int, wealth: float): """state constructor
Arguments: t -- time period wealth -- initial wealth """ self.t, self.wealth = t, wealth
def __eq__(self, other): return self.__dict__
def __str__(self): return str(self.t) + " " + str(self.wealth)
def __hash__(self): return hash(str(self))
class GamblersRuin: def __init__(self, bettingHorizon: int, targetWealth: float, pmf: List[List[Tuple[int, float]]],): """the gambler's ruin problem
Arguments: bettingHorizon -- betting horizon targetWealth -- target wealth pmf -- probability mass function """
# initialize instance variables self.bettingHorizon, self.targetWealth, self.pmf = (bettingHorizon, targetWealth, pmf,)
# lambdas self.ag = lambda s: [i for i in range(0, min(self.targetWealth // 2, s.wealth) + 1) ] # action generator self.st = lambda s, a, r: State(s.t + 1, s.wealth - a + a * r) # state transition self.iv = (lambda s, a, r: 1 if s.wealth - a + a * r >= self.targetWealth else 0) # immediate value function
self.cache_actions = # cache with optimal state/action pairs
def f(self, wealth: float) -> float: s = State(0, wealth) return self._f(s)
def q(self, t: int, wealth: float) -> float: s = State(t, wealth) return self.cache_actions[str(s)]
@memoize def _f(self, s: State) -> float: # Forward recursion values = [sum([p[1]*(self._f(self.st(s, a, p[0])) if s.t < self.bettingHorizon - 1 else self.iv(s, a, p[0])) # value function for p in self.pmf[s.t]]) # bet realisations for a in self.ag(s)] # actions
v = max(values) try: self.cache_actions[str(s)]=self.ag(s)[values.index(v)] # store best action except ValueError: self.cache_actions[str(s)]=None print("Error in retrieving best action") return v # return expected total cost
instance = gr, initial_wealth = GamblersRuin(**instance), 2
print("f_1(" + str(initial_wealth) + "): " + str(gr.f(initial_wealth)))
t, initial_wealth = 1, 1print("b_" + str(t + 1) + "(" + str(initial_wealth) + "): " + str(gr.q(t, initial_wealth)))
Java implementation. GamblersRuin.java is a standalone Java 8 implementation of the above example.
An introduction to approximate dynamic programming is provided by .