Bellman equation explained

A Bellman equation, named after Richard E. Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming.^[1] It writes the "value" of a decision problem at a certain point in time in terms of the payoff from some initial choices and the "value" of the remaining decision problem that results from those initial choices.^[2] This breaks a dynamic optimization problem into a sequence of simpler subproblems, as Bellman's “principle of optimality" prescribes.^[3] The equation applies to algebraic structures with a total ordering; for algebraic structures with a partial ordering, the generic Bellman's equation can be used.

The Bellman equation was first applied to engineering control theory and to other topics in applied mathematics, and subsequently became an important tool in economic theory; though the basic concepts of dynamic programming are prefigured in John von Neumann and Oskar Morgenstern's Theory of Games and Economic Behavior and Abraham Wald's sequential analysis. The term "Bellman equation" usually refers to the dynamic programming equation (DPE) associated with discrete-time optimization problems. In continuous-time optimization problems, the analogous equation is a partial differential equation that is called the Hamilton–Jacobi–Bellman equation.^[4]

In discrete time any multi-stage optimization problem can be solved by analyzing the appropriate Bellman equation. The appropriate Bellman equation can be found by introducing new state variables (state augmentation).^[5] However, the resulting augmented-state multi-stage optimization problem has a higher dimensional state space than the original multi-stage optimization problem - an issue that can potentially render the augmented problem intractable due to the “curse of dimensionality”. Alternatively, it has been shown that if the cost function of the multi-stage optimization problem satisfies a "backward separable" structure, then the appropriate Bellman equation can be found without state augmentation.^[6]

Analytical concepts in dynamic programming

To understand the Bellman equation, several underlying concepts must be understood. First, any optimization problem has some objective: minimizing travel time, minimizing cost, maximizing profits, maximizing utility, etc. The mathematical function that describes this objective is called the objective function.

Dynamic programming breaks a multi-period planning problem into simpler steps at different points in time. Therefore, it requires keeping track of how the decision situation is evolving over time. The information about the current situation that is needed to make a correct decision is called the "state".^[7] ^[8] For example, to decide how much to consume and spend at each point in time, people would need to know (among other things) their initial wealth. Therefore, wealth

(W)

would be one of their state variables, but there would probably be others.

The variables chosen at any given point in time are often called the control variables. For instance, given their current wealth, people might decide how much to consume now. Choosing the control variables now may be equivalent to choosing the next state; more generally, the next state is affected by other factors in addition to the current control. For example, in the simplest case, today's wealth (the state) and consumption (the control) might exactly determine tomorrow's wealth (the new state), though typically other factors will affect tomorrow's wealth too.

The dynamic programming approach describes the optimal plan by finding a rule that tells what the controls should be, given any possible value of the state. For example, if consumption (c) depends only on wealth (W), we would seek a rule

c(W)

that gives consumption as a function of wealth. Such a rule, determining the controls as a function of the states, is called a policy function.^[9]

Finally, by definition, the optimal decision rule is the one that achieves the best possible value of the objective. For example, if someone chooses consumption, given wealth, in order to maximize happiness (assuming happiness H can be represented by a mathematical function, such as a utility function and is something defined by wealth), then each level of wealth will be associated with some highest possible level of happiness,

H(W)

. The best possible value of the objective, written as a function of the state, is called the value function.

Bellman showed that a dynamic optimization problem in discrete time can be stated in a recursive, step-by-step form known as backward induction by writing down the relationship between the value function in one period and the value function in the next period. The relationship between these two value functions is called the "Bellman equation". In this approach, the optimal policy in the last time period is specified in advance as a function of the state variable's value at that time, and the resulting optimal value of the objective function is thus expressed in terms of that value of the state variable. Next, the next-to-last period's optimization involves maximizing the sum of that period's period-specific objective function and the optimal value of the future objective function, giving that period's optimal policy contingent upon the value of the state variable as of the next-to-last period decision. This logic continues recursively back in time, until the first period decision rule is derived, as a function of the initial state variable value, by optimizing the sum of the first-period-specific objective function and the value of the second period's value function, which gives the value for all the future periods. Thus, each period's decision is made by explicitly acknowledging that all future decisions will be optimally made.

Derivation

A dynamic decision problem

Let

x_t

be the state at time

. For a decision that begins at time 0, we take as given the initial state

x₀

. At any time, the set of possible actions depends on the current state; we express this as

a_t\in\Gamma(x_t)

, where a particular action

a_t

represents particular values for one or more control variables, and

\Gamma(x_t)

is the set of actions available to be taken at state

x_t

. It is also assumed that the state changes from

to a new state

T(x,a)

when action

is taken, and that the current payoff from taking action

in state

F(x,a)

. Finally, we assume impatience, represented by a discount factor

0<\beta<1

Under these assumptions, an infinite-horizon decision problem takes the following form:

V(x₀₎ =

max
	\left\{a_t\right\

	infty

	t=0

}

	infty
\sum
	t=0

\beta^tF(x_t,a_t),

subject to the constraints

a_t\in\Gamma(x_t), x_t+1=T(x_t,a_t), \forallt=0,1,2,...

Notice that we have defined notation

V(x₀₎

to denote the optimal value that can be obtained by maximizing this objective function subject to the assumed constraints. This function is the value function. It is a function of the initial state variable

x₀

, since the best value obtainable depends on the initial situation.

Bellman's principle of optimality

The dynamic programming method breaks this decision problem into smaller subproblems. Bellman's principle of optimality describes how to do this:

Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)^[10]

In computer science, a problem that can be broken apart like this is said to have optimal substructure. In the context of dynamic game theory, this principle is analogous to the concept of subgame perfect equilibrium, although what constitutes an optimal policy in this case is conditioned on the decision-maker's opponents choosing similarly optimal policies from their points of view.

As suggested by the principle of optimality, we will consider the first decision separately, setting aside all future decisions (we will start afresh from time 1 with the new state

x₁

). Collecting the future decisions in brackets on the right, the above infinite-horizon decision problem is equivalent to:

max
	a₀

\left\{F(x_0,a_0)
+\beta\left[

max
	\left\{a_t\right\

	infty

	t=1

	infty
} \sum
	t=1

\beta^t-1F(x_t,a_t): a_t\in\Gamma(x_t), x_t+1=T(x_t,a_t), \forallt\geq1\right]\right\}

subject to the constraints

a₀\in\Gamma(x_0), x_1=T(x_0,a_0).

Here we are choosing

a₀

, knowing that our choice will cause the time 1 state to be

x_1=T(x_0,a₀₎

. That new state will then affect the decision problem from time 1 on. The whole future decision problem appears inside the square brackets on the right.

The Bellman equation

So far it seems we have only made the problem uglier by separating today's decision from future decisions. But we can simplify by noticing that what is inside the square brackets on the right is the value of the time 1 decision problem, starting from state

x_1=T(x_0,a₀₎

Therefore, the problem can be rewritten as a recursive definition of the value function:

V(x₀₎=

max
	a₀

\{F(x_0,a₀₎+\betaV(x₁₎\}

, subject to the constraints:

a₀\in\Gamma(x_0), x_1=T(x_0,a_0).

This is the Bellman equation. It may be simplified even further if the time subscripts are dropped and the value of the next state is plugged in:

V(x)=max_a\{F(x,a)+\betaV(T(x,a))\}.

The Bellman equation is classified as a functional equation, because solving it means finding the unknown function

, which is the value function. Recall that the value function describes the best possible value of the objective, as a function of the state

. By calculating the value function, we will also find the function

a(x)

that describes the optimal action as a function of the state; this is called the policy function.

In a stochastic problem

Solution methods

The method of undetermined coefficients, also known as 'guess and verify', can be used to solve some infinite-horizon, autonomous Bellman equations.^[11]
The Bellman equation can be solved by backwards induction, either analytically in a few special cases, or numerically on a computer. Numerical backwards induction is applicable to a wide variety of problems, but may be infeasible when there are many state variables, due to the curse of dimensionality. Approximate dynamic programming has been introduced by D. P. Bertsekas and J. N. Tsitsiklis with the use of artificial neural networks (multilayer perceptrons) for approximating the Bellman function.^[12] This is an effective mitigation strategy for reducing the impact of dimensionality by replacing the memorization of the complete function mapping for the whole space domain with the memorization of the sole neural network parameters. In particular, for continuous-time systems, an approximate dynamic programming approach that combines both policy iterations with neural networks was introduced.^[13] In discrete-time, an approach to solve the HJB equation combining value iterations and neural networks was introduced.^[14]
By calculating the first-order conditions associated with the Bellman equation, and then using the envelope theorem to eliminate the derivatives of the value function, it is possible to obtain a system of difference equations or differential equations called the 'Euler equations'.^[15] Standard techniques for the solution of difference or differential equations can then be used to calculate the dynamics of the state variables and the control variables of the optimization problem.

Applications in economics

The first known application of a Bellman equation in economics is due to Martin Beckmann and Richard Muth.^[16] Martin Beckmann also wrote extensively on consumption theory using the Bellman equation in 1959. His work influenced Edmund S. Phelps, among others.

A celebrated economic application of a Bellman equation is Robert C. Merton's seminal 1973 article on the intertemporal capital asset pricing model.^[17] (See also Merton's portfolio problem). The solution to Merton's theoretical model, one in which investors chose between income today and future income or capital gains, is a form of Bellman's equation. Because economic applications of dynamic programming usually result in a Bellman equation that is a difference equation, economists refer to dynamic programming as a "recursive method" and a subfield of recursive economics is now recognized within economics.

Nancy Stokey, Robert E. Lucas, and Edward Prescott describe stochastic and nonstochastic dynamic programming in considerable detail, and develop theorems for the existence of solutions to problems meeting certain conditions. They also describe many examples of modeling theoretical problems in economics using recursive methods.^[18] This book led to dynamic programming being employed to solve a wide range of theoretical problems in economics, including optimal economic growth, resource extraction, principal–agent problems, public finance, business investment, asset pricing, factor supply, and industrial organization. Lars Ljungqvist and Thomas Sargent apply dynamic programming to study a variety of theoretical questions in monetary policy, fiscal policy, taxation, economic growth, search theory, and labor economics.^[19] Avinash Dixit and Robert Pindyck showed the value of the method for thinking about capital budgeting.^[20] Anderson adapted the technique to business valuation, including privately held businesses.^[21]

Using dynamic programming to solve concrete problems is complicated by informational difficulties, such as choosing the unobservable discount rate. There are also computational issues, the main one being the curse of dimensionality arising from the vast number of possible actions and potential state variables that must be considered before an optimal strategy can be selected. For an extensive discussion of computational issues, see Miranda and Fackler,^[22] and Meyn 2007.^[23]

Example

In Markov decision processes, a Bellman equation is a recursion for expected rewards. For example, the expected reward for being in a particular state s and following some fixed policy

\pi

has the Bellman equation:

V^\pi(s)=R(s,\pi(s))+\gamma\sum_s'P(s'|s,\pi(s))V^\pi(s').

This equation describes the expected reward for taking the action prescribed by some policy

\pi

The equation for the optimal policy is referred to as the Bellman optimality equation:

V^\pi*(s)=max_a\left\{{R(s,a)+\gamma\sum_s'P(s'|s,a)V^\pi*(s')}\right\}.

where

{\pi*}

is the optimal policy and

V^\pi*

refers to the value function of the optimal policy. The equation above describes the reward for taking the action giving the highest expected return.

Notes and References

Book: Dixit, Avinash K. . Optimization in Economic Theory . Oxford University Press . 2nd . 1990 . 0-19-877211-4 . 164 .
Web site: Bellman's principle of optimality. . 2023-08-17 . www.ques10.com.
Book: Kirk, Donald E. . Optimal Control Theory: An Introduction . Prentice-Hall . 1970 . 0-13-638098-0 . 55 .
Book: Morton I. . Kamien . Morton Kamien . Nancy L. . Schwartz . Dynamic Optimization: The Calculus of Variations and Optimal Control in Economics and Management . Amsterdam . Elsevier . Second . 1991 . 0-444-01609-0 . 261 .
Jones . Morgan . Peet . Matthew M. . Extensions of the Dynamic Programming Framework: Battery Scheduling, Demand Charges, and Renewable Integration . IEEE Transactions on Automatic Control . 2020 . 66 . 4 . 1602–1617 . 10.1109/TAC.2020.3002235. 1812.00792 . 119622206 .
Jones . Morgan . Peet . Matthew M. . A Generalization of Bellman's Equation with Application to Path Planning, Obstacle Avoidance and Invariant Set Estimation . Automatica . 2021 . 127 . 109510 . 10.1016/j.automatica.2021.109510 . 2006.08175 . 222350370 .
Book: Bellman, R.E. . 1957 . Dynamic Programming . 2003 . Dover . 0-486-42809-5.
S. . Dreyfus . 2002 . Richard Bellman on the birth of dynamic programming . Operations Research . 50 . 1 . 48–51 . 10.1287/opre.50.1.48.17791.
Bellman, 1957, Ch. III.2.
R . Bellman . 1063639 . On the Theory of Dynamic Programming . Proc Natl Acad Sci U S A . August 1952 . 38 . 8 . 716–9 . 16589166 . 10.1073/pnas.38.8.716. 1952PNAS...38..716B . free .
Book: Lars . Ljungqvist . Thomas J. . Sargent . Recursive Macroeconomic Theory . MIT Press . 2004 . 2nd . 88–90 . 0-262-12274-X . registration .
Book: Dimitri P. . Bertsekas . John N. . Tsitsiklis . Neuro-dynamic Programming . 1996 . Athena Scientific . 978-1-886529-10-6.
Murad . Abu-Khalaf . Frank L.. Lewis . Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. 2005 . Automatica . 41 . 5 . 779–791. 10.1016/j.automatica.2004.11.034. 14757582 .
Asma . Al-Tamimi. Frank L.. Lewis . Murad . Abu-Khalaf . Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof. 2008 . IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics . 38. 4 . 943–949 . 10.1109/TSMCB.2008.926614. 18632382. 14202785.
Book: Miao, Jianjun . Economic Dynamics in Discrete Time . MIT Press . 2014 . 134 . 978-0-262-32560-8 .
Martin . Beckmann . Richard . Muth . 1954 . On the Solution to the 'Fundamental Equation' of inventory theory . Cowles Commission Discussion Paper 2116 .
Robert C. . Merton . 1973 . An Intertemporal Capital Asset Pricing Model . . 41 . 5 . 867–887 . 10.2307/1913811 . 1913811 .
Book: Nancy . Stokey . Robert E. . Lucas . Edward . Prescott . 1989 . Recursive Methods in Economic Dynamics . Harvard University Press . 0-674-75096-9 .
Book: Lars . Ljungqvist . Thomas . Sargent . 2012 . Recursive Macroeconomic Theory . MIT Press . 3rd . 978-0-262-01874-6 .
Book: Avinash . Dixit . Robert . Pindyck . 1994 . Investment Under Uncertainty . Princeton University Press . 0-691-03410-9 . registration .
Book: Anderson, Patrick L. . Business Economics & Finance . CRC Press . 2004 . Ch. 10 . 1-58488-348-0.
Anderson . Patrick L. . 1 . The Value of Private Businesses in the United States . Business Economics . 2009 . 44 . 2 . 87–108 . 10.1057/be.2009.4. 154743445 .
Book: Anderson, Patrick L. . 1 . Economics of Business Valuation . Stanford University Press . 2013 . 9780804758307. Stanford Press
Book: Mario J. . Miranda . Paul L. . Fackler . Applied Computational Economics and Finance . registration . 2004 . MIT Press . 978-0-262-29175-0 .
Book: Meyn, Sean . Control Techniques for Complex Networks . 2008 . Cambridge University Press . 978-0-521-88441-9. Appendix contains abridged Meyn & Tweedie .

max
	\left\{c_t\right\