In machine learning, automatic basis function construction (or basis discovery) is the mathematical method of looking for a set of task-independent basis functions that map the state space to a lower-dimensional embedding, while still representing the value function accurately. Automatic basis construction is independent of prior knowledge of the domain, which allows it to perform well where expert-constructed basis functions are difficult or impossible to create.
In reinforcement learning (RL), most real-world Markov Decision Process (MDP) problems have large or continuous state spaces, which typically require some sort of approximation to be represented efficiently.
Linear function approximators[1] (LFAs) are widely adopted for their low theoretical complexity. Two sub-problems needs to be solved for better approximation: weight optimization and basis construction. To solve the second problem, one way is to design special basis functions. Those basis functions work well in specific tasks but are significantly restricted to domains. Thus constructing basis construction functions automatically is preferred for broader applications.
A Markov decision process with finite state space and fixed policy is defined with a 5-tuple
{s,a,p,\gamma,r}
S={{1,2,\ldots,s}}
A
r
\gamma\in[0,1)
P
Bellman equation is defined as:
v=r+\gammaPv.
When the number of elements in
S
v
S
v
\Phi={\phi1,\phi2,\ldots,\phin}
n\theta | |
v ≈ \hat{v}=\sum | |
n\phi |
n
Here
\Phi
|S| x n
\theta
n\ll|s|
Basis construction looks for ways to automatically construct better basis function
\Phi
A good construction method should have the following characteristics:
In this approach, Mahadevan analyzes the connectivity graph between states to determine a set of basis functions.
The normalized graph Laplacian is defined as:
| ||||
L=I-D |
| ||||
WD |
Here W is an adjacency matrix which represents the states of fixed policy MDP which forms an undirected graph (N,E). D is a diagonal matrix related to nodes' degrees.
In discrete state space, the adjacency matrix
W
This spectral framework can be used for value function approximation (VFA). Given the fixed policy, the edge weights are determined by corresponding states' transition probability. To get smooth value approximation, diffusion wavelets are used.[3]
Krylov basis construction uses the actual transition matrix instead of random walk Laplacian. The assumption of this method is that transition model P and reward r are available.
The vectors in Neumann series are denoted as
ir | |
y | |
i=P |
i\in[0,infty)
It shows that Krylov space spanned by
y0,y1,\ldots,ym-1
(I-\gammaP)
Suppose the minimal polynomial is
p(A)= | 1 |
\alpha0 |
m-1 | |
\sum | |
i=0 |
\alphai+1Ai
BA=I
v=Br= | 1 |
\alpha0 |
m-1 | |
\sum | |
i=0 |
\alphai+1(I-\gamma
m-1 | |
P) | |
i=0 |
\alphai+1\betaiyi.
Algorithm Augmented Krylov Method[5]
z1,z2,\ldots,zk
zk+1:=r
for
i:=1:(l+k)
if
i>k+1
zi:=Pzi-1
j:=1:(i-1)
zi:=zi-<zj,zi>zj;
end for
if
\parallelzi\parallel ≈ 0
break;
end if
end for
Bellman error (or BEBFs) is defined as:
\varepsilon=r+\gammaP\hat{v}-\hat{v}=r+\gammaP\Phi\theta-\Phi\theta
Loosely speaking, Bellman error points towards the optimal value function.[6] The sequence of BEBF form a basis space which is orthogonal to the real value function space; thus with sufficient number of BEBFs, any value function can be represented exactly.
Algorithm BEBF
stage stage i=1,
\phi1=r
stage
i\in[2,N]
compute the weight vector
\thetai
\Phii
compute new bellman error by
\varepsilon=r+\gammaP\Phii\thetai-\Phii\thetai
add bellman error to form new basis function:
\Phii+1=[\Phii:\varepsilon]
Bellman Average Reward Bases (or BARBs)[7] is similar to Krylov Bases, but the reward function is being dilated by the average adjusted transition matrix
P-P*
P*
BARBs converges faster than BEBFs and Krylov when
\gamma
Algorithm BARBs
stage stage i=1,
P*r
stage
i\in[2,N]
compute the weight vector
\thetai
\Phii
compute new basis:
:\phii+1
*r+P\Phi | |
=r-P | |
i |
\thetai-\Phii\thetai
\Phii+1=[\Phii:\phii+1]
There are two principal types of basis construction methods.
The first type of methods are reward-sensitive, like Krylov and BEBFs; they dilate the reward function geometrically through transition matrix. However, when discount factor
\gamma
\gamma
Another type is reward-insensitive proto value basis function derived from graph Lapalacian. This method uses graph information, but the construction of adjacency matrix makes this method hard to analyze.