<< Chapter < Page | Chapter >> Page > |
In detail, the algorithm is as follows:
Above, we had written out fitted value iteration using linear regression as the algorithm to try to make $V\left({s}^{\left(i\right)}\right)$ close to ${y}^{\left(i\right)}$ . That step of the algorithm is completely analogous to a standard supervised learning (regression) problem in which we have a training set $({x}^{\left(1\right)},{y}^{\left(1\right)}),({x}^{\left(2\right)},{y}^{\left(2\right)}),...,({x}^{\left(m\right)},{y}^{\left(m\right)})$ , and want to learn a function mapping from $x$ to $y$ ; the only difference is that here $s$ plays the role of $x$ . Eventhough our description above used linear regression, clearly other regression algorithms (such as locally weighted linear regression) can also be used.
Unlike value iteration over a discrete set of states, fitted value iteration cannot be proved to always to converge. However, in practice, it often does converge (or approximately converge), and works well for many problems.Note also that if we are using a deterministic simulator/model of the MDP, then fitted value iteration can be simplified by setting $k=1$ in the algorithm. This is because the expectation in Equation [link] becomes an expectation over a deterministic distribution, and so a single example is sufficient to exactly compute that expectation.Otherwise, in the algorithm above, we had to draw $k$ samples, and average to try to approximate that expectation (see the definition of $q\left(a\right)$ , in the algorithm pseudo-code).
Finally, fitted value iteration outputs $V$ , which is an approximation to ${V}^{*}$ . This implicitly defines our policy. Specifically,when our system is in some state $s$ , and we need to choose an action, we would like to choose the action
The process for computing/approximating this is similar to the inner-loop of fitted value iteration, where for each action, we sample ${s}_{1}^{\text{'}},...,{s}_{k}^{\text{'}}\sim {P}_{sa}$ to approximate the expectation. (And again, if the simulator is deterministic, we can set $k=1$ .)
In practice, there're often other ways to approximate this step as well. For example, one very common case is if thesimulator is of the form ${s}_{t+1}=f({s}_{t},{a}_{t})+{\u03f5}_{t}$ , where $f$ is some determinstic function of the states (such as $f({s}_{t},{a}_{t})=A{s}_{t}+B{a}_{t}$ ), and $\u03f5$ is zero-mean Gaussian noise. In this case, we can pick the action given by
In other words, here we are just setting ${\u03f5}_{t}=0$ (i.e., ignoring the noise in the simulator), and setting $k=1$ . Equivalently, this can be derived from Equation [link] using the approximation
where here the expection is over the random ${s}^{\text{'}}\sim {P}_{sa}$ . So long as the noise terms ${\u03f5}_{t}$ are small, this will usually be a reasonable approximation.
However, for problems that don't lend themselves to such approximations, having to sample $k\left|A\right|$ states using the model, in order to approximate the expectation above, can be computationally expensive.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?