<< Chapter < Page Chapter >> Page >

It turns out, Bellman’s equation gives you a way of doing this, so coming back to this board, it turns out to be – sort of coming back to the previous boards, around – could you move the camera to point to this board? Okay, cool. So going back to this board, we work the problem’s equation, this equation, right, let’s say I have a fixed policy p and I want to solve for the value function for the policy p. Then what this equation is just imposes a set of linear constraints on the value function. So in particular, this says that the value for a given state is equal to some constant, and then some linear function of other values.

And so you can write down one such equation for every state in your MDP and this imposes a set of linear constraints on what the value function could be. And then it turns out that by solving the resulting linear system equations, you can then solve for the value function Vp(s). There’s a high level description. Let me now make this concrete.

So specifically, let me take the free one state, that state we’re using as an example. So Bellman’s equation tells me that the value for p for the free one state – oh, and let’s say I have a specific policy so that p are free one – let’s say it takes a North action, which is not the ultimate action. For this policy, Bellman’s equation tells me that Vp of free one is equal to R of the state free one, and then plus gamma times our trans 0.8 I get to the free two state, which translates .1 and gets to the four one state and which times 0.1, I will get to the two one state.

And so what I’ve done is I’ve written down Bellman’s equations for the free one state. I hope you know what it means. It’s in my low MDP; I’m indexing the states 1, 2, 3, 4, so this state over there where I drew the circle is the free one state.

So for every one of my 11 states in the MDP, I can write down an equation like this. This stands just for one state. And you notice that if I’m trying to solve for the values, so these are the unknowns, then I will have 11 variables because I’m trying to solve for the value function for each of my 11 states, and I will have 11 constraints because I can write down 11 equations of this form, one such equation for each one of my states.

So if you do this, if you write down this sort of equation for every one of your states and then do these, you have your set of linear equations with 11 unknowns and 11 variables – excuse me, 11 constraints or 11 equations with 11 unknowns, and so you can solve that linear system of equations to get an explicit solution for Vp. So if you have n states, you end up with n equations and n unknowns and solve that to get the values for all of your states.

Okay, cool. So actually, could you just raise your hand if this made sense? Cool.

All right, so that was the value function for specific policy and how to solve for it. Let me define one more thing. So the optimal value function when defined as V*(s) equals max over all policies p of Vp(s). So in other words, for any given state s, the optimal value function says suppose I take a max over all possible policies p, what is the best possible expected – some of the counter rewards that I can expect to get? Or what is my optimal expected total payoff for starting at state s, so taking a max over all possible control policies p.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask