<< Chapter < Page Chapter >> Page >

So what I did was just take this probability of seeing this sequence of states and actions, and then just [inaudible] explicitly or expanded explicitly like this. It turns out later on I’m going to need to write this sum of rewards a lot, so I’m just gonna call this the payoff from now. So whenever later in this lecture I write the word payoff, I just mean this sum. So our goal is to maximize the expected payoff, so our goal is to maximize this sum. Let me actually just skip ahead. I’m going to write down what the final answer is, and then I’ll come back and justify the algorithm. So here’s the algorithm. This is how we’re going to update the parameters of the algorithm. We’re going to sample a state action sequence. The way you do this is you just take your current stochastic policy, and you execute it in the MDP. So just go ahead and start from some initial state, take a stochastic action according to your current stochastic policy, see where the state transition probably takes you, and so you just do that for T times steps, and that’s how you sample the state sequence. Then you compute the payoff, and then you perform this update.

So let’s go back and figure out what this algorithm is doing. Notice that this algorithm performs stochastic updates because on every step it updates data according to this thing on the right hand side. This thing on the right hand side depends very much on your payoff and on the state action sequence you saw. Your state action sequence is random, so what I want to do is figure out – so on every step, I’ll sort of take a step that’s chosen randomly because it depends on this random state action sequence. So what I want to do is figure out on average how does it change the parameters theta. In particular, I want to know what is the expected value of the change to the parameters. So I want to know what is the expected value of this change to my parameters theta. Our goal is to maximize the sum [inaudible] – our goal is to maximize the value of the payoff. So long as the updates on expectation are on average taking us uphill on the expected payoff, then we’re happy. It turns out that this algorithm is a form of stochastic gradient ascent in which – remember when I talked about stochastic gradient descent for least squares regression, I said that you have some parameters – maybe you’re trying to minimize a quadratic function. Then you may have parameters that will wander around randomly until it gets close to the optimum of the [inaudible]quadratic surface. It turns out that the reinforce algorithm will be very much like that. It will be a stochastic gradient ascent algorithm in which on every step – the step we take is a little bit random. It’s determined by the random state action sequence, but on expectation this turns out to be essentially gradient ascent algorithm. And so we’ll do something like this. It’ll wander around randomly, but on average take you towards the optimum.

So let me go ahead and prove that now. Let’s see. What I’m going to do is I’m going to derive a gradient ascent update rule for maximizing the expected payoff. Then I’ll hopefully show that by deriving a gradient ascent update rule, I’ll end up with this thing on expectation. So before I do the derivation, let me just remind you of the chain rule – the product rule for differentiation in which if I have a product of functions, then the derivative of the product is given by taking of the derivatives of these things one at a time. So first I differentiate with respect to F prime, leaving the other two fixed. Then I differentiate with respect to G, leaving the other two fixed. Then I differentiate with respect to H, so I get H prime leaving the other two fixed. So that’s the product rule for derivatives. If you refer back to this equation where earlier we wrote out that the expected payoff by this equation, this sum over all the states of the probability times the payoff. So what I’m going to do is take the derivative of this expression with respect to the parameters theta because I want to do gradient ascent on this function. So I’m going to take the derivative of this function with respect to theta, and then try to go uphill on this function.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask