<< Chapter < Page | Chapter >> Page > |
The proof is not that difficult, but it is also sort of longer than I want to go over in this class. Yeah, that was a good point. Cool. Actually, any questions for any of these?
Okay, so we now have two algorithms for solving MDP. There’s a given, the five tuple, given the set of states, the set of actions, the state transition properties, the discount factor, and the reward function, you can now apply policy iteration or value iteration to compute the optimal policy for the MDP.
The last thing I want to talk about is what if you don’t know the state transition probabilities, and sometimes you won’t know the reward function R as well, but let’s leave that aside. And so for example, let’s say you’re trying to fly a helicopter and you don’t really know in advance what state your helicopter will transition to and take an action in a certain state, because helicopter dynamics are kind of noisy. You sort of often don’t really know what state you end up in.
So the standard thing to do, or one standard thing to do, is then to try to estimate the state transition probabilities from data. Let me just write this out. It turns out that the MDP has its 5 tuple, right? S, A; you have the transition probabilities, gamma, and R. S and A you almost always know. The state space is sort of up to you to define. What’s the state space at the very bottom, factor you’re trying to control, whatever. Actions is, again, just one of your actions. Usually, we almost always know these. Gamma, the discount factor is something you choose depending on how much you want to trade off current versus future rewards. The reward function you usually know. There are some exceptional cases. Usually, you come up with a reward function and so you usually know what the reward function is. Sometimes you don’t, but let’s just leave that aside for now and the most common thing for you to have to learn are the state transition probabilities. So we’ll just talk about how to learn that. So when you don’t know state transition probabilities, the most common thing to do is just estimate it from data. So what I mean is imagine some robot – maybe it’s a robot roaming around the hallway, like in that grid example – you would then have the robot just take actions in the MDP and you would then estimate your state transition probabilities P subscript (s, a) s prime to be – pretty much exactly what you’d expect it to be.
This would be the number of times you took action a in state s and you got to s prime, divided by the number of times you took action a in state s. Okay? So the estimate of this is just all the times you took the action a in the state s, what’s the fraction of times you actually got to the state s prime. It’s pretty much exactly what you expect it to be. Or you can – or in case you’ve never actually tried action a in state s, so if this turns out to be 0 over 0, you can then have some default estimate for those vector uniform distribution over all states, this reasonable default.
And so, putting it all together – and by the way, it turns out in reinforcement learning, in most of the earlier parts of this class where we did supervised learning, I sort of talked about the logistic regression algorithm, so it does the algorithm and most implementations of logistic regression – like a fairly standard way to do logistic regression were SVMs or faster analysis or whatever. It turns out in reinforcement learning there’s more of a mix and match sense, I guess, so there are often different pieces of different algorithms you can choose to use. So in some of the algorithms I write down, there’s sort of more than one way to do it and I’m sort of giving specific examples, but if you’re faced with an AI problem, some of you in control of robots, you want to plug in value iteration here instead of policy iteration. You want to do something slightly different than one of the specific things I wrote down. That’s actually fairly common, so just in reinforcement learning, there’s sort of other major ways to apply different algorithms and mix and match different algorithms. And this will come up again in the weekly lectures. So just putting the things I said together, here would be a – now this would be an example of how you might estimate the state transition probabilities in a MDP and find the policy for it. So you might repeatedly do the following. Let’s see. Take actions using some policy p to get experience in the MDP, meaning that just execute the policy p observed state transitions. Based on the data you get, you then update estimates of your state transition probabilities P subscript (s, a) based on the experience of the observations you just got. Then you might solve Bellman’s equations using value iterations, which I’m abbreviating to VI, and by Bellman’s equations, I mean Bellman’s equations for V*, not for Vp. Solve Bellman’s equations using value iteration to get an estimate for P* and then you update your policy by events equals [inaudible].
And now you have a new policy so you can then go back and execute this policy for a bit more of the MDPs to get some more observations of state transitions, get the noisy ones in MDP, use that update to estimate your state transition probabilities again; use value iteration or policy iteration to solve for [inaudible] the value function, get a new policy and so on. Okay? And it turns out when you do this, I actually wrote down value iteration for a reason. It turns out in the third step of the algorithm, if you’re using value iteration rather than policy iteration, to initialize value iteration, if you use your solution from the previous used algorithm, right, then that’s a very good initialization condition and this will tend to converge much more quickly because value iteration tries to solve for V(s) for every state s. It tries to estimate V*(s) and the s from the * in V(s) and so if you’re looking through this and you initialize your value iteration algorithm using the values you have from the previous round through this, then that will often make this converge faster.
But again, this is again here, you can also adjust a small part in policy iteration in here as well and whatever, and this is a fairly typical example of how you would solve a policy, correct digits and then key in and try to find a good policy for a problem for which you did not know the state transition probabilities in advance.
Cool. Questions about this? Cool. So that sure was exciting. This is like our first two MDP algorithms in just one lecture. All right, let’s close for today. Thanks.
[End of Audio]
Duration: 73 minutes
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?