<< Chapter < Page Chapter >> Page >

So let me say what these are. Actually, could you please raise the screen? I won’t need the laptop anymore today. [Inaudible] more space. Yep, go, great. Thanks.

So an MDP comprises a five tuple. The first of these elements, s is a set of states and so for the helicopter example, the set of states would be the possible positions and orientations of a helicopter. A is a set of actions. So again, for the helicopter example, this would be the set of all possible positions that we could put our control sticks into. P, s, a are state transition distributions. So for each state and each action, this is a high probability distribution, so sum over s prime, Psa, s prime equals 1 and Psa s prime is created over zero.

And state transition distributions are – or state transition probabilities work as follows. P subscript (s, a) gives me the probability distribution over what state I will transition to, what state I wind up in, if I take an action a in a state s. So this is probability distribution over states s prime and then I get to when I take an action a in the state s. Now I’ll read this in a second.

Gamma is the number called the discount factor. Don’t worry about this yet. I’ll say what this is in a second. And there’s usually a number strictly rated, strictly less than 1 and rated equal to zero. And R is our reward function, so the reward function maps from the set of states to the real numbers and can be positive or negative. This is the set of real numbers.

So just to make these elements concrete, let me give a specific example of a MDP. Rather than talking about something as complicated as helicopters, I’m going to use a much smaller MDP as the running example for the rest of today’s lecture. We’ll look at much more complicated MDPs in subsequent lectures.

This is an example that I adapted from a textbook by Stuart Russell and Peter Norvig called Artificial Intelligence: A Modern Approach (Second Edition). And this is a small MDP that models a robot navigation task in which if you imagine you have a robot that lives all over the grid world where the shaded-in cell is an obstacle, so the robot can’t go over this cell.

And so, let’s see. I would really like the robot to get to this upper right north cell, let’s say, so I’m going to associate that cell with a +1 reward, and I’d really like it to avoid that grid cell, so I’m gonna associate that grid cell with -1 reward.

So let’s actually iterate through the five elements of the MDP and so see what they are for this problem. So the robot can be in any of these eleven positions and so I have an MDP with 11 states, and it’s a set capital S corresponding to the 11 places it could be in. And let’s say my robot in this set, highly simplified for a logical example, can try to move in each of the compass directions, so in this MDP, I’ll have four actions corresponding to moving in each of the North, South, East, West compass directions.

And let’s see. Let’s say that my robot’s dynamics are noisy. If you’ve worked in robotics before, you know that if you command a robot to go North, because of wheel slip or a core design in how you act or whatever, there’s a small chance that your robot will be off side here. So you command your robot to move forward one meter, usually it will move forward somewhere between like 95 centimeters or to 105 centimeters.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask