<< Chapter < Page Chapter >> Page >

The proof is not that difficult, but it is also sort of longer than I want to go over in this class. Yeah, that was a good point. Cool. Actually, any questions for any of these?

Okay, so we now have two algorithms for solving MDP. There’s a given, the five tuple, given the set of states, the set of actions, the state transition properties, the discount factor, and the reward function, you can now apply policy iteration or value iteration to compute the optimal policy for the MDP.

The last thing I want to talk about is what if you don’t know the state transition probabilities, and sometimes you won’t know the reward function R as well, but let’s leave that aside. And so for example, let’s say you’re trying to fly a helicopter and you don’t really know in advance what state your helicopter will transition to and take an action in a certain state, because helicopter dynamics are kind of noisy. You sort of often don’t really know what state you end up in.

So the standard thing to do, or one standard thing to do, is then to try to estimate the state transition probabilities from data. Let me just write this out. It turns out that the MDP has its 5 tuple, right? S, A; you have the transition probabilities, gamma, and R. S and A you almost always know. The state space is sort of up to you to define. What’s the state space at the very bottom, factor you’re trying to control, whatever. Actions is, again, just one of your actions. Usually, we almost always know these. Gamma, the discount factor is something you choose depending on how much you want to trade off current versus future rewards. The reward function you usually know. There are some exceptional cases. Usually, you come up with a reward function and so you usually know what the reward function is. Sometimes you don’t, but let’s just leave that aside for now and the most common thing for you to have to learn are the state transition probabilities. So we’ll just talk about how to learn that. So when you don’t know state transition probabilities, the most common thing to do is just estimate it from data. So what I mean is imagine some robot – maybe it’s a robot roaming around the hallway, like in that grid example – you would then have the robot just take actions in the MDP and you would then estimate your state transition probabilities P subscript (s, a) s prime to be – pretty much exactly what you’d expect it to be.

This would be the number of times you took action a in state s and you got to s prime, divided by the number of times you took action a in state s. Okay? So the estimate of this is just all the times you took the action a in the state s, what’s the fraction of times you actually got to the state s prime. It’s pretty much exactly what you expect it to be. Or you can – or in case you’ve never actually tried action a in state s, so if this turns out to be 0 over 0, you can then have some default estimate for those vector uniform distribution over all states, this reasonable default.

And so, putting it all together – and by the way, it turns out in reinforcement learning, in most of the earlier parts of this class where we did supervised learning, I sort of talked about the logistic regression algorithm, so it does the algorithm and most implementations of logistic regression – like a fairly standard way to do logistic regression were SVMs or faster analysis or whatever. It turns out in reinforcement learning there’s more of a mix and match sense, I guess, so there are often different pieces of different algorithms you can choose to use. So in some of the algorithms I write down, there’s sort of more than one way to do it and I’m sort of giving specific examples, but if you’re faced with an AI problem, some of you in control of robots, you want to plug in value iteration here instead of policy iteration. You want to do something slightly different than one of the specific things I wrote down. That’s actually fairly common, so just in reinforcement learning, there’s sort of other major ways to apply different algorithms and mix and match different algorithms. And this will come up again in the weekly lectures. So just putting the things I said together, here would be a – now this would be an example of how you might estimate the state transition probabilities in a MDP and find the policy for it. So you might repeatedly do the following. Let’s see. Take actions using some policy p to get experience in the MDP, meaning that just execute the policy p observed state transitions. Based on the data you get, you then update estimates of your state transition probabilities P subscript (s, a) based on the experience of the observations you just got. Then you might solve Bellman’s equations using value iterations, which I’m abbreviating to VI, and by Bellman’s equations, I mean Bellman’s equations for V*, not for Vp. Solve Bellman’s equations using value iteration to get an estimate for P* and then you update your policy by events equals [inaudible].

And now you have a new policy so you can then go back and execute this policy for a bit more of the MDPs to get some more observations of state transitions, get the noisy ones in MDP, use that update to estimate your state transition probabilities again; use value iteration or policy iteration to solve for [inaudible] the value function, get a new policy and so on. Okay? And it turns out when you do this, I actually wrote down value iteration for a reason. It turns out in the third step of the algorithm, if you’re using value iteration rather than policy iteration, to initialize value iteration, if you use your solution from the previous used algorithm, right, then that’s a very good initialization condition and this will tend to converge much more quickly because value iteration tries to solve for V(s) for every state s. It tries to estimate V*(s) and the s from the * in V(s) and so if you’re looking through this and you initialize your value iteration algorithm using the values you have from the previous round through this, then that will often make this converge faster.

But again, this is again here, you can also adjust a small part in policy iteration in here as well and whatever, and this is a fairly typical example of how you would solve a policy, correct digits and then key in and try to find a good policy for a problem for which you did not know the state transition probabilities in advance.

Cool. Questions about this? Cool. So that sure was exciting. This is like our first two MDP algorithms in just one lecture. All right, let’s close for today. Thanks.

[End of Audio]

Duration: 73 minutes

Questions & Answers

where we get a research paper on Nano chemistry....?
Maira Reply
nanopartical of organic/inorganic / physical chemistry , pdf / thesis / review
what are the products of Nano chemistry?
Maira Reply
There are lots of products of nano chemistry... Like nano coatings.....carbon fiber.. And lots of others..
Even nanotechnology is pretty much all about chemistry... Its the chemistry on quantum or atomic level
no nanotechnology is also a part of physics and maths it requires angle formulas and some pressure regarding concepts
Preparation and Applications of Nanomaterial for Drug Delivery
Hafiz Reply
Application of nanotechnology in medicine
what is variations in raman spectra for nanomaterials
Jyoti Reply
ya I also want to know the raman spectra
I only see partial conversation and what's the question here!
Crow Reply
what about nanotechnology for water purification
RAW Reply
please someone correct me if I'm wrong but I think one can use nanoparticles, specially silver nanoparticles for water treatment.
yes that's correct
I think
Nasa has use it in the 60's, copper as water purification in the moon travel.
nanocopper obvius
what is the stm
Brian Reply
is there industrial application of fullrenes. What is the method to prepare fullrene on large scale.?
industrial application...? mmm I think on the medical side as drug carrier, but you should go deeper on your research, I may be wrong
How we are making nano material?
what is a peer
What is meant by 'nano scale'?
What is STMs full form?
scanning tunneling microscope
how nano science is used for hydrophobicity
Do u think that Graphene and Fullrene fiber can be used to make Air Plane body structure the lightest and strongest. Rafiq
what is differents between GO and RGO?
what is simplest way to understand the applications of nano robots used to detect the cancer affected cell of human body.? How this robot is carried to required site of body cell.? what will be the carrier material and how can be detected that correct delivery of drug is done Rafiq
analytical skills graphene is prepared to kill any type viruses .
Any one who tell me about Preparation and application of Nanomaterial for drug Delivery
what is Nano technology ?
Bob Reply
write examples of Nano molecule?
The nanotechnology is as new science, to scale nanometric
nanotechnology is the study, desing, synthesis, manipulation and application of materials and functional systems through control of matter at nanoscale
Is there any normative that regulates the use of silver nanoparticles?
Damian Reply
what king of growth are you checking .?
What fields keep nano created devices from performing or assimulating ? Magnetic fields ? Are do they assimilate ?
Stoney Reply
why we need to study biomolecules, molecular biology in nanotechnology?
Adin Reply
yes I'm doing my masters in nanotechnology, we are being studying all these domains as well..
what school?
biomolecules are e building blocks of every organics and inorganic materials.
Got questions? Join the online conversation and get instant answers!
Jobilize.com Reply

Get the best Algebra and trigonometry course in your pocket!

Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?