1.15 Machine learning lecture 16  (Page 15/15)

The proof is not that difficult, but it is also sort of longer than I want to go over in this class. Yeah, that was a good point. Cool. Actually, any questions for any of these?

Okay, so we now have two algorithms for solving MDP. There’s a given, the five tuple, given the set of states, the set of actions, the state transition properties, the discount factor, and the reward function, you can now apply policy iteration or value iteration to compute the optimal policy for the MDP.

The last thing I want to talk about is what if you don’t know the state transition probabilities, and sometimes you won’t know the reward function R as well, but let’s leave that aside. And so for example, let’s say you’re trying to fly a helicopter and you don’t really know in advance what state your helicopter will transition to and take an action in a certain state, because helicopter dynamics are kind of noisy. You sort of often don’t really know what state you end up in.

So the standard thing to do, or one standard thing to do, is then to try to estimate the state transition probabilities from data. Let me just write this out. It turns out that the MDP has its 5 tuple, right? S, A; you have the transition probabilities, gamma, and R. S and A you almost always know. The state space is sort of up to you to define. What’s the state space at the very bottom, factor you’re trying to control, whatever. Actions is, again, just one of your actions. Usually, we almost always know these. Gamma, the discount factor is something you choose depending on how much you want to trade off current versus future rewards. The reward function you usually know. There are some exceptional cases. Usually, you come up with a reward function and so you usually know what the reward function is. Sometimes you don’t, but let’s just leave that aside for now and the most common thing for you to have to learn are the state transition probabilities. So we’ll just talk about how to learn that. So when you don’t know state transition probabilities, the most common thing to do is just estimate it from data. So what I mean is imagine some robot – maybe it’s a robot roaming around the hallway, like in that grid example – you would then have the robot just take actions in the MDP and you would then estimate your state transition probabilities P subscript (s, a) s prime to be – pretty much exactly what you’d expect it to be.

This would be the number of times you took action a in state s and you got to s prime, divided by the number of times you took action a in state s. Okay? So the estimate of this is just all the times you took the action a in the state s, what’s the fraction of times you actually got to the state s prime. It’s pretty much exactly what you expect it to be. Or you can – or in case you’ve never actually tried action a in state s, so if this turns out to be 0 over 0, you can then have some default estimate for those vector uniform distribution over all states, this reasonable default.

And so, putting it all together – and by the way, it turns out in reinforcement learning, in most of the earlier parts of this class where we did supervised learning, I sort of talked about the logistic regression algorithm, so it does the algorithm and most implementations of logistic regression – like a fairly standard way to do logistic regression were SVMs or faster analysis or whatever. It turns out in reinforcement learning there’s more of a mix and match sense, I guess, so there are often different pieces of different algorithms you can choose to use. So in some of the algorithms I write down, there’s sort of more than one way to do it and I’m sort of giving specific examples, but if you’re faced with an AI problem, some of you in control of robots, you want to plug in value iteration here instead of policy iteration. You want to do something slightly different than one of the specific things I wrote down. That’s actually fairly common, so just in reinforcement learning, there’s sort of other major ways to apply different algorithms and mix and match different algorithms. And this will come up again in the weekly lectures. So just putting the things I said together, here would be a – now this would be an example of how you might estimate the state transition probabilities in a MDP and find the policy for it. So you might repeatedly do the following. Let’s see. Take actions using some policy p to get experience in the MDP, meaning that just execute the policy p observed state transitions. Based on the data you get, you then update estimates of your state transition probabilities P subscript (s, a) based on the experience of the observations you just got. Then you might solve Bellman’s equations using value iterations, which I’m abbreviating to VI, and by Bellman’s equations, I mean Bellman’s equations for V*, not for Vp. Solve Bellman’s equations using value iteration to get an estimate for P* and then you update your policy by events equals [inaudible].

And now you have a new policy so you can then go back and execute this policy for a bit more of the MDPs to get some more observations of state transitions, get the noisy ones in MDP, use that update to estimate your state transition probabilities again; use value iteration or policy iteration to solve for [inaudible] the value function, get a new policy and so on. Okay? And it turns out when you do this, I actually wrote down value iteration for a reason. It turns out in the third step of the algorithm, if you’re using value iteration rather than policy iteration, to initialize value iteration, if you use your solution from the previous used algorithm, right, then that’s a very good initialization condition and this will tend to converge much more quickly because value iteration tries to solve for V(s) for every state s. It tries to estimate V*(s) and the s from the * in V(s) and so if you’re looking through this and you initialize your value iteration algorithm using the values you have from the previous round through this, then that will often make this converge faster.

But again, this is again here, you can also adjust a small part in policy iteration in here as well and whatever, and this is a fairly typical example of how you would solve a policy, correct digits and then key in and try to find a good policy for a problem for which you did not know the state transition probabilities in advance.

Cool. Questions about this? Cool. So that sure was exciting. This is like our first two MDP algorithms in just one lecture. All right, let’s close for today. Thanks.

[End of Audio]

Duration: 73 minutes

what is the stm
is there industrial application of fullrenes. What is the method to prepare fullrene on large scale.?
Rafiq
industrial application...? mmm I think on the medical side as drug carrier, but you should go deeper on your research, I may be wrong
Damian
How we are making nano material?
what is a peer
What is meant by 'nano scale'?
What is STMs full form?
LITNING
scanning tunneling microscope
Sahil
how nano science is used for hydrophobicity
Santosh
Do u think that Graphene and Fullrene fiber can be used to make Air Plane body structure the lightest and strongest. Rafiq
Rafiq
what is differents between GO and RGO?
Mahi
what is simplest way to understand the applications of nano robots used to detect the cancer affected cell of human body.? How this robot is carried to required site of body cell.? what will be the carrier material and how can be detected that correct delivery of drug is done Rafiq
Rafiq
what is Nano technology ?
write examples of Nano molecule?
Bob
The nanotechnology is as new science, to scale nanometric
brayan
nanotechnology is the study, desing, synthesis, manipulation and application of materials and functional systems through control of matter at nanoscale
Damian
Is there any normative that regulates the use of silver nanoparticles?
what king of growth are you checking .?
Renato
What fields keep nano created devices from performing or assimulating ? Magnetic fields ? Are do they assimilate ?
why we need to study biomolecules, molecular biology in nanotechnology?
?
Kyle
yes I'm doing my masters in nanotechnology, we are being studying all these domains as well..
why?
what school?
Kyle
biomolecules are e building blocks of every organics and inorganic materials.
Joe
anyone know any internet site where one can find nanotechnology papers?
research.net
kanaga
sciencedirect big data base
Ernesto
Introduction about quantum dots in nanotechnology
what does nano mean?
nano basically means 10^(-9). nanometer is a unit to measure length.
Bharti
do you think it's worthwhile in the long term to study the effects and possibilities of nanotechnology on viral treatment?
absolutely yes
Daniel
how to know photocatalytic properties of tio2 nanoparticles...what to do now
it is a goid question and i want to know the answer as well
Maciej
characteristics of micro business
Abigail
for teaching engĺish at school how nano technology help us
Anassong
How can I make nanorobot?
Lily
Do somebody tell me a best nano engineering book for beginners?
there is no specific books for beginners but there is book called principle of nanotechnology
NANO
how can I make nanorobot?
Lily
what is fullerene does it is used to make bukky balls
are you nano engineer ?
s.
fullerene is a bucky ball aka Carbon 60 molecule. It was name by the architect Fuller. He design the geodesic dome. it resembles a soccer ball.
Tarell
what is the actual application of fullerenes nowadays?
Damian
That is a great question Damian. best way to answer that question is to Google it. there are hundreds of applications for buck minister fullerenes, from medical to aerospace. you can also find plenty of research papers that will give you great detail on the potential applications of fullerenes.
Tarell
Got questions? Join the online conversation and get instant answers!