<< Chapter < Page Chapter >> Page >

The proof is not that difficult, but it is also sort of longer than I want to go over in this class. Yeah, that was a good point. Cool. Actually, any questions for any of these?

Okay, so we now have two algorithms for solving MDP. There’s a given, the five tuple, given the set of states, the set of actions, the state transition properties, the discount factor, and the reward function, you can now apply policy iteration or value iteration to compute the optimal policy for the MDP.

The last thing I want to talk about is what if you don’t know the state transition probabilities, and sometimes you won’t know the reward function R as well, but let’s leave that aside. And so for example, let’s say you’re trying to fly a helicopter and you don’t really know in advance what state your helicopter will transition to and take an action in a certain state, because helicopter dynamics are kind of noisy. You sort of often don’t really know what state you end up in.

So the standard thing to do, or one standard thing to do, is then to try to estimate the state transition probabilities from data. Let me just write this out. It turns out that the MDP has its 5 tuple, right? S, A; you have the transition probabilities, gamma, and R. S and A you almost always know. The state space is sort of up to you to define. What’s the state space at the very bottom, factor you’re trying to control, whatever. Actions is, again, just one of your actions. Usually, we almost always know these. Gamma, the discount factor is something you choose depending on how much you want to trade off current versus future rewards. The reward function you usually know. There are some exceptional cases. Usually, you come up with a reward function and so you usually know what the reward function is. Sometimes you don’t, but let’s just leave that aside for now and the most common thing for you to have to learn are the state transition probabilities. So we’ll just talk about how to learn that. So when you don’t know state transition probabilities, the most common thing to do is just estimate it from data. So what I mean is imagine some robot – maybe it’s a robot roaming around the hallway, like in that grid example – you would then have the robot just take actions in the MDP and you would then estimate your state transition probabilities P subscript (s, a) s prime to be – pretty much exactly what you’d expect it to be.

This would be the number of times you took action a in state s and you got to s prime, divided by the number of times you took action a in state s. Okay? So the estimate of this is just all the times you took the action a in the state s, what’s the fraction of times you actually got to the state s prime. It’s pretty much exactly what you expect it to be. Or you can – or in case you’ve never actually tried action a in state s, so if this turns out to be 0 over 0, you can then have some default estimate for those vector uniform distribution over all states, this reasonable default.

And so, putting it all together – and by the way, it turns out in reinforcement learning, in most of the earlier parts of this class where we did supervised learning, I sort of talked about the logistic regression algorithm, so it does the algorithm and most implementations of logistic regression – like a fairly standard way to do logistic regression were SVMs or faster analysis or whatever. It turns out in reinforcement learning there’s more of a mix and match sense, I guess, so there are often different pieces of different algorithms you can choose to use. So in some of the algorithms I write down, there’s sort of more than one way to do it and I’m sort of giving specific examples, but if you’re faced with an AI problem, some of you in control of robots, you want to plug in value iteration here instead of policy iteration. You want to do something slightly different than one of the specific things I wrote down. That’s actually fairly common, so just in reinforcement learning, there’s sort of other major ways to apply different algorithms and mix and match different algorithms. And this will come up again in the weekly lectures. So just putting the things I said together, here would be a – now this would be an example of how you might estimate the state transition probabilities in a MDP and find the policy for it. So you might repeatedly do the following. Let’s see. Take actions using some policy p to get experience in the MDP, meaning that just execute the policy p observed state transitions. Based on the data you get, you then update estimates of your state transition probabilities P subscript (s, a) based on the experience of the observations you just got. Then you might solve Bellman’s equations using value iterations, which I’m abbreviating to VI, and by Bellman’s equations, I mean Bellman’s equations for V*, not for Vp. Solve Bellman’s equations using value iteration to get an estimate for P* and then you update your policy by events equals [inaudible].

And now you have a new policy so you can then go back and execute this policy for a bit more of the MDPs to get some more observations of state transitions, get the noisy ones in MDP, use that update to estimate your state transition probabilities again; use value iteration or policy iteration to solve for [inaudible] the value function, get a new policy and so on. Okay? And it turns out when you do this, I actually wrote down value iteration for a reason. It turns out in the third step of the algorithm, if you’re using value iteration rather than policy iteration, to initialize value iteration, if you use your solution from the previous used algorithm, right, then that’s a very good initialization condition and this will tend to converge much more quickly because value iteration tries to solve for V(s) for every state s. It tries to estimate V*(s) and the s from the * in V(s) and so if you’re looking through this and you initialize your value iteration algorithm using the values you have from the previous round through this, then that will often make this converge faster.

But again, this is again here, you can also adjust a small part in policy iteration in here as well and whatever, and this is a fairly typical example of how you would solve a policy, correct digits and then key in and try to find a good policy for a problem for which you did not know the state transition probabilities in advance.

Cool. Questions about this? Cool. So that sure was exciting. This is like our first two MDP algorithms in just one lecture. All right, let’s close for today. Thanks.

[End of Audio]

Duration: 73 minutes

Questions & Answers

Is there any normative that regulates the use of silver nanoparticles?
Damian Reply
what king of growth are you checking .?
Renato
What fields keep nano created devices from performing or assimulating ? Magnetic fields ? Are do they assimilate ?
Stoney Reply
why we need to study biomolecules, molecular biology in nanotechnology?
Adin Reply
?
Kyle
yes I'm doing my masters in nanotechnology, we are being studying all these domains as well..
Adin
why?
Adin
what school?
Kyle
biomolecules are e building blocks of every organics and inorganic materials.
Joe
anyone know any internet site where one can find nanotechnology papers?
Damian Reply
research.net
kanaga
sciencedirect big data base
Ernesto
Introduction about quantum dots in nanotechnology
Praveena Reply
what does nano mean?
Anassong Reply
nano basically means 10^(-9). nanometer is a unit to measure length.
Bharti
do you think it's worthwhile in the long term to study the effects and possibilities of nanotechnology on viral treatment?
Damian Reply
absolutely yes
Daniel
how to know photocatalytic properties of tio2 nanoparticles...what to do now
Akash Reply
it is a goid question and i want to know the answer as well
Maciej
characteristics of micro business
Abigail
for teaching engĺish at school how nano technology help us
Anassong
Do somebody tell me a best nano engineering book for beginners?
s. Reply
there is no specific books for beginners but there is book called principle of nanotechnology
NANO
what is fullerene does it is used to make bukky balls
Devang Reply
are you nano engineer ?
s.
fullerene is a bucky ball aka Carbon 60 molecule. It was name by the architect Fuller. He design the geodesic dome. it resembles a soccer ball.
Tarell
what is the actual application of fullerenes nowadays?
Damian
That is a great question Damian. best way to answer that question is to Google it. there are hundreds of applications for buck minister fullerenes, from medical to aerospace. you can also find plenty of research papers that will give you great detail on the potential applications of fullerenes.
Tarell
what is the Synthesis, properties,and applications of carbon nano chemistry
Abhijith Reply
Mostly, they use nano carbon for electronics and for materials to be strengthened.
Virgil
is Bucky paper clear?
CYNTHIA
carbon nanotubes has various application in fuel cells membrane, current research on cancer drug,and in electronics MEMS and NEMS etc
NANO
so some one know about replacing silicon atom with phosphorous in semiconductors device?
s. Reply
Yeah, it is a pain to say the least. You basically have to heat the substarte up to around 1000 degrees celcius then pass phosphene gas over top of it, which is explosive and toxic by the way, under very low pressure.
Harper
Do you know which machine is used to that process?
s.
how to fabricate graphene ink ?
SUYASH Reply
for screen printed electrodes ?
SUYASH
What is lattice structure?
s. Reply
of graphene you mean?
Ebrahim
or in general
Ebrahim
in general
s.
Graphene has a hexagonal structure
tahir
On having this app for quite a bit time, Haven't realised there's a chat room in it.
Cied
what is biological synthesis of nanoparticles
Sanket Reply
how did you get the value of 2000N.What calculations are needed to arrive at it
Smarajit Reply
Privacy Information Security Software Version 1.1a
Good
Got questions? Join the online conversation and get instant answers!
Jobilize.com Reply

Get the best Algebra and trigonometry course in your pocket!





Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask