<< Chapter < Page Chapter >> Page >

In policy iteration, we initialize the policy p randomly, so it doesn’t matter. It can be the policy that always goes north or the policy that takes actions random or whatever. And then we’ll repeatedly do the following. Okay, so that’s the algorithm.

So the algorithm has two steps. In the first step, we solve. We take the current policy p and we solve Bellman’s equations to obtain Vp. So remember, earlier I said if you have a fixed policy p, then yeah, Bellman’s equation defines this system of linear equations with 11 unknowns and 11 linear constraints. And so you solve that linear system equation so you get the value function for your current policy p, and by this notation, I mean just let V be the value function for policy p.

Then the second step is you update the policy. In other words, you pretend that your current guess V from the value function is indeed the optimal value function and you let p(s) be equal to that out max formula, so as to update your policy p.

And so it turns out that if you do this, then V will converge to V* and p will converge to p*, and so this is another way to find the optimal policy for MDP.

In terms of tradeoffs, it turns out that – let’s see – in policy iteration, the computationally expensive step is this one. You need to solve this linear system of equations. You have n equations and n unknowns, if you have n states. And so if you have a problem with a reasonably few number of states, if you have a problem with like 11 states, you can solve the linear system equations fairly efficiently, and so policy iteration tends to work extremely well for problems with smallish numbers of states where you can actually solve those linear systems of equations efficiently.

So if you have a thousand states, anything less than that, you can solve a system of a thousand equations very efficiently, so policy iteration will often work fine. If you have an MDP with an enormous number of states, so we’ll actually often see MDPs with tens of thousands or hundreds of thousands or millions or tens of millions of states. If you have a problem with 10 million states and you try to apply policy iteration, then this step requires solving the linear system of 10 million equations and this would be computationally expensive. And so for these really, really large MDPs, I tend to use value iteration.

Let’s see. Any questions about this?

Student: So this is a convex function where – that it could be good in local optimization scheme.

Instructor (Andrew Ng) :Ah, yes, you’re right. That’s a good question: Is this a convex function? It actually turns out that there is a way to pose a problem of solving for V* as a convex optimization problem, as a linear program. For instance, I can break down the solution – you write down V* as a solution, so linear would be the only problem you can solve. Policy iteration converges as gamma T conversion. We’re not just stuck with local optimal, but the proof of the conversions of policy iteration sort of uses somewhat different principles in convex optimization. At least the versions as far as I can see, yeah. You could probably relate this back to convex optimization, but not understand the principle of why this often converges.

Questions & Answers

What is randomization
Joseph Reply
definition of stimulus
Thomas Reply
please explain me clinical studies
abril Reply
clinical studies are people who evaluate behavior, medical, and surgical intervention
clinical studies are people who evaluate behavior, medical, and surgical intervention
what are the characteristics of learning?
steve Reply
The ability to learn is one of the most outstanding human characteristics. Learning occurs continuously throughout a person’s lifetime. To understand how people learn, it is necessary to understand what happens to the individual during the process. In spite of numerous theories and contrasting views
Psychologists generally agree there are many characteristics of learning.
Learning is the process by which one acquires, ingests, and stores or accepts information. The main characteristic of learning that; it is a process of obtaining knowledge to change human behavior through interaction, practice, and experience.
Our experiences with learned information compose our bodies of knowledge.
Is there not one universal understanding or relatable emotion whether physical or communicated verbally they could Trigger empathy?
what is immune system
Amanda Reply
a complex network of cells that protects the body against infection
Sorry. Cells and proteins
what is perspectives
acholonu Reply
someones point of view the way ur brain sees the way situations unfold
perspective is your view on topics
Perspective is your opinion on certain situations
your perspective is your interpretation of what is being said and done around you and how you hear and view them .
is this in reference to any particular use of the word? Because there are also the 7 major "perspectives" in psychology: psychodynamic, cognitive, behavioral, biological, cross-cultural, evolutionary, & humanistic
aren't they like schemas of the world, the future and yourself
Perspective is your opinion on things that you feel, think or hear.
I'm trying to write a paper about video game and violence and suggestions or researches would help with this
do you also have to write about aggression and how its linked to video games and violence
because for alevel psychology we could have a 16 marker essay for how media influences aggression which basically includes video games and violence on tv shows
yes basically I'm arguing that video games cause violence not necessarily in a direct way but it plays apart. I'm trying to oppose the popular opinion of video games doesn't cause violence
It's not connected that much actually tbh
the way we think
according to my textbook rebal
there is a lab study by Craig an Anderson- computer games mortal combat
Matt delisi et al did a correlation study
Lindsay Robertson et al - longitudinal study
Craig Anderson also did a meta analysis of 136 studies
I will check them thanks naina
Yes, that would be my recommendation.
Ashley Reply
merits and demerits of observation as a method of studying human behavior in education psychology
Khadija Reply
what is psychology
psychology: scientific study of behaviour and mental processes.
how to mind reading?
how to face reading
energy and thought will give mind spirit and proper exercises , flexibility and making mind readind easier
How to read microexpresions easier?
scope of educationnalpsycology
seriously i will pay someone to do an essay for me my god i need help so much 😪😪😪
what do you need help with ?
The better question is how much?
hey all I'm Beth I just started psych 101 have 1st test any suggestions I should memorize ?
Bethany Reply
memorize dememorizing
how the blinds person his dream
Hi! I started as well
what is prototype
Arnav Reply
what is event schema
Event schema is how you deal with situations is this installation is good you gonna handle a great but if it's badd you have to be strong I handle it the best way you can and stay positive
why do we adapt to negative events more quickly as compared to the positive ones?
The negative ones are easier to adapt because of the people the we surround ourselves with if we surround ourselves with negative people who go to see things negatively but if we surround ourselves with positive people we are going to see things more positive
oh ok thanks!
your welcome
what are the remedy for ADHD
anagha Reply
counter conditioning
gad Reply
it is d conditioning were u add an unpleasant stimulus ad a pleasant stimulus to give a good response. eg a girl that hate or fears snake u add her mom to d picture because she loves her mother she gradually tends to like snake
can I get an update on the discussion at hand
I'd like one as well, please
I third this.
you guys could refer to a research by Mary Cover Jones. she did a study on counter conditioning.
i too would like to get an update on the dicussion please
so basically we are discussing about counter conditioning. if you all know about classical conditioning which was done by pavlov, later a similar thing was done by JB Watson, but on a child. this child was made to learn a phobia.
after his unethical experiment, his student mary cover jones also did an experiment where she proved, if a fear can be learnt, it can also be unlearn. hence, counter conditioning. where a negative stimulus (any fearful object) is followed by a pleasant stimulus (eg food).
After several pairings, the fear is neutralized.
hey, i wanted to know in positive counter conditioning a several trials are done to make the person unlearn their phobias but in aversive (-ve) counter conditioning,after just 1 trial a person learns that behaviour, why the negative behaviour is learnt in just 1 try?
Please what is randomization?
hi, may I have many MCQ?
Ango Reply
what would you like to know
We want to know everything
what's MCQ?
Multiple Chouce Questiom
Who knows something about Multiple Personality Disorder ?
Victor what you want to know in it
Everything you know about it.
what you know about functionalism
@ Victor. Dissociative identity disorder or multiple personality disorder has two or more distinct personality states.they may be disconnected among thoughts, identity, consciousness and memory .this could be happened when to trauma of childhood incident or any other impact in life
how do you identify these altars ? and how do you easily identify MPD patient ?
what is experimental bias
experimental bias is when you experiment with something and you like it and tell someone else about it and you tell them ira good even though they don't have the same reacton to it that you did
define Experiment
Experiment is trying something new you don't know how it's gonna work out or even if it will work or if it won't so you try it anyway it's sometimes it works and sometimes it doesn't
what is the difference between CBT and REBT
Zeeshan Reply
what it means to survive danger
Bonsa Reply
it's called bystander effect when people reject you
hey Jason
There are 2 meanings of surviving danger one is to be stronger than you were when you went in and the 2nd is a to have your wits about you and to keep yourself from getting into danger in the 1st place but if you have to be in danger Use your best common knowledge to get out of it
how do I overcome my fears of public speaking
There's many theories to that as well many people say the picture the audience in their underwear but that does not always work so what I do is I forget about the people in the audience and pretend like I'm by myself and you gonna find yourself about more comfortable and a lot more at ease
Got questions? Join the online conversation and get instant answers!
Jobilize.com Reply

Get the best Algebra and trigonometry course in your pocket!

Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?