<< Chapter < Page Chapter >> Page >

In policy iteration, we initialize the policy p randomly, so it doesn’t matter. It can be the policy that always goes north or the policy that takes actions random or whatever. And then we’ll repeatedly do the following. Okay, so that’s the algorithm.

So the algorithm has two steps. In the first step, we solve. We take the current policy p and we solve Bellman’s equations to obtain Vp. So remember, earlier I said if you have a fixed policy p, then yeah, Bellman’s equation defines this system of linear equations with 11 unknowns and 11 linear constraints. And so you solve that linear system equation so you get the value function for your current policy p, and by this notation, I mean just let V be the value function for policy p.

Then the second step is you update the policy. In other words, you pretend that your current guess V from the value function is indeed the optimal value function and you let p(s) be equal to that out max formula, so as to update your policy p.

And so it turns out that if you do this, then V will converge to V* and p will converge to p*, and so this is another way to find the optimal policy for MDP.

In terms of tradeoffs, it turns out that – let’s see – in policy iteration, the computationally expensive step is this one. You need to solve this linear system of equations. You have n equations and n unknowns, if you have n states. And so if you have a problem with a reasonably few number of states, if you have a problem with like 11 states, you can solve the linear system equations fairly efficiently, and so policy iteration tends to work extremely well for problems with smallish numbers of states where you can actually solve those linear systems of equations efficiently.

So if you have a thousand states, anything less than that, you can solve a system of a thousand equations very efficiently, so policy iteration will often work fine. If you have an MDP with an enormous number of states, so we’ll actually often see MDPs with tens of thousands or hundreds of thousands or millions or tens of millions of states. If you have a problem with 10 million states and you try to apply policy iteration, then this step requires solving the linear system of 10 million equations and this would be computationally expensive. And so for these really, really large MDPs, I tend to use value iteration.

Let’s see. Any questions about this?

Student: So this is a convex function where – that it could be good in local optimization scheme.

Instructor (Andrew Ng) :Ah, yes, you’re right. That’s a good question: Is this a convex function? It actually turns out that there is a way to pose a problem of solving for V* as a convex optimization problem, as a linear program. For instance, I can break down the solution – you write down V* as a solution, so linear would be the only problem you can solve. Policy iteration converges as gamma T conversion. We’re not just stuck with local optimal, but the proof of the conversions of policy iteration sort of uses somewhat different principles in convex optimization. At least the versions as far as I can see, yeah. You could probably relate this back to convex optimization, but not understand the principle of why this often converges.

Questions & Answers

What is randomization
Joseph Reply
definition of stimulus
Thomas Reply
please explain me clinical studies
abril Reply
clinical studies are people who evaluate behavior, medical, and surgical intervention
Connie
clinical studies are people who evaluate behavior, medical, and surgical intervention
Connie
what are the characteristics of learning?
steve Reply
The ability to learn is one of the most outstanding human characteristics. Learning occurs continuously throughout a person’s lifetime. To understand how people learn, it is necessary to understand what happens to the individual during the process. In spite of numerous theories and contrasting views
MUBARAK
Psychologists generally agree there are many characteristics of learning.
MUBARAK
Learning is the process by which one acquires, ingests, and stores or accepts information. The main characteristic of learning that; it is a process of obtaining knowledge to change human behavior through interaction, practice, and experience.
MUBARAK
Our experiences with learned information compose our bodies of knowledge.
MUBARAK
Is there not one universal understanding or relatable emotion whether physical or communicated verbally they could Trigger empathy?
Tory
what is immune system
Amanda Reply
a complex network of cells that protects the body against infection
Stephen
Sorry. Cells and proteins
Stephen
what is perspectives
acholonu Reply
someones point of view the way ur brain sees the way situations unfold
Echo
perspective is your view on topics
Trish
Perspective is your opinion on certain situations
Connie
your perspective is your interpretation of what is being said and done around you and how you hear and view them .
Jim
is this in reference to any particular use of the word? Because there are also the 7 major "perspectives" in psychology: psychodynamic, cognitive, behavioral, biological, cross-cultural, evolutionary, & humanistic
Alex
aren't they like schemas of the world, the future and yourself
naina
Perspective is your opinion on things that you feel, think or hear.
hiba
I'm trying to write a paper about video game and violence and suggestions or researches would help with this
rebal
do you also have to write about aggression and how its linked to video games and violence
naina
because for alevel psychology we could have a 16 marker essay for how media influences aggression which basically includes video games and violence on tv shows
naina
yes basically I'm arguing that video games cause violence not necessarily in a direct way but it plays apart. I'm trying to oppose the popular opinion of video games doesn't cause violence
rebal
It's not connected that much actually tbh
Kira
the way we think
according to my textbook rebal
naina
there is a lab study by Craig an Anderson- computer games mortal combat
naina
Matt delisi et al did a correlation study
naina
Lindsay Robertson et al - longitudinal study
naina
Craig Anderson also did a meta analysis of 136 studies
naina
I will check them thanks naina
rebal
Yes, that would be my recommendation.
Ashley Reply
merits and demerits of observation as a method of studying human behavior in education psychology
Khadija Reply
what is psychology
Khadija
psychology: scientific study of behaviour and mental processes.
Ahmed
how to mind reading?
UMESH
how to face reading
UMESH
energy and thought will give mind spirit and proper exercises , flexibility and making mind readind easier
Ahmed
How to read microexpresions easier?
Edg
scope of educationnalpsycology
USMAN
seriously i will pay someone to do an essay for me my god i need help so much 😪😪😪
dearbhlagh
what do you need help with ?
Keiko
The better question is how much?
Pixeled
🥒🍑😩💦
Jacob
hey all I'm Beth I just started psych 101 have 1st test any suggestions I should memorize ?
Bethany Reply
memorize dememorizing
devesh
how the blinds person his dream
Ahmed
Hi! I started as well
La
what is prototype
Arnav Reply
what is event schema
Arnav
Event schema is how you deal with situations is this installation is good you gonna handle a great but if it's badd you have to be strong I handle it the best way you can and stay positive
Connie
why do we adapt to negative events more quickly as compared to the positive ones?
Fareha
The negative ones are easier to adapt because of the people the we surround ourselves with if we surround ourselves with negative people who go to see things negatively but if we surround ourselves with positive people we are going to see things more positive
Connie
oh ok thanks!
Fareha
your welcome
Connie
what are the remedy for ADHD
anagha Reply
methylphenidate
Maryjade
counter conditioning
gad Reply
it is d conditioning were u add an unpleasant stimulus ad a pleasant stimulus to give a good response. eg a girl that hate or fears snake u add her mom to d picture because she loves her mother she gradually tends to like snake
dalusi
can I get an update on the discussion at hand
Segun
I'd like one as well, please
Trish
I third this.
Redacted
you guys could refer to a research by Mary Cover Jones. she did a study on counter conditioning.
Shubhra
i too would like to get an update on the dicussion please
Connie
so basically we are discussing about counter conditioning. if you all know about classical conditioning which was done by pavlov, later a similar thing was done by JB Watson, but on a child. this child was made to learn a phobia.
Shubhra
after his unethical experiment, his student mary cover jones also did an experiment where she proved, if a fear can be learnt, it can also be unlearn. hence, counter conditioning. where a negative stimulus (any fearful object) is followed by a pleasant stimulus (eg food).
Shubhra
After several pairings, the fear is neutralized.
Shubhra
hey, i wanted to know in positive counter conditioning a several trials are done to make the person unlearn their phobias but in aversive (-ve) counter conditioning,after just 1 trial a person learns that behaviour, why the negative behaviour is learnt in just 1 try?
Fareha
Please what is randomization?
Joseph
hi, may I have many MCQ?
Ango Reply
hi
Teri
what would you like to know
Connie
We want to know everything
victor
what's MCQ?
Ronah
Multiple Chouce Questiom
Connie
Who knows something about Multiple Personality Disorder ?
victor
Victor what you want to know in it
Brindhu
Everything you know about it.
victor
what you know about functionalism
Teri
@ Victor. Dissociative identity disorder or multiple personality disorder has two or more distinct personality states.they may be disconnected among thoughts, identity, consciousness and memory .this could be happened when to trauma of childhood incident or any other impact in life
Brindhu
how do you identify these altars ? and how do you easily identify MPD patient ?
victor
what is experimental bias
Teri
experimental bias is when you experiment with something and you like it and tell someone else about it and you tell them ira good even though they don't have the same reacton to it that you did
Connie
define Experiment
Teri
Experiment is trying something new you don't know how it's gonna work out or even if it will work or if it won't so you try it anyway it's sometimes it works and sometimes it doesn't
Connie
what is the difference between CBT and REBT
Zeeshan Reply
what it means to survive danger
Bonsa Reply
it's called bystander effect when people reject you
Jason
hey Jason
Bonsa
There are 2 meanings of surviving danger one is to be stronger than you were when you went in and the 2nd is a to have your wits about you and to keep yourself from getting into danger in the 1st place but if you have to be in danger Use your best common knowledge to get out of it
Connie
how do I overcome my fears of public speaking
Bonsa
There's many theories to that as well many people say the picture the audience in their underwear but that does not always work so what I do is I forget about the people in the audience and pretend like I'm by myself and you gonna find yourself about more comfortable and a lot more at ease
Connie
Got questions? Join the online conversation and get instant answers!
Jobilize.com Reply

Get the best Algebra and trigonometry course in your pocket!





Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask