<< Chapter < Page | Chapter >> Page > |
In order to provide the background for Matlab procedures for Markov decision processes, we first summarize certain essentials in the analysis ofhomogeneous Markov chains with costs and rewards associated with states, or with transitions between states. References are to Pfeiffer: PROBABILITY FOR APPLICATIONS.
Let X _{N} be an ergodic, homogeneous Markov chain with state space E and gain sequence $\{{G}_{n}:1\le n\}$ . Let $\mathbf{q}={[{q}_{1}{q}_{1}\cdots {q}_{M}]}^{T}$ where ${q}_{i}=E\left[{G}_{n+1}\right|{X}_{n}=i]\text{for}i\in \mathbf{E}$ . For $\alpha \in (0,1)$ , let φ be the α -potential for the gain sequence. That is,
Consider an ergodic, homogeneous Markov chain X _{N} with gain sequence $\{{G}_{n}:1\le n\}$ and next-period expected gain matrix q . For $\alpha \in (0,1)$ , the α -potential φ and the α -potential matrix ${\mathbf{R}}^{\alpha}$ satisfy
Suppose the transition matrix P and the next-period expected gain matrix q are
If k units are ordered, the cost in dollars is
For each unit of unsatisfied demand, there is a penalty of 50 dollars. We use the $(m,\phantom{\rule{0.166667em}{0ex}}M)$ inventory policy described in Ex 23.1.3. ${X}_{0}=M$ , and X _{n} is the stock at the end of period n , before restocking. Demand in period n is D _{n} . The cost for restocking at the end of period $n+1$ is
The largest eigenvalue $\left|\lambda \right|\approx 0.26$ , so $n\ge 10$ should be sufficient for convergence. We use $n=16$ .
Taking any row of ${\mathbf{P}}^{16}$ , we approximate the long-run distribution by
We thus have
For the Poisson distribution
$\sum _{k=n}^{\infty}k{p}_{k}=\lambda \sum _{k=n-1}^{\infty}{p}_{k}$ .
Hence
${q}_{0}=85+50[\sum _{k=3}^{\infty}{p}_{k}-3\sum _{k=4}^{\infty}{p}_{k}]\approx 85+50[0.0803-3\times 0.0190]=86.1668$
${q}_{1}=50\sum _{k=3}^{\infty}(k-1){p}_{k}=50[\sum _{k=2}^{\infty}{p}_{k}-\sum _{k=3}^{\infty}{p}_{k}]=50{p}_{2}=9.1970$
${q}_{2}=50\sum _{k=3}^{\infty}(k-2){p}_{k}=50[0.2642-2\times 0.0803]=5.1809$
${q}_{3}={q}_{0}-85=1.1668$ so that
Then, for $\alpha =0.8$ we have
Recall that we are dealing with expected discounted costs. Since we usually consider starting with a full stock, the amount $\phi \left(M\right)=\phi \left(3\right)=100.68$ is the quantity of interest. Note that this is smaller than other values of $\phi \left(j\right)$ , which correspond to smaller beginning stock levels. We expect greater restocking costs insuch cases.
We consider again the inventory problem. We have
The eigenvalues are $1,0.0920+i0.2434,0.0920-0.2434$ and 0. Thus, the decay of the transients is controlled by $|{\lambda}^{*}|={(0.{0920}^{2}+0.{2434}^{2})}^{1/2}=0.2602$ . Since $|{\lambda}^{*}{|}^{4}\approx 0.0046$ , the transients die out rather rapidly. We obtain the following approximations
The approximate value of B _{0} is found to be
The value $g=\pi \mathbf{q}\approx 28.80$ is found in the earlier treatment. From the values above, we have
The average gain per period is clearly $g\approx 28.80$ . This soon dominates the constant terms represented by the entries in v .
A Markov decision process has three states: State space $\mathbf{E}=\{1,\phantom{\rule{0.166667em}{0ex}}2,\phantom{\rule{0.166667em}{0ex}}3\}$ .
Actions: State 1: ${A}_{1}=\{1,\phantom{\rule{0.166667em}{0ex}}2,\phantom{\rule{0.166667em}{0ex}}3\}$ State 2: ${A}_{2}=\{1,\phantom{\rule{0.166667em}{0ex}}2\}$ State 3: ${A}_{3}=\{1,\phantom{\rule{0.166667em}{0ex}}2\}$
Transition probabilities and rewards are:
${p}_{1j}\left(1\right)$ : [1/3 1/3 1/3] | ${g}_{1j}\left(1\right)$ : [1 3 4] |
${p}_{1j}\left(2\right)$ : [1/4 3/8 3/8] | ${g}_{2j}\left(2\right)$ : [2 2 3] |
${p}_{1j}\left(3\right)$ : [1/3 1/3 1/3] | ${g}_{3j}\left(3\right)$ : [2 2 3] |
${p}_{2j}\left(1\right)$ : [1/8 3/8 1/2] | ${g}_{2j}\left(1\right)$ : [2 1 2] |
${p}_{2j}\left(2\right)$ : [1/2 1/4 1/4] | ${g}_{2j}\left(2\right)$ : [1 4 4] |
${p}_{3j}\left(1\right)$ : [3/8 1/4 3/8] | ${g}_{3j}\left(1\right)$ : [2 3 3] |
${p}_{3j}\left(2\right)$ : [1/8 1/4 5/8] | ${g}_{3j}\left(2\right)$ : [3 2 2] |
A computational procedure utilizing Matlab We first put the data in an m-file. Since we have several cases, it is
expedient to include the case number. This example belongs to type 2.Data in file md61.m
type = 2
PA = [1/3 1/3 1/3; 1/4 3/8 3/8; 1/3 1/3 1/3; 0 0 0;
1/8 3/8 1/2; 1/2 1/4 1/4; 0 0 0;3/8 1/4 3/8; 1/8 1/4 5/8]
GA = [1 3 4; 2 2 3; 2 2 3; 0 0 0; % Zero rows are separators2 1 2; 1 4 4; 0 0 0;
2 3 3; 3 2 2]A = [1 2 3 0 1 2 0 1 2]
md61
type = 2PA = 0.3333 0.3333 0.3333
0.2500 0.3750 0.37500.3333 0.3333 0.3333
0 0 00.1250 0.3750 0.5000
0.5000 0.2500 0.25000 0 0
0.3750 0.2500 0.37500.1250 0.2500 0.6250GA = 1 3 4
2 2 32 2 3
0 0 02 1 2
1 4 40 0 0
2 3 33 2 2A = 1 2 3 0 1 2 0 1 2
Once the data are entered into Matlab by the call for file “md61,” we make preparation for the policy-improvement step. The procedureis in the file newpolprep.m
% file newpolprep.m
% version of 3/23/92disp('Data needed:')
disp(' Matrix PA of transition probabilities for states and actions')disp(' Matrix GA of gains for states and actions')
disp(' Type number to identify GA matrix types')disp(' Type 1. Gains associated with a state')
disp(' Type 2. One-step transition gains')disp(' Type 3. Gains associated with a demand')
disp(' Row vector of actions')disp(' Value of alpha (= 1 for no discounting)')
c = input('Enter type number to show gain type ');a = input('Enter value of alpha (= 1 for no discounting) ');
PA = input('Enter matrix PA of transition probabilities ');GA = input('Enter matrix GA of gains ');
if c == 3PD = input('Enter row vector of demand probabilities ');
endif c == 1
QA = GA';elseif c ==2
QA = diag(PA*GA'); % (qx1)else
QA = GA*PD';end
A = input('Enter row vector A of possible actions '); % (1xq)
m = length(PA(1,:));q = length(A);
n = input('Enter n, the power of P to approximate P0 ');s = input('Enter s, the power of P in the approximation of V ');
QD = [(1:q)' A' QA]; % First col is index; second is A; third is QA
DIS = [' Index Action Value'];
disp(DIS)disp(QD)
if a == 1call = 'Call for newpol';
elsecall = 'Call for newpold';
enddisp(call)
newpolprep % Call for preparatory program in file npolprep.m
Data needed:Matrix PA of transition probabilities for states and actions
Matrix GA of gains for states and actionsType number to identify GA matrix types
Type 1. Gains associated with a stateType 2. One-step transition gains
Type 3. Gains associated with a demandRow vector of actions
Value of alpha (= 1 for no discounting)Enter type number to show gain type 2Enter value of alpha (=1 for no discounting) 1
Enter matrix PA of transition probabilities PAEnter matrix GA of gains GA
Enter row vector A of possible actions AEnter n, the power of P to approximate P0 16
Enter s, the power of P in the approximation of V 6Index Action Value
1.0000 1.0000 2.66672.0000 2.0000 2.3750
3.0000 3.0000 2.33334.0000 0 0
5.0000 1.0000 1.62506.0000 2.0000 2.5000
7.0000 0 08.0000 1.0000 2.6250
9.0000 2.0000 2.1250Call for newpol
The procedure has prompted for the procedure newpol (in file newpol.m)
% file: newpol.m (without discounting)
% version of 3/23/92d = input('Enter policy as row matrix of indices ');
D = A(d); % policy in terms of actionsP = PA(d',:); % selects probabilities for policy
Q = QA(d',:); % selects next-period gains for policyP0 = P^n; % Display to check convergence
PI = P0(1,:); % Long-run distributionG = PI*Q % Long-run expected gain per period
C = P + eye(P);for j=2:s
C = C + P^j; % C = I + P + P^2 + ... + P^send
V = (C - (s+1)*P0 )*Q; % B = B0*Qdisp(' ')
disp('Approximation to P0; rows are long-run dbn')disp(P0)
disp('Policy in terms of indices')disp(d)
disp('Policy in terms of actions')disp(D)
disp('Values for the policy selected')disp(V)
disp('Long-run expected gain per period G')disp(G)
T = [(1:q)' A' (QA + PA*V)]; % Test values for determining new policy
DIS =[' Index Action Test Value'];
disp(DIS)disp(T)
disp('Check current policy against new test values.')disp('--If new policy needed, call for newpol')
disp('--If not, use policy, values V, and gain G, above')
newpol
Enter policy as row matrix of indices [2 6 9]% A deliberately poor choiceApproximation to P0; rows are long-run dbn
0.2642 0.2830 0.45280.2642 0.2830 0.4528
0.2642 0.2830 0.4528Policy in terms of indices2 6 9Policy in terms of actions
2 2 2
Long-run expected gain per period G
2.2972Index Action Test Value1.0000 1.0000 2.7171
2.0000 2.0000 2.41683.0000 3.0000 2.3838
4.0000 0 05.0000 1.0000 1.6220
6.0000 2.0000 2.56777.0000 0 0
8.0000 1.0000 2.64799.0000 2.0000 2.0583Check current policy against new test values.
--If new policy needed, call for newpol--If not, use policy and gain G, above % New policy is needed
newpolEnter policy as row matrix of indices [1 6 8]Approximation to P0; rows are long-run dbn0.3939 0.2828 0.3232
0.3939 0.2828 0.32320.3939 0.2828 0.3232Policy in terms of indices
1 6 8Policy in terms of actions1 2 1Values for selected policy
0.0526-0.0989
0.0223Long-run expected gain per period G2.6061Index Action Test Value
1.0000 1.0000 2.65872.0000 2.0000 2.3595
3.0000 3.0000 2.32544.0000 0 0
5.0000 1.0000 1.60576.0000 2.0000 2.5072
7.0000 0 08.0000 1.0000 2.6284
9.0000 2.0000 2.1208Check current policy against new test values.--If new policy needed, call for newpol
--If not, use policy, values V, and gain G, above
Since the policy selected on this iteration is the same as the previous one, the procedure has converged to an optimal policy.The first of the first three rows is maximum; the second of the next two rows is maximum; and the first of the final two rows is maximum. Thus,we have selected rows 1, 5, 6, corresponding to the optimal policy ${d}^{*}\sim \left(1\phantom{\rule{0.277778em}{0ex}}2\phantom{\rule{0.277778em}{0ex}}1\right)$ . The expected long-run gain per period $g=2.6061$ .
The value matrix is
We made a somewhat arbitrary choice of the powers of P used in the approximation of P _{0} and B _{0} . As we note in the development of the approximation procedures in Sec 4, the convergence of ${\mathbf{P}}^{n}$ is governed by the magnitude of the largest eigenvalue other than one. We can always get a check on the reliability of the calculations by checkingthe eigenvalues for P corresponding to the presumed optimal policy. For the choice above, we find the eigenvalues to be 1, -0.1250, 0.0833. Since $0.{125}^{4}\approx 0.0002$ , the choices of exponents should be quite satisfactory. In fact, we could probably use ${\mathbf{P}}^{8}$ as a satisfactory approximation to P _{0} , if that were desirable. The margin allows for the possibility that for some policies the eigenvalues may not be so small.— $\square $
Assume data in file md61.m is in Matlab; if not, call for md61. We use the procedure newpolprep to prepare for the iterative procedure by making someinitial choices.
newpolprep
Data needed:Matrix PA of transition probabilities for states and actions
Matrix GA of gains for states and actionsType number to identify GA matrix types
Type 1. Gains associated with a stateType 2. One-step transition gains
Type 3. Gains associated with a demandRow vector of actions
Value of alpha (= 1 for no discounting)
Enter type number to show gain type 2
Enter value of alpha (= 1 for no discounting) 0.8Enter matrix PA of transition probabilities PA
Enter matrix GA of gains GAEnter row vector A of possible actions A
Enter n, the power of P to approximate P0 16Enter s, the power of P in the approximation of V 6Index Action Test Value
1.0000 1.0000 2.66672.0000 2.0000 2.3750
3.0000 3.0000 2.33334.0000 0 0
5.0000 1.0000 1.62506.0000 2.0000 2.5000
7.0000 0 08.0000 1.0000 2.6250
9.0000 2.0000 2.1250Call for newpold
The procedure for policy iteration is in the file newpold.m.
% file newpold.m (with discounting)
% version of 3/23/92d = input('Enter policy as row matrix of indices ');
D = A(d);P = PA(d',:); % transition matrix for policy selected
Q = QA(d',:); % average next-period gain for policy selectedV = (eye(P) - a*P)\Q; % Present values for unlimited futureT = [(1:q)' A' (QA + a*PA*V)]; % Test values for determining new policydisp(' ')
disp('Approximation to P0; rows are long-run dbn')disp(P0)
disp(' Policy in terms of indices')disp(d)
disp(' Policy in terms of actions')disp(D)
disp(' Values for policy')disp(V)
DIS =[' Index Action Test Value'];
disp(DIS)disp(T)
disp('Check current policy against new test values.')disp('--If new policy needed, call for newpold')
disp('--If not, use policy and values above')
newpold
Enter policy as row matrix of indices [3 5 9]% Deliberately poor choiceApproximation to P0; rows are long-run dbn
0.3939 0.2828 0.32320.3939 0.2828 0.3232
0.39390.2828 0.3232Policy in terms of indices
3 5 9Policy in terms of actions3 1 2Values for policy
10.37789.6167
10.1722Index Action Test Value1.0000 1.0000 10.7111
2.0000 2.0000 10.38723.0000 3.0000 10.3778
4.0000 0 05.0000 1.0000 9.6167
6.0000 2.0000 10.60897.0000 0 0
8.0000 1.0000 10.71339.0000 2.0000 10.1722Check current policy against new test values.
--If new policy needed, call for newpold--If not, use policy and values abovenewpold
Enter policy as row matrix of indices [1 6 8]Approximation to P0; rows are long-run dbn0.3939 0.2828 0.3232
0.3939 0.2828 0.32320.3939 0.2828 0.3232Policy in terms of indices
1 6 8Policy in terms of actions1 2 1Values for policy
13.084412.9302
13.0519
Index Action Test Value
1.0000 1.0000 13.08442.0000 2.0000 12.7865
3.0000 3.0000 12.75114.0000 0 0
5.0000 1.0000 12.03336.0000 2.0000 12.9302
7.0000 0 08.0000 1.0000 13.0519
9.0000 2.0000 12.5455Check current policy against new test values.--If new policy needed, call for newpold
--If not, use policy and values above
Since the policy indicated is the same as the previous policy, we know this is an optimal policy. Rows 1, 6, 8, indicate theoptimal policy to be ${d}^{*}\sim (1,\phantom{\rule{0.166667em}{0ex}}2,\phantom{\rule{0.166667em}{0ex}}1)$ . It turns out in this example that the optimal policies are the same for thediscounted and undiscounted cases. That is not always true. The v matrix gives the present values for unlimited futures, for each of the three possible starting states.
Notification Switch
Would you like to follow the 'Topics in applied probability' conversation and receive update notifications?