mdp.py 56.7 KB
Newer Older
Steven Cordwell's avatar
Steven Cordwell committed
1
# -*- coding: utf-8 -*-
2
3
"""Markov Decision Process (MDP) Toolbox: ``mdp`` module
=====================================================
4

5
6
The ``mdp`` module provides classes for the resolution of descrete-time Markov
Decision Processes.
Steven Cordwell's avatar
Steven Cordwell committed
7

Steven Cordwell's avatar
Steven Cordwell committed
8
9
10
11
12
Available classes
-----------------
MDP
    Base Markov decision process class
FiniteHorizon
Steven Cordwell's avatar
Steven Cordwell committed
13
    Backwards induction finite horizon MDP
Steven Cordwell's avatar
Steven Cordwell committed
14
15
16
17
18
19
20
21
22
23
24
25
26
27
LP
    Linear programming MDP
PolicyIteration
    Policy iteration MDP
PolicyIterationModified
    Modified policy iteration MDP
QLearning
    Q-learning MDP
RelativeValueIteration
    Relative value iteration MDP
ValueIteration
    Value iteration MDP
ValueIterationGS
    Gauss-Seidel value iteration MDP
Steven Cordwell's avatar
Steven Cordwell committed
28
29
30

"""

31
32
# Copyright (c) 2011-2013 Steven A. W. Cordwell
# Copyright (c) 2009 INRA
Steven Cordwell's avatar
Steven Cordwell committed
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# 
# All rights reserved.
# 
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# 
#   * Redistributions of source code must retain the above copyright notice,
#     this list of conditions and the following disclaimer.
#   * Redistributions in binary form must reproduce the above copyright notice,
#     this list of conditions and the following disclaimer in the documentation
#     and/or other materials provided with the distribution.
#   * Neither the name of the <ORGANIZATION> nor the names of its contributors
#     may be used to endorse or promote products derived from this software
#     without specific prior written permission.
# 
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.

Steven Cordwell's avatar
Steven Cordwell committed
60
61
62
from math import ceil, log, sqrt
from time import time

63
from numpy import absolute, array, empty, mean, mod, multiply, ones, zeros
64
from numpy.random import randint, random
Steven Cordwell's avatar
Steven Cordwell committed
65
from scipy.sparse import csr_matrix as sparse
Steven Cordwell's avatar
Steven Cordwell committed
66

Steven Cordwell's avatar
Steven Cordwell committed
67
from .utils import check, getSpan
68

Steven Cordwell's avatar
Steven Cordwell committed
69
class MDP(object):
70
    
Steven Cordwell's avatar
Steven Cordwell committed
71
72
    """A Markov Decision Problem.
    
73
74
    Let S = the number of states, and A = the number of acions.
    
Steven Cordwell's avatar
Steven Cordwell committed
75
76
77
    Parameters
    ----------
    transitions : array
78
79
80
81
82
83
84
85
86
        Transition probability matrices. These can be defined in a variety of 
        ways. The simplest is a numpy array that has the shape (A, S, S),
        though there are other possibilities. It can be a tuple or list or
        numpy object array of length A, where each element contains a numpy
        array or matrix that has the shape (S, S). This "list of matrices" form
        is useful when the transition matrices are sparse as
        scipy.sparse.csr_matrix matrices can be used. In summary, each action's
        transition matrix must be indexable like ``P[a]`` where
        ``a`` ∈ {0, 1...A-1}.
Steven Cordwell's avatar
Steven Cordwell committed
87
    reward : array
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
        Reward matrices or vectors. Like the transition matrices, these can
        also be defined in a variety of ways. Again the simplest is a numpy
        array that has the shape (S, A), (S,) or (A, S, S). A list of lists can
        be used, where each inner list has length S. A list of numpy arrays is
        possible where each inner array can be of the shape (S,), (S, 1),
        (1, S) or (S, S). Also scipy.sparse.csr_matrix can be used instead of
        numpy arrays. In addition, the outer list can be replaced with a tuple
        or numpy object array can be used.
    discount : float
        Discount factor. The per time-step discount factor on future rewards.
        Valid values are greater than 0 upto and including 1. If the discount
        factor is 1, then convergence is cannot be assumed and a warning will
        be displayed. Subclasses of ``MDP`` may pass None in the case where the
        algorithm does not use a discount factor.
    epsilon : float
        Stopping criterion. The maximum change in the value function at each
        iteration is compared against ``epsilon``. Once the change falls below
        this value, then the value function is considered to have converged to
        the optimal value function. Subclasses of ``MDP`` may pass None in the
        case where the algorithm does not use a stopping criterion.
    max_iter : int
        Maximum number of iterations. The algorithm will be terminated once
        this many iterations have elapsed. This must be greater than 0 if
        specified. Subclasses of ``MDP`` may pass None in the case where the 
        algorithm does not use a maximum number of iterations.
Steven Cordwell's avatar
Steven Cordwell committed
113
114
115
116
    
    Attributes
    ----------
    P : array
117
        Transition probability matrices.
Steven Cordwell's avatar
Steven Cordwell committed
118
    R : array
119
120
        Reward vectors.
    V : tuple
Steven Cordwell's avatar
Steven Cordwell committed
121
122
123
        The optimal value function. Each element is a float corresponding to
        the expected value of being in that state assuming the optimal policy
        is followed.
Steven Cordwell's avatar
Steven Cordwell committed
124
    discount : float
125
        The discount rate on future rewards.
Steven Cordwell's avatar
Steven Cordwell committed
126
    max_iter : int
127
128
129
        The maximum number of iterations.
    policy : tuple
        The optimal policy.
Steven Cordwell's avatar
Steven Cordwell committed
130
    time : float
131
132
133
        The time used to converge to the optimal policy.
    verbose : boolean
        Whether verbose output should be displayed in not.
Steven Cordwell's avatar
Steven Cordwell committed
134
135
136
    
    Methods
    -------
137
138
139
    run
        Implemented in child classes as the main algorithm loop. Raises and
        exception if it has not been overridden.
Steven Cordwell's avatar
Steven Cordwell committed
140
141
142
143
144
145
    setSilent
        Turn the verbosity off
    setVerbose
        Turn the verbosity on
    
    """
Steven Cordwell's avatar
Steven Cordwell committed
146
    
147
    def __init__(self, transitions, reward, discount, epsilon, max_iter):
148
        # Initialise a MDP based on the input parameters.
149
        
Steven Cordwell's avatar
Steven Cordwell committed
150
151
        # if the discount is None then the algorithm is assumed to not use it
        # in its computations
152
153
154
155
        if discount is not None:
            self.discount = float(discount)
            assert 0.0 < self.discount <= 1.0, "Discount rate must be in ]0; 1]"
            if self.discount == 1:
Steven Cordwell's avatar
Steven Cordwell committed
156
157
                print("WARNING: check conditions of convergence. With no "
                      "discount, convergence is can not be assumed.")
Steven Cordwell's avatar
Steven Cordwell committed
158
159
        # if the max_iter is None then the algorithm is assumed to not use it
        # in its computations
160
161
162
163
        if max_iter is not None:
            self.max_iter = int(max_iter)
            assert self.max_iter > 0, "The maximum number of iterations " \
                                      "must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
164
        # check that epsilon is something sane
165
166
167
        if epsilon is not None:
            self.epsilon = float(epsilon)
            assert self.epsilon > 0, "Epsilon must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
168
169
170
171
        # we run a check on P and R to make sure they are describing an MDP. If
        # an exception isn't raised then they are assumed to be correct.
        check(transitions, reward)
        # computePR will assign the variables self.S, self.A, self.P and self.R
172
        self._computePR(transitions, reward)
Steven Cordwell's avatar
Steven Cordwell committed
173
174
175
176
        # the verbosity is by default turned off
        self.verbose = False
        # Initially the time taken to perform the computations is set to None
        self.time = None
177
178
        # set the initial iteration count to zero
        self.iter = 0
Steven Cordwell's avatar
Steven Cordwell committed
179
        # V should be stored as a vector ie shape of (S,) or (1, S)
Steven Cordwell's avatar
Steven Cordwell committed
180
        self.V = None
Steven Cordwell's avatar
Steven Cordwell committed
181
        # policy can also be stored as a vector
Steven Cordwell's avatar
Steven Cordwell committed
182
        self.policy = None
Steven Cordwell's avatar
Steven Cordwell committed
183
    
184
185
186
187
188
189
    def __repr__(self):
        P_repr = "P: \n"
        R_repr = "R: \n"
        for aa in range(self.A):
            P_repr += repr(self.P[aa]) + "\n"
            R_repr += repr(self.R[aa]) + "\n"
190
        return(P_repr + "\n" + R_repr)
191
    
192
    def _bellmanOperator(self, V=None):
Steven Cordwell's avatar
Steven Cordwell committed
193
        # Apply the Bellman operator on the value function.
194
        # 
Steven Cordwell's avatar
Steven Cordwell committed
195
        # Updates the value function and the Vprev-improving policy.
196
        # 
Steven Cordwell's avatar
Steven Cordwell committed
197
198
199
200
        # Returns: (policy, value), tuple of new policy and its value
        #
        # If V hasn't been sent into the method, then we assume to be working
        # on the objects V attribute
201
202
        if V is None:
            # this V should be a reference to the data rather than a copy
203
204
            V = self.V
        else:
Steven Cordwell's avatar
Steven Cordwell committed
205
            # make sure the user supplied V is of the right shape
206
            try:
207
208
                assert V.shape in ((self.S,), (1, self.S)), "V is not the " \
                    "right shape (Bellman operator)."
209
            except AttributeError:
210
                raise TypeError("V must be a numpy array or matrix.")
211
212
213
214
        # Looping through each action the the Q-value matrix is calculated.
        # P and V can be any object that supports indexing, so it is important
        # that you know they define a valid MDP before calling the
        # _bellmanOperator method. Otherwise the results will be meaningless.
Steven Cordwell's avatar
Steven Cordwell committed
215
        Q = empty((self.A, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
216
        for aa in range(self.A):
Steven Cordwell's avatar
Steven Cordwell committed
217
            Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)
Steven Cordwell's avatar
Steven Cordwell committed
218
        # Get the policy and value, for now it is being returned but...
219
        # Which way is better?
220
        # 1. Return, (policy, value)
221
        return (Q.argmax(axis=0), Q.max(axis=0))
Steven Cordwell's avatar
Steven Cordwell committed
222
223
        # 2. update self.policy and self.V directly
        # self.V = Q.max(axis=1)
224
        # self.policy = Q.argmax(axis=1)
Steven Cordwell's avatar
Steven Cordwell committed
225
    
226
227
228
229
230
231
232
233
234
235
236
237
    def _computeP(self, P):
        # Set self.P as a tuple of length A, with each element storing an S×S
        # matrix.
        self.A = len(P)
        try:
            if P.ndim == 3:
                self.S = P.shape[1]
            else:
               self.S = P[0].shape[0]
        except AttributeError:
            self.S = P[0].shape[0]
        # convert P to a tuple of numpy arrays
238
        self.P = tuple(P[aa] for aa in range(self.A))
239
    
240
    def _computePR(self, P, R):
Steven Cordwell's avatar
Steven Cordwell committed
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
        # Compute the reward for the system in one state chosing an action.
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        #    P(SxSxA)  = transition matrix 
        #        P could be an array with 3 dimensions or  a cell array (1xA), 
        #        each cell containing a matrix (SxS) possibly sparse
        #    R(SxSxA) or (SxA) = reward matrix
        #        R could be an array with 3 dimensions (SxSxA) or  a cell array 
        #        (1xA), each cell containing a sparse matrix (SxS) or a 2D 
        #        array(SxA) possibly sparse  
        # Evaluation
        # ----------
        #    PR(SxA)   = reward matrix
        #
256
        # We assume that P and R define a MDP i,e. assumption is that
Steven Cordwell's avatar
Steven Cordwell committed
257
        # check(P, R) has already been run and doesn't fail.
258
        #
259
260
        # First compute store P, S, and A
        self._computeP(P)
Steven Cordwell's avatar
Steven Cordwell committed
261
262
        # Set self.R as a tuple of length A, with each element storing an 1×S
        # vector.
263
        try:
264
265
            if R.ndim == 1:
                r = array(R).reshape(self.S)
266
                self.R = tuple(r for aa in range(self.A))
267
            elif R.ndim == 2:
268
269
                self.R = tuple(array(R[:, aa]).reshape(self.S)
                                for aa in range(self.A))
Steven Cordwell's avatar
Steven Cordwell committed
270
            else:
271
272
                self.R = tuple(multiply(P[aa], R[aa]).sum(1).reshape(self.S)
                                for aa in range(self.A))
273
        except AttributeError:
274
            if len(R) == self.A:
275
276
                self.R = tuple(multiply(P[aa], R[aa]).sum(1).reshape(self.S)
                                for aa in range(self.A))
277
278
            else:
                r = array(R).reshape(self.S)
279
                self.R = tuple(r for aa in range(self.A))
280
    
281
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
282
        # Raise error because child classes should implement this function.
283
        raise NotImplementedError("You should create a run() method.")
Steven Cordwell's avatar
Steven Cordwell committed
284
    
Steven Cordwell's avatar
Steven Cordwell committed
285
    def setSilent(self):
286
        """Set the MDP algorithm to silent mode."""
Steven Cordwell's avatar
Steven Cordwell committed
287
288
289
        self.verbose = False
    
    def setVerbose(self):
290
        """Set the MDP algorithm to verbose mode."""
Steven Cordwell's avatar
Steven Cordwell committed
291
        self.verbose = True
Steven Cordwell's avatar
Steven Cordwell committed
292
293

class FiniteHorizon(MDP):
294
    
Steven Cordwell's avatar
Steven Cordwell committed
295
    """A MDP solved using the finite-horizon backwards induction algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
296
    
Steven Cordwell's avatar
Steven Cordwell committed
297
298
    Parameters
    ----------
299
300
301
302
303
304
305
306
307
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
308
309
310
311
    N : int
        Number of periods. Must be greater than 0.
    h : array, optional
        Terminal reward. Default: a vector of zeros.
Steven Cordwell's avatar
Steven Cordwell committed
312
    
Steven Cordwell's avatar
Steven Cordwell committed
313
314
315
316
317
318
319
320
321
322
323
    Data Attributes
    ---------------
    V : array 
        Optimal value function. Shape = (S, N+1). ``V[:, n]`` = optimal value
        function at stage ``n`` with stage in (0, 1...N-1). ``V[:, N]`` value
        function for terminal stage. 
    policy : array
        Optimal policy. ``policy[:, n]`` = optimal policy at stage ``n`` with
        stage in (0, 1...N). ``policy[:, N]`` = policy for stage ``N``.
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
324
325
326
327
  
    Notes
    -----
    In verbose mode, displays the current stage and policy transpose.
328
    
Steven Cordwell's avatar
Steven Cordwell committed
329
330
    Examples
    --------
331
332
333
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3)
334
    >>> fh.run()
Steven Cordwell's avatar
Steven Cordwell committed
335
336
337
338
339
340
341
342
    >>> fh.V
    array([[ 2.6973,  0.81  ,  0.    ,  0.    ],
           [ 5.9373,  3.24  ,  1.    ,  0.    ],
           [ 9.9373,  7.24  ,  4.    ,  0.    ]])
    >>> fh.policy
    array([[0, 0, 0],
           [0, 0, 1],
           [0, 0, 0]])
343
    
Steven Cordwell's avatar
Steven Cordwell committed
344
    """
Steven Cordwell's avatar
Steven Cordwell committed
345

Steven Cordwell's avatar
Steven Cordwell committed
346
    def __init__(self, transitions, reward, discount, N, h=None):
347
        # Initialise a finite horizon MDP.
348
349
        self.N = int(N)
        assert self.N > 0, 'PyMDPtoolbox: N must be greater than 0.'
Steven Cordwell's avatar
Steven Cordwell committed
350
        # Initialise the base class
351
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
352
353
        # remove the iteration counter, it is not meaningful for backwards
        # induction
354
        del self.iter
Steven Cordwell's avatar
Steven Cordwell committed
355
        # There are value vectors for each time step up to the horizon
356
        self.V = zeros((self.S, N + 1))
Steven Cordwell's avatar
Steven Cordwell committed
357
358
359
360
361
        # There are policy vectors for each time step before the horizon, when
        # we reach the horizon we don't need to make decisions anymore.
        self.policy = empty((self.S, N), dtype=int)
        # Set the reward for the final transition to h, if specified.
        if h is not None:
362
            self.V[:, N] = h
363
        # Call the iteration method
364
        #self.run()
365
        
366
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
367
        # Run the finite horizon algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
368
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
369
        # loop through each time period
370
        for n in range(self.N):
Steven Cordwell's avatar
Steven Cordwell committed
371
372
373
            W, X = self._bellmanOperator(self.V[:, self.N - n])
            self.V[:, self.N - n - 1] = X
            self.policy[:, self.N - n - 1] = W
Steven Cordwell's avatar
Steven Cordwell committed
374
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
375
376
                print(("stage: %s ... policy transpose : %s") % (
                    self.N - n, self.policy[:, self.N - n -1].tolist()))
Steven Cordwell's avatar
Steven Cordwell committed
377
        # update time spent running
Steven Cordwell's avatar
Steven Cordwell committed
378
        self.time = time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
379
380
381
382
383
384
385
386
387
388
        # After this we could create a tuple of tuples for the values and 
        # policies.
        #V = []
        #p = []
        #for n in xrange(self.N):
        #    V.append()
        #    p.append()
        #V.append()
        #self.V = tuple(V)
        #self.policy = tuple(p)
Steven Cordwell's avatar
Steven Cordwell committed
389
390

class LP(MDP):
391
    
392
    """A discounted MDP soloved using linear programming.
Steven Cordwell's avatar
Steven Cordwell committed
393
394
    
    This class requires the Python ``cvxopt`` module to be installed.
Steven Cordwell's avatar
Steven Cordwell committed
395
396
397

    Arguments
    ---------
398
399
400
401
402
403
404
405
406
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
407
408
    h : array, optional
        Terminal reward. Default: a vector of zeros.
Steven Cordwell's avatar
Steven Cordwell committed
409
    
Steven Cordwell's avatar
Steven Cordwell committed
410
411
412
413
414
415
416
417
    Data Attributes
    ---------------
    V : tuple
        optimal values
    policy : tuple
        optimal policy
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
418
419
420
421
422
423
424
    
    Notes    
    -----
    In verbose mode, displays the current stage and policy transpose.
    
    Examples
    --------
425
426
427
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> lp = mdptoolbox.mdp.LP(P, R, 0.9)
428
    >>> lp.run()
429
    
Steven Cordwell's avatar
Steven Cordwell committed
430
    """
Steven Cordwell's avatar
Steven Cordwell committed
431

Steven Cordwell's avatar
Steven Cordwell committed
432
    def __init__(self, transitions, reward, discount):
433
        # Initialise a linear programming MDP.
Steven Cordwell's avatar
Steven Cordwell committed
434
        # import some functions from cvxopt and set them as object methods
Steven Cordwell's avatar
Steven Cordwell committed
435
436
        try:
            from cvxopt import matrix, solvers
437
438
            self._linprog = solvers.lp
            self._cvxmat = matrix
Steven Cordwell's avatar
Steven Cordwell committed
439
        except ImportError:
440
441
            raise ImportError("The python module cvxopt is required to use "
                              "linear programming functionality.")
Steven Cordwell's avatar
Steven Cordwell committed
442
443
        # we also need diagonal matrices, and using a sparse one may be more
        # memory efficient
Steven Cordwell's avatar
Steven Cordwell committed
444
        from scipy.sparse import eye as speye
445
        self._speye = speye
Steven Cordwell's avatar
Steven Cordwell committed
446
        # initialise the MDP. epsilon and max_iter are not needed
447
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
448
        # Set the cvxopt solver to be quiet by default, but ...
449
        # this doesn't do what I want it to do c.f. issue #3
450
451
        if not self.verbose:
            solvers.options['show_progress'] = False
452
        # Call the iteration method
453
        #self.run()
454
    
455
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
456
        #Run the linear programming algorithm.
457
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
458
        # The objective is to resolve : min V / V >= PR + discount*P*V
459
460
        # The function linprog of the optimisation Toolbox of Mathworks
        # resolves :
Steven Cordwell's avatar
Steven Cordwell committed
461
        # min f'* x / M * x <= b
462
463
464
465
        # So the objective could be expressed as :
        # min V / (discount*P-I) * V <= - PR
        # To avoid loop on states, the matrix M is structured following actions
        # M(A*S,S)
466
467
468
        f = self._cvxmat(ones((self.S, 1)))
        h = self._cvxmat(self.R.reshape(self.S * self.A, 1, order="F"), tc='d')
        M = zeros((self.A * self.S, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
469
470
        for aa in range(self.A):
            pos = (aa + 1) * self.S
471
472
473
            M[(pos - self.S):pos, :] = (
                self.discount * self.P[aa] - self._speye(self.S, self.S))
        M = self._cvxmat(M)
474
475
476
        # Using the glpk option will make this behave more like Octave
        # (Octave uses glpk) and perhaps Matlab. If solver=None (ie using the 
        # default cvxopt solver) then V agrees with the Octave equivalent
Steven Cordwell's avatar
Steven Cordwell committed
477
        # only to 10e-8 places. This assumes glpk is installed of course.
Steven Cordwell's avatar
Steven Cordwell committed
478
        self.V = array(self._linprog(f, M, -h, solver='glpk')['x'])
Steven Cordwell's avatar
Steven Cordwell committed
479
        # apply the Bellman operator
480
        self.policy, self.V =  self._bellmanOperator()
Steven Cordwell's avatar
Steven Cordwell committed
481
        # update the time spent solving
Steven Cordwell's avatar
Steven Cordwell committed
482
        self.time = time() - self.time
483
        # store value and policy as tuples
484
485
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
486
487

class PolicyIteration(MDP):
488
    
489
    """A discounted MDP solved using the policy iteration algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
490
    
Steven Cordwell's avatar
Steven Cordwell committed
491
492
    Arguments
    ---------
493
494
495
496
497
498
499
500
501
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details. 
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
502
503
504
    policy0 : array, optional
        Starting policy.
    max_iter : int, optional
505
506
        Maximum number of iterations. See the documentation for the ``MDP``
        class for details. Default is 1000.
Steven Cordwell's avatar
Steven Cordwell committed
507
508
509
510
    eval_type : int or string, optional
        Type of function used to evaluate policy. 0 or "matrix" to solve as a
        set of linear equations. 1 or "iterative" to solve iteratively.
        Default: 0.
Steven Cordwell's avatar
Steven Cordwell committed
511
             
Steven Cordwell's avatar
Steven Cordwell committed
512
513
514
515
516
517
518
519
520
521
    Data Attributes
    ---------------
    V : tuple
        value function 
    policy : tuple
        optimal policy
    iter : int
        number of done iterations
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
522
523
524
525
526
527
    
    Notes
    -----
    In verbose mode, at each iteration, displays the number 
    of differents actions between policy n-1 and n
    
Steven Cordwell's avatar
Steven Cordwell committed
528
529
    Examples
    --------
530
531
532
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.rand()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
533
    >>> pi.run()
534
    
535
536
    >>> P, R = mdptoolbox.example.forest()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
537
    >>> pi.run()
Steven Cordwell's avatar
Steven Cordwell committed
538
    >>> pi.V
539
    (26.244000000000018, 29.48400000000002, 33.484000000000016)
Steven Cordwell's avatar
Steven Cordwell committed
540
    >>> pi.policy
541
    (0, 0, 0)
Steven Cordwell's avatar
Steven Cordwell committed
542
    """
Steven Cordwell's avatar
Steven Cordwell committed
543
    
544
545
    def __init__(self, transitions, reward, discount, policy0=None,
                 max_iter=1000, eval_type=0):
Steven Cordwell's avatar
Steven Cordwell committed
546
547
548
        # Initialise a policy iteration MDP.
        #
        # Set up the MDP, but don't need to worry about epsilon values
549
        MDP.__init__(self, transitions, reward, discount, None, max_iter)
Steven Cordwell's avatar
Steven Cordwell committed
550
        # Check if the user has supplied an initial policy. If not make one.
Steven Cordwell's avatar
Steven Cordwell committed
551
        if policy0 == None:
Steven Cordwell's avatar
Steven Cordwell committed
552
            # Initialise the policy to the one which maximises the expected
Steven Cordwell's avatar
Steven Cordwell committed
553
            # immediate reward
Steven Cordwell's avatar
Steven Cordwell committed
554
555
            null = zeros(self.S)
            self.policy, null = self._bellmanOperator(null)
556
            del null
Steven Cordwell's avatar
Steven Cordwell committed
557
        else:
Steven Cordwell's avatar
Steven Cordwell committed
558
559
            # Use the policy that the user supplied
            # Make sure it is a numpy array
Steven Cordwell's avatar
Steven Cordwell committed
560
            policy0 = array(policy0)
Steven Cordwell's avatar
Steven Cordwell committed
561
            # Make sure the policy is the right size and shape
562
563
            assert policy0.shape in ((self.S, ), (self.S, 1), (1, self.S)), \
                "'policy0' must a vector with length S."
Steven Cordwell's avatar
Steven Cordwell committed
564
            # reshape the policy to be a vector
Steven Cordwell's avatar
Steven Cordwell committed
565
            policy0 = policy0.reshape(self.S)
566
567
568
569
570
571
            # The policy can only contain integers between 0 and S-1
            msg = "'policy0' must be a vector of integers between 0 and S-1."
            assert not mod(policy0, 1).any(), msg
            assert (policy0 >= 0).all(), msg
            assert (policy0 < self.S).all(), msg
            self.policy = policy0
Steven Cordwell's avatar
Steven Cordwell committed
572
        # set the initial values to zero
Steven Cordwell's avatar
Steven Cordwell committed
573
        self.V = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
574
        # Do some setup depending on the evaluation type
Steven Cordwell's avatar
Steven Cordwell committed
575
        if eval_type in (0, "matrix"):
576
            from numpy.linalg import solve
577
            from scipy.sparse import eye
578
579
            self._speye = eye
            self._lin_eq = solve
Steven Cordwell's avatar
Steven Cordwell committed
580
581
582
583
            self.eval_type = "matrix"
        elif eval_type in (1, "iterative"):
            self.eval_type = "iterative"
        else:
584
585
586
587
            raise ValueError("PyMDPtoolbox: eval_type should be 0 for matrix "
                             "evaluation or 1 for iterative evaluation. "
                             "The strings 'matrix' and 'iterative' can also "
                             "be used.")
588
        # Call the iteration method
589
        #self.run()
Steven Cordwell's avatar
Steven Cordwell committed
590
    
591
    def _computePpolicyPRpolicy(self):
Steven Cordwell's avatar
Steven Cordwell committed
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
        # Compute the transition matrix and the reward matrix for a policy.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA)  = transition matrix 
        #     P could be an array with 3 dimensions or a cell array (1xA),
        #     each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #     R could be an array with 3 dimensions (SxSxA) or 
        #     a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #     a 2D array(SxA) possibly sparse  
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Ppolicy(SxS)  = transition matrix for policy
        # PRpolicy(S)   = reward matrix for policy
        #
Steven Cordwell's avatar
Steven Cordwell committed
611
612
        Ppolicy = empty((self.S, self.S))
        Rpolicy = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
613
        for aa in range(self.A): # avoid looping over S
Steven Cordwell's avatar
Steven Cordwell committed
614
615
            # the rows that use action a.
            ind = (self.policy == aa).nonzero()[0]
616
617
            # if no rows use action a, then no need to assign this
            if ind.size > 0:
618
619
620
621
                try:
                    Ppolicy[ind, :] = self.P[aa][ind, :]
                except ValueError:
                    Ppolicy[ind, :] = self.P[aa][ind, :].todense()
622
                #PR = self._computePR() # an apparently uneeded line, and
Steven Cordwell's avatar
Steven Cordwell committed
623
624
                # perhaps harmful in this implementation c.f.
                # mdp_computePpolicyPRpolicy.m
625
                Rpolicy[ind] = self.R[aa][ind]
Steven Cordwell's avatar
Steven Cordwell committed
626
627
628
629
630
631
632
633
634
635
        # self.R cannot be sparse with the code in its current condition, but
        # it should be possible in the future. Also, if R is so big that its
        # a good idea to use a sparse matrix for it, then converting PRpolicy
        # from a dense to sparse matrix doesn't seem very memory efficient
        if type(self.R) is sparse:
            Rpolicy = sparse(Rpolicy)
        #self.Ppolicy = Ppolicy
        #self.Rpolicy = Rpolicy
        return (Ppolicy, Rpolicy)
    
636
    def _evalPolicyIterative(self, V0=0, epsilon=0.0001, max_iter=10000):
Steven Cordwell's avatar
Steven Cordwell committed
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
        # Evaluate a policy using iteration.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA)  = transition matrix 
        #    P could be an array with 3 dimensions or 
        #    a cell array (1xS), each cell containing a matrix possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #    R could be an array with 3 dimensions (SxSxA) or 
        #    a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #    a 2D array(SxA) possibly sparse  
        # discount  = discount rate in ]0; 1[
        # policy(S) = a policy
        # V0(S)     = starting value function, optional (default : zeros(S,1))
        # epsilon   = epsilon-optimal policy search, upper than 0,
        #    optional (default : 0.0001)
        # max_iter  = maximum number of iteration to be done, upper than 0, 
        #    optional (default : 10000)
        #    
        # Evaluation
        # ----------
        # Vpolicy(S) = value function, associated to a specific policy
        #
        # Notes
        # -----
        # In verbose mode, at each iteration, displays the condition which
        # stopped iterations: epsilon-optimum value function found or maximum
        # number of iterations reached.
        #
667
668
669
670
671
672
673
        try:
            assert V0.shape in ((self.S, ), (self.S, 1), (1, self.S)), \
                "'V0' must be a vector of length S."
            policy_V = array(V0).reshape(self.S)
        except AttributeError:
            if len(V0) == self.S:
                policy_V = array(V0).reshape(self.S)
674
            else:
675
                policy_V = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
676
        
677
        policy_P, policy_R = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
678
679
680
        
        if self.verbose:
            print('  Iteration    V_variation')
681
        
Steven Cordwell's avatar
Steven Cordwell committed
682
683
684
        itr = 0
        done = False
        while not done:
685
            itr += 1
686
687
            
            Vprev = policy_V
688
            policy_V = policy_R + self.discount * policy_P.dot(Vprev)
689
690
            
            variation = absolute(policy_V - Vprev).max()
Steven Cordwell's avatar
Steven Cordwell committed
691
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
692
                print(('      %s         %s') % (itr, variation))
693
694
695
            
            # ensure |Vn - Vpolicy| < epsilon
            if variation < ((1 - self.discount) / self.discount) * epsilon:
Steven Cordwell's avatar
Steven Cordwell committed
696
697
                done = True
                if self.verbose:
698
699
                    print("PyMDPtoolbox: iterations stopped, epsilon-optimal "
                          "value function.")
Steven Cordwell's avatar
Steven Cordwell committed
700
701
702
            elif itr == max_iter:
                done = True
                if self.verbose:
703
704
                    print("PyMDPtoolbox: iterations stopped by maximum number "
                          "of iteration condition.")
Steven Cordwell's avatar
Steven Cordwell committed
705
        
Steven Cordwell's avatar
Steven Cordwell committed
706
        self.V = policy_V
707
    
708
    def _evalPolicyMatrix(self):
Steven Cordwell's avatar
Steven Cordwell committed
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
        # Evaluate the value function of the policy using linear equations.
        #
        # Arguments 
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA) = transition matrix 
        #      P could be an array with 3 dimensions or a cell array (1xA),
        #      each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #      R could be an array with 3 dimensions (SxSxA) or 
        #      a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #      a 2D array(SxA) possibly sparse  
        # discount = discount rate in ]0; 1[
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Vpolicy(S) = value function of the policy
        #
728
        Ppolicy, Rpolicy = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
729
        # V = PR + gPV  => (I-gP)V = PR  => V = inv(I-gP)* PR
730
731
        self.V = self._lin_eq(
            (self._speye(self.S, self.S) - self.discount * Ppolicy), Rpolicy)
Steven Cordwell's avatar
Steven Cordwell committed
732
    
733
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
734
735
        # Run the policy iteration algorithm.
        # If verbose the print a header
736
737
        if self.verbose:
            print('  Iteration  Number_of_different_actions')
Steven Cordwell's avatar
Steven Cordwell committed
738
        # Set up the while stopping condition and the current time
Steven Cordwell's avatar
Steven Cordwell committed
739
        done = False
740
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
741
        # loop until a stopping condition is reached
Steven Cordwell's avatar
Steven Cordwell committed
742
        while not done:
743
            self.iter += 1
744
            # these _evalPolicy* functions will update the classes value
Steven Cordwell's avatar
Steven Cordwell committed
745
            # attribute
Steven Cordwell's avatar
Steven Cordwell committed
746
            if self.eval_type == "matrix":
747
                self._evalPolicyMatrix()
Steven Cordwell's avatar
Steven Cordwell committed
748
            elif self.eval_type == "iterative":
749
                self._evalPolicyIterative()
Steven Cordwell's avatar
Steven Cordwell committed
750
751
            # This should update the classes policy attribute but leave the
            # value alone
752
            policy_next, null = self._bellmanOperator()
753
            del null
Steven Cordwell's avatar
Steven Cordwell committed
754
755
            # calculate in how many places does the old policy disagree with
            # the new policy
756
            n_different = (policy_next != self.policy).sum()
Steven Cordwell's avatar
Steven Cordwell committed
757
            # if verbose then continue printing a table
758
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
759
760
                print(('       %s                 %s') % (self.iter,
                                                         n_different))
Steven Cordwell's avatar
Steven Cordwell committed
761
762
            # Once the policy is unchanging of the maximum number of 
            # of iterations has been reached then stop
763
            if n_different == 0:
Steven Cordwell's avatar
Steven Cordwell committed
764
                done = True
765
                if self.verbose:
766
767
                    print("PyMDPtoolbox: iterations stopped, unchanging "
                          "policy found.")
768
769
770
            elif (self.iter == self.max_iter):
                done = True 
                if self.verbose:
771
772
                    print("PyMDPtoolbox: iterations stopped by maximum number "
                          "of iteration condition.")
773
774
            else:
                self.policy = policy_next
Steven Cordwell's avatar
Steven Cordwell committed
775
        # update the time to return th computation time
776
        self.time = time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
777
        # store value and policy as tuples
778
779
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
780

781
class PolicyIterationModified(PolicyIteration):
782
    
783
    """A discounted MDP  solved using a modifified policy iteration algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
784
785
786
    
    Arguments
    ---------
787
788
789
790
791
792
793
794
795
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
796
797
798
799
    epsilon : float, optional
        Stopping criterion. See the documentation for the ``MDP`` class for
        details. Default: 0.01.
    max_iter : int, optional
800
        Maximum number of iterations. See the documentation for the ``MDP``
Steven Cordwell's avatar
Steven Cordwell committed
801
        class for details. Default is 10.
Steven Cordwell's avatar
Steven Cordwell committed
802
803
804
    
    Data Attributes
    ---------------
Steven Cordwell's avatar
Steven Cordwell committed
805
806
807
808
809
810
811
812
    V : tuple
        value function 
    policy : tuple
        optimal policy
    iter : int
        number of done iterations
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
813
814
815
    
    Examples
    --------
816
817
818
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> pim = mdptoolbox.mdp.PolicyIterationModified(P, R, 0.9)
819
    >>> pim.run()
820
    >>> pim.policy
821
    (0, 0, 0)
822
    >>> pim.V
823
    (21.81408652334702, 25.054086523347017, 29.054086523347017)
824
    
Steven Cordwell's avatar
Steven Cordwell committed
825
    """
826
    
827
828
    def __init__(self, transitions, reward, discount, epsilon=0.01,
                 max_iter=10):
829
        # Initialise a (modified) policy iteration MDP.
Steven Cordwell's avatar
Steven Cordwell committed
830
        
831
832
833
        # Maybe its better not to subclass from PolicyIteration, because the
        # initialisation of the two are quite different. eg there is policy0
        # being calculated here which doesn't need to be. The only thing that
834
        # is needed from the PolicyIteration class is the _evalPolicyIterative
835
        # function. Perhaps there is a better way to do it?
836
837
        PolicyIteration.__init__(self, transitions, reward, discount, None,
                                 max_iter, 1)
838
        
839
840
        # PolicyIteration doesn't pass epsilon to MDP.__init__() so we will
        # check it here
841
842
        self.epsilon = float(epsilon)
        assert epsilon > 0, "'epsilon' must be greater than 0."
843
        
844
845
        # computation of threshold of variation for V for an epsilon-optimal
        # policy
846
        if self.discount != 1:
847
            self.thresh = self.epsilon * (1 - self.discount) / self.discount
848
        else:
849
            self.thresh = self.epsilon
850
        
851
        if self.discount == 1:
Steven Cordwell's avatar
Steven Cordwell committed
852
            self.V = zeros((self.S, 1))
853
        else:
854
            Rmin = min(R.min() for R in self.R)
855
            self.V = 1 / (1 - self.discount) * Rmin * ones((self.S,))
856
857
        
        # Call the iteration method
858
        #self.run()
Steven Cordwell's avatar
Steven Cordwell committed
859
    
860
    def run(self):
861
        # Run the modified policy iteration algorithm.
862
863
        
        if self.verbose:
864
            print('\tIteration\tV-variation')
865
        
Steven Cordwell's avatar
Steven Cordwell committed
866
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
867
        
868
869
        done = False
        while not done:
870
            self.iter += 1
871
            
872
            self.policy, Vnext = self._bellmanOperator()
873
            #[Ppolicy, PRpolicy] = mdp_computePpolicyPRpolicy(P, PR, policy);
874
            
875
            variation = getSpan(Vnext - self.V)
876
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
877
                print(("\t%s\t%s" % (self.iter, variation)))
878
            
Steven Cordwell's avatar
Steven Cordwell committed
879
            self.V = Vnext
Steven Cordwell's avatar
Steven Cordwell committed
880
            if variation < self.thresh:
881
882
883
884
                done = True
            else:
                is_verbose = False
                if self.verbose:
885
                    self.setSilent()
886
887
                    is_verbose = True
                
888
                self._evalPolicyIterative(self.V, self.epsilon, self.max_iter)
889
890
                
                if is_verbose:
891
                    self.setVerbose()
892
        
893
        self.time = time() - self.time
894
895
        
        # store value and policy as tuples
896
897
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
898
899

class QLearning(MDP):
900
    
901
    """A discounted MDP solved using the Q learning algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
902
903
904
    
    Parameters
    ----------
905
906
907
908
909
910
911
912
913
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details. 
Steven Cordwell's avatar
Steven Cordwell committed
914
915
916
    n_iter : int, optional
        Number of iterations to execute. This is ignored unless it is an 
        integer greater than the default value. Defaut: 10,000.
Steven Cordwell's avatar
Steven Cordwell committed
917
    
Steven Cordwell's avatar
Steven Cordwell committed
918
919
920
921
922
923
924
925
926
927
928
    Data Attributes
    ---------------
    Q : array
        learned Q matrix (SxA)     
    V : tuple
        learned value function (S).    
    policy : tuple
        learned optimal policy (S).    
    mean_discrepancy : array
        Vector of V discrepancy mean over 100 iterations. Then the length of
        this vector for the default value of N is 100 (N/100).
Steven Cordwell's avatar
Steven Cordwell committed
929

930
    Examples
Steven Cordwell's avatar
Steven Cordwell committed
931
    ---------
932
933
934
    >>> # These examples are reproducible only if random seed is set to 0 in
    >>> # both the random and numpy.random modules.
    >>> import numpy as np
935
    >>> import mdptoolbox, mdptoolbox.example
936
    >>> np.random.seed(0)
937
938
    >>> P, R = mdptoolbox.example.forest()
    >>> ql = mdptoolbox.mdp.QLearning(P, R, 0.96)
939
    >>> ql.run()
Steven Cordwell's avatar
Steven Cordwell committed
940
    >>> ql.Q
941
942
943
    array([[ 68.38037354,  43.24888454],
           [ 72.37777922,  42.75549145],
           [ 77.02892702,  64.68712932]])
Steven Cordwell's avatar
Steven Cordwell committed
944
    >>> ql.V
945
    (68.38037354422798, 72.37777921607258, 77.02892701616531)
Steven Cordwell's avatar
Steven Cordwell committed
946
    >>> ql.policy
947
    (0, 0, 0)
Steven Cordwell's avatar
Steven Cordwell committed
948
    
949
    >>> import mdptoolbox
Steven Cordwell's avatar
Steven Cordwell committed
950
951
952
    >>> import numpy as np
    >>> P = np.array([[[0.5, 0.5],[0.8, 0.2]],[[0, 1],[0.1, 0.9]]])
    >>> R = np.array([[5, 10], [-1, 2]])