mdp.py 56.7 KB
Newer Older
Steven Cordwell's avatar
Steven Cordwell committed
1
# -*- coding: utf-8 -*-
2 3
"""Markov Decision Process (MDP) Toolbox: ``mdp`` module
=====================================================
4

5 6
The ``mdp`` module provides classes for the resolution of descrete-time Markov
Decision Processes.
Steven Cordwell's avatar
Steven Cordwell committed
7

Steven Cordwell's avatar
Steven Cordwell committed
8 9 10 11 12
Available classes
-----------------
MDP
    Base Markov decision process class
FiniteHorizon
Steven Cordwell's avatar
Steven Cordwell committed
13
    Backwards induction finite horizon MDP
Steven Cordwell's avatar
Steven Cordwell committed
14 15 16 17 18 19 20 21 22 23 24 25 26 27
LP
    Linear programming MDP
PolicyIteration
    Policy iteration MDP
PolicyIterationModified
    Modified policy iteration MDP
QLearning
    Q-learning MDP
RelativeValueIteration
    Relative value iteration MDP
ValueIteration
    Value iteration MDP
ValueIterationGS
    Gauss-Seidel value iteration MDP
Steven Cordwell's avatar
Steven Cordwell committed
28 29 30

"""

31 32
# Copyright (c) 2011-2013 Steven A. W. Cordwell
# Copyright (c) 2009 INRA
Steven Cordwell's avatar
Steven Cordwell committed
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
# 
# All rights reserved.
# 
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# 
#   * Redistributions of source code must retain the above copyright notice,
#     this list of conditions and the following disclaimer.
#   * Redistributions in binary form must reproduce the above copyright notice,
#     this list of conditions and the following disclaimer in the documentation
#     and/or other materials provided with the distribution.
#   * Neither the name of the <ORGANIZATION> nor the names of its contributors
#     may be used to endorse or promote products derived from this software
#     without specific prior written permission.
# 
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.

Steven Cordwell's avatar
Steven Cordwell committed
60 61 62
from math import ceil, log, sqrt
from time import time

63
from numpy import absolute, array, empty, mean, mod, multiply, ones, zeros
64
from numpy.random import randint, random
Steven Cordwell's avatar
Steven Cordwell committed
65
from scipy.sparse import csr_matrix as sparse
Steven Cordwell's avatar
Steven Cordwell committed
66

Steven Cordwell's avatar
Steven Cordwell committed
67
from .utils import check, getSpan
68

Steven Cordwell's avatar
Steven Cordwell committed
69 70 71 72 73 74 75 76
MSG_STOP_MAX_ITER = "Iterating stopped due to maximum number of iterations " \
    "condition."
MSG_STOP_EPSILON_OPTIMAL_POLICY = "Iterating stopped, epsilon-optimal " \
    "policy found."
MSG_STOP_EPSILON_OPTIMAL_VALUE = "Iterating stopped, epsilon-optimal value " \
    "function found."
MSG_STOP_UNCHANGING_POLICY = "Iterating stopped, unchanging policy found."

Steven Cordwell's avatar
Steven Cordwell committed
77
class MDP(object):
78
    
Steven Cordwell's avatar
Steven Cordwell committed
79 80
    """A Markov Decision Problem.
    
Steven Cordwell's avatar
Steven Cordwell committed
81
    Let ``S`` = the number of states, and ``A`` = the number of acions.
82
    
Steven Cordwell's avatar
Steven Cordwell committed
83 84 85
    Parameters
    ----------
    transitions : array
86
        Transition probability matrices. These can be defined in a variety of 
Steven Cordwell's avatar
Steven Cordwell committed
87
        ways. The simplest is a numpy array that has the shape ``(A, S, S)``,
88
        though there are other possibilities. It can be a tuple or list or
Steven Cordwell's avatar
Steven Cordwell committed
89 90 91 92 93 94 95
        numpy object array of length ``A``, where each element contains a numpy
        array or matrix that has the shape ``(S, S)``. This "list of matrices"
        form is useful when the transition matrices are sparse as
        ``scipy.sparse.csr_matrix`` matrices can be used. In summary, each
        action's transition matrix must be indexable like ``transitions[a]``
        where ``a`` ∈ {0, 1...A-1}, and ``transitions[a]`` returns an ``S`` ×
        ``S`` array-like object.
Steven Cordwell's avatar
Steven Cordwell committed
96
    reward : array
97 98
        Reward matrices or vectors. Like the transition matrices, these can
        also be defined in a variety of ways. Again the simplest is a numpy
Steven Cordwell's avatar
Steven Cordwell committed
99 100 101 102 103 104 105 106
        array that has the shape ``(S, A)``, ``(S,)`` or ``(A, S, S)``. A list
        of lists can be used, where each inner list has length ``S`` and the
        outer list has length ``A``. A list of numpy arrays is possible where
        each inner array can be of the shape ``(S,)``, ``(S, 1)``, ``(1, S)``
        or ``(S, S)``. Also ``scipy.sparse.csr_matrix`` can be used instead of
        numpy arrays. In addition, the outer list can be replaced by any object
        that can be indexed like ``reward[a]`` such as a tuple or numpy object
        array of length ``A``.
107 108 109 110
    discount : float
        Discount factor. The per time-step discount factor on future rewards.
        Valid values are greater than 0 upto and including 1. If the discount
        factor is 1, then convergence is cannot be assumed and a warning will
Steven Cordwell's avatar
Steven Cordwell committed
111 112
        be displayed. Subclasses of ``MDP`` may pass ``None`` in the case where
        the algorithm does not use a discount factor.
113 114 115 116
    epsilon : float
        Stopping criterion. The maximum change in the value function at each
        iteration is compared against ``epsilon``. Once the change falls below
        this value, then the value function is considered to have converged to
Steven Cordwell's avatar
Steven Cordwell committed
117 118 119
        the optimal value function. Subclasses of ``MDP`` may pass ``None`` in
        the case where the algorithm does not use an epsilon-optimal stopping
        criterion.
120 121 122
    max_iter : int
        Maximum number of iterations. The algorithm will be terminated once
        this many iterations have elapsed. This must be greater than 0 if
Steven Cordwell's avatar
Steven Cordwell committed
123 124
        specified. Subclasses of ``MDP`` may pass ``None`` in the case where
        the algorithm does not use a maximum number of iterations.
Steven Cordwell's avatar
Steven Cordwell committed
125 126 127 128
    
    Attributes
    ----------
    P : array
129
        Transition probability matrices.
Steven Cordwell's avatar
Steven Cordwell committed
130
    R : array
131 132
        Reward vectors.
    V : tuple
Steven Cordwell's avatar
Steven Cordwell committed
133 134 135
        The optimal value function. Each element is a float corresponding to
        the expected value of being in that state assuming the optimal policy
        is followed.
Steven Cordwell's avatar
Steven Cordwell committed
136
    discount : float
137
        The discount rate on future rewards.
Steven Cordwell's avatar
Steven Cordwell committed
138
    max_iter : int
139 140 141
        The maximum number of iterations.
    policy : tuple
        The optimal policy.
Steven Cordwell's avatar
Steven Cordwell committed
142
    time : float
143 144
        The time used to converge to the optimal policy.
    verbose : boolean
Steven Cordwell's avatar
Steven Cordwell committed
145
        Whether verbose output should be displayed or not.
Steven Cordwell's avatar
Steven Cordwell committed
146 147 148
    
    Methods
    -------
149
    run
Steven Cordwell's avatar
Steven Cordwell committed
150
        Implemented in child classes as the main algorithm loop. Raises an
151
        exception if it has not been overridden.
Steven Cordwell's avatar
Steven Cordwell committed
152 153 154 155 156 157
    setSilent
        Turn the verbosity off
    setVerbose
        Turn the verbosity on
    
    """
Steven Cordwell's avatar
Steven Cordwell committed
158
    
159
    def __init__(self, transitions, reward, discount, epsilon, max_iter):
160
        # Initialise a MDP based on the input parameters.
161
        
Steven Cordwell's avatar
Steven Cordwell committed
162 163
        # if the discount is None then the algorithm is assumed to not use it
        # in its computations
164 165 166 167
        if discount is not None:
            self.discount = float(discount)
            assert 0.0 < self.discount <= 1.0, "Discount rate must be in ]0; 1]"
            if self.discount == 1:
Steven Cordwell's avatar
Steven Cordwell committed
168
                print("WARNING: check conditions of convergence. With no "
Steven Cordwell's avatar
Steven Cordwell committed
169
                      "discount, convergence can not be assumed.")
Steven Cordwell's avatar
Steven Cordwell committed
170 171
        # if the max_iter is None then the algorithm is assumed to not use it
        # in its computations
172 173 174 175
        if max_iter is not None:
            self.max_iter = int(max_iter)
            assert self.max_iter > 0, "The maximum number of iterations " \
                                      "must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
176
        # check that epsilon is something sane
177 178 179
        if epsilon is not None:
            self.epsilon = float(epsilon)
            assert self.epsilon > 0, "Epsilon must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
180 181 182 183
        # we run a check on P and R to make sure they are describing an MDP. If
        # an exception isn't raised then they are assumed to be correct.
        check(transitions, reward)
        # computePR will assign the variables self.S, self.A, self.P and self.R
184
        self._computePR(transitions, reward)
Steven Cordwell's avatar
Steven Cordwell committed
185 186 187 188
        # the verbosity is by default turned off
        self.verbose = False
        # Initially the time taken to perform the computations is set to None
        self.time = None
189 190
        # set the initial iteration count to zero
        self.iter = 0
Steven Cordwell's avatar
Steven Cordwell committed
191
        # V should be stored as a vector ie shape of (S,) or (1, S)
Steven Cordwell's avatar
Steven Cordwell committed
192
        self.V = None
Steven Cordwell's avatar
Steven Cordwell committed
193
        # policy can also be stored as a vector
Steven Cordwell's avatar
Steven Cordwell committed
194
        self.policy = None
Steven Cordwell's avatar
Steven Cordwell committed
195
    
196 197 198 199 200 201
    def __repr__(self):
        P_repr = "P: \n"
        R_repr = "R: \n"
        for aa in range(self.A):
            P_repr += repr(self.P[aa]) + "\n"
            R_repr += repr(self.R[aa]) + "\n"
202
        return(P_repr + "\n" + R_repr)
203
    
204
    def _bellmanOperator(self, V=None):
Steven Cordwell's avatar
Steven Cordwell committed
205
        # Apply the Bellman operator on the value function.
206
        # 
Steven Cordwell's avatar
Steven Cordwell committed
207
        # Updates the value function and the Vprev-improving policy.
208
        # 
Steven Cordwell's avatar
Steven Cordwell committed
209 210 211 212
        # Returns: (policy, value), tuple of new policy and its value
        #
        # If V hasn't been sent into the method, then we assume to be working
        # on the objects V attribute
213 214
        if V is None:
            # this V should be a reference to the data rather than a copy
215 216
            V = self.V
        else:
Steven Cordwell's avatar
Steven Cordwell committed
217
            # make sure the user supplied V is of the right shape
218
            try:
219 220
                assert V.shape in ((self.S,), (1, self.S)), "V is not the " \
                    "right shape (Bellman operator)."
221
            except AttributeError:
222
                raise TypeError("V must be a numpy array or matrix.")
223 224 225 226
        # Looping through each action the the Q-value matrix is calculated.
        # P and V can be any object that supports indexing, so it is important
        # that you know they define a valid MDP before calling the
        # _bellmanOperator method. Otherwise the results will be meaningless.
Steven Cordwell's avatar
Steven Cordwell committed
227
        Q = empty((self.A, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
228
        for aa in range(self.A):
Steven Cordwell's avatar
Steven Cordwell committed
229
            Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)
Steven Cordwell's avatar
Steven Cordwell committed
230
        # Get the policy and value, for now it is being returned but...
231
        # Which way is better?
232
        # 1. Return, (policy, value)
233
        return (Q.argmax(axis=0), Q.max(axis=0))
Steven Cordwell's avatar
Steven Cordwell committed
234 235
        # 2. update self.policy and self.V directly
        # self.V = Q.max(axis=1)
236
        # self.policy = Q.argmax(axis=1)
Steven Cordwell's avatar
Steven Cordwell committed
237
    
238 239 240 241 242 243 244 245 246 247 248 249
    def _computeP(self, P):
        # Set self.P as a tuple of length A, with each element storing an S×S
        # matrix.
        self.A = len(P)
        try:
            if P.ndim == 3:
                self.S = P.shape[1]
            else:
               self.S = P[0].shape[0]
        except AttributeError:
            self.S = P[0].shape[0]
        # convert P to a tuple of numpy arrays
250
        self.P = tuple(P[aa] for aa in range(self.A))
251
    
252
    def _computePR(self, P, R):
Steven Cordwell's avatar
Steven Cordwell committed
253 254 255 256 257 258 259 260 261 262 263 264 265 266 267
        # Compute the reward for the system in one state chosing an action.
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        #    P(SxSxA)  = transition matrix 
        #        P could be an array with 3 dimensions or  a cell array (1xA), 
        #        each cell containing a matrix (SxS) possibly sparse
        #    R(SxSxA) or (SxA) = reward matrix
        #        R could be an array with 3 dimensions (SxSxA) or  a cell array 
        #        (1xA), each cell containing a sparse matrix (SxS) or a 2D 
        #        array(SxA) possibly sparse  
        # Evaluation
        # ----------
        #    PR(SxA)   = reward matrix
        #
268
        # We assume that P and R define a MDP i,e. assumption is that
Steven Cordwell's avatar
Steven Cordwell committed
269
        # check(P, R) has already been run and doesn't fail.
270
        #
271 272
        # First compute store P, S, and A
        self._computeP(P)
Steven Cordwell's avatar
Steven Cordwell committed
273 274
        # Set self.R as a tuple of length A, with each element storing an 1×S
        # vector.
275
        try:
276 277
            if R.ndim == 1:
                r = array(R).reshape(self.S)
278
                self.R = tuple(r for aa in range(self.A))
279
            elif R.ndim == 2:
280 281
                self.R = tuple(array(R[:, aa]).reshape(self.S)
                                for aa in range(self.A))
Steven Cordwell's avatar
Steven Cordwell committed
282
            else:
283 284
                self.R = tuple(multiply(P[aa], R[aa]).sum(1).reshape(self.S)
                                for aa in range(self.A))
285
        except AttributeError:
286
            if len(R) == self.A:
287 288
                self.R = tuple(multiply(P[aa], R[aa]).sum(1).reshape(self.S)
                                for aa in range(self.A))
289 290
            else:
                r = array(R).reshape(self.S)
291
                self.R = tuple(r for aa in range(self.A))
292
    
293
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
294
        # Raise error because child classes should implement this function.
295
        raise NotImplementedError("You should create a run() method.")
Steven Cordwell's avatar
Steven Cordwell committed
296
    
Steven Cordwell's avatar
Steven Cordwell committed
297
    def setSilent(self):
298
        """Set the MDP algorithm to silent mode."""
Steven Cordwell's avatar
Steven Cordwell committed
299 300 301
        self.verbose = False
    
    def setVerbose(self):
302
        """Set the MDP algorithm to verbose mode."""
Steven Cordwell's avatar
Steven Cordwell committed
303
        self.verbose = True
Steven Cordwell's avatar
Steven Cordwell committed
304 305

class FiniteHorizon(MDP):
306
    
Steven Cordwell's avatar
Steven Cordwell committed
307
    """A MDP solved using the finite-horizon backwards induction algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
308
    
Steven Cordwell's avatar
Steven Cordwell committed
309 310
    Parameters
    ----------
311 312 313 314 315 316 317 318 319
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
320 321 322 323
    N : int
        Number of periods. Must be greater than 0.
    h : array, optional
        Terminal reward. Default: a vector of zeros.
Steven Cordwell's avatar
Steven Cordwell committed
324
    
Steven Cordwell's avatar
Steven Cordwell committed
325 326 327 328
    Data Attributes
    ---------------
    V : array 
        Optimal value function. Shape = (S, N+1). ``V[:, n]`` = optimal value
Steven Cordwell's avatar
Steven Cordwell committed
329
        function at stage ``n`` with stage in {0, 1...N-1}. ``V[:, N]`` value
Steven Cordwell's avatar
Steven Cordwell committed
330 331 332
        function for terminal stage. 
    policy : array
        Optimal policy. ``policy[:, n]`` = optimal policy at stage ``n`` with
Steven Cordwell's avatar
Steven Cordwell committed
333
        stage in {0, 1...N}. ``policy[:, N]`` = policy for stage ``N``.
Steven Cordwell's avatar
Steven Cordwell committed
334 335
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
336 337 338 339
  
    Notes
    -----
    In verbose mode, displays the current stage and policy transpose.
340
    
Steven Cordwell's avatar
Steven Cordwell committed
341 342
    Examples
    --------
343 344 345
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3)
346
    >>> fh.run()
Steven Cordwell's avatar
Steven Cordwell committed
347 348 349 350 351 352 353 354
    >>> fh.V
    array([[ 2.6973,  0.81  ,  0.    ,  0.    ],
           [ 5.9373,  3.24  ,  1.    ,  0.    ],
           [ 9.9373,  7.24  ,  4.    ,  0.    ]])
    >>> fh.policy
    array([[0, 0, 0],
           [0, 0, 1],
           [0, 0, 0]])
355
    
Steven Cordwell's avatar
Steven Cordwell committed
356
    """
Steven Cordwell's avatar
Steven Cordwell committed
357

Steven Cordwell's avatar
Steven Cordwell committed
358
    def __init__(self, transitions, reward, discount, N, h=None):
359
        # Initialise a finite horizon MDP.
360
        self.N = int(N)
Steven Cordwell's avatar
Steven Cordwell committed
361
        assert self.N > 0, "N must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
362
        # Initialise the base class
363
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
364 365
        # remove the iteration counter, it is not meaningful for backwards
        # induction
366
        del self.iter
Steven Cordwell's avatar
Steven Cordwell committed
367
        # There are value vectors for each time step up to the horizon
368
        self.V = zeros((self.S, N + 1))
Steven Cordwell's avatar
Steven Cordwell committed
369 370 371 372 373
        # There are policy vectors for each time step before the horizon, when
        # we reach the horizon we don't need to make decisions anymore.
        self.policy = empty((self.S, N), dtype=int)
        # Set the reward for the final transition to h, if specified.
        if h is not None:
374
            self.V[:, N] = h
375
        # Call the iteration method
376
        #self.run()
377
        
378
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
379
        # Run the finite horizon algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
380
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
381
        # loop through each time period
382
        for n in range(self.N):
Steven Cordwell's avatar
Steven Cordwell committed
383
            W, X = self._bellmanOperator(self.V[:, self.N - n])
Steven Cordwell's avatar
Steven Cordwell committed
384 385 386
            stage = self.N - n - 1
            self.V[:, stage] = X
            self.policy[:, stage] = W
Steven Cordwell's avatar
Steven Cordwell committed
387
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
388 389
                print(("stage: %s, policy: %s") % (
                    stage, self.policy[:, stage].tolist()))
Steven Cordwell's avatar
Steven Cordwell committed
390
        # update time spent running
Steven Cordwell's avatar
Steven Cordwell committed
391
        self.time = time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
392 393
        # After this we could create a tuple of tuples for the values and 
        # policies.
Steven Cordwell's avatar
Steven Cordwell committed
394 395 396
        #self.V = tuple(tuple(self.V[:, n].tolist()) for n in range(self.N))
        #self.policy = tuple(tuple(self.policy[:, n].tolist())
        #                    for n in range(self.N))
Steven Cordwell's avatar
Steven Cordwell committed
397 398

class LP(MDP):
399
    
400
    """A discounted MDP soloved using linear programming.
Steven Cordwell's avatar
Steven Cordwell committed
401 402
    
    This class requires the Python ``cvxopt`` module to be installed.
Steven Cordwell's avatar
Steven Cordwell committed
403 404 405

    Arguments
    ---------
406 407 408 409 410 411 412 413 414
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
415 416
    h : array, optional
        Terminal reward. Default: a vector of zeros.
Steven Cordwell's avatar
Steven Cordwell committed
417
    
Steven Cordwell's avatar
Steven Cordwell committed
418 419 420 421 422 423 424 425
    Data Attributes
    ---------------
    V : tuple
        optimal values
    policy : tuple
        optimal policy
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
426 427 428
    
    Examples
    --------
429 430 431
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> lp = mdptoolbox.mdp.LP(P, R, 0.9)
432
    >>> lp.run()
433
    
Steven Cordwell's avatar
Steven Cordwell committed
434
    """
Steven Cordwell's avatar
Steven Cordwell committed
435

Steven Cordwell's avatar
Steven Cordwell committed
436
    def __init__(self, transitions, reward, discount):
437
        # Initialise a linear programming MDP.
Steven Cordwell's avatar
Steven Cordwell committed
438
        # import some functions from cvxopt and set them as object methods
Steven Cordwell's avatar
Steven Cordwell committed
439 440
        try:
            from cvxopt import matrix, solvers
441 442
            self._linprog = solvers.lp
            self._cvxmat = matrix
Steven Cordwell's avatar
Steven Cordwell committed
443
        except ImportError:
444 445
            raise ImportError("The python module cvxopt is required to use "
                              "linear programming functionality.")
Steven Cordwell's avatar
Steven Cordwell committed
446 447
        # we also need diagonal matrices, and using a sparse one may be more
        # memory efficient
Steven Cordwell's avatar
Steven Cordwell committed
448
        from scipy.sparse import eye as speye
449
        self._speye = speye
Steven Cordwell's avatar
Steven Cordwell committed
450
        # initialise the MDP. epsilon and max_iter are not needed
451
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
452
        # Set the cvxopt solver to be quiet by default, but ...
453
        # this doesn't do what I want it to do c.f. issue #3
454 455
        if not self.verbose:
            solvers.options['show_progress'] = False
456
        # Call the iteration method
457
        #self.run()
458
    
459
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
460
        #Run the linear programming algorithm.
461
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
462
        # The objective is to resolve : min V / V >= PR + discount*P*V
463 464
        # The function linprog of the optimisation Toolbox of Mathworks
        # resolves :
Steven Cordwell's avatar
Steven Cordwell committed
465
        # min f'* x / M * x <= b
466 467 468 469
        # So the objective could be expressed as :
        # min V / (discount*P-I) * V <= - PR
        # To avoid loop on states, the matrix M is structured following actions
        # M(A*S,S)
470 471 472
        f = self._cvxmat(ones((self.S, 1)))
        h = self._cvxmat(self.R.reshape(self.S * self.A, 1, order="F"), tc='d')
        M = zeros((self.A * self.S, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
473 474
        for aa in range(self.A):
            pos = (aa + 1) * self.S
475 476 477
            M[(pos - self.S):pos, :] = (
                self.discount * self.P[aa] - self._speye(self.S, self.S))
        M = self._cvxmat(M)
478 479 480
        # Using the glpk option will make this behave more like Octave
        # (Octave uses glpk) and perhaps Matlab. If solver=None (ie using the 
        # default cvxopt solver) then V agrees with the Octave equivalent
Steven Cordwell's avatar
Steven Cordwell committed
481
        # only to 10e-8 places. This assumes glpk is installed of course.
Steven Cordwell's avatar
Steven Cordwell committed
482
        self.V = array(self._linprog(f, M, -h, solver='glpk')['x'])
Steven Cordwell's avatar
Steven Cordwell committed
483
        # apply the Bellman operator
484
        self.policy, self.V =  self._bellmanOperator()
Steven Cordwell's avatar
Steven Cordwell committed
485
        # update the time spent solving
Steven Cordwell's avatar
Steven Cordwell committed
486
        self.time = time() - self.time
487
        # store value and policy as tuples
488 489
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
490 491

class PolicyIteration(MDP):
492
    
493
    """A discounted MDP solved using the policy iteration algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
494
    
Steven Cordwell's avatar
Steven Cordwell committed
495 496
    Arguments
    ---------
497 498 499 500 501 502 503 504 505
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details. 
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
506 507 508
    policy0 : array, optional
        Starting policy.
    max_iter : int, optional
509 510
        Maximum number of iterations. See the documentation for the ``MDP``
        class for details. Default is 1000.
Steven Cordwell's avatar
Steven Cordwell committed
511 512 513 514
    eval_type : int or string, optional
        Type of function used to evaluate policy. 0 or "matrix" to solve as a
        set of linear equations. 1 or "iterative" to solve iteratively.
        Default: 0.
Steven Cordwell's avatar
Steven Cordwell committed
515
             
Steven Cordwell's avatar
Steven Cordwell committed
516 517 518 519 520 521 522 523 524 525
    Data Attributes
    ---------------
    V : tuple
        value function 
    policy : tuple
        optimal policy
    iter : int
        number of done iterations
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
526 527 528 529 530 531
    
    Notes
    -----
    In verbose mode, at each iteration, displays the number 
    of differents actions between policy n-1 and n
    
Steven Cordwell's avatar
Steven Cordwell committed
532 533
    Examples
    --------
534 535 536
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.rand()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
537
    >>> pi.run()
538
    
539 540
    >>> P, R = mdptoolbox.example.forest()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
541
    >>> pi.run()
Steven Cordwell's avatar
Steven Cordwell committed
542
    >>> pi.V
543
    (26.244000000000018, 29.48400000000002, 33.484000000000016)
Steven Cordwell's avatar
Steven Cordwell committed
544
    >>> pi.policy
545
    (0, 0, 0)
Steven Cordwell's avatar
Steven Cordwell committed
546
    """
Steven Cordwell's avatar
Steven Cordwell committed
547
    
548 549
    def __init__(self, transitions, reward, discount, policy0=None,
                 max_iter=1000, eval_type=0):
Steven Cordwell's avatar
Steven Cordwell committed
550 551 552
        # Initialise a policy iteration MDP.
        #
        # Set up the MDP, but don't need to worry about epsilon values
553
        MDP.__init__(self, transitions, reward, discount, None, max_iter)
Steven Cordwell's avatar
Steven Cordwell committed
554
        # Check if the user has supplied an initial policy. If not make one.
Steven Cordwell's avatar
Steven Cordwell committed
555
        if policy0 == None:
Steven Cordwell's avatar
Steven Cordwell committed
556
            # Initialise the policy to the one which maximises the expected
Steven Cordwell's avatar
Steven Cordwell committed
557
            # immediate reward
Steven Cordwell's avatar
Steven Cordwell committed
558 559
            null = zeros(self.S)
            self.policy, null = self._bellmanOperator(null)
560
            del null
Steven Cordwell's avatar
Steven Cordwell committed
561
        else:
Steven Cordwell's avatar
Steven Cordwell committed
562 563
            # Use the policy that the user supplied
            # Make sure it is a numpy array
Steven Cordwell's avatar
Steven Cordwell committed
564
            policy0 = array(policy0)
Steven Cordwell's avatar
Steven Cordwell committed
565
            # Make sure the policy is the right size and shape
566 567
            assert policy0.shape in ((self.S, ), (self.S, 1), (1, self.S)), \
                "'policy0' must a vector with length S."
Steven Cordwell's avatar
Steven Cordwell committed
568
            # reshape the policy to be a vector
Steven Cordwell's avatar
Steven Cordwell committed
569
            policy0 = policy0.reshape(self.S)
570 571 572 573 574 575
            # The policy can only contain integers between 0 and S-1
            msg = "'policy0' must be a vector of integers between 0 and S-1."
            assert not mod(policy0, 1).any(), msg
            assert (policy0 >= 0).all(), msg
            assert (policy0 < self.S).all(), msg
            self.policy = policy0
Steven Cordwell's avatar
Steven Cordwell committed
576
        # set the initial values to zero
Steven Cordwell's avatar
Steven Cordwell committed
577
        self.V = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
578
        # Do some setup depending on the evaluation type
Steven Cordwell's avatar
Steven Cordwell committed
579
        if eval_type in (0, "matrix"):
580
            from numpy.linalg import solve
581
            from scipy.sparse import eye
582 583
            self._speye = eye
            self._lin_eq = solve
Steven Cordwell's avatar
Steven Cordwell committed
584 585 586 587
            self.eval_type = "matrix"
        elif eval_type in (1, "iterative"):
            self.eval_type = "iterative"
        else:
Steven Cordwell's avatar
Steven Cordwell committed
588 589 590
            raise ValueError("'eval_type' should be '0' for matrix evaluation "
                             "or '1' for iterative evaluation. The strings "
                             "'matrix' and 'iterative' can also be used.")
591
        # Call the iteration method
592
        #self.run()
Steven Cordwell's avatar
Steven Cordwell committed
593
    
594
    def _computePpolicyPRpolicy(self):
Steven Cordwell's avatar
Steven Cordwell committed
595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613
        # Compute the transition matrix and the reward matrix for a policy.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA)  = transition matrix 
        #     P could be an array with 3 dimensions or a cell array (1xA),
        #     each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #     R could be an array with 3 dimensions (SxSxA) or 
        #     a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #     a 2D array(SxA) possibly sparse  
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Ppolicy(SxS)  = transition matrix for policy
        # PRpolicy(S)   = reward matrix for policy
        #
Steven Cordwell's avatar
Steven Cordwell committed
614 615
        Ppolicy = empty((self.S, self.S))
        Rpolicy = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
616
        for aa in range(self.A): # avoid looping over S
Steven Cordwell's avatar
Steven Cordwell committed
617 618
            # the rows that use action a.
            ind = (self.policy == aa).nonzero()[0]
619 620
            # if no rows use action a, then no need to assign this
            if ind.size > 0:
621 622 623 624
                try:
                    Ppolicy[ind, :] = self.P[aa][ind, :]
                except ValueError:
                    Ppolicy[ind, :] = self.P[aa][ind, :].todense()
625
                #PR = self._computePR() # an apparently uneeded line, and
Steven Cordwell's avatar
Steven Cordwell committed
626 627
                # perhaps harmful in this implementation c.f.
                # mdp_computePpolicyPRpolicy.m
628
                Rpolicy[ind] = self.R[aa][ind]
Steven Cordwell's avatar
Steven Cordwell committed
629 630 631 632 633 634 635 636 637 638
        # self.R cannot be sparse with the code in its current condition, but
        # it should be possible in the future. Also, if R is so big that its
        # a good idea to use a sparse matrix for it, then converting PRpolicy
        # from a dense to sparse matrix doesn't seem very memory efficient
        if type(self.R) is sparse:
            Rpolicy = sparse(Rpolicy)
        #self.Ppolicy = Ppolicy
        #self.Rpolicy = Rpolicy
        return (Ppolicy, Rpolicy)
    
639
    def _evalPolicyIterative(self, V0=0, epsilon=0.0001, max_iter=10000):
Steven Cordwell's avatar
Steven Cordwell committed
640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669
        # Evaluate a policy using iteration.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA)  = transition matrix 
        #    P could be an array with 3 dimensions or 
        #    a cell array (1xS), each cell containing a matrix possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #    R could be an array with 3 dimensions (SxSxA) or 
        #    a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #    a 2D array(SxA) possibly sparse  
        # discount  = discount rate in ]0; 1[
        # policy(S) = a policy
        # V0(S)     = starting value function, optional (default : zeros(S,1))
        # epsilon   = epsilon-optimal policy search, upper than 0,
        #    optional (default : 0.0001)
        # max_iter  = maximum number of iteration to be done, upper than 0, 
        #    optional (default : 10000)
        #    
        # Evaluation
        # ----------
        # Vpolicy(S) = value function, associated to a specific policy
        #
        # Notes
        # -----
        # In verbose mode, at each iteration, displays the condition which
        # stopped iterations: epsilon-optimum value function found or maximum
        # number of iterations reached.
        #
670 671 672 673 674
        try:
            assert V0.shape in ((self.S, ), (self.S, 1), (1, self.S)), \
                "'V0' must be a vector of length S."
            policy_V = array(V0).reshape(self.S)
        except AttributeError:
Steven Cordwell's avatar
Steven Cordwell committed
675
            if V0 == 0:
676
                policy_V = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
677 678
            else:
                policy_V = array(V0).reshape(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
679
        
680
        policy_P, policy_R = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
681 682
        
        if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
683
            print('    Iteration\t\t    V variation')
684
        
Steven Cordwell's avatar
Steven Cordwell committed
685 686 687
        itr = 0
        done = False
        while not done:
688
            itr += 1
689 690
            
            Vprev = policy_V
691
            policy_V = policy_R + self.discount * policy_P.dot(Vprev)
692 693
            
            variation = absolute(policy_V - Vprev).max()
Steven Cordwell's avatar
Steven Cordwell committed
694
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
695
                print(('      %s\t\t      %s') % (itr, variation))
696 697 698
            
            # ensure |Vn - Vpolicy| < epsilon
            if variation < ((1 - self.discount) / self.discount) * epsilon:
Steven Cordwell's avatar
Steven Cordwell committed
699 700
                done = True
                if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
701
                    print(MSG_STOP_EPSILON_OPTIMAL_VALUE)
Steven Cordwell's avatar
Steven Cordwell committed
702 703 704
            elif itr == max_iter:
                done = True
                if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
705
                    print(MSG_STOP_MAX_ITER)
Steven Cordwell's avatar
Steven Cordwell committed
706
        
Steven Cordwell's avatar
Steven Cordwell committed
707
        self.V = policy_V
708
    
709
    def _evalPolicyMatrix(self):
Steven Cordwell's avatar
Steven Cordwell committed
710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728
        # Evaluate the value function of the policy using linear equations.
        #
        # Arguments 
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA) = transition matrix 
        #      P could be an array with 3 dimensions or a cell array (1xA),
        #      each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #      R could be an array with 3 dimensions (SxSxA) or 
        #      a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #      a 2D array(SxA) possibly sparse  
        # discount = discount rate in ]0; 1[
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Vpolicy(S) = value function of the policy
        #
729
        Ppolicy, Rpolicy = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
730
        # V = PR + gPV  => (I-gP)V = PR  => V = inv(I-gP)* PR
731 732
        self.V = self._lin_eq(
            (self._speye(self.S, self.S) - self.discount * Ppolicy), Rpolicy)
Steven Cordwell's avatar
Steven Cordwell committed
733
    
734
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
735 736
        # Run the policy iteration algorithm.
        # If verbose the print a header
737
        if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
738
            print('  Iteration\t\tNumber of different actions')
Steven Cordwell's avatar
Steven Cordwell committed
739
        # Set up the while stopping condition and the current time
Steven Cordwell's avatar
Steven Cordwell committed
740
        done = False
741
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
742
        # loop until a stopping condition is reached
Steven Cordwell's avatar
Steven Cordwell committed
743
        while not done:
744
            self.iter += 1
745
            # these _evalPolicy* functions will update the classes value
Steven Cordwell's avatar
Steven Cordwell committed
746
            # attribute
Steven Cordwell's avatar
Steven Cordwell committed
747
            if self.eval_type == "matrix":
748
                self._evalPolicyMatrix()
Steven Cordwell's avatar
Steven Cordwell committed
749
            elif self.eval_type == "iterative":
750
                self._evalPolicyIterative()
Steven Cordwell's avatar
Steven Cordwell committed
751 752
            # This should update the classes policy attribute but leave the
            # value alone
753
            policy_next, null = self._bellmanOperator()
754
            del null
Steven Cordwell's avatar
Steven Cordwell committed
755 756
            # calculate in how many places does the old policy disagree with
            # the new policy
757
            n_different = (policy_next != self.policy).sum()
Steven Cordwell's avatar
Steven Cordwell committed
758
            # if verbose then continue printing a table
759
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
760
                print(('    %s\t\t  %s') % (self.iter, n_different))
Steven Cordwell's avatar
Steven Cordwell committed
761 762
            # Once the policy is unchanging of the maximum number of 
            # of iterations has been reached then stop
763
            if n_different == 0:
Steven Cordwell's avatar
Steven Cordwell committed
764
                done = True
765
                if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
766
                    print(MSG_STOP_UNCHANGING_POLICY)
767 768 769
            elif (self.iter == self.max_iter):
                done = True 
                if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
770
                    print(MSG_STOP_MAX_ITER)
771 772
            else:
                self.policy = policy_next
Steven Cordwell's avatar
Steven Cordwell committed
773
        # update the time to return th computation time
774
        self.time = time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
775
        # store value and policy as tuples
776 777
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
778

779
class PolicyIterationModified(PolicyIteration):
780
    
781
    """A discounted MDP  solved using a modifified policy iteration algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
782 783 784
    
    Arguments
    ---------
785 786 787 788 789 790 791 792 793
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
794 795 796 797
    epsilon : float, optional
        Stopping criterion. See the documentation for the ``MDP`` class for
        details. Default: 0.01.
    max_iter : int, optional
798
        Maximum number of iterations. See the documentation for the ``MDP``
Steven Cordwell's avatar
Steven Cordwell committed
799
        class for details. Default is 10.
Steven Cordwell's avatar
Steven Cordwell committed
800 801 802
    
    Data Attributes
    ---------------
Steven Cordwell's avatar
Steven Cordwell committed
803 804 805 806 807 808 809 810
    V : tuple
        value function 
    policy : tuple
        optimal policy
    iter : int
        number of done iterations
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
811 812 813
    
    Examples
    --------
814 815 816
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> pim = mdptoolbox.mdp.PolicyIterationModified(P, R, 0.9)
817
    >>> pim.run()
818
    >>> pim.policy
819
    (0, 0, 0)
820
    >>> pim.V
821
    (21.81408652334702, 25.054086523347017, 29.054086523347017)
822
    
Steven Cordwell's avatar
Steven Cordwell committed
823
    """
824
    
825 826
    def __init__(self, transitions, reward, discount, epsilon=0.01,
                 max_iter=10):
827
        # Initialise a (modified) policy iteration MDP.
Steven Cordwell's avatar
Steven Cordwell committed
828
        
829 830 831
        # Maybe its better not to subclass from PolicyIteration, because the
        # initialisation of the two are quite different. eg there is policy0
        # being calculated here which doesn't need to be. The only thing that
832
        # is needed from the PolicyIteration class is the _evalPolicyIterative
833
        # function. Perhaps there is a better way to do it?
834 835
        PolicyIteration.__init__(self, transitions, reward, discount, None,
                                 max_iter, 1)
836
        
837 838