mdp.py 57 KB
Newer Older
Steven Cordwell's avatar
Steven Cordwell committed
1
# -*- coding: utf-8 -*-
Steven Cordwell's avatar
Steven Cordwell committed
2
"""Markov Decision Process (MDP) Toolbox
3
=====================================
4

Steven Cordwell's avatar
Steven Cordwell committed
5 6
The MDP toolbox provides classes and functions for the resolution of
descrete-time Markov Decision Processes.
Steven Cordwell's avatar
Steven Cordwell committed
7

Steven Cordwell's avatar
Steven Cordwell committed
8 9 10 11 12
Available classes
-----------------
MDP
    Base Markov decision process class
FiniteHorizon
Steven Cordwell's avatar
Steven Cordwell committed
13
    Backwards induction finite horizon MDP
Steven Cordwell's avatar
Steven Cordwell committed
14 15 16 17 18 19 20 21 22 23 24 25 26 27
LP
    Linear programming MDP
PolicyIteration
    Policy iteration MDP
PolicyIterationModified
    Modified policy iteration MDP
QLearning
    Q-learning MDP
RelativeValueIteration
    Relative value iteration MDP
ValueIteration
    Value iteration MDP
ValueIterationGS
    Gauss-Seidel value iteration MDP
Steven Cordwell's avatar
Steven Cordwell committed
28

Steven Cordwell's avatar
Steven Cordwell committed
29 30 31 32 33 34 35 36 37 38
Available functions
-------------------
check
    Check that an MDP is properly defined
checkSquareStochastic
    Check that a matrix is square and stochastic
exampleForest
    A simple forest management example
exampleRand
    A random example
Steven Cordwell's avatar
Steven Cordwell committed
39

Steven Cordwell's avatar
Steven Cordwell committed
40 41 42 43 44
How to use the documentation
----------------------------
Documentation is available both as docstrings provided with the code and
in html or pdf format from 
`The MDP toolbox homepage <http://www.somewhere.com>`_. The docstring
45
examples assume that the `mdp` module has been imported imported like so::
Steven Cordwell's avatar
Steven Cordwell committed
46

47
  >>> import mdptoolbox.mdp as mdp
Steven Cordwell's avatar
Steven Cordwell committed
48 49 50 51 52

Code snippets are indicated by three greater-than signs::

  >>> x = 17
  >>> x = x + 1
53 54
  >>> x
  18
Steven Cordwell's avatar
Steven Cordwell committed
55 56 57 58 59

The documentation can be displayed with
`IPython <http://ipython.scipy.org>`_. For example, to view the docstring of
the ValueIteration class use ``mdp.ValueIteration?<ENTER>``, and to view its
source code use ``mdp.ValueIteration??<ENTER>``.
60

61 62 63
Acknowledgments
---------------
This module is modified from the MDPtoolbox (c) 2009 INRA available at 
Steven Cordwell's avatar
Steven Cordwell committed
64
http://www.inra.fr/mia/T/MDPtoolbox/.
65

Steven Cordwell's avatar
Steven Cordwell committed
66 67
"""

68 69
# Copyright (c) 2011-2013 Steven A. W. Cordwell
# Copyright (c) 2009 INRA
Steven Cordwell's avatar
Steven Cordwell committed
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
# 
# All rights reserved.
# 
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# 
#   * Redistributions of source code must retain the above copyright notice,
#     this list of conditions and the following disclaimer.
#   * Redistributions in binary form must reproduce the above copyright notice,
#     this list of conditions and the following disclaimer in the documentation
#     and/or other materials provided with the distribution.
#   * Neither the name of the <ORGANIZATION> nor the names of its contributors
#     may be used to endorse or promote products derived from this software
#     without specific prior written permission.
# 
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.

Steven Cordwell's avatar
Steven Cordwell committed
97 98 99
from math import ceil, log, sqrt
from time import time

100 101
from numpy import absolute, array, empty, mean, mod, multiply
from numpy import ndarray, ones, zeros
102
from numpy.random import randint, random
Steven Cordwell's avatar
Steven Cordwell committed
103
from scipy.sparse import csr_matrix as sparse
Steven Cordwell's avatar
Steven Cordwell committed
104

105
from utils import check, getSpan
106

Steven Cordwell's avatar
Steven Cordwell committed
107
class MDP(object):
108
    
Steven Cordwell's avatar
Steven Cordwell committed
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
    """A Markov Decision Problem.
    
    Parameters
    ----------
    transitions : array
            transition probability matrices
    reward : array
            reward matrices
    discount : float or None
            discount factor
    epsilon : float or None
            stopping criteria
    max_iter : int or None
            maximum number of iterations
    
    Attributes
    ----------
    P : array
        Transition probability matrices
    R : array
        Reward matrices
    V : list
        Value function
    discount : float
        b
    max_iter : int
        a
    policy : list
        a
    time : float
        a
    verbose : logical
        a
    
    Methods
    -------
    iterate
        To be implemented in child classes, raises exception
    setSilent
        Turn the verbosity off
    setVerbose
        Turn the verbosity on
    
    """
Steven Cordwell's avatar
Steven Cordwell committed
153
    
154
    def __init__(self, transitions, reward, discount, epsilon, max_iter):
155
        # Initialise a MDP based on the input parameters.
156
        
Steven Cordwell's avatar
Steven Cordwell committed
157 158
        # if the discount is None then the algorithm is assumed to not use it
        # in its computations
159 160 161 162 163 164
        if discount is not None:
            self.discount = float(discount)
            assert 0.0 < self.discount <= 1.0, "Discount rate must be in ]0; 1]"
            if self.discount == 1:
                print("PyMDPtoolbox WARNING: check conditions of convergence. "
                      "With no discount, convergence is not always assumed.")
Steven Cordwell's avatar
Steven Cordwell committed
165 166
        # if the max_iter is None then the algorithm is assumed to not use it
        # in its computations
167 168 169 170
        if max_iter is not None:
            self.max_iter = int(max_iter)
            assert self.max_iter > 0, "The maximum number of iterations " \
                                      "must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
171
        # check that epsilon is something sane
172 173 174
        if epsilon is not None:
            self.epsilon = float(epsilon)
            assert self.epsilon > 0, "Epsilon must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
175 176 177 178
        # we run a check on P and R to make sure they are describing an MDP. If
        # an exception isn't raised then they are assumed to be correct.
        check(transitions, reward)
        # computePR will assign the variables self.S, self.A, self.P and self.R
179
        self._computePR(transitions, reward)
Steven Cordwell's avatar
Steven Cordwell committed
180 181 182 183
        # the verbosity is by default turned off
        self.verbose = False
        # Initially the time taken to perform the computations is set to None
        self.time = None
184 185
        # set the initial iteration count to zero
        self.iter = 0
Steven Cordwell's avatar
Steven Cordwell committed
186
        # V should be stored as a vector ie shape of (S,) or (1, S)
Steven Cordwell's avatar
Steven Cordwell committed
187
        self.V = None
Steven Cordwell's avatar
Steven Cordwell committed
188
        # policy can also be stored as a vector
Steven Cordwell's avatar
Steven Cordwell committed
189
        self.policy = None
Steven Cordwell's avatar
Steven Cordwell committed
190
    
191 192 193 194 195 196
    def __repr__(self):
        P_repr = "P: \n"
        R_repr = "R: \n"
        for aa in range(self.A):
            P_repr += repr(self.P[aa]) + "\n"
            R_repr += repr(self.R[aa]) + "\n"
197
        return(P_repr + "\n" + R_repr)
198
    
199
    def _bellmanOperator(self, V=None):
Steven Cordwell's avatar
Steven Cordwell committed
200
        # Apply the Bellman operator on the value function.
201
        # 
Steven Cordwell's avatar
Steven Cordwell committed
202
        # Updates the value function and the Vprev-improving policy.
203
        # 
Steven Cordwell's avatar
Steven Cordwell committed
204 205 206 207
        # Returns: (policy, value), tuple of new policy and its value
        #
        # If V hasn't been sent into the method, then we assume to be working
        # on the objects V attribute
208 209
        if V is None:
            # this V should be a reference to the data rather than a copy
210 211
            V = self.V
        else:
Steven Cordwell's avatar
Steven Cordwell committed
212
            # make sure the user supplied V is of the right shape
213
            try:
214 215
                assert V.shape in ((self.S,), (1, self.S)), "V is not the " \
                    "right shape (Bellman operator)."
216
            except AttributeError:
217
                raise TypeError("V must be a numpy array or matrix.")
218 219 220 221
        # Looping through each action the the Q-value matrix is calculated.
        # P and V can be any object that supports indexing, so it is important
        # that you know they define a valid MDP before calling the
        # _bellmanOperator method. Otherwise the results will be meaningless.
Steven Cordwell's avatar
Steven Cordwell committed
222
        Q = empty((self.A, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
223
        for aa in range(self.A):
Steven Cordwell's avatar
Steven Cordwell committed
224
            Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)
Steven Cordwell's avatar
Steven Cordwell committed
225
        # Get the policy and value, for now it is being returned but...
226
        # Which way is better?
227
        # 1. Return, (policy, value)
228
        return (Q.argmax(axis=0), Q.max(axis=0))
Steven Cordwell's avatar
Steven Cordwell committed
229 230
        # 2. update self.policy and self.V directly
        # self.V = Q.max(axis=1)
231
        # self.policy = Q.argmax(axis=1)
Steven Cordwell's avatar
Steven Cordwell committed
232
    
233 234 235 236 237 238 239 240 241 242 243 244 245 246
    def _computeP(self, P):
        # Set self.P as a tuple of length A, with each element storing an S×S
        # matrix.
        self.A = len(P)
        try:
            if P.ndim == 3:
                self.S = P.shape[1]
            else:
               self.S = P[0].shape[0]
        except AttributeError:
            self.S = P[0].shape[0]
        # convert P to a tuple of numpy arrays
        self.P = tuple([P[aa] for aa in range(self.A)])
    
247
    def _computePR(self, P, R):
Steven Cordwell's avatar
Steven Cordwell committed
248 249 250 251 252 253 254 255 256 257 258 259 260 261 262
        # Compute the reward for the system in one state chosing an action.
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        #    P(SxSxA)  = transition matrix 
        #        P could be an array with 3 dimensions or  a cell array (1xA), 
        #        each cell containing a matrix (SxS) possibly sparse
        #    R(SxSxA) or (SxA) = reward matrix
        #        R could be an array with 3 dimensions (SxSxA) or  a cell array 
        #        (1xA), each cell containing a sparse matrix (SxS) or a 2D 
        #        array(SxA) possibly sparse  
        # Evaluation
        # ----------
        #    PR(SxA)   = reward matrix
        #
263
        # We assume that P and R define a MDP i,e. assumption is that
Steven Cordwell's avatar
Steven Cordwell committed
264
        # check(P, R) has already been run and doesn't fail.
265
        #
266 267
        # First compute store P, S, and A
        self._computeP(P)
Steven Cordwell's avatar
Steven Cordwell committed
268 269
        # Set self.R as a tuple of length A, with each element storing an 1×S
        # vector.
270
        try:
271
            if R.ndim == 2:
272 273
                self.R = tuple([array(R[:, aa]).reshape(self.S)
                                for aa in range(self.A)])
Steven Cordwell's avatar
Steven Cordwell committed
274
            else:
275 276
                self.R = tuple([multiply(P[aa], R[aa]).sum(1).reshape(self.S)
                                for aa in xrange(self.A)])
277
        except AttributeError:
278 279
            self.R = tuple([multiply(P[aa], R[aa]).sum(1).reshape(self.S)
                            for aa in xrange(self.A)])
280
    
281
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
282
        # Raise error because child classes should implement this function.
283
        raise NotImplementedError("You should create a run() method.")
Steven Cordwell's avatar
Steven Cordwell committed
284
    
Steven Cordwell's avatar
Steven Cordwell committed
285
    def setSilent(self):
286
        """Set the MDP algorithm to silent mode."""
Steven Cordwell's avatar
Steven Cordwell committed
287 288 289
        self.verbose = False
    
    def setVerbose(self):
290
        """Set the MDP algorithm to verbose mode."""
Steven Cordwell's avatar
Steven Cordwell committed
291
        self.verbose = True
Steven Cordwell's avatar
Steven Cordwell committed
292 293

class FiniteHorizon(MDP):
294
    
Steven Cordwell's avatar
Steven Cordwell committed
295
    """A MDP solved using the finite-horizon backwards induction algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
296 297
    
    Let S = number of states, A = number of actions
Steven Cordwell's avatar
Steven Cordwell committed
298 299 300
    
    Parameters
    ----------
Steven Cordwell's avatar
Steven Cordwell committed
301
    P(SxSxA) = transition matrix 
302 303
             P could be an array with 3 dimensions ora cell array (1xA),
             each cell containing a matrix (SxS) possibly sparse
Steven Cordwell's avatar
Steven Cordwell committed
304 305 306 307 308 309 310
    R(SxSxA) or (SxA) = reward matrix
             R could be an array with 3 dimensions (SxSxA) or 
             a cell array (1xA), each cell containing a sparse matrix (SxS) or
             a 2D array(SxA) possibly sparse  
    discount = discount factor, in ]0, 1]
    N        = number of periods, upper than 0
    h(S)     = terminal reward, optional (default [0; 0; ... 0] )
Steven Cordwell's avatar
Steven Cordwell committed
311 312
    
    Attributes
Steven Cordwell's avatar
Steven Cordwell committed
313
    ----------
Steven Cordwell's avatar
Steven Cordwell committed
314 315 316
    
    Methods
    -------
Steven Cordwell's avatar
Steven Cordwell committed
317 318 319 320 321 322 323 324 325 326 327 328 329
    V(S,N+1)     = optimal value function
                 V(:,n) = optimal value function at stage n
                        with stage in 1, ..., N
                        V(:,N+1) = value function for terminal stage 
    policy(S,N)  = optimal policy
                 policy(:,n) = optimal policy at stage n
                        with stage in 1, ...,N
                        policy(:,N) = policy for stage N
    cpu_time = used CPU time
  
    Notes
    -----
    In verbose mode, displays the current stage and policy transpose.
330
    
Steven Cordwell's avatar
Steven Cordwell committed
331 332
    Examples
    --------
333 334 335
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3)
336
    >>> fh.run()
Steven Cordwell's avatar
Steven Cordwell committed
337 338 339 340 341 342 343 344
    >>> fh.V
    array([[ 2.6973,  0.81  ,  0.    ,  0.    ],
           [ 5.9373,  3.24  ,  1.    ,  0.    ],
           [ 9.9373,  7.24  ,  4.    ,  0.    ]])
    >>> fh.policy
    array([[0, 0, 0],
           [0, 0, 1],
           [0, 0, 0]])
345
    
Steven Cordwell's avatar
Steven Cordwell committed
346
    """
Steven Cordwell's avatar
Steven Cordwell committed
347

Steven Cordwell's avatar
Steven Cordwell committed
348
    def __init__(self, transitions, reward, discount, N, h=None):
349
        # Initialise a finite horizon MDP.
350 351
        self.N = int(N)
        assert self.N > 0, 'PyMDPtoolbox: N must be greater than 0.'
Steven Cordwell's avatar
Steven Cordwell committed
352
        # Initialise the base class
353
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
354 355
        # remove the iteration counter, it is not meaningful for backwards
        # induction
356
        del self.iter
Steven Cordwell's avatar
Steven Cordwell committed
357
        # There are value vectors for each time step up to the horizon
358
        self.V = zeros((self.S, N + 1))
Steven Cordwell's avatar
Steven Cordwell committed
359 360 361 362 363
        # There are policy vectors for each time step before the horizon, when
        # we reach the horizon we don't need to make decisions anymore.
        self.policy = empty((self.S, N), dtype=int)
        # Set the reward for the final transition to h, if specified.
        if h is not None:
364
            self.V[:, N] = h
365
        # Call the iteration method
366
        #self.run()
367
        
368
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
369
        # Run the finite horizon algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
370
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
371
        # loop through each time period
372
        for n in range(self.N):
Steven Cordwell's avatar
Steven Cordwell committed
373 374 375
            W, X = self._bellmanOperator(self.V[:, self.N - n])
            self.V[:, self.N - n - 1] = X
            self.policy[:, self.N - n - 1] = W
Steven Cordwell's avatar
Steven Cordwell committed
376
            if self.verbose:
377 378
                print("stage: %s ... policy transpose : %s") % (
                    self.N - n, self.policy[:, self.N - n -1].tolist())
Steven Cordwell's avatar
Steven Cordwell committed
379
        # update time spent running
Steven Cordwell's avatar
Steven Cordwell committed
380
        self.time = time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
381 382 383 384 385 386 387 388 389 390
        # After this we could create a tuple of tuples for the values and 
        # policies.
        #V = []
        #p = []
        #for n in xrange(self.N):
        #    V.append()
        #    p.append()
        #V.append()
        #self.V = tuple(V)
        #self.policy = tuple(p)
Steven Cordwell's avatar
Steven Cordwell committed
391 392

class LP(MDP):
393
    
394
    """A discounted MDP soloved using linear programming.
Steven Cordwell's avatar
Steven Cordwell committed
395 396 397 398 399

    Arguments
    ---------
    Let S = number of states, A = number of actions
    P(SxSxA) = transition matrix 
400 401
             P could be an array with 3 dimensions or a cell array (1xA),
             each cell containing a matrix (SxS) possibly sparse
Steven Cordwell's avatar
Steven Cordwell committed
402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
    R(SxSxA) or (SxA) = reward matrix
             R could be an array with 3 dimensions (SxSxA) or 
             a cell array (1xA), each cell containing a sparse matrix (SxS) or
             a 2D array(SxA) possibly sparse  
    discount = discount rate, in ]0; 1[
    h(S)     = terminal reward, optional (default [0; 0; ... 0] )
    
    Evaluation
    ----------
    V(S)   = optimal values
    policy(S) = optimal policy
    cpu_time = used CPU time
    
    Notes    
    -----
    In verbose mode, displays the current stage and policy transpose.
    
    Examples
    --------
421 422 423
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> lp = mdptoolbox.mdp.LP(P, R, 0.9)
424
    >>> lp.run()
425
    
Steven Cordwell's avatar
Steven Cordwell committed
426
    """
Steven Cordwell's avatar
Steven Cordwell committed
427

Steven Cordwell's avatar
Steven Cordwell committed
428
    def __init__(self, transitions, reward, discount):
429
        # Initialise a linear programming MDP.
Steven Cordwell's avatar
Steven Cordwell committed
430
        # import some functions from cvxopt and set them as object methods
Steven Cordwell's avatar
Steven Cordwell committed
431 432
        try:
            from cvxopt import matrix, solvers
433 434
            self._linprog = solvers.lp
            self._cvxmat = matrix
Steven Cordwell's avatar
Steven Cordwell committed
435
        except ImportError:
436 437
            raise ImportError("The python module cvxopt is required to use "
                              "linear programming functionality.")
Steven Cordwell's avatar
Steven Cordwell committed
438 439
        # we also need diagonal matrices, and using a sparse one may be more
        # memory efficient
Steven Cordwell's avatar
Steven Cordwell committed
440
        from scipy.sparse import eye as speye
441
        self._speye = speye
Steven Cordwell's avatar
Steven Cordwell committed
442
        # initialise the MDP. epsilon and max_iter are not needed
443
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
444
        # Set the cvxopt solver to be quiet by default, but ...
445
        # this doesn't do what I want it to do c.f. issue #3
446 447
        if not self.verbose:
            solvers.options['show_progress'] = False
448
        # Call the iteration method
449
        #self.run()
450
    
451
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
452
        #Run the linear programming algorithm.
453
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
454
        # The objective is to resolve : min V / V >= PR + discount*P*V
455 456
        # The function linprog of the optimisation Toolbox of Mathworks
        # resolves :
Steven Cordwell's avatar
Steven Cordwell committed
457
        # min f'* x / M * x <= b
458 459 460 461
        # So the objective could be expressed as :
        # min V / (discount*P-I) * V <= - PR
        # To avoid loop on states, the matrix M is structured following actions
        # M(A*S,S)
462 463 464
        f = self._cvxmat(ones((self.S, 1)))
        h = self._cvxmat(self.R.reshape(self.S * self.A, 1, order="F"), tc='d')
        M = zeros((self.A * self.S, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
465 466
        for aa in range(self.A):
            pos = (aa + 1) * self.S
467 468 469
            M[(pos - self.S):pos, :] = (
                self.discount * self.P[aa] - self._speye(self.S, self.S))
        M = self._cvxmat(M)
470 471 472
        # Using the glpk option will make this behave more like Octave
        # (Octave uses glpk) and perhaps Matlab. If solver=None (ie using the 
        # default cvxopt solver) then V agrees with the Octave equivalent
Steven Cordwell's avatar
Steven Cordwell committed
473
        # only to 10e-8 places. This assumes glpk is installed of course.
Steven Cordwell's avatar
Steven Cordwell committed
474
        self.V = array(self._linprog(f, M, -h, solver='glpk')['x'])
Steven Cordwell's avatar
Steven Cordwell committed
475
        # apply the Bellman operator
476
        self.policy, self.V =  self._bellmanOperator()
Steven Cordwell's avatar
Steven Cordwell committed
477
        # update the time spent solving
Steven Cordwell's avatar
Steven Cordwell committed
478
        self.time = time() - self.time
479
        # store value and policy as tuples
480 481
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
482 483

class PolicyIteration(MDP):
484
    
485
    """A discounted MDP solved using the policy iteration algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
486
    
Steven Cordwell's avatar
Steven Cordwell committed
487 488 489 490
    Arguments
    ---------
    Let S = number of states, A = number of actions
    P(SxSxA) = transition matrix 
491 492
             P could be an array with 3 dimensions or a cell array (1xA),
             each cell containing a matrix (SxS) possibly sparse
Steven Cordwell's avatar
Steven Cordwell committed
493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516
    R(SxSxA) or (SxA) = reward matrix
             R could be an array with 3 dimensions (SxSxA) or 
             a cell array (1xA), each cell containing a sparse matrix (SxS) or
             a 2D array(SxA) possibly sparse  
    discount = discount rate, in ]0, 1[
    policy0(S) = starting policy, optional 
    max_iter = maximum number of iteration to be done, upper than 0, 
             optional (default 1000)
    eval_type = type of function used to evaluate policy: 
             0 for mdp_eval_policy_matrix, else mdp_eval_policy_iterative
             optional (default 0)
             
    Evaluation
    ----------
    V(S)   = value function 
    policy(S) = optimal policy
    iter     = number of done iterations
    cpu_time = used CPU time
    
    Notes
    -----
    In verbose mode, at each iteration, displays the number 
    of differents actions between policy n-1 and n
    
Steven Cordwell's avatar
Steven Cordwell committed
517 518
    Examples
    --------
519 520 521
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.rand()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
522
    >>> pi.run()
523
    
524 525
    >>> P, R = mdptoolbox.example.forest()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
526
    >>> pi.run()
Steven Cordwell's avatar
Steven Cordwell committed
527
    >>> pi.V
528
    (26.244000000000018, 29.48400000000002, 33.484000000000016)
Steven Cordwell's avatar
Steven Cordwell committed
529
    >>> pi.policy
530
    (0, 0, 0)
Steven Cordwell's avatar
Steven Cordwell committed
531
    """
Steven Cordwell's avatar
Steven Cordwell committed
532
    
533 534
    def __init__(self, transitions, reward, discount, policy0=None,
                 max_iter=1000, eval_type=0):
Steven Cordwell's avatar
Steven Cordwell committed
535 536 537
        # Initialise a policy iteration MDP.
        #
        # Set up the MDP, but don't need to worry about epsilon values
538
        MDP.__init__(self, transitions, reward, discount, None, max_iter)
Steven Cordwell's avatar
Steven Cordwell committed
539
        # Check if the user has supplied an initial policy. If not make one.
Steven Cordwell's avatar
Steven Cordwell committed
540
        if policy0 == None:
Steven Cordwell's avatar
Steven Cordwell committed
541
            # Initialise the policy to the one which maximises the expected
Steven Cordwell's avatar
Steven Cordwell committed
542
            # immediate reward
Steven Cordwell's avatar
Steven Cordwell committed
543 544
            null = zeros(self.S)
            self.policy, null = self._bellmanOperator(null)
545
            del null
Steven Cordwell's avatar
Steven Cordwell committed
546
        else:
Steven Cordwell's avatar
Steven Cordwell committed
547 548
            # Use the policy that the user supplied
            # Make sure it is a numpy array
Steven Cordwell's avatar
Steven Cordwell committed
549
            policy0 = array(policy0)
Steven Cordwell's avatar
Steven Cordwell committed
550
            # Make sure the policy is the right size and shape
Steven Cordwell's avatar
Steven Cordwell committed
551
            if not policy0.shape in ((self.S, ), (self.S, 1), (1, self.S)):
552 553
                raise ValueError("PyMDPtolbox: policy0 must a vector with "
                                 "length S.")
Steven Cordwell's avatar
Steven Cordwell committed
554
            # reshape the policy to be a vector
Steven Cordwell's avatar
Steven Cordwell committed
555
            policy0 = policy0.reshape(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
556
            # The policy can only contain integers between 1 and S
557 558 559 560
            if (mod(policy0, 1).any() or (policy0 < 0).any() or
                    (policy0 >= self.S).any()):
                raise ValueError("PyMDPtoolbox: policy0 must be a vector of "
                                 "integers between 1 and S.")
Steven Cordwell's avatar
Steven Cordwell committed
561 562
            else:
                self.policy = policy0
Steven Cordwell's avatar
Steven Cordwell committed
563
        # set the initial values to zero
Steven Cordwell's avatar
Steven Cordwell committed
564
        self.V = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
565
        # Do some setup depending on the evaluation type
Steven Cordwell's avatar
Steven Cordwell committed
566
        if eval_type in (0, "matrix"):
567
            from numpy.linalg import solve
568
            from scipy.sparse import eye
569 570
            self._speye = eye
            self._lin_eq = solve
Steven Cordwell's avatar
Steven Cordwell committed
571 572 573 574
            self.eval_type = "matrix"
        elif eval_type in (1, "iterative"):
            self.eval_type = "iterative"
        else:
575 576 577 578
            raise ValueError("PyMDPtoolbox: eval_type should be 0 for matrix "
                             "evaluation or 1 for iterative evaluation. "
                             "The strings 'matrix' and 'iterative' can also "
                             "be used.")
579
        # Call the iteration method
580
        #self.run()
Steven Cordwell's avatar
Steven Cordwell committed
581
    
582
    def _computePpolicyPRpolicy(self):
Steven Cordwell's avatar
Steven Cordwell committed
583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601
        # Compute the transition matrix and the reward matrix for a policy.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA)  = transition matrix 
        #     P could be an array with 3 dimensions or a cell array (1xA),
        #     each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #     R could be an array with 3 dimensions (SxSxA) or 
        #     a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #     a 2D array(SxA) possibly sparse  
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Ppolicy(SxS)  = transition matrix for policy
        # PRpolicy(S)   = reward matrix for policy
        #
Steven Cordwell's avatar
Steven Cordwell committed
602 603
        Ppolicy = empty((self.S, self.S))
        Rpolicy = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
604
        for aa in range(self.A): # avoid looping over S
Steven Cordwell's avatar
Steven Cordwell committed
605 606
            # the rows that use action a.
            ind = (self.policy == aa).nonzero()[0]
607 608
            # if no rows use action a, then no need to assign this
            if ind.size > 0:
609 610 611 612
                try:
                    Ppolicy[ind, :] = self.P[aa][ind, :]
                except ValueError:
                    Ppolicy[ind, :] = self.P[aa][ind, :].todense()
613
                #PR = self._computePR() # an apparently uneeded line, and
Steven Cordwell's avatar
Steven Cordwell committed
614 615
                # perhaps harmful in this implementation c.f.
                # mdp_computePpolicyPRpolicy.m
616
                Rpolicy[ind] = self.R[aa][ind]
Steven Cordwell's avatar
Steven Cordwell committed
617 618 619 620 621 622 623 624 625 626
        # self.R cannot be sparse with the code in its current condition, but
        # it should be possible in the future. Also, if R is so big that its
        # a good idea to use a sparse matrix for it, then converting PRpolicy
        # from a dense to sparse matrix doesn't seem very memory efficient
        if type(self.R) is sparse:
            Rpolicy = sparse(Rpolicy)
        #self.Ppolicy = Ppolicy
        #self.Rpolicy = Rpolicy
        return (Ppolicy, Rpolicy)
    
627
    def _evalPolicyIterative(self, V0=0, epsilon=0.0001, max_iter=10000):
Steven Cordwell's avatar
Steven Cordwell committed
628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657
        # Evaluate a policy using iteration.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA)  = transition matrix 
        #    P could be an array with 3 dimensions or 
        #    a cell array (1xS), each cell containing a matrix possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #    R could be an array with 3 dimensions (SxSxA) or 
        #    a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #    a 2D array(SxA) possibly sparse  
        # discount  = discount rate in ]0; 1[
        # policy(S) = a policy
        # V0(S)     = starting value function, optional (default : zeros(S,1))
        # epsilon   = epsilon-optimal policy search, upper than 0,
        #    optional (default : 0.0001)
        # max_iter  = maximum number of iteration to be done, upper than 0, 
        #    optional (default : 10000)
        #    
        # Evaluation
        # ----------
        # Vpolicy(S) = value function, associated to a specific policy
        #
        # Notes
        # -----
        # In verbose mode, at each iteration, displays the condition which
        # stopped iterations: epsilon-optimum value function found or maximum
        # number of iterations reached.
        #
658
        if (type(V0) in (int, float)) and (V0 == 0):
Steven Cordwell's avatar
Steven Cordwell committed
659
            policy_V = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
660
        else:
Steven Cordwell's avatar
Steven Cordwell committed
661
            if (type(V0) in (ndarray)) and (V0.shape == (self.S, 1)):
662 663
                policy_V = V0
            else:
664 665 666
                raise ValueError("PyMDPtoolbox: V0 vector/array type not "
                                 "supported. Use ndarray of matrix column "
                                 "vector length S.")
Steven Cordwell's avatar
Steven Cordwell committed
667
        
668
        policy_P, policy_R = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
669 670 671
        
        if self.verbose:
            print('  Iteration    V_variation')
672
        
Steven Cordwell's avatar
Steven Cordwell committed
673 674 675
        itr = 0
        done = False
        while not done:
676
            itr += 1
677 678
            
            Vprev = policy_V
679
            policy_V = policy_R + self.discount * policy_P.dot(Vprev)
680 681
            
            variation = absolute(policy_V - Vprev).max()
Steven Cordwell's avatar
Steven Cordwell committed
682 683
            if self.verbose:
                print('      %s         %s') % (itr, variation)
684 685 686
            
            # ensure |Vn - Vpolicy| < epsilon
            if variation < ((1 - self.discount) / self.discount) * epsilon:
Steven Cordwell's avatar
Steven Cordwell committed
687 688
                done = True
                if self.verbose:
689 690
                    print("PyMDPtoolbox: iterations stopped, epsilon-optimal "
                          "value function.")
Steven Cordwell's avatar
Steven Cordwell committed
691 692 693
            elif itr == max_iter:
                done = True
                if self.verbose:
694 695
                    print("PyMDPtoolbox: iterations stopped by maximum number "
                          "of iteration condition.")
Steven Cordwell's avatar
Steven Cordwell committed
696
        
Steven Cordwell's avatar
Steven Cordwell committed
697
        self.V = policy_V
698
    
699
    def _evalPolicyMatrix(self):
Steven Cordwell's avatar
Steven Cordwell committed
700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718
        # Evaluate the value function of the policy using linear equations.
        #
        # Arguments 
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA) = transition matrix 
        #      P could be an array with 3 dimensions or a cell array (1xA),
        #      each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #      R could be an array with 3 dimensions (SxSxA) or 
        #      a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #      a 2D array(SxA) possibly sparse  
        # discount = discount rate in ]0; 1[
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Vpolicy(S) = value function of the policy
        #
719
        Ppolicy, Rpolicy = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
720
        # V = PR + gPV  => (I-gP)V = PR  => V = inv(I-gP)* PR
721 722
        self.V = self._lin_eq(
            (self._speye(self.S, self.S) - self.discount * Ppolicy), Rpolicy)
Steven Cordwell's avatar
Steven Cordwell committed
723
    
724
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
725 726
        # Run the policy iteration algorithm.
        # If verbose the print a header
727 728
        if self.verbose:
            print('  Iteration  Number_of_different_actions')
Steven Cordwell's avatar
Steven Cordwell committed
729
        # Set up the while stopping condition and the current time
Steven Cordwell's avatar
Steven Cordwell committed
730
        done = False
731
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
732
        # loop until a stopping condition is reached
Steven Cordwell's avatar
Steven Cordwell committed
733
        while not done:
734
            self.iter += 1
735
            # these _evalPolicy* functions will update the classes value
Steven Cordwell's avatar
Steven Cordwell committed
736
            # attribute
Steven Cordwell's avatar
Steven Cordwell committed
737
            if self.eval_type == "matrix":
738
                self._evalPolicyMatrix()
Steven Cordwell's avatar
Steven Cordwell committed
739
            elif self.eval_type == "iterative":
740
                self._evalPolicyIterative()
Steven Cordwell's avatar
Steven Cordwell committed
741 742
            # This should update the classes policy attribute but leave the
            # value alone
743
            policy_next, null = self._bellmanOperator()
744
            del null
Steven Cordwell's avatar
Steven Cordwell committed
745 746
            # calculate in how many places does the old policy disagree with
            # the new policy
747
            n_different = (policy_next != self.policy).sum()
Steven Cordwell's avatar
Steven Cordwell committed
748
            # if verbose then continue printing a table
749
            if self.verbose:
750 751
                print('       %s                 %s') % (self.iter,
                                                         n_different)
Steven Cordwell's avatar
Steven Cordwell committed
752 753
            # Once the policy is unchanging of the maximum number of 
            # of iterations has been reached then stop
754
            if n_different == 0:
Steven Cordwell's avatar
Steven Cordwell committed
755
                done = True
756
                if self.verbose:
757 758
                    print("PyMDPtoolbox: iterations stopped, unchanging "
                          "policy found.")
759 760 761
            elif (self.iter == self.max_iter):
                done = True 
                if self.verbose:
762 763
                    print("PyMDPtoolbox: iterations stopped by maximum number "
                          "of iteration condition.")
764 765
            else:
                self.policy = policy_next
Steven Cordwell's avatar
Steven Cordwell committed
766
        # update the time to return th computation time
767
        self.time = time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
768
        # store value and policy as tuples
769 770
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
771

772
class PolicyIterationModified(PolicyIteration):
773
    
774
    """A discounted MDP  solved using a modifified policy iteration algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
775 776 777 778 779
    
    Arguments
    ---------
    Let S = number of states, A = number of actions
    P(SxSxA) = transition matrix 
780 781
             P could be an array with 3 dimensions or a cell array (1xA),
             each cell containing a matrix (SxS) possibly sparse
Steven Cordwell's avatar
Steven Cordwell committed
782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807
    R(SxSxA) or (SxA) = reward matrix
             R could be an array with 3 dimensions (SxSxA) or 
             a cell array (1xA), each cell containing a sparse matrix (SxS) or
             a 2D array(SxA) possibly sparse  
    discount = discount rate, in ]0, 1[
    policy0(S) = starting policy, optional 
    max_iter = maximum number of iteration to be done, upper than 0, 
             optional (default 1000)
    eval_type = type of function used to evaluate policy: 
             0 for mdp_eval_policy_matrix, else mdp_eval_policy_iterative
             optional (default 0)
    
    Data Attributes
    ---------------
    V(S)   = value function 
    policy(S) = optimal policy
    iter     = number of done iterations
    cpu_time = used CPU time
    
    Notes
    -----
    In verbose mode, at each iteration, displays the number 
    of differents actions between policy n-1 and n
    
    Examples
    --------
808 809 810
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> pim = mdptoolbox.mdp.PolicyIterationModified(P, R, 0.9)
811
    >>> pim.run()
812 813 814 815
    >>> pim.policy
    FIXME
    >>> pim.V
    FIXME
816
    
Steven Cordwell's avatar
Steven Cordwell committed
817
    """
818
    
819 820
    def __init__(self, transitions, reward, discount, epsilon=0.01,
                 max_iter=10):
821
        # Initialise a (modified) policy iteration MDP.
Steven Cordwell's avatar
Steven Cordwell committed
822
        
823 824 825
        # Maybe its better not to subclass from PolicyIteration, because the
        # initialisation of the two are quite different. eg there is policy0
        # being calculated here which doesn't need to be. The only thing that
826
        # is needed from the PolicyIteration class is the _evalPolicyIterative
827
        # function. Perhaps there is a better way to do it?
828 829
        PolicyIteration.__init__(self, transitions, reward, discount, None,
                                 max_iter, 1)
830
        
831 832 833 834
        # PolicyIteration doesn't pass epsilon to MDP.__init__() so we will
        # check it here
        if type(epsilon) in (int, float):
            if epsilon <= 0:
835 836
                raise ValueError("PyMDPtoolbox: epsilon must be greater than "
                                 "0.")
837
        else:
838 839
            raise ValueError("PyMDPtoolbox: epsilon must be a positive real "
                             "number greater than zero.")
840
        
841 842
        # computation of threshold of variation for V for an epsilon-optimal
        # policy
843 844 845 846 847
        if self.discount != 1:
            self.thresh = epsilon * (1 - self.discount) / self.discount
        else:
            self.thresh = epsilon
        
848 849
        self.epsilon = epsilon
        
850
        if discount == 1:
Steven Cordwell's avatar
Steven Cordwell committed
851
            self.V = zeros((self.S, 1))
852 853
        else:
            # min(min()) is not right
854
            self.V = 1 / (1 - discount) * self.R.min() * ones((self.S, 1))
855 856
        
        # Call the iteration method
857
        #self.run()
Steven Cordwell's avatar
Steven Cordwell committed
858
    
859
    def run(self):
860
        # Run the modified policy iteration algorithm.
861 862
        
        if self.verbose:
863
            print('\tIteration\tV-variation')
864
        
Steven Cordwell's avatar
Steven Cordwell committed
865
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
866
        
867 868
        done = False
        while not done:
869
            self.iter += 1
870
            
871
            self.policy, Vnext = self._bellmanOperator()
872
            #[Ppolicy, PRpolicy] = mdp_computePpolicyPRpolicy(P, PR, policy);
873
            
874
            variation = getSpan(Vnext - self.V)
875
            if self.verbose:
876
                print("\t%s\t%s" % (self.iter, variation))
877
            
Steven Cordwell's avatar
Steven Cordwell committed
878
            self.V = Vnext
Steven Cordwell's avatar
Steven Cordwell committed
879
            if variation < self.thresh:
880 881 882 883
                done = True
            else:
                is_verbose = False
                if self.verbose:
884
                    self.setSilent()
885 886
                    is_verbose = True
                
887
                self._evalPolicyIterative(self.V, self.epsilon, self.max_iter)
888 889
                
                if is_verbose:
890
                    self.setVerbose()
891
        
892
        self.time = time() - self.time
893 894
        
        # store value and policy as tuples
895 896
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
897 898

class QLearning(MDP):
899
    
900
    """A discounted MDP solved using the Q learning algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
901 902 903 904 905 906 907 908 909 910 911 912 913 914 915
    
    Let S = number of states, A = number of actions
    
    Parameters
    ----------
    P : transition matrix (SxSxA)
        P could be an array with 3 dimensions or a cell array (1xA), each
        cell containing a sparse matrix (SxS)
    R : reward matrix(SxSxA) or (SxA)
        R could be an array with 3 dimensions (SxSxA) or a cell array
        (1xA), each cell containing a sparse matrix (SxS) or a 2D
        array(SxA) possibly sparse
    discount : discount rate
        in ]0; 1[    
    n_iter : number of iterations to execute (optional).
916 917
        Default value = 10000; it is an integer greater than the default
        value.
Steven Cordwell's avatar
Steven Cordwell committed
918 919 920 921 922
    
    Results
    -------
    Q : learned Q matrix (SxA) 
    
Steven Cordwell's avatar
Steven Cordwell committed
923
    V : learned value function (S).
Steven Cordwell's avatar
Steven Cordwell committed
924 925 926 927 928 929 930
    
    policy : learned optimal policy (S).
    
    mean_discrepancy : vector of V discrepancy mean over 100 iterations
        Then the length of this vector for the default value of N is 100 
        (N/100).

931
    Examples
Steven Cordwell's avatar
Steven Cordwell committed
932
    ---------
933 934 935
    >>> # These examples are reproducible only if random seed is set to 0 in
    >>> # both the random and numpy.random modules.
    >>> import numpy as np
936
    >>> import mdptoolbox, mdptoolbox.example
937
    >>> np.random.seed(0)
938 939
    >>> P, R = mdptoolbox.example.forest()
    >>> ql = mdptoolbox.mdp.QLearning(P, R, 0.96)
940
    >>> ql.run()
Steven Cordwell's avatar
Steven Cordwell committed
941
    >>> ql.Q
942 943 944
    array([[ 68.38037354,  43.24888454],
           [ 72.37777922,  42.75549145],
           [ 77.02892702,  64.68712932]])
Steven Cordwell's avatar
Steven Cordwell committed
945
    >>> ql.V
946
    (68.38037354422798, 72.37777921607258, 77.02892701616531)
Steven Cordwell's avatar
Steven Cordwell committed
947
    >>> ql.policy
948
    (0, 0, 0)
Steven Cordwell's avatar
Steven Cordwell committed
949
    
950
    >>> import mdptoolbox
Steven Cordwell's avatar