mdp.py 57.5 KB
 Steven Cordwell committed Jun 12, 2012 1 ``````# -*- coding: utf-8 -*- `````` Steven Cordwell committed Feb 04, 2013 2 ``````"""Markov Decision Process (MDP) Toolbox `````` Steven Cordwell committed Feb 04, 2013 3 ``````===================================== `````` Steven Cordwell committed Jan 26, 2013 4 `````` `````` Steven Cordwell committed Feb 04, 2013 5 6 ``````The MDP toolbox provides classes and functions for the resolution of descrete-time Markov Decision Processes. `````` Steven Cordwell committed Jun 12, 2012 7 `````` `````` Steven Cordwell committed Feb 04, 2013 8 9 10 11 12 ``````Available classes ----------------- MDP Base Markov decision process class FiniteHorizon `````` Steven Cordwell committed Feb 04, 2013 13 `````` Backwards induction finite horizon MDP `````` Steven Cordwell committed Feb 04, 2013 14 15 16 17 18 19 20 21 22 23 24 25 26 27 ``````LP Linear programming MDP PolicyIteration Policy iteration MDP PolicyIterationModified Modified policy iteration MDP QLearning Q-learning MDP RelativeValueIteration Relative value iteration MDP ValueIteration Value iteration MDP ValueIterationGS Gauss-Seidel value iteration MDP `````` Steven Cordwell committed Jun 12, 2012 28 `````` `````` Steven Cordwell committed Feb 04, 2013 29 30 31 32 33 34 35 36 37 38 ``````Available functions ------------------- check Check that an MDP is properly defined checkSquareStochastic Check that a matrix is square and stochastic exampleForest A simple forest management example exampleRand A random example `````` Steven Cordwell committed Jun 12, 2012 39 `````` `````` Steven Cordwell committed Feb 04, 2013 40 41 42 43 44 ``````How to use the documentation ---------------------------- Documentation is available both as docstrings provided with the code and in html or pdf format from `The MDP toolbox homepage `_. The docstring `````` Steven Cordwell committed Aug 24, 2013 45 ``````examples assume that the `mdp` module has been imported imported like so:: `````` Steven Cordwell committed Jun 12, 2012 46 `````` `````` Steven Cordwell committed Aug 24, 2013 47 `````` >>> import mdptoolbox.mdp as mdp `````` Steven Cordwell committed Feb 04, 2013 48 49 50 51 52 `````` Code snippets are indicated by three greater-than signs:: >>> x = 17 >>> x = x + 1 `````` Steven Cordwell committed May 18, 2013 53 54 `````` >>> x 18 `````` Steven Cordwell committed Feb 04, 2013 55 56 57 58 59 `````` The documentation can be displayed with `IPython `_. For example, to view the docstring of the ValueIteration class use ``mdp.ValueIteration?``, and to view its source code use ``mdp.ValueIteration??``. `````` 60 `````` `````` 61 62 63 ``````Acknowledgments --------------- This module is modified from the MDPtoolbox (c) 2009 INRA available at `````` Steven Cordwell committed Feb 08, 2013 64 ``````http://www.inra.fr/mia/T/MDPtoolbox/. `````` 65 `````` `````` Steven Cordwell committed Jun 12, 2012 66 67 ``````""" `````` 68 69 ``````# Copyright (c) 2011-2013 Steven A. W. Cordwell # Copyright (c) 2009 INRA `````` Steven Cordwell committed Feb 04, 2013 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 ``````# # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright notice, # this list of conditions and the following disclaimer. # * Redistributions in binary form must reproduce the above copyright notice, # this list of conditions and the following disclaimer in the documentation # and/or other materials provided with the distribution. # * Neither the name of the nor the names of its contributors # may be used to endorse or promote products derived from this software # without specific prior written permission. # # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" # AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE # ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS # INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN # CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) # ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE # POSSIBILITY OF SUCH DAMAGE. `````` Steven Cordwell committed Jan 26, 2013 97 98 99 ``````from math import ceil, log, sqrt from time import time `````` Steven Cordwell committed Aug 18, 2013 100 101 ``````from numpy import absolute, array, empty, mean, mod, multiply from numpy import ndarray, ones, zeros `````` Steven Cordwell committed Aug 24, 2013 102 ``````from numpy.random import randint, random `````` Steven Cordwell committed Oct 29, 2012 103 ``````from scipy.sparse import csr_matrix as sparse `````` Steven Cordwell committed Jun 12, 2012 104 `````` `````` Steven Cordwell committed Aug 18, 2013 105 ``````from utils import check, getSpan `````` 106 `````` `````` Steven Cordwell committed Feb 04, 2013 107 ``````# These need to be fixed so that we use classes derived from Error. `````` Steven Cordwell committed Oct 29, 2012 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 ``````mdperr = { "mat_nonneg" : "PyMDPtoolbox: Probabilities must be non-negative.", "mat_square" : "PyMDPtoolbox: The matrix must be square.", "mat_stoch" : "PyMDPtoolbox: Rows of the matrix must sum to one (1).", "mask_numpy" : "PyMDPtoolbox: mask must be a numpy array or matrix; i.e. type(mask) is " "ndarray or type(mask) is matrix.", "mask_SbyS" : "PyMDPtoolbox: The mask must have shape SxS; i.e. mask.shape = (S, S).", "obj_shape" : "PyMDPtoolbox: Object arrays for transition probabilities and rewards " "must have only 1 dimension: the number of actions A. Each element of " "the object array contains an SxS ndarray or matrix.", "obj_square" : "PyMDPtoolbox: Each element of an object array for transition " "probabilities and rewards must contain an SxS ndarray or matrix; i.e. " "P[a].shape = (S, S) or R[a].shape = (S, S).", `````` Steven Cordwell committed Nov 04, 2012 128 ``````"P_type" : `````` Steven Cordwell committed Oct 29, 2012 129 130 131 132 133 134 135 136 137 138 `````` "PyMDPtoolbox: The transition probabilities must be in a numpy array; " "i.e. type(P) is ndarray.", "P_shape" : "PyMDPtoolbox: The transition probability array must have the shape " "(A, S, S) with S : number of states greater than 0 and A : number of " "actions greater than 0. i.e. R.shape = (A, S, S)", "PR_incompat" : "PyMDPtoolbox: Incompatibility between P and R dimensions.", "prob_in01" : "PyMDPtoolbox: Probability p must be in [0; 1].", `````` Steven Cordwell committed Nov 04, 2012 139 ``````"R_type" : `````` Steven Cordwell committed Oct 29, 2012 140 `````` "PyMDPtoolbox: The rewards must be in a numpy array; i.e. type(R) is " `````` Steven Cordwell committed Nov 04, 2012 141 `````` "ndarray, or numpy matrix; i.e. type(R) is matrix.", `````` Steven Cordwell committed Oct 29, 2012 142 143 ``````"R_shape" : "PyMDPtoolbox: The reward matrix R must be an array of shape (A, S, S) or " `````` Steven Cordwell committed Jan 26, 2013 144 145 `````` "(S, A) with S : number of states greater than 0 and A : number of " "actions greater than 0. i.e. R.shape = (S, A) or (A, S, S).", `````` Steven Cordwell committed Oct 29, 2012 146 147 148 149 150 151 ``````"R_gt_0" : "PyMDPtoolbox: The rewards must be greater than 0.", "S_gt_1" : "PyMDPtoolbox: Number of states S must be greater than 1.", "SA_gt_1" : "PyMDPtoolbox: The number of states S and the number of actions A must be " `````` Steven Cordwell committed Jan 21, 2013 152 153 154 155 156 `````` "greater than 1.", "discount_rng" : "PyMDPtoolbox: Discount rate must be in ]0; 1]", "maxi_min" : "PyMDPtoolbox: The maximum number of iterations must be greater than 0" `````` Steven Cordwell committed Oct 29, 2012 157 158 ``````} `````` Steven Cordwell committed Nov 04, 2012 159 ``````class MDP(object): `````` 160 `````` `````` Steven Cordwell committed Feb 04, 2013 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 `````` """A Markov Decision Problem. Parameters ---------- transitions : array transition probability matrices reward : array reward matrices discount : float or None discount factor epsilon : float or None stopping criteria max_iter : int or None maximum number of iterations Attributes ---------- P : array Transition probability matrices R : array Reward matrices V : list Value function discount : float b max_iter : int a policy : list a time : float a verbose : logical a Methods ------- iterate To be implemented in child classes, raises exception setSilent Turn the verbosity off setVerbose Turn the verbosity on """ `````` Steven Cordwell committed Oct 29, 2012 205 `````` `````` Steven Cordwell committed Jan 25, 2013 206 `````` def __init__(self, transitions, reward, discount, epsilon, max_iter): `````` Steven Cordwell committed Jan 26, 2013 207 `````` """Initialise a MDP based on the input parameters.""" `````` 208 `````` `````` Steven Cordwell committed Jan 24, 2013 209 210 `````` # if the discount is None then the algorithm is assumed to not use it # in its computations `````` Steven Cordwell committed Jan 25, 2013 211 `````` if type(discount) in (int, float): `````` Steven Cordwell committed Jan 21, 2013 212 213 214 `````` if (discount <= 0) or (discount > 1): raise ValueError(mdperr["discount_rng"]) else: `````` Steven Cordwell committed Jan 24, 2013 215 `````` if discount == 1: `````` Steven Cordwell committed Jan 26, 2013 216 217 218 `````` print("PyMDPtoolbox WARNING: check conditions of " "convergence. With no discount, convergence is not " "always assumed.") `````` Steven Cordwell committed Jan 21, 2013 219 `````` self.discount = discount `````` Steven Cordwell committed Jan 26, 2013 220 `````` elif discount is not None: `````` Steven Cordwell committed Jan 26, 2013 221 222 `````` raise ValueError("PyMDPtoolbox: the discount must be a positive " "real number less than or equal to one.") `````` Steven Cordwell committed Jan 24, 2013 223 224 `````` # if the max_iter is None then the algorithm is assumed to not use it # in its computations `````` Steven Cordwell committed Jan 25, 2013 225 226 `````` if type(max_iter) in (int, float): if max_iter <= 0: `````` Steven Cordwell committed Jan 21, 2013 227 228 229 `````` raise ValueError(mdperr["maxi_min"]) else: self.max_iter = max_iter `````` Steven Cordwell committed Jan 26, 2013 230 `````` elif max_iter is not None: `````` Steven Cordwell committed Jan 26, 2013 231 232 `````` raise ValueError("PyMDPtoolbox: max_iter must be a positive real " "number greater than zero.") `````` Steven Cordwell committed Jun 21, 2013 233 `````` # check that epsilon is something sane `````` Steven Cordwell committed Jan 25, 2013 234 235 `````` if type(epsilon) in (int, float): if epsilon <= 0: `````` Steven Cordwell committed Jan 26, 2013 236 237 `````` raise ValueError("PyMDPtoolbox: epsilon must be greater than " "0.") `````` Steven Cordwell committed Jan 26, 2013 238 `````` elif epsilon is not None: `````` Steven Cordwell committed Jan 26, 2013 239 240 `````` raise ValueError("PyMDPtoolbox: epsilon must be a positive real " "number greater than zero.") `````` Steven Cordwell committed Jan 22, 2013 241 242 243 244 `````` # we run a check on P and R to make sure they are describing an MDP. If # an exception isn't raised then they are assumed to be correct. check(transitions, reward) # computePR will assign the variables self.S, self.A, self.P and self.R `````` 245 `````` self._computePR(transitions, reward) `````` Steven Cordwell committed Jan 21, 2013 246 247 248 249 `````` # the verbosity is by default turned off self.verbose = False # Initially the time taken to perform the computations is set to None self.time = None `````` Steven Cordwell committed Jan 21, 2013 250 251 `````` # set the initial iteration count to zero self.iter = 0 `````` Steven Cordwell committed Feb 10, 2013 252 `````` # V should be stored as a vector ie shape of (S,) or (1, S) `````` Steven Cordwell committed Jan 24, 2013 253 `````` self.V = None `````` Steven Cordwell committed Feb 10, 2013 254 `````` # policy can also be stored as a vector `````` Steven Cordwell committed Jan 21, 2013 255 `````` self.policy = None `````` Steven Cordwell committed Oct 29, 2012 256 `````` `````` 257 `````` def _bellmanOperator(self, V=None): `````` Steven Cordwell committed Jun 21, 2013 258 259 260 261 262 263 `````` # Apply the Bellman operator on the value function. # Updates the value function and the Vprev-improving policy. # Returns: (policy, value), tuple of new policy and its value # # If V hasn't been sent into the method, then we assume to be working # on the objects V attribute `````` Steven Cordwell committed Jan 26, 2013 264 265 `````` if V is None: # this V should be a reference to the data rather than a copy `````` Steven Cordwell committed Jan 25, 2013 266 267 `````` V = self.V else: `````` Steven Cordwell committed Jun 21, 2013 268 `````` # make sure the user supplied V is of the right shape `````` Steven Cordwell committed Jan 26, 2013 269 `````` try: `````` Steven Cordwell committed Feb 10, 2013 270 `````` if V.shape not in ((self.S,), (1, self.S)): `````` Steven Cordwell committed Jan 27, 2013 271 `````` raise ValueError("bellman: V is not the right shape.") `````` Steven Cordwell committed Jan 26, 2013 272 273 `````` except AttributeError: raise TypeError("bellman: V must be a numpy array or matrix.") `````` Steven Cordwell committed Jun 21, 2013 274 `````` # Looping through each action the the Q-value matrix is calculated `````` Steven Cordwell committed May 18, 2013 275 `````` Q = empty((self.A, self.S)) `````` Steven Cordwell committed Jun 12, 2012 276 `````` for aa in range(self.A): `````` Steven Cordwell committed May 18, 2013 277 `````` Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V) `````` Steven Cordwell committed Jun 21, 2013 278 `````` # Get the policy and value, for now it is being returned but... `````` Steven Cordwell committed Jan 26, 2013 279 `````` # Which way is better? `````` 280 `````` # 1. Return, (policy, value) `````` Steven Cordwell committed May 18, 2013 281 `````` return (Q.argmax(axis=0), Q.max(axis=0)) `````` Steven Cordwell committed Jan 24, 2013 282 283 `````` # 2. update self.policy and self.V directly # self.V = Q.max(axis=1) `````` Steven Cordwell committed Jan 24, 2013 284 `````` # self.policy = Q.argmax(axis=1) `````` Steven Cordwell committed Jun 12, 2012 285 `````` `````` 286 `````` def _computePR(self, P, R): `````` Steven Cordwell committed Jun 21, 2013 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 `````` # Compute the reward for the system in one state chosing an action. # Arguments # --------- # Let S = number of states, A = number of actions # P(SxSxA) = transition matrix # P could be an array with 3 dimensions or a cell array (1xA), # each cell containing a matrix (SxS) possibly sparse # R(SxSxA) or (SxA) = reward matrix # R could be an array with 3 dimensions (SxSxA) or a cell array # (1xA), each cell containing a sparse matrix (SxS) or a 2D # array(SxA) possibly sparse # Evaluation # ---------- # PR(SxA) = reward matrix # `````` Steven Cordwell committed Jan 26, 2013 302 `````` # We assume that P and R define a MDP i,e. assumption is that `````` Steven Cordwell committed Jan 22, 2013 303 `````` # check(P, R) has already been run and doesn't fail. `````` Steven Cordwell committed Jan 26, 2013 304 `````` # `````` Steven Cordwell committed Feb 09, 2013 305 306 `````` # Set self.P as a tuple of length A, with each element storing an S×S # matrix. `````` Steven Cordwell committed Feb 08, 2013 307 `````` self.A = len(P) `````` Steven Cordwell committed Feb 08, 2013 308 309 310 `````` try: if P.ndim == 3: self.S = P.shape[1] `````` Steven Cordwell committed Feb 08, 2013 311 312 `````` else: self.S = P[0].shape[0] `````` Steven Cordwell committed Feb 08, 2013 313 `````` except AttributeError: `````` Steven Cordwell committed Feb 08, 2013 314 `````` self.S = P[0].shape[0] `````` Steven Cordwell committed Feb 08, 2013 315 316 317 318 319 `````` except: raise # convert Ps to matrices self.P = [] for aa in xrange(self.A): `````` Steven Cordwell committed Feb 09, 2013 320 `````` self.P.append(P[aa]) `````` Steven Cordwell committed Feb 08, 2013 321 `````` self.P = tuple(self.P) `````` Steven Cordwell committed Feb 10, 2013 322 323 `````` # Set self.R as a tuple of length A, with each element storing an 1×S # vector. `````` Steven Cordwell committed Feb 09, 2013 324 `````` try: `````` Steven Cordwell committed Jan 24, 2013 325 `````` if R.ndim == 2: `````` Steven Cordwell committed Feb 09, 2013 326 327 `````` self.R = [] for aa in xrange(self.A): `````` Steven Cordwell committed Feb 10, 2013 328 `````` self.R.append(array(R[:, aa]).reshape(self.S)) `````` Steven Cordwell committed Jun 12, 2012 329 `````` else: `````` Steven Cordwell committed Feb 09, 2013 330 331 332 333 `````` raise AttributeError except AttributeError: self.R = [] for aa in xrange(self.A): `````` Steven Cordwell committed Feb 10, 2013 334 335 336 337 338 339 `````` try: self.R.append(P[aa].multiply(R[aa]).sum(1).reshape(self.S)) except AttributeError: self.R.append(multiply(P[aa],R[aa]).sum(1).reshape(self.S)) except: raise `````` Steven Cordwell committed Feb 09, 2013 340 341 342 `````` except: raise self.R = tuple(self.R) `````` Steven Cordwell committed Jan 26, 2013 343 `````` `````` 344 `````` def _iterate(self): `````` Steven Cordwell committed Jun 21, 2013 345 `````` # Raise error because child classes should implement this function. `````` 346 `````` raise NotImplementedError("You should create an _iterate() method.") `````` Steven Cordwell committed Jun 12, 2012 347 `````` `````` Steven Cordwell committed Oct 29, 2012 348 `````` def setSilent(self): `````` Steven Cordwell committed Jan 26, 2013 349 `````` """Set the MDP algorithm to silent mode.""" `````` Steven Cordwell committed Oct 29, 2012 350 351 352 `````` self.verbose = False def setVerbose(self): `````` Steven Cordwell committed Jan 26, 2013 353 `````` """Set the MDP algorithm to verbose mode.""" `````` Steven Cordwell committed Oct 29, 2012 354 `````` self.verbose = True `````` Steven Cordwell committed Oct 29, 2012 355 356 `````` class FiniteHorizon(MDP): `````` 357 `````` `````` Steven Cordwell committed Feb 04, 2013 358 `````` """A MDP solved using the finite-horizon backwards induction algorithm. `````` Steven Cordwell committed Jan 21, 2013 359 360 `````` Let S = number of states, A = number of actions `````` Steven Cordwell committed Feb 04, 2013 361 362 363 `````` Parameters ---------- `````` Steven Cordwell committed Jan 21, 2013 364 `````` P(SxSxA) = transition matrix `````` Steven Cordwell committed Jan 26, 2013 365 366 `````` P could be an array with 3 dimensions ora cell array (1xA), each cell containing a matrix (SxS) possibly sparse `````` Steven Cordwell committed Jan 21, 2013 367 368 369 370 371 372 373 `````` R(SxSxA) or (SxA) = reward matrix R could be an array with 3 dimensions (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS) or a 2D array(SxA) possibly sparse discount = discount factor, in ]0, 1] N = number of periods, upper than 0 h(S) = terminal reward, optional (default [0; 0; ... 0] ) `````` Steven Cordwell committed Feb 04, 2013 374 375 `````` Attributes `````` Steven Cordwell committed Jan 21, 2013 376 `````` ---------- `````` Steven Cordwell committed Feb 04, 2013 377 378 379 `````` Methods ------- `````` Steven Cordwell committed Jan 21, 2013 380 381 382 383 384 385 386 387 388 389 390 391 392 `````` V(S,N+1) = optimal value function V(:,n) = optimal value function at stage n with stage in 1, ..., N V(:,N+1) = value function for terminal stage policy(S,N) = optimal policy policy(:,n) = optimal policy at stage n with stage in 1, ...,N policy(:,N) = policy for stage N cpu_time = used CPU time Notes ----- In verbose mode, displays the current stage and policy transpose. `````` 393 `````` `````` Steven Cordwell committed Feb 04, 2013 394 395 `````` Examples -------- `````` Steven Cordwell committed Aug 24, 2013 396 397 398 `````` >>> import mdptoolbox, mdptoolbox.example >>> P, R = mdptoolbox.example.forest() >>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3) `````` Steven Cordwell committed Feb 04, 2013 399 400 401 402 403 404 405 406 `````` >>> fh.V array([[ 2.6973, 0.81 , 0. , 0. ], [ 5.9373, 3.24 , 1. , 0. ], [ 9.9373, 7.24 , 4. , 0. ]]) >>> fh.policy array([[0, 0, 0], [0, 0, 1], [0, 0, 0]]) `````` Steven Cordwell committed Aug 24, 2013 407 `````` `````` Steven Cordwell committed Oct 29, 2012 408 `````` """ `````` Steven Cordwell committed Jan 21, 2013 409 `````` `````` Steven Cordwell committed Jan 21, 2013 410 `````` def __init__(self, transitions, reward, discount, N, h=None): `````` Steven Cordwell committed Jan 26, 2013 411 `````` """Initialise a finite horizon MDP.""" `````` Steven Cordwell committed Jan 21, 2013 412 `````` if N < 1: `````` Steven Cordwell committed Jan 22, 2013 413 `````` raise ValueError('PyMDPtoolbox: N must be greater than 0') `````` Steven Cordwell committed Jan 21, 2013 414 `````` else: `````` Steven Cordwell committed Jan 21, 2013 415 `````` self.N = N `````` Steven Cordwell committed Feb 10, 2013 416 `````` # Initialise the base class `````` Steven Cordwell committed Jan 25, 2013 417 `````` MDP.__init__(self, transitions, reward, discount, None, None) `````` Steven Cordwell committed Feb 10, 2013 418 419 `````` # remove the iteration counter, it is not meaningful for backwards # induction `````` Steven Cordwell committed Jan 25, 2013 420 `````` del self.iter `````` Steven Cordwell committed Feb 10, 2013 421 `````` # There are value vectors for each time step up to the horizon `````` Steven Cordwell committed Jan 25, 2013 422 `````` self.V = zeros((self.S, N + 1)) `````` Steven Cordwell committed Feb 10, 2013 423 424 425 426 427 `````` # There are policy vectors for each time step before the horizon, when # we reach the horizon we don't need to make decisions anymore. self.policy = empty((self.S, N), dtype=int) # Set the reward for the final transition to h, if specified. if h is not None: `````` Steven Cordwell committed Jan 25, 2013 428 `````` self.V[:, N] = h `````` 429 430 431 432 `````` # Call the iteration method self._iterate() def _iterate(self): `````` Steven Cordwell committed Jun 21, 2013 433 `````` # Run the finite horizon algorithm. `````` Steven Cordwell committed Jan 21, 2013 434 `````` self.time = time() `````` Steven Cordwell committed Jun 21, 2013 435 `````` # loop through each time period `````` Steven Cordwell committed Jan 25, 2013 436 `````` for n in range(self.N): `````` Steven Cordwell committed Feb 10, 2013 437 438 439 `````` W, X = self._bellmanOperator(self.V[:, self.N - n]) self.V[:, self.N - n - 1] = X self.policy[:, self.N - n - 1] = W `````` Steven Cordwell committed Jan 21, 2013 440 `````` if self.verbose: `````` Steven Cordwell committed Jan 26, 2013 441 442 `````` print("stage: %s ... policy transpose : %s") % ( self.N - n, self.policy[:, self.N - n -1].tolist()) `````` Steven Cordwell committed Jun 21, 2013 443 `````` # update time spent running `````` Steven Cordwell committed Jan 21, 2013 444 `````` self.time = time() - self.time `````` Steven Cordwell committed Feb 10, 2013 445 446 447 448 449 450 451 452 453 454 `````` # After this we could create a tuple of tuples for the values and # policies. #V = [] #p = [] #for n in xrange(self.N): # V.append() # p.append() #V.append() #self.V = tuple(V) #self.policy = tuple(p) `````` Steven Cordwell committed Oct 29, 2012 455 456 `````` class LP(MDP): `````` 457 `````` `````` Steven Cordwell committed Jan 26, 2013 458 `````` """A discounted MDP soloved using linear programming. `````` Steven Cordwell committed Jan 21, 2013 459 460 461 462 463 `````` Arguments --------- Let S = number of states, A = number of actions P(SxSxA) = transition matrix `````` Steven Cordwell committed Jan 26, 2013 464 465 `````` P could be an array with 3 dimensions or a cell array (1xA), each cell containing a matrix (SxS) possibly sparse `````` Steven Cordwell committed Jan 21, 2013 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 `````` R(SxSxA) or (SxA) = reward matrix R could be an array with 3 dimensions (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS) or a 2D array(SxA) possibly sparse discount = discount rate, in ]0; 1[ h(S) = terminal reward, optional (default [0; 0; ... 0] ) Evaluation ---------- V(S) = optimal values policy(S) = optimal policy cpu_time = used CPU time Notes ----- In verbose mode, displays the current stage and policy transpose. Examples -------- `````` Steven Cordwell committed Aug 24, 2013 485 486 487 `````` >>> import mdptoolbox, mdptoolbox.example >>> P, R = mdptoolbox.example.forest() >>> lp = mdptoolbox.mdp.LP(P, R, 0.9) `````` 488 `````` `````` Steven Cordwell committed Oct 29, 2012 489 `````` """ `````` Steven Cordwell committed Jan 21, 2013 490 `````` `````` Steven Cordwell committed Jan 21, 2013 491 `````` def __init__(self, transitions, reward, discount): `````` Steven Cordwell committed Jan 26, 2013 492 `````` """Initialise a linear programming MDP.""" `````` Steven Cordwell committed Jun 21, 2013 493 `````` # import some functions from cvxopt and set them as object methods `````` Steven Cordwell committed Jan 21, 2013 494 495 `````` try: from cvxopt import matrix, solvers `````` Steven Cordwell committed Jan 26, 2013 496 497 `````` self._linprog = solvers.lp self._cvxmat = matrix `````` Steven Cordwell committed Jan 21, 2013 498 `````` except ImportError: `````` Steven Cordwell committed Jan 26, 2013 499 500 `````` raise ImportError("The python module cvxopt is required to use " "linear programming functionality.") `````` Steven Cordwell committed Jun 21, 2013 501 502 `````` # we also need diagonal matrices, and using a sparse one may be more # memory efficient `````` Steven Cordwell committed Jan 21, 2013 503 `````` from scipy.sparse import eye as speye `````` Steven Cordwell committed Jan 26, 2013 504 `````` self._speye = speye `````` Steven Cordwell committed Jun 21, 2013 505 `````` # initialise the MDP. epsilon and max_iter are not needed `````` Steven Cordwell committed Jan 26, 2013 506 `````` MDP.__init__(self, transitions, reward, discount, None, None) `````` Steven Cordwell committed Jun 21, 2013 507 `````` # Set the cvxopt solver to be quiet by default, but ... `````` Steven Cordwell committed Jan 26, 2013 508 `````` # this doesn't do what I want it to do c.f. issue #3 `````` Steven Cordwell committed Jan 26, 2013 509 510 `````` if not self.verbose: solvers.options['show_progress'] = False `````` 511 512 `````` # Call the iteration method self._iterate() `````` Steven Cordwell committed Jan 26, 2013 513 `````` `````` 514 `````` def _iterate(self): `````` Steven Cordwell committed Jun 21, 2013 515 `````` #Run the linear programming algorithm. `````` Steven Cordwell committed Jan 26, 2013 516 `````` self.time = time() `````` Steven Cordwell committed Jan 21, 2013 517 `````` # The objective is to resolve : min V / V >= PR + discount*P*V `````` Steven Cordwell committed Jan 26, 2013 518 519 `````` # The function linprog of the optimisation Toolbox of Mathworks # resolves : `````` Steven Cordwell committed Jan 21, 2013 520 `````` # min f'* x / M * x <= b `````` Steven Cordwell committed Jan 26, 2013 521 522 523 524 `````` # So the objective could be expressed as : # min V / (discount*P-I) * V <= - PR # To avoid loop on states, the matrix M is structured following actions # M(A*S,S) `````` Steven Cordwell committed Jan 26, 2013 525 526 527 `````` f = self._cvxmat(ones((self.S, 1))) h = self._cvxmat(self.R.reshape(self.S * self.A, 1, order="F"), tc='d') M = zeros((self.A * self.S, self.S)) `````` Steven Cordwell committed Jan 21, 2013 528 529 `````` for aa in range(self.A): pos = (aa + 1) * self.S `````` Steven Cordwell committed Jan 26, 2013 530 531 532 `````` M[(pos - self.S):pos, :] = ( self.discount * self.P[aa] - self._speye(self.S, self.S)) M = self._cvxmat(M) `````` Steven Cordwell committed Jan 26, 2013 533 534 535 `````` # Using the glpk option will make this behave more like Octave # (Octave uses glpk) and perhaps Matlab. If solver=None (ie using the # default cvxopt solver) then V agrees with the Octave equivalent `````` Steven Cordwell committed Jun 21, 2013 536 `````` # only to 10e-8 places. This assumes glpk is installed of course. `````` Steven Cordwell committed Feb 10, 2013 537 `````` self.V = array(self._linprog(f, M, -h, solver='glpk')['x']) `````` Steven Cordwell committed Jun 21, 2013 538 `````` # apply the Bellman operator `````` 539 `````` self.policy, self.V = self._bellmanOperator() `````` Steven Cordwell committed Jun 21, 2013 540 `````` # update the time spent solving `````` Steven Cordwell committed Jan 21, 2013 541 `````` self.time = time() - self.time `````` Steven Cordwell committed Jan 26, 2013 542 `````` # store value and policy as tuples `````` 543 544 `````` self.V = tuple(self.V.tolist()) self.policy = tuple(self.policy.tolist()) `````` Steven Cordwell committed Oct 29, 2012 545 546 `````` class PolicyIteration(MDP): `````` 547 `````` `````` Steven Cordwell committed Jan 26, 2013 548 `````` """A discounted MDP solved using the policy iteration algorithm. `````` Steven Cordwell committed Nov 04, 2012 549 `````` `````` Steven Cordwell committed Jan 22, 2013 550 551 552 553 `````` Arguments --------- Let S = number of states, A = number of actions P(SxSxA) = transition matrix `````` Steven Cordwell committed Jan 26, 2013 554 555 `````` P could be an array with 3 dimensions or a cell array (1xA), each cell containing a matrix (SxS) possibly sparse `````` Steven Cordwell committed Jan 22, 2013 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 `````` R(SxSxA) or (SxA) = reward matrix R could be an array with 3 dimensions (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS) or a 2D array(SxA) possibly sparse discount = discount rate, in ]0, 1[ policy0(S) = starting policy, optional max_iter = maximum number of iteration to be done, upper than 0, optional (default 1000) eval_type = type of function used to evaluate policy: 0 for mdp_eval_policy_matrix, else mdp_eval_policy_iterative optional (default 0) Evaluation ---------- V(S) = value function policy(S) = optimal policy iter = number of done iterations cpu_time = used CPU time Notes ----- In verbose mode, at each iteration, displays the number of differents actions between policy n-1 and n `````` Steven Cordwell committed Nov 04, 2012 580 581 `````` Examples -------- `````` Steven Cordwell committed Aug 24, 2013 582 583 584 `````` >>> import mdptoolbox, mdptoolbox.example >>> P, R = mdptoolbox.example.rand() >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9) `````` Steven Cordwell committed Jun 21, 2013 585 `````` `````` Steven Cordwell committed Aug 24, 2013 586 587 `````` >>> P, R = mdptoolbox.example.forest() >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9) `````` Steven Cordwell committed Jun 21, 2013 588 `````` >>> pi.V `````` Steven Cordwell committed Jun 21, 2013 589 `````` (26.244000000000018, 29.48400000000002, 33.484000000000016) `````` Steven Cordwell committed Jun 21, 2013 590 `````` >>> pi.policy `````` Steven Cordwell committed Jun 21, 2013 591 `````` (0, 0, 0) `````` Steven Cordwell committed Oct 29, 2012 592 `````` """ `````` Steven Cordwell committed Nov 04, 2012 593 `````` `````` Steven Cordwell committed Jan 26, 2013 594 595 `````` def __init__(self, transitions, reward, discount, policy0=None, max_iter=1000, eval_type=0): `````` Steven Cordwell committed Jun 21, 2013 596 597 598 `````` # Initialise a policy iteration MDP. # # Set up the MDP, but don't need to worry about epsilon values `````` Steven Cordwell committed Jan 25, 2013 599 `````` MDP.__init__(self, transitions, reward, discount, None, max_iter) `````` Steven Cordwell committed Jun 21, 2013 600 `````` # Check if the user has supplied an initial policy. If not make one. `````` Steven Cordwell committed Jan 22, 2013 601 `````` if policy0 == None: `````` Steven Cordwell committed Jun 21, 2013 602 `````` # Initialise the policy to the one which maximises the expected `````` Steven Cordwell committed Jan 22, 2013 603 `````` # immediate reward `````` Steven Cordwell committed Jun 21, 2013 604 605 `````` null = zeros(self.S) self.policy, null = self._bellmanOperator(null) `````` Steven Cordwell committed Jan 24, 2013 606 `````` del null `````` Steven Cordwell committed Jan 22, 2013 607 `````` else: `````` Steven Cordwell committed Jun 21, 2013 608 609 `````` # Use the policy that the user supplied # Make sure it is a numpy array `````` Steven Cordwell committed Jan 22, 2013 610 `````` policy0 = array(policy0) `````` Steven Cordwell committed Jun 21, 2013 611 `````` # Make sure the policy is the right size and shape `````` Steven Cordwell committed Jan 22, 2013 612 `````` if not policy0.shape in ((self.S, ), (self.S, 1), (1, self.S)): `````` Steven Cordwell committed Jan 26, 2013 613 614 `````` raise ValueError("PyMDPtolbox: policy0 must a vector with " "length S.") `````` Steven Cordwell committed Jun 21, 2013 615 `````` # reshape the policy to be a vector `````` Steven Cordwell committed Feb 10, 2013 616 `````` policy0 = policy0.reshape(self.S) `````` Steven Cordwell committed Jun 21, 2013 617 `````` # The policy can only contain integers between 1 and S `````` Steven Cordwell committed Jan 26, 2013 618 619 620 621 `````` if (mod(policy0, 1).any() or (policy0 < 0).any() or (policy0 >= self.S).any()): raise ValueError("PyMDPtoolbox: policy0 must be a vector of " "integers between 1 and S.") `````` Steven Cordwell committed Jan 22, 2013 622 623 `````` else: self.policy = policy0 `````` Steven Cordwell committed Jun 21, 2013 624 `````` # set the initial values to zero `````` Steven Cordwell committed Feb 10, 2013 625 `````` self.V = zeros(self.S) `````` Steven Cordwell committed Jun 21, 2013 626 `````` # Do some setup depending on the evaluation type `````` Steven Cordwell committed Jan 22, 2013 627 `````` if eval_type in (0, "matrix"): `````` Steven Cordwell committed Jan 24, 2013 628 `````` from numpy.linalg import solve `````` Steven Cordwell committed Jan 24, 2013 629 `````` from scipy.sparse import eye `````` 630 631 `````` self._speye = eye self._lin_eq = solve `````` Steven Cordwell committed Jan 22, 2013 632 633 634 635 `````` self.eval_type = "matrix" elif eval_type in (1, "iterative"): self.eval_type = "iterative" else: `````` Steven Cordwell committed Jan 26, 2013 636 637 638 639 `````` raise ValueError("PyMDPtoolbox: eval_type should be 0 for matrix " "evaluation or 1 for iterative evaluation. " "The strings 'matrix' and 'iterative' can also " "be used.") `````` 640 641 `````` # Call the iteration method self._iterate() `````` Steven Cordwell committed Jan 22, 2013 642 `````` `````` 643 `````` def _computePpolicyPRpolicy(self): `````` Steven Cordwell committed Jun 21, 2013 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 `````` # Compute the transition matrix and the reward matrix for a policy. # # Arguments # --------- # Let S = number of states, A = number of actions # P(SxSxA) = transition matrix # P could be an array with 3 dimensions or a cell array (1xA), # each cell containing a matrix (SxS) possibly sparse # R(SxSxA) or (SxA) = reward matrix # R could be an array with 3 dimensions (SxSxA) or # a cell array (1xA), each cell containing a sparse matrix (SxS) or # a 2D array(SxA) possibly sparse # policy(S) = a policy # # Evaluation # ---------- # Ppolicy(SxS) = transition matrix for policy # PRpolicy(S) = reward matrix for policy # `````` Steven Cordwell committed Feb 10, 2013 663 664 `````` Ppolicy = empty((self.S, self.S)) Rpolicy = zeros(self.S) `````` Steven Cordwell committed Jan 24, 2013 665 `````` for aa in range(self.A): # avoid looping over S `````` Steven Cordwell committed Feb 10, 2013 666 667 `````` # the rows that use action a. ind = (self.policy == aa).nonzero()[0] `````` Steven Cordwell committed Jan 26, 2013 668 669 `````` # if no rows use action a, then no need to assign this if ind.size > 0: `````` Steven Cordwell committed Jan 24, 2013 670 `````` Ppolicy[ind, :] = self.P[aa][ind, :] `````` 671 `````` #PR = self._computePR() # an apparently uneeded line, and `````` Steven Cordwell committed Jan 24, 2013 672 673 `````` # perhaps harmful in this implementation c.f. # mdp_computePpolicyPRpolicy.m `````` Steven Cordwell committed Feb 09, 2013 674 `````` Rpolicy[ind] = self.R[aa][ind] `````` Steven Cordwell committed Jan 24, 2013 675 676 677 678 679 680 681 682 683 684 `````` # self.R cannot be sparse with the code in its current condition, but # it should be possible in the future. Also, if R is so big that its # a good idea to use a sparse matrix for it, then converting PRpolicy # from a dense to sparse matrix doesn't seem very memory efficient if type(self.R) is sparse: Rpolicy = sparse(Rpolicy) #self.Ppolicy = Ppolicy #self.Rpolicy = Rpolicy return (Ppolicy, Rpolicy) `````` 685 `````` def _evalPolicyIterative(self, V0=0, epsilon=0.0001, max_iter=10000): `````` Steven Cordwell committed Jun 21, 2013 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 `````` # Evaluate a policy using iteration. # # Arguments # --------- # Let S = number of states, A = number of actions # P(SxSxA) = transition matrix # P could be an array with 3 dimensions or # a cell array (1xS), each cell containing a matrix possibly sparse # R(SxSxA) or (SxA) = reward matrix # R could be an array with 3 dimensions (SxSxA) or # a cell array (1xA), each cell containing a sparse matrix (SxS) or # a 2D array(SxA) possibly sparse # discount = discount rate in ]0; 1[ # policy(S) = a policy # V0(S) = starting value function, optional (default : zeros(S,1)) # epsilon = epsilon-optimal policy search, upper than 0, # optional (default : 0.0001) # max_iter = maximum number of iteration to be done, upper than 0, # optional (default : 10000) # # Evaluation # ---------- # Vpolicy(S) = value function, associated to a specific policy # # Notes # ----- # In verbose mode, at each iteration, displays the condition which # stopped iterations: epsilon-optimum value function found or maximum # number of iterations reached. # `````` Steven Cordwell committed Jan 25, 2013 716 `````` if (type(V0) in (int, float)) and (V0 == 0): `````` Steven Cordwell committed Feb 10, 2013 717 `````` policy_V = zeros(self.S) `````` Steven Cordwell committed Jan 22, 2013 718 `````` else: `````` Steven Cordwell committed May 18, 2013 719 `````` if (type(V0) in (ndarray)) and (V0.shape == (self.S, 1)): `````` Steven Cordwell committed Jan 25, 2013 720 721 `````` policy_V = V0 else: `````` Steven Cordwell committed Jan 26, 2013 722 723 724 `````` raise ValueError("PyMDPtoolbox: V0 vector/array type not " "supported. Use ndarray of matrix column " "vector length S.") `````` Steven Cordwell committed Jan 22, 2013 725 `````` `````` 726 `````` policy_P, policy_R = self._computePpolicyPRpolicy() `````` Steven Cordwell committed Jan 22, 2013 727 728 729 `````` if self.verbose: print(' Iteration V_variation') `````` Steven Cordwell committed Jan 24, 2013 730 `````` `````` Steven Cordwell committed Jan 22, 2013 731 732 733 `````` itr = 0 done = False while not done: `````` Steven Cordwell committed Feb 10, 2013 734 `````` itr += 1 `````` Steven Cordwell committed Jan 24, 2013 735 736 `````` Vprev = policy_V `````` 737 `````` policy_V = policy_R + self.discount * policy_P.dot(Vprev) `````` Steven Cordwell committed Jan 24, 2013 738 739 `````` variation = absolute(policy_V - Vprev).max() `````` Steven Cordwell committed Jan 22, 2013 740 741 `````` if self.verbose: print(' %s %s') % (itr, variation) `````` Steven Cordwell committed Jan 26, 2013 742 743 744 `````` # ensure |Vn - Vpolicy| < epsilon if variation < ((1 - self.discount) / self.discount) * epsilon: `````` Steven Cordwell committed Jan 22, 2013 745 746 `````` done = True if self.verbose: `````` Steven Cordwell committed Jan 26, 2013 747 748 `````` print("PyMDPtoolbox: iterations stopped, epsilon-optimal " "value function.") `````` Steven Cordwell committed Jan 22, 2013 749 750 751 `````` elif itr == max_iter: done = True if self.verbose: `````` Steven Cordwell committed Jan 26, 2013 752 753 `````` print("PyMDPtoolbox: iterations stopped by maximum number " "of iteration condition.") `````` Steven Cordwell committed Jan 22, 2013 754 `````` `````` Steven Cordwell committed Jan 24, 2013 755 `````` self.V = policy_V `````` Steven Cordwell committed Jan 21, 2013 756 `````` `````` 757 `````` def _evalPolicyMatrix(self): `````` Steven Cordwell committed Jun 21, 2013 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 `````` # Evaluate the value function of the policy using linear equations. # # Arguments # --------- # Let S = number of states, A = number of actions # P(SxSxA) = transition matrix # P could be an array with 3 dimensions or a cell array (1xA), # each cell containing a matrix (SxS) possibly sparse # R(SxSxA) or (SxA) = reward matrix # R could be an array with 3 dimensions (SxSxA) or # a cell array (1xA), each cell containing a sparse matrix (SxS) or # a 2D array(SxA) possibly sparse # discount = discount rate in ]0; 1[ # policy(S) = a policy # # Evaluation # ---------- # Vpolicy(S) = value function of the policy # `````` 777 `````` Ppolicy, Rpolicy = self._computePpolicyPRpolicy() `````` Steven Cordwell committed Jan 22, 2013 778 `````` # V = PR + gPV => (I-gP)V = PR => V = inv(I-gP)* PR `````` 779 780 `````` self.V = self._lin_eq( (self._speye(self.S, self.S) - self.discount * Ppolicy), Rpolicy) `````` Steven Cordwell committed Nov 04, 2012 781 `````` `````` 782 `````` def _iterate(self): `````` Steven Cordwell committed Jun 21, 2013 783 784 `````` # Run the policy iteration algorithm. # If verbose the print a header `````` Steven Cordwell committed Jan 21, 2013 785 786 `````` if self.verbose: print(' Iteration Number_of_different_actions') `````` Steven Cordwell committed Jun 21, 2013 787 `````` # Set up the while stopping condition and the current time `````` Steven Cordwell committed Jan 24, 2013 788 `````` done = False `````` Steven Cordwell committed Jan 21, 2013 789 `````` self.time = time() `````` Steven Cordwell committed Jun 21, 2013 790 `````` # loop until a stopping condition is reached `````` Steven Cordwell committed Nov 04, 2012 791 `````` while not done: `````` Steven Cordwell committed Feb 10, 2013 792 `````` self.iter += 1 `````` 793 `````` # these _evalPolicy* functions will update the classes value `````` Steven Cordwell committed Jan 24, 2013 794 `````` # attribute `````` Steven Cordwell committed Jan 22, 2013 795 `````` if self.eval_type == "matrix": `````` 796 `````` self._evalPolicyMatrix() `````` Steven Cordwell committed Jan 22, 2013 797 `````` elif self.eval_type == "iterative": `````` 798 `````` self._evalPolicyIterative() `````` Steven Cordwell committed Jan 24, 2013 799 800 `````` # This should update the classes policy attribute but leave the # value alone `````` 801 `````` policy_next, null = self._bellmanOperator() `````` Steven Cordwell committed Jan 24, 2013 802 `````` del null `````` Steven Cordwell committed Jun 21, 2013 803 804 `````` # calculate in how many places does the old policy disagree with # the new policy `````` Steven Cordwell committed Jan 24, 2013 805 `````` n_different = (policy_next != self.policy).sum() `````` Steven Cordwell committed Jun 21, 2013 806 `````` # if verbose then continue printing a table `````` Steven Cordwell committed Jan 21, 2013 807 `````` if self.verbose: `````` Steven Cordwell committed Jan 26, 2013 808 809 `````` print(' %s %s') % (self.iter, n_different) `````` Steven Cordwell committed Jun 21, 2013 810 811 `````` # Once the policy is unchanging of the maximum number of # of iterations has been reached then stop `````` Steven Cordwell committed Jan 24, 2013 812 `````` if n_different == 0: `````` Steven Cordwell committed Nov 04, 2012 813 `````` done = True `````` Steven Cordwell committed Jan 24, 2013 814 `````` if self.verbose: `````` Steven Cordwell committed Jan 26, 2013 815 816 `````` print("PyMDPtoolbox: iterations stopped, unchanging " "policy found.") `````` Steven Cordwell committed Jan 24, 2013 817 818 819 `````` elif (self.iter == self.max_iter): done = True if self.verbose: `````` Steven Cordwell committed Jan 26, 2013 820 821 `````` print("PyMDPtoolbox: iterations stopped by maximum number " "of iteration condition.") `````` Steven Cordwell committed Jan 24, 2013 822 823 `````` else: self.policy = policy_next `````` Steven Cordwell committed Jun 21, 2013 824 `````` # update the time to return th computation time `````` Steven Cordwell committed Jan 21, 2013 825 `````` self.time = time() - self.time `````` Steven Cordwell committed Nov 04, 2012 826 `````` # store value and policy as tuples `````` 827 828 `````` self.V = tuple(self.V.tolist()) self.policy = tuple(self.policy.tolist()) `````` Steven Cordwell committed Oct 29, 2012 829 `````` `````` Steven Cordwell committed Jan 25, 2013 830 ``````class PolicyIterationModified(PolicyIteration): `````` 831 `````` `````` Steven Cordwell committed Jan 26, 2013 832 `````` """A discounted MDP solved using a modifified policy iteration algorithm. `````` Steven Cordwell committed Jan 20, 2013 833 834 835 836 837 `````` Arguments --------- Let S = number of states, A = number of actions P(SxSxA) = transition matrix `````` Steven Cordwell committed Jan 26, 2013 838 839 `````` P could be an array with 3 dimensions or a cell array (1xA), each cell containing a matrix (SxS) possibly sparse `````` Steven Cordwell committed Jan 20, 2013 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 `````` R(SxSxA) or (SxA) = reward matrix R could be an array with 3 dimensions (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS) or a 2D array(SxA) possibly sparse discount = discount rate, in ]0, 1[ policy0(S) = starting policy, optional max_iter = maximum number of iteration to be done, upper than 0, optional (default 1000) eval_type = type of function used to evaluate policy: 0 for mdp_eval_policy_matrix, else mdp_eval_policy_iterative optional (default 0) Data Attributes --------------- V(S) = value function policy(S) = optimal policy iter = number of done iterations cpu_time = used CPU time Notes ----- In verbose mode, at each iteration, displays the number of differents actions between policy n-1 and n Examples -------- `````` Steven Cordwell committed Aug 24, 2013 866 867 868 869 870 871 872 `````` >>> import mdptoolbox, mdptoolbox.example >>> P, R = mdptoolbox.example.forest() >>> pim = mdptoolbox.mdp.PolicyIterationModified(P, R, 0.9) >>> pim.policy FIXME >>> pim.V FIXME `````` 873 `````` `````` Steven Cordwell committed Oct 29, 2012 874 `````` """ `````` Steven Cordwell committed Jan 20, 2013 875 `````` `````` Steven Cordwell committed Jan 26, 2013 876 877 `````` def __init__(self, transitions, reward, discount, epsilon=0.01, max_iter=10): `````` Steven Cordwell committed Jan 26, 2013 878 `````` """Initialise a (modified) policy iteration MDP.""" `````` Steven Cordwell committed Jan 21, 2013 879 `````` `````` Steven Cordwell committed Jan 25, 2013 880 881 882 `````` # Maybe its better not to subclass from PolicyIteration, because the # initialisation of the two are quite different. eg there is policy0 # being calculated here which doesn't need to be. The only thing that `````` 883 `````` # is needed from the PolicyIteration class is the _evalPolicyIterative `````` Steven Cordwell committed Jan 25, 2013 884 `````` # function. Perhaps there is a better way to do it? `````` Steven Cordwell committed Jan 26, 2013 885 886 `````` PolicyIteration.__init__(self, transitions, reward, discount, None, max_iter, 1) `````` Steven Cordwell committed Jan 20, 2013 887 `````` `````` Steven Cordwell committed Jan 25, 2013 888 889 890 891 `````` # PolicyIteration doesn't pass epsilon to MDP.__init__() so we will # check it here if type(epsilon) in (int, float): if epsilon <= 0: `````` Steven Cordwell committed Jan 26, 2013 892 893 `````` raise ValueError("PyMDPtoolbox: epsilon must be greater than " "0.") `````` Steven Cordwell committed Jan 25, 2013 894 `````` else: `````` Steven Cordwell committed Jan 26, 2013 895 896 `````` raise ValueError("PyMDPtoolbox: epsilon must be a positive real " "number greater than zero.") `````` Steven Cordwell committed Jan 20, 2013 897 `````` `````` 898 899 `````` # computation of threshold of variation for V for an epsilon-optimal # policy `````` Steven Cordwell committed Jan 20, 2013 900 901 902 903 904 `````` if self.discount != 1: self.thresh = epsilon * (1 - self.discount) / self.discount else: self.thresh = epsilon `````` Steven Cordwell committed Jan 25, 2013 905 906 `````` self.epsilon = epsilon `````` Steven Cordwell committed Jan 20, 2013 907 `````` if discount == 1: `````` Steven Cordwell committed May 18, 2013 908 `````` self.V = zeros((self.S, 1)) `````` Steven Cordwell committed Jan 20, 2013 909 910 `````` else: # min(min()) is not right `````` Steven Cordwell committed Jan 25, 2013 911 `````` self.V = 1 / (1 - discount) * self.R.min() * ones((self.S, 1)) `````` 912 913 914 `````` # Call the iteration method self._iterate() `````` Steven Cordwell committed Jan 21, 2013 915 `````` `````` 916 `````` def _iterate(self): `````` Steven Cordwell committed Jan 26, 2013 917 `````` """Run the modified policy iteration algorithm.""" `````` Steven Cordwell committed Jan 20, 2013 918 919 920 921 `````` if self.verbose: print(' Iteration V_variation') `````` Steven Cordwell committed Jan 21, 2013 922 `````` self.time = time() `````` Steven Cordwell committed Jan 21, 2013 923 `````` `````` Steven Cordwell committed Jan 20, 2013 924 925 `````` done = False while not done: `````` Steven Cordwell committed Feb 10, 2013 926 `````` self.iter += 1 `````` Steven Cordwell committed Jan 20, 2013 927 `````` `````` 928 `````` self.policy, Vnext = self._bellmanOperator() `````` Steven Cordwell committed Jan 20, 2013 929 `````` #[Ppolicy, PRpolicy] = mdp_computePpolicyPRpolicy(P, PR, policy); `````` Steven Cordwell committed Jan 20, 2013 930 `````` `````` Steven Cordwell committed Jan 25, 2013 931 `````` variation = getSpan(Vnext - self.V) `````` Steven Cordwell committed Jan 20, 2013 932 933 934 `````` if self.verbose: print(" %s %s" % (self.iter, variation)) `````` Steven Cordwell committed Jan 24, 2013 935 `````` self.V = Vnext `````` Steven Cordwell committed Jan 21, 2013 936 `````` if variation < self.thresh: `````` Steven Cordwell committed Jan 20, 2013 937 938 939 940 `````` done = True else: is_verbose = False if self.verbose: `````` Steven Cordwell committed Jan 25, 2013 941 `````` self.setSilent() `````` Steven Cordwell committed Jan 20, 2013 942 943 `````` is_verbose = True `````` 944 `````` self._evalPolicyIterative(self.V, self.epsilon, self.max_iter) `````` Steven Cordwell committed Jan 20, 2013 945 946 `````` if is_verbose: `````` Steven Cordwell committed Jan 25, 2013 947 `````` self.setVerbose() `````` Steven Cordwell committed Jan 20, 2013 948 `````` `````` Steven Cordwell committed Jan 20, 2013 949 `````` self.time = time() - self.time `````` Steven Cordwell committed Jan 25, 2013 950 951 `````` # store value and policy as tuples `````` 952 953 `````` self.V = tuple(self.V.tolist()) self.policy = tuple(self.policy.tolist()) `````` Steven Cordwell committed Oct 29, 2012 954 955 `````` class QLearning(MDP): `````` 956 `````` `````` Steven Cordwell committed Jan 26, 2013 957 `````` """A discounted MDP solved using the Q learning algorithm. `````` Steven Cordwell committed Oct 30, 2012 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 `````` Let S = number of states, A = number of actions Parameters ---------- P : transition matrix (SxSxA) P could be an array with 3 dimensions or a cell array (1xA), each cell containing a sparse matrix (SxS) R : reward matrix(SxSxA) or (SxA) R could be an array with 3 dimensions (SxSxA) or a cell array (1xA), each cell containing a sparse matrix (SxS) or a 2D array(SxA) possibly sparse discount : discount rate in ]0; 1[ n_iter : number of iterations to execute (optional). `````` 973 974 `````` Default value = 10000; it is an integer greater than the default value. `````` Steven Cordwell committed Oct 30, 2012 975 976 977 978 979 `````` Results ------- Q : learned Q matrix (SxA) `````` Steven Cordwell committed Jan 24, 2013 980 `````` V : learned value function (S). `````` Steven Cordwell committed Oct 30, 2012 981 982 983 984 985 986 987 `````` policy : learned optimal policy (S). mean_discrepancy : vector of V discrepancy mean over 100 iterations Then the length of this vector for the default value of N is 100 (N/100). `````` Steven Cordwell committed Jan 25, 2013 988 `````` Examples `````` Steven Cordwell committed Oct 30, 2012 989 `````` --------- `````` Steven Cordwell committed Jun 21, 2013 990 991 992 `````` >>> # These examples are reproducible only if random seed is set to 0 in >>> # both the random and numpy.random modules. >>> import numpy as np `````` Steven Cordwell committed Aug 24, 2013 993 `````` >>> import mdptoolbox, mdptoolbox.example `````` Steven Cordwell committed Jun 21, 2013 994 `````` >>> np.random.seed(0) `````` Steven Cordwell committed Aug 24, 2013 995 996 `````` >>> P, R = mdptoolbox.example.forest() >>> ql = mdptoolbox.mdp.QLearning(P, R, 0.96) `````` Steven Cordwell committed Oct 30, 2012 997 `````` >>> ql.Q `````` Steven Cordwell committed Jun 21, 2013 998 999 1000 `````` array([[ 68.38037354, 43.24888454], [ 72.37777922, 42.75549145], [ 77.02892702, 64.68712932]]) `````` Steven Cordwell committed Jan 24, 2013 1001 `````` >>> ql.V `````` Steven Cordwell committed Jun 21, 2013 1002 `````` (68.38037354422798, 72.37777921607258, 77.02892701616531) `````` Steven Cordwell committed Oct 30, 2012 1003 `````` >>> ql.policy `````` Steven Cordwell committed Jan 25, 2013 1004 `````` (0, 0, 0) `````` Steven Cordwell committed Oct 30, 2012 1005 `````` `````` Steven Cordwell committed Aug 24, 2013 1006 `````` >>> import mdptoolbox `````` Steven Cordwell committed Oct 30, 2012 1007 1008 1009 `````` >>> import numpy as np >>> P = np.array([[[0.5, 0.5],[0.8, 0.2]],[[0, 1],[0.1, 0.9]]]) >>> R = np.array([[5, 10], [-1, 2]]) `````` Steven Cordwell committed Jun 21, 2013 1010 `````` >>> np.random.seed(0) `````` Steven Cordwell committed Aug 24, 2013 1011 `````` >>> pim = mdptoolbox.mdp.QLearning(P, R, 0.9) `````` Steven Cordwell committed Oct 30, 2012 1012 `````` >>> ql.Q `````` Steven Cordwell committed Jun 21, 2013 1013 1014 `````` array([[ 39.933691 , 43.17543338], [ 36.94394224, 35.42568056]]) `````` Steven Cordwell committed Jan 24, 2013 1015 `````` >>> ql.V `````` Steven Cordwell committed Jun 21, 2013 1016 `````` (43.17543338090149, 36.943942243204454) `````` Steven Cordwell committed Oct 30, 2012 1017 `````` >>> ql.policy `````` Steven Cordwell committed Jan 25, 2013 1018 `````` (1, 0) `````` 1019 `````` `````` Steven Cordwell committed Oct 29, 2012 1020 1021 1022 `````` """ def __init__(self, transitions, reward, discount, n_iter=10000): `````` Steven Cordwell committed Jan 26, 2013 1023 `````` """Initialise a Q-learning MDP.""" `````` Steven Cordwell committed Oct 29, 2012 1024 `````` `````` Steven Cordwell committed Jan 21, 2013 1025 1026 1027 `````` # The following check won't be done in MDP()'s initialisation, so let's # do it here if (n_iter < 10000): `````` Steven Cordwell committed Jan 26, 2013 1028 1029 `````` raise ValueError("PyMDPtoolbox: n_iter should be greater than " "10000.") `````` Steven Cordwell committed Oct 29, 2012 1030 `````` `````` 1031 `````` # We don't want to send this to MDP because _computePR should not be `````` Steven Cordwell committed Jan 25, 2013 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 `````` # run on it # MDP.__init__(self, transitions, reward, discount, None, n_iter) check(transitions, reward) if (transitions.dtype is object): self.P = transitions self.A = self.P.shape[0] self.S = self.P[0].shape[0] else: # convert to an object array self.A = transitions.shape[0] self.S = transitions.shape[1] self.P = zeros(self.A, dtype=object) for aa in range(self.A): self.P[aa] = transitions[aa, :, :] self.R = reward ``````