mdp.py 57 KB
Newer Older
Steven Cordwell's avatar
Steven Cordwell committed
1
# -*- coding: utf-8 -*-
2
3
"""Markov Decision Process (MDP) Toolbox: ``mdp`` module
=====================================================
4

5
6
The ``mdp`` module provides classes for the resolution of descrete-time Markov
Decision Processes.
Steven Cordwell's avatar
Steven Cordwell committed
7

Steven Cordwell's avatar
Steven Cordwell committed
8
9
10
11
12
Available classes
-----------------
MDP
    Base Markov decision process class
FiniteHorizon
Steven Cordwell's avatar
Steven Cordwell committed
13
    Backwards induction finite horizon MDP
Steven Cordwell's avatar
Steven Cordwell committed
14
15
16
17
18
19
20
21
22
23
24
25
26
27
LP
    Linear programming MDP
PolicyIteration
    Policy iteration MDP
PolicyIterationModified
    Modified policy iteration MDP
QLearning
    Q-learning MDP
RelativeValueIteration
    Relative value iteration MDP
ValueIteration
    Value iteration MDP
ValueIterationGS
    Gauss-Seidel value iteration MDP
Steven Cordwell's avatar
Steven Cordwell committed
28
29
30

"""

31
32
# Copyright (c) 2011-2013 Steven A. W. Cordwell
# Copyright (c) 2009 INRA
Steven Cordwell's avatar
Steven Cordwell committed
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# 
# All rights reserved.
# 
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# 
#   * Redistributions of source code must retain the above copyright notice,
#     this list of conditions and the following disclaimer.
#   * Redistributions in binary form must reproduce the above copyright notice,
#     this list of conditions and the following disclaimer in the documentation
#     and/or other materials provided with the distribution.
#   * Neither the name of the <ORGANIZATION> nor the names of its contributors
#     may be used to endorse or promote products derived from this software
#     without specific prior written permission.
# 
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.

Steven Cordwell's avatar
Steven Cordwell committed
60
61
62
from math import ceil, log, sqrt
from time import time

63
from numpy import absolute, array, empty, mean, mod, multiply, ones, zeros
64
from numpy.random import randint, random
Steven Cordwell's avatar
Steven Cordwell committed
65
from scipy.sparse import csr_matrix as sparse
Steven Cordwell's avatar
Steven Cordwell committed
66

Steven Cordwell's avatar
Steven Cordwell committed
67
from .utils import check, getSpan
68

Steven Cordwell's avatar
Steven Cordwell committed
69
class MDP(object):
70
    
Steven Cordwell's avatar
Steven Cordwell committed
71
72
    """A Markov Decision Problem.
    
Steven Cordwell's avatar
Steven Cordwell committed
73
    Let ``S`` = the number of states, and ``A`` = the number of acions.
74
    
Steven Cordwell's avatar
Steven Cordwell committed
75
76
77
    Parameters
    ----------
    transitions : array
78
        Transition probability matrices. These can be defined in a variety of 
Steven Cordwell's avatar
Steven Cordwell committed
79
        ways. The simplest is a numpy array that has the shape ``(A, S, S)``,
80
        though there are other possibilities. It can be a tuple or list or
Steven Cordwell's avatar
Steven Cordwell committed
81
82
83
84
85
86
87
        numpy object array of length ``A``, where each element contains a numpy
        array or matrix that has the shape ``(S, S)``. This "list of matrices"
        form is useful when the transition matrices are sparse as
        ``scipy.sparse.csr_matrix`` matrices can be used. In summary, each
        action's transition matrix must be indexable like ``transitions[a]``
        where ``a`` ∈ {0, 1...A-1}, and ``transitions[a]`` returns an ``S`` ×
        ``S`` array-like object.
Steven Cordwell's avatar
Steven Cordwell committed
88
    reward : array
89
90
        Reward matrices or vectors. Like the transition matrices, these can
        also be defined in a variety of ways. Again the simplest is a numpy
Steven Cordwell's avatar
Steven Cordwell committed
91
92
93
94
95
96
97
98
        array that has the shape ``(S, A)``, ``(S,)`` or ``(A, S, S)``. A list
        of lists can be used, where each inner list has length ``S`` and the
        outer list has length ``A``. A list of numpy arrays is possible where
        each inner array can be of the shape ``(S,)``, ``(S, 1)``, ``(1, S)``
        or ``(S, S)``. Also ``scipy.sparse.csr_matrix`` can be used instead of
        numpy arrays. In addition, the outer list can be replaced by any object
        that can be indexed like ``reward[a]`` such as a tuple or numpy object
        array of length ``A``.
99
100
101
102
    discount : float
        Discount factor. The per time-step discount factor on future rewards.
        Valid values are greater than 0 upto and including 1. If the discount
        factor is 1, then convergence is cannot be assumed and a warning will
Steven Cordwell's avatar
Steven Cordwell committed
103
104
        be displayed. Subclasses of ``MDP`` may pass ``None`` in the case where
        the algorithm does not use a discount factor.
105
106
107
108
    epsilon : float
        Stopping criterion. The maximum change in the value function at each
        iteration is compared against ``epsilon``. Once the change falls below
        this value, then the value function is considered to have converged to
Steven Cordwell's avatar
Steven Cordwell committed
109
110
111
        the optimal value function. Subclasses of ``MDP`` may pass ``None`` in
        the case where the algorithm does not use an epsilon-optimal stopping
        criterion.
112
113
114
    max_iter : int
        Maximum number of iterations. The algorithm will be terminated once
        this many iterations have elapsed. This must be greater than 0 if
Steven Cordwell's avatar
Steven Cordwell committed
115
116
        specified. Subclasses of ``MDP`` may pass ``None`` in the case where
        the algorithm does not use a maximum number of iterations.
Steven Cordwell's avatar
Steven Cordwell committed
117
118
119
120
    
    Attributes
    ----------
    P : array
121
        Transition probability matrices.
Steven Cordwell's avatar
Steven Cordwell committed
122
    R : array
123
124
        Reward vectors.
    V : tuple
Steven Cordwell's avatar
Steven Cordwell committed
125
126
127
        The optimal value function. Each element is a float corresponding to
        the expected value of being in that state assuming the optimal policy
        is followed.
Steven Cordwell's avatar
Steven Cordwell committed
128
    discount : float
129
        The discount rate on future rewards.
Steven Cordwell's avatar
Steven Cordwell committed
130
    max_iter : int
131
132
133
        The maximum number of iterations.
    policy : tuple
        The optimal policy.
Steven Cordwell's avatar
Steven Cordwell committed
134
    time : float
135
136
        The time used to converge to the optimal policy.
    verbose : boolean
Steven Cordwell's avatar
Steven Cordwell committed
137
        Whether verbose output should be displayed or not.
Steven Cordwell's avatar
Steven Cordwell committed
138
139
140
    
    Methods
    -------
141
    run
Steven Cordwell's avatar
Steven Cordwell committed
142
        Implemented in child classes as the main algorithm loop. Raises an
143
        exception if it has not been overridden.
Steven Cordwell's avatar
Steven Cordwell committed
144
145
146
147
148
149
    setSilent
        Turn the verbosity off
    setVerbose
        Turn the verbosity on
    
    """
Steven Cordwell's avatar
Steven Cordwell committed
150
    
151
    def __init__(self, transitions, reward, discount, epsilon, max_iter):
152
        # Initialise a MDP based on the input parameters.
153
        
Steven Cordwell's avatar
Steven Cordwell committed
154
155
        # if the discount is None then the algorithm is assumed to not use it
        # in its computations
156
157
158
159
        if discount is not None:
            self.discount = float(discount)
            assert 0.0 < self.discount <= 1.0, "Discount rate must be in ]0; 1]"
            if self.discount == 1:
Steven Cordwell's avatar
Steven Cordwell committed
160
161
                print("WARNING: check conditions of convergence. With no "
                      "discount, convergence is can not be assumed.")
Steven Cordwell's avatar
Steven Cordwell committed
162
163
        # if the max_iter is None then the algorithm is assumed to not use it
        # in its computations
164
165
166
167
        if max_iter is not None:
            self.max_iter = int(max_iter)
            assert self.max_iter > 0, "The maximum number of iterations " \
                                      "must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
168
        # check that epsilon is something sane
169
170
171
        if epsilon is not None:
            self.epsilon = float(epsilon)
            assert self.epsilon > 0, "Epsilon must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
172
173
174
175
        # we run a check on P and R to make sure they are describing an MDP. If
        # an exception isn't raised then they are assumed to be correct.
        check(transitions, reward)
        # computePR will assign the variables self.S, self.A, self.P and self.R
176
        self._computePR(transitions, reward)
Steven Cordwell's avatar
Steven Cordwell committed
177
178
179
180
        # the verbosity is by default turned off
        self.verbose = False
        # Initially the time taken to perform the computations is set to None
        self.time = None
181
182
        # set the initial iteration count to zero
        self.iter = 0
Steven Cordwell's avatar
Steven Cordwell committed
183
        # V should be stored as a vector ie shape of (S,) or (1, S)
Steven Cordwell's avatar
Steven Cordwell committed
184
        self.V = None
Steven Cordwell's avatar
Steven Cordwell committed
185
        # policy can also be stored as a vector
Steven Cordwell's avatar
Steven Cordwell committed
186
        self.policy = None
Steven Cordwell's avatar
Steven Cordwell committed
187
    
188
189
190
191
192
193
    def __repr__(self):
        P_repr = "P: \n"
        R_repr = "R: \n"
        for aa in range(self.A):
            P_repr += repr(self.P[aa]) + "\n"
            R_repr += repr(self.R[aa]) + "\n"
194
        return(P_repr + "\n" + R_repr)
195
    
196
    def _bellmanOperator(self, V=None):
Steven Cordwell's avatar
Steven Cordwell committed
197
        # Apply the Bellman operator on the value function.
198
        # 
Steven Cordwell's avatar
Steven Cordwell committed
199
        # Updates the value function and the Vprev-improving policy.
200
        # 
Steven Cordwell's avatar
Steven Cordwell committed
201
202
203
204
        # Returns: (policy, value), tuple of new policy and its value
        #
        # If V hasn't been sent into the method, then we assume to be working
        # on the objects V attribute
205
206
        if V is None:
            # this V should be a reference to the data rather than a copy
207
208
            V = self.V
        else:
Steven Cordwell's avatar
Steven Cordwell committed
209
            # make sure the user supplied V is of the right shape
210
            try:
211
212
                assert V.shape in ((self.S,), (1, self.S)), "V is not the " \
                    "right shape (Bellman operator)."
213
            except AttributeError:
214
                raise TypeError("V must be a numpy array or matrix.")
215
216
217
218
        # Looping through each action the the Q-value matrix is calculated.
        # P and V can be any object that supports indexing, so it is important
        # that you know they define a valid MDP before calling the
        # _bellmanOperator method. Otherwise the results will be meaningless.
Steven Cordwell's avatar
Steven Cordwell committed
219
        Q = empty((self.A, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
220
        for aa in range(self.A):
Steven Cordwell's avatar
Steven Cordwell committed
221
            Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)
Steven Cordwell's avatar
Steven Cordwell committed
222
        # Get the policy and value, for now it is being returned but...
223
        # Which way is better?
224
        # 1. Return, (policy, value)
225
        return (Q.argmax(axis=0), Q.max(axis=0))
Steven Cordwell's avatar
Steven Cordwell committed
226
227
        # 2. update self.policy and self.V directly
        # self.V = Q.max(axis=1)
228
        # self.policy = Q.argmax(axis=1)
Steven Cordwell's avatar
Steven Cordwell committed
229
    
230
231
232
233
234
235
236
237
238
239
240
241
    def _computeP(self, P):
        # Set self.P as a tuple of length A, with each element storing an S×S
        # matrix.
        self.A = len(P)
        try:
            if P.ndim == 3:
                self.S = P.shape[1]
            else:
               self.S = P[0].shape[0]
        except AttributeError:
            self.S = P[0].shape[0]
        # convert P to a tuple of numpy arrays
242
        self.P = tuple(P[aa] for aa in range(self.A))
243
    
244
    def _computePR(self, P, R):
Steven Cordwell's avatar
Steven Cordwell committed
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
        # Compute the reward for the system in one state chosing an action.
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        #    P(SxSxA)  = transition matrix 
        #        P could be an array with 3 dimensions or  a cell array (1xA), 
        #        each cell containing a matrix (SxS) possibly sparse
        #    R(SxSxA) or (SxA) = reward matrix
        #        R could be an array with 3 dimensions (SxSxA) or  a cell array 
        #        (1xA), each cell containing a sparse matrix (SxS) or a 2D 
        #        array(SxA) possibly sparse  
        # Evaluation
        # ----------
        #    PR(SxA)   = reward matrix
        #
260
        # We assume that P and R define a MDP i,e. assumption is that
Steven Cordwell's avatar
Steven Cordwell committed
261
        # check(P, R) has already been run and doesn't fail.
262
        #
263
264
        # First compute store P, S, and A
        self._computeP(P)
Steven Cordwell's avatar
Steven Cordwell committed
265
266
        # Set self.R as a tuple of length A, with each element storing an 1×S
        # vector.
267
        try:
268
269
            if R.ndim == 1:
                r = array(R).reshape(self.S)
270
                self.R = tuple(r for aa in range(self.A))
271
            elif R.ndim == 2:
272
273
                self.R = tuple(array(R[:, aa]).reshape(self.S)
                                for aa in range(self.A))
Steven Cordwell's avatar
Steven Cordwell committed
274
            else:
275
276
                self.R = tuple(multiply(P[aa], R[aa]).sum(1).reshape(self.S)
                                for aa in range(self.A))
277
        except AttributeError:
278
            if len(R) == self.A:
279
280
                self.R = tuple(multiply(P[aa], R[aa]).sum(1).reshape(self.S)
                                for aa in range(self.A))
281
282
            else:
                r = array(R).reshape(self.S)
283
                self.R = tuple(r for aa in range(self.A))
284
    
285
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
286
        # Raise error because child classes should implement this function.
287
        raise NotImplementedError("You should create a run() method.")
Steven Cordwell's avatar
Steven Cordwell committed
288
    
Steven Cordwell's avatar
Steven Cordwell committed
289
    def setSilent(self):
290
        """Set the MDP algorithm to silent mode."""
Steven Cordwell's avatar
Steven Cordwell committed
291
292
293
        self.verbose = False
    
    def setVerbose(self):
294
        """Set the MDP algorithm to verbose mode."""
Steven Cordwell's avatar
Steven Cordwell committed
295
        self.verbose = True
Steven Cordwell's avatar
Steven Cordwell committed
296
297

class FiniteHorizon(MDP):
298
    
Steven Cordwell's avatar
Steven Cordwell committed
299
    """A MDP solved using the finite-horizon backwards induction algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
300
    
Steven Cordwell's avatar
Steven Cordwell committed
301
302
    Parameters
    ----------
303
304
305
306
307
308
309
310
311
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
312
313
314
315
    N : int
        Number of periods. Must be greater than 0.
    h : array, optional
        Terminal reward. Default: a vector of zeros.
Steven Cordwell's avatar
Steven Cordwell committed
316
    
Steven Cordwell's avatar
Steven Cordwell committed
317
318
319
320
    Data Attributes
    ---------------
    V : array 
        Optimal value function. Shape = (S, N+1). ``V[:, n]`` = optimal value
Steven Cordwell's avatar
Steven Cordwell committed
321
        function at stage ``n`` with stage in {0, 1...N-1}. ``V[:, N]`` value
Steven Cordwell's avatar
Steven Cordwell committed
322
323
324
        function for terminal stage. 
    policy : array
        Optimal policy. ``policy[:, n]`` = optimal policy at stage ``n`` with
Steven Cordwell's avatar
Steven Cordwell committed
325
        stage in {0, 1...N}. ``policy[:, N]`` = policy for stage ``N``.
Steven Cordwell's avatar
Steven Cordwell committed
326
327
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
328
329
330
331
  
    Notes
    -----
    In verbose mode, displays the current stage and policy transpose.
332
    
Steven Cordwell's avatar
Steven Cordwell committed
333
334
    Examples
    --------
335
336
337
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3)
338
    >>> fh.run()
Steven Cordwell's avatar
Steven Cordwell committed
339
340
341
342
343
344
345
346
    >>> fh.V
    array([[ 2.6973,  0.81  ,  0.    ,  0.    ],
           [ 5.9373,  3.24  ,  1.    ,  0.    ],
           [ 9.9373,  7.24  ,  4.    ,  0.    ]])
    >>> fh.policy
    array([[0, 0, 0],
           [0, 0, 1],
           [0, 0, 0]])
347
    
Steven Cordwell's avatar
Steven Cordwell committed
348
    """
Steven Cordwell's avatar
Steven Cordwell committed
349

Steven Cordwell's avatar
Steven Cordwell committed
350
    def __init__(self, transitions, reward, discount, N, h=None):
351
        # Initialise a finite horizon MDP.
352
353
        self.N = int(N)
        assert self.N > 0, 'PyMDPtoolbox: N must be greater than 0.'
Steven Cordwell's avatar
Steven Cordwell committed
354
        # Initialise the base class
355
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
356
357
        # remove the iteration counter, it is not meaningful for backwards
        # induction
358
        del self.iter
Steven Cordwell's avatar
Steven Cordwell committed
359
        # There are value vectors for each time step up to the horizon
360
        self.V = zeros((self.S, N + 1))
Steven Cordwell's avatar
Steven Cordwell committed
361
362
363
364
365
        # There are policy vectors for each time step before the horizon, when
        # we reach the horizon we don't need to make decisions anymore.
        self.policy = empty((self.S, N), dtype=int)
        # Set the reward for the final transition to h, if specified.
        if h is not None:
366
            self.V[:, N] = h
367
        # Call the iteration method
368
        #self.run()
369
        
370
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
371
        # Run the finite horizon algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
372
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
373
        # loop through each time period
374
        for n in range(self.N):
Steven Cordwell's avatar
Steven Cordwell committed
375
376
377
            W, X = self._bellmanOperator(self.V[:, self.N - n])
            self.V[:, self.N - n - 1] = X
            self.policy[:, self.N - n - 1] = W
Steven Cordwell's avatar
Steven Cordwell committed
378
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
379
380
                print(("stage: %s ... policy transpose : %s") % (
                    self.N - n, self.policy[:, self.N - n -1].tolist()))
Steven Cordwell's avatar
Steven Cordwell committed
381
        # update time spent running
Steven Cordwell's avatar
Steven Cordwell committed
382
        self.time = time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
383
384
385
386
387
388
389
390
391
392
        # After this we could create a tuple of tuples for the values and 
        # policies.
        #V = []
        #p = []
        #for n in xrange(self.N):
        #    V.append()
        #    p.append()
        #V.append()
        #self.V = tuple(V)
        #self.policy = tuple(p)
Steven Cordwell's avatar
Steven Cordwell committed
393
394

class LP(MDP):
395
    
396
    """A discounted MDP soloved using linear programming.
Steven Cordwell's avatar
Steven Cordwell committed
397
398
    
    This class requires the Python ``cvxopt`` module to be installed.
Steven Cordwell's avatar
Steven Cordwell committed
399
400
401

    Arguments
    ---------
402
403
404
405
406
407
408
409
410
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
411
412
    h : array, optional
        Terminal reward. Default: a vector of zeros.
Steven Cordwell's avatar
Steven Cordwell committed
413
    
Steven Cordwell's avatar
Steven Cordwell committed
414
415
416
417
418
419
420
421
    Data Attributes
    ---------------
    V : tuple
        optimal values
    policy : tuple
        optimal policy
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
422
423
424
425
426
427
428
    
    Notes    
    -----
    In verbose mode, displays the current stage and policy transpose.
    
    Examples
    --------
429
430
431
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> lp = mdptoolbox.mdp.LP(P, R, 0.9)
432
    >>> lp.run()
433
    
Steven Cordwell's avatar
Steven Cordwell committed
434
    """
Steven Cordwell's avatar
Steven Cordwell committed
435

Steven Cordwell's avatar
Steven Cordwell committed
436
    def __init__(self, transitions, reward, discount):
437
        # Initialise a linear programming MDP.
Steven Cordwell's avatar
Steven Cordwell committed
438
        # import some functions from cvxopt and set them as object methods
Steven Cordwell's avatar
Steven Cordwell committed
439
440
        try:
            from cvxopt import matrix, solvers
441
442
            self._linprog = solvers.lp
            self._cvxmat = matrix
Steven Cordwell's avatar
Steven Cordwell committed
443
        except ImportError:
444
445
            raise ImportError("The python module cvxopt is required to use "
                              "linear programming functionality.")
Steven Cordwell's avatar
Steven Cordwell committed
446
447
        # we also need diagonal matrices, and using a sparse one may be more
        # memory efficient
Steven Cordwell's avatar
Steven Cordwell committed
448
        from scipy.sparse import eye as speye
449
        self._speye = speye
Steven Cordwell's avatar
Steven Cordwell committed
450
        # initialise the MDP. epsilon and max_iter are not needed
451
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
452
        # Set the cvxopt solver to be quiet by default, but ...
453
        # this doesn't do what I want it to do c.f. issue #3
454
455
        if not self.verbose:
            solvers.options['show_progress'] = False
456
        # Call the iteration method
457
        #self.run()
458
    
459
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
460
        #Run the linear programming algorithm.
461
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
462
        # The objective is to resolve : min V / V >= PR + discount*P*V
463
464
        # The function linprog of the optimisation Toolbox of Mathworks
        # resolves :
Steven Cordwell's avatar
Steven Cordwell committed
465
        # min f'* x / M * x <= b
466
467
468
469
        # So the objective could be expressed as :
        # min V / (discount*P-I) * V <= - PR
        # To avoid loop on states, the matrix M is structured following actions
        # M(A*S,S)
470
471
472
        f = self._cvxmat(ones((self.S, 1)))
        h = self._cvxmat(self.R.reshape(self.S * self.A, 1, order="F"), tc='d')
        M = zeros((self.A * self.S, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
473
474
        for aa in range(self.A):
            pos = (aa + 1) * self.S
475
476
477
            M[(pos - self.S):pos, :] = (
                self.discount * self.P[aa] - self._speye(self.S, self.S))
        M = self._cvxmat(M)
478
479
480
        # Using the glpk option will make this behave more like Octave
        # (Octave uses glpk) and perhaps Matlab. If solver=None (ie using the 
        # default cvxopt solver) then V agrees with the Octave equivalent
Steven Cordwell's avatar
Steven Cordwell committed
481
        # only to 10e-8 places. This assumes glpk is installed of course.
Steven Cordwell's avatar
Steven Cordwell committed
482
        self.V = array(self._linprog(f, M, -h, solver='glpk')['x'])
Steven Cordwell's avatar
Steven Cordwell committed
483
        # apply the Bellman operator
484
        self.policy, self.V =  self._bellmanOperator()
Steven Cordwell's avatar
Steven Cordwell committed
485
        # update the time spent solving
Steven Cordwell's avatar
Steven Cordwell committed
486
        self.time = time() - self.time
487
        # store value and policy as tuples
488
489
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
490
491

class PolicyIteration(MDP):
492
    
493
    """A discounted MDP solved using the policy iteration algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
494
    
Steven Cordwell's avatar
Steven Cordwell committed
495
496
    Arguments
    ---------
497
498
499
500
501
502
503
504
505
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details. 
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
506
507
508
    policy0 : array, optional
        Starting policy.
    max_iter : int, optional
509
510
        Maximum number of iterations. See the documentation for the ``MDP``
        class for details. Default is 1000.
Steven Cordwell's avatar
Steven Cordwell committed
511
512
513
514
    eval_type : int or string, optional
        Type of function used to evaluate policy. 0 or "matrix" to solve as a
        set of linear equations. 1 or "iterative" to solve iteratively.
        Default: 0.
Steven Cordwell's avatar
Steven Cordwell committed
515
             
Steven Cordwell's avatar
Steven Cordwell committed
516
517
518
519
520
521
522
523
524
525
    Data Attributes
    ---------------
    V : tuple
        value function 
    policy : tuple
        optimal policy
    iter : int
        number of done iterations
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
526
527
528
529
530
531
    
    Notes
    -----
    In verbose mode, at each iteration, displays the number 
    of differents actions between policy n-1 and n
    
Steven Cordwell's avatar
Steven Cordwell committed
532
533
    Examples
    --------
534
535
536
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.rand()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
537
    >>> pi.run()
538
    
539
540
    >>> P, R = mdptoolbox.example.forest()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
541
    >>> pi.run()
Steven Cordwell's avatar
Steven Cordwell committed
542
    >>> pi.V
543
    (26.244000000000018, 29.48400000000002, 33.484000000000016)
Steven Cordwell's avatar
Steven Cordwell committed
544
    >>> pi.policy
545
    (0, 0, 0)
Steven Cordwell's avatar
Steven Cordwell committed
546
    """
Steven Cordwell's avatar
Steven Cordwell committed
547
    
548
549
    def __init__(self, transitions, reward, discount, policy0=None,
                 max_iter=1000, eval_type=0):
Steven Cordwell's avatar
Steven Cordwell committed
550
551
552
        # Initialise a policy iteration MDP.
        #
        # Set up the MDP, but don't need to worry about epsilon values
553
        MDP.__init__(self, transitions, reward, discount, None, max_iter)
Steven Cordwell's avatar
Steven Cordwell committed
554
        # Check if the user has supplied an initial policy. If not make one.
Steven Cordwell's avatar
Steven Cordwell committed
555
        if policy0 == None:
Steven Cordwell's avatar
Steven Cordwell committed
556
            # Initialise the policy to the one which maximises the expected
Steven Cordwell's avatar
Steven Cordwell committed
557
            # immediate reward
Steven Cordwell's avatar
Steven Cordwell committed
558
559
            null = zeros(self.S)
            self.policy, null = self._bellmanOperator(null)
560
            del null
Steven Cordwell's avatar
Steven Cordwell committed
561
        else:
Steven Cordwell's avatar
Steven Cordwell committed
562
563
            # Use the policy that the user supplied
            # Make sure it is a numpy array
Steven Cordwell's avatar
Steven Cordwell committed
564
            policy0 = array(policy0)
Steven Cordwell's avatar
Steven Cordwell committed
565
            # Make sure the policy is the right size and shape
566
567
            assert policy0.shape in ((self.S, ), (self.S, 1), (1, self.S)), \
                "'policy0' must a vector with length S."
Steven Cordwell's avatar
Steven Cordwell committed
568
            # reshape the policy to be a vector
Steven Cordwell's avatar
Steven Cordwell committed
569
            policy0 = policy0.reshape(self.S)
570
571
572
573
574
575
            # The policy can only contain integers between 0 and S-1
            msg = "'policy0' must be a vector of integers between 0 and S-1."
            assert not mod(policy0, 1).any(), msg
            assert (policy0 >= 0).all(), msg
            assert (policy0 < self.S).all(), msg
            self.policy = policy0
Steven Cordwell's avatar
Steven Cordwell committed
576
        # set the initial values to zero
Steven Cordwell's avatar
Steven Cordwell committed
577
        self.V = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
578
        # Do some setup depending on the evaluation type
Steven Cordwell's avatar
Steven Cordwell committed
579
        if eval_type in (0, "matrix"):
580
            from numpy.linalg import solve
581
            from scipy.sparse import eye
582
583
            self._speye = eye
            self._lin_eq = solve
Steven Cordwell's avatar
Steven Cordwell committed
584
585
586
587
            self.eval_type = "matrix"
        elif eval_type in (1, "iterative"):
            self.eval_type = "iterative"
        else:
588
589
590
591
            raise ValueError("PyMDPtoolbox: eval_type should be 0 for matrix "
                             "evaluation or 1 for iterative evaluation. "
                             "The strings 'matrix' and 'iterative' can also "
                             "be used.")
592
        # Call the iteration method
593
        #self.run()
Steven Cordwell's avatar
Steven Cordwell committed
594
    
595
    def _computePpolicyPRpolicy(self):
Steven Cordwell's avatar
Steven Cordwell committed
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
        # Compute the transition matrix and the reward matrix for a policy.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA)  = transition matrix 
        #     P could be an array with 3 dimensions or a cell array (1xA),
        #     each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #     R could be an array with 3 dimensions (SxSxA) or 
        #     a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #     a 2D array(SxA) possibly sparse  
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Ppolicy(SxS)  = transition matrix for policy
        # PRpolicy(S)   = reward matrix for policy
        #
Steven Cordwell's avatar
Steven Cordwell committed
615
616
        Ppolicy = empty((self.S, self.S))
        Rpolicy = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
617
        for aa in range(self.A): # avoid looping over S
Steven Cordwell's avatar
Steven Cordwell committed
618
619
            # the rows that use action a.
            ind = (self.policy == aa).nonzero()[0]
620
621
            # if no rows use action a, then no need to assign this
            if ind.size > 0:
622
623
624
625
                try:
                    Ppolicy[ind, :] = self.P[aa][ind, :]
                except ValueError:
                    Ppolicy[ind, :] = self.P[aa][ind, :].todense()
626
                #PR = self._computePR() # an apparently uneeded line, and
Steven Cordwell's avatar
Steven Cordwell committed
627
628
                # perhaps harmful in this implementation c.f.
                # mdp_computePpolicyPRpolicy.m
629
                Rpolicy[ind] = self.R[aa][ind]
Steven Cordwell's avatar
Steven Cordwell committed
630
631
632
633
634
635
636
637
638
639
        # self.R cannot be sparse with the code in its current condition, but
        # it should be possible in the future. Also, if R is so big that its
        # a good idea to use a sparse matrix for it, then converting PRpolicy
        # from a dense to sparse matrix doesn't seem very memory efficient
        if type(self.R) is sparse:
            Rpolicy = sparse(Rpolicy)
        #self.Ppolicy = Ppolicy
        #self.Rpolicy = Rpolicy
        return (Ppolicy, Rpolicy)
    
640
    def _evalPolicyIterative(self, V0=0, epsilon=0.0001, max_iter=10000):
Steven Cordwell's avatar
Steven Cordwell committed
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
        # Evaluate a policy using iteration.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA)  = transition matrix 
        #    P could be an array with 3 dimensions or 
        #    a cell array (1xS), each cell containing a matrix possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #    R could be an array with 3 dimensions (SxSxA) or 
        #    a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #    a 2D array(SxA) possibly sparse  
        # discount  = discount rate in ]0; 1[
        # policy(S) = a policy
        # V0(S)     = starting value function, optional (default : zeros(S,1))
        # epsilon   = epsilon-optimal policy search, upper than 0,
        #    optional (default : 0.0001)
        # max_iter  = maximum number of iteration to be done, upper than 0, 
        #    optional (default : 10000)
        #    
        # Evaluation
        # ----------
        # Vpolicy(S) = value function, associated to a specific policy
        #
        # Notes
        # -----
        # In verbose mode, at each iteration, displays the condition which
        # stopped iterations: epsilon-optimum value function found or maximum
        # number of iterations reached.
        #
671
672
673
674
675
676
677
        try:
            assert V0.shape in ((self.S, ), (self.S, 1), (1, self.S)), \
                "'V0' must be a vector of length S."
            policy_V = array(V0).reshape(self.S)
        except AttributeError:
            if len(V0) == self.S:
                policy_V = array(V0).reshape(self.S)
678
            else:
679
                policy_V = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
680
        
681
        policy_P, policy_R = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
682
683
684
        
        if self.verbose:
            print('  Iteration    V_variation')
685
        
Steven Cordwell's avatar
Steven Cordwell committed
686
687
688
        itr = 0
        done = False
        while not done:
689
            itr += 1
690
691
            
            Vprev = policy_V
692
            policy_V = policy_R + self.discount * policy_P.dot(Vprev)
693
694
            
            variation = absolute(policy_V - Vprev).max()
Steven Cordwell's avatar
Steven Cordwell committed
695
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
696
                print(('      %s         %s') % (itr, variation))
697
698
699
            
            # ensure |Vn - Vpolicy| < epsilon
            if variation < ((1 - self.discount) / self.discount) * epsilon:
Steven Cordwell's avatar
Steven Cordwell committed
700
701
                done = True
                if self.verbose:
702
703
                    print("PyMDPtoolbox: iterations stopped, epsilon-optimal "
                          "value function.")
Steven Cordwell's avatar
Steven Cordwell committed
704
705
706
            elif itr == max_iter:
                done = True
                if self.verbose:
707
708
                    print("PyMDPtoolbox: iterations stopped by maximum number "
                          "of iteration condition.")
Steven Cordwell's avatar
Steven Cordwell committed
709
        
Steven Cordwell's avatar
Steven Cordwell committed
710
        self.V = policy_V
711
    
712
    def _evalPolicyMatrix(self):
Steven Cordwell's avatar
Steven Cordwell committed
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
        # Evaluate the value function of the policy using linear equations.
        #
        # Arguments 
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA) = transition matrix 
        #      P could be an array with 3 dimensions or a cell array (1xA),
        #      each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #      R could be an array with 3 dimensions (SxSxA) or 
        #      a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #      a 2D array(SxA) possibly sparse  
        # discount = discount rate in ]0; 1[
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Vpolicy(S) = value function of the policy
        #
732
        Ppolicy, Rpolicy = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
733
        # V = PR + gPV  => (I-gP)V = PR  => V = inv(I-gP)* PR
734
735
        self.V = self._lin_eq(
            (self._speye(self.S, self.S) - self.discount * Ppolicy), Rpolicy)
Steven Cordwell's avatar
Steven Cordwell committed
736
    
737
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
738
739
        # Run the policy iteration algorithm.
        # If verbose the print a header
740
741
        if self.verbose:
            print('  Iteration  Number_of_different_actions')
Steven Cordwell's avatar
Steven Cordwell committed
742
        # Set up the while stopping condition and the current time
Steven Cordwell's avatar
Steven Cordwell committed
743
        done = False
744
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
745
        # loop until a stopping condition is reached
Steven Cordwell's avatar
Steven Cordwell committed
746
        while not done:
747
            self.iter += 1
748
            # these _evalPolicy* functions will update the classes value
Steven Cordwell's avatar
Steven Cordwell committed
749
            # attribute
Steven Cordwell's avatar
Steven Cordwell committed
750
            if self.eval_type == "matrix":
751
                self._evalPolicyMatrix()
Steven Cordwell's avatar
Steven Cordwell committed
752
            elif self.eval_type == "iterative":
753
                self._evalPolicyIterative()
Steven Cordwell's avatar
Steven Cordwell committed
754
755
            # This should update the classes policy attribute but leave the
            # value alone
756
            policy_next, null = self._bellmanOperator()
757
            del null
Steven Cordwell's avatar
Steven Cordwell committed
758
759
            # calculate in how many places does the old policy disagree with
            # the new policy
760
            n_different = (policy_next != self.policy).sum()
Steven Cordwell's avatar
Steven Cordwell committed
761
            # if verbose then continue printing a table
762
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
763
764
                print(('       %s                 %s') % (self.iter,
                                                         n_different))
Steven Cordwell's avatar
Steven Cordwell committed
765
766
            # Once the policy is unchanging of the maximum number of 
            # of iterations has been reached then stop
767
            if n_different == 0:
Steven Cordwell's avatar
Steven Cordwell committed
768
                done = True
769
                if self.verbose:
770
771
                    print("PyMDPtoolbox: iterations stopped, unchanging "
                          "policy found.")
772
773
774
            elif (self.iter == self.max_iter):
                done = True 
                if self.verbose:
775
776
                    print("PyMDPtoolbox: iterations stopped by maximum number "
                          "of iteration condition.")
777
778
            else:
                self.policy = policy_next
Steven Cordwell's avatar
Steven Cordwell committed
779
        # update the time to return th computation time
780
        self.time = time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
781
        # store value and policy as tuples
782
783
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
784

785
class PolicyIterationModified(PolicyIteration):
786
    
787
    """A discounted MDP  solved using a modifified policy iteration algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
788
789
790
    
    Arguments
    ---------
791
792
793
794
795
796
797
798
799
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
800
801
802
803
    epsilon : float, optional
        Stopping criterion. See the documentation for the ``MDP`` class for
        details. Default: 0.01.
    max_iter : int, optional
804
        Maximum number of iterations. See the documentation for the ``MDP``
Steven Cordwell's avatar
Steven Cordwell committed
805
        class for details. Default is 10.
Steven Cordwell's avatar
Steven Cordwell committed
806
807
808
    
    Data Attributes
    ---------------
Steven Cordwell's avatar
Steven Cordwell committed
809
810
811
812
813
814
815
816
    V : tuple
        value function 
    policy : tuple
        optimal policy
    iter : int
        number of done iterations
    time : float
        used CPU time
Steven Cordwell's avatar
Steven Cordwell committed
817
818
819
    
    Examples
    --------
820
821
822
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> pim = mdptoolbox.mdp.PolicyIterationModified(P, R, 0.9)
823
    >>> pim.run()
824
    >>> pim.policy
825
    (0, 0, 0)
826
    >>> pim.V
827
    (21.81408652334702, 25.054086523347017, 29.054086523347017)
828
    
Steven Cordwell's avatar
Steven Cordwell committed
829
    """
830
    
831
832
    def __init__(self, transitions, reward, discount, epsilon=0.01,
                 max_iter=10):
833
        # Initialise a (modified) policy iteration MDP.
Steven Cordwell's avatar
Steven Cordwell committed
834
        
835
836
837
        # Maybe its better not to subclass from PolicyIteration, because the
        # initialisation of the two are quite different. eg there is policy0
        # being calculated here which doesn't need to be. The only thing that
838
        # is needed from the PolicyIteration class is the _evalPolicyIterative
839
        # function. Perhaps there is a better way to do it?
840
841
        PolicyIteration.__init__(self, transitions, reward, discount, None,
                                 max_iter, 1)
842
        
843
844
        # PolicyIteration doesn't pass epsilon to MDP.__init__() so we will
        # check it here
845
846
        self.epsilon = float(epsilon)
        assert epsilon > 0, "'epsilon' must be greater than 0."
847
        
848
849
        # computation of threshold of variation for V for an epsilon-optimal
        # policy
850
        if self.discount != 1:
851
            self.thresh = self.epsilon * (1 - self.discount) / self.discount
852
        else:
853
            self.thresh = self.epsilon
854
        
855
        if self.discount == 1:
Steven Cordwell's avatar
Steven Cordwell committed
856
            self.V = zeros((self.S, 1))
857
        else:
858
            Rmin = min(R.min() for R in self.R)
859
            self.V = 1 / (1 - self.discount) * Rmin * ones((self.S,))
860
861
        
        # Call the iteration method
862
        #self.run()
Steven Cordwell's avatar
Steven Cordwell committed
863
    
864
    def run(self):
865
        # Run the modified policy iteration algorithm.
866
867
        
        if self.verbose:
868
            print('\tIteration\tV-variation')
869
        
Steven Cordwell's avatar
Steven Cordwell committed
870
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
871
        
872
873
        done = False
        while not done:
874
            self.iter += 1
875
            
876
            self.policy, Vnext = self._bellmanOperator()
877
            #[Ppolicy, PRpolicy] = mdp_computePpolicyPRpolicy(P, PR, policy);
878
            
879
            variation = getSpan(Vnext - self.V)
880
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
881
                print(("\t%s\t%s" % (self.iter, variation)))
882
            
Steven Cordwell's avatar
Steven Cordwell committed
883
            self.V = Vnext
Steven Cordwell's avatar
Steven Cordwell committed
884
            if variation < self.thresh:
885
886
887
888
                done = True
            else:
                is_verbose = False
                if self.verbose:
889
                    self.setSilent()
890
891
                    is_verbose = True
                
892
                self._evalPolicyIterative(self.V, self.epsilon, self.max_iter)
893
894
                
                if is_verbose:
895
                    self.setVerbose()
896
        
897
        self.time = time() - self.time
898
899
        
        # store value and policy as tuples
900
901
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
902
903

class QLearning(MDP):
904
    
905
    """A discounted MDP solved using the Q learning algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
906
907
908
    
    Parameters
    ----------
909
910
911
912
913
914
915
916
917
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details. 
Steven Cordwell's avatar
Steven Cordwell committed
918
919
920
    n_iter : int, optional
        Number of iterations to execute. This is ignored unless it is an 
        integer greater than the default value. Defaut: 10,000.
Steven Cordwell's avatar
Steven Cordwell committed
921
    
Steven Cordwell's avatar
Steven Cordwell committed
922
923
924
925
926
927
928
929
930
931
932
    Data Attributes
    ---------------
    Q : array
        learned Q matrix (SxA)     
    V : tuple
        learned value function (S).    
    policy : tuple
        learned optimal policy (S).    
    mean_discrepancy : array
        Vector of V discrepancy mean over 100 iterations. Then the length of
        this vector for the default value of N is 100 (N/100).
Steven Cordwell's avatar
Steven Cordwell committed
933

934
    Examples
Steven Cordwell's avatar
<