mdp.py 56.6 KB
Newer Older
Steven Cordwell's avatar
Steven Cordwell committed
1
# -*- coding: utf-8 -*-
2
3
"""Markov Decision Process (MDP) Toolbox: ``mdp`` module
=====================================================
4

5
6
The ``mdp`` module provides classes for the resolution of descrete-time Markov
Decision Processes.
Steven Cordwell's avatar
Steven Cordwell committed
7

Steven Cordwell's avatar
Steven Cordwell committed
8
9
10
11
12
Available classes
-----------------
MDP
    Base Markov decision process class
FiniteHorizon
Steven Cordwell's avatar
Steven Cordwell committed
13
    Backwards induction finite horizon MDP
Steven Cordwell's avatar
Steven Cordwell committed
14
15
16
17
18
19
20
21
22
23
24
25
26
27
LP
    Linear programming MDP
PolicyIteration
    Policy iteration MDP
PolicyIterationModified
    Modified policy iteration MDP
QLearning
    Q-learning MDP
RelativeValueIteration
    Relative value iteration MDP
ValueIteration
    Value iteration MDP
ValueIterationGS
    Gauss-Seidel value iteration MDP
Steven Cordwell's avatar
Steven Cordwell committed
28
29
30

"""

31
32
# Copyright (c) 2011-2013 Steven A. W. Cordwell
# Copyright (c) 2009 INRA
Steven Cordwell's avatar
Steven Cordwell committed
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# 
# All rights reserved.
# 
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# 
#   * Redistributions of source code must retain the above copyright notice,
#     this list of conditions and the following disclaimer.
#   * Redistributions in binary form must reproduce the above copyright notice,
#     this list of conditions and the following disclaimer in the documentation
#     and/or other materials provided with the distribution.
#   * Neither the name of the <ORGANIZATION> nor the names of its contributors
#     may be used to endorse or promote products derived from this software
#     without specific prior written permission.
# 
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.

Steven Cordwell's avatar
Steven Cordwell committed
60
61
62
from math import ceil, log, sqrt
from time import time

63
from numpy import absolute, array, empty, mean, mod, multiply, ones, zeros
64
from numpy.random import randint, random
Steven Cordwell's avatar
Steven Cordwell committed
65
from scipy.sparse import csr_matrix as sparse
Steven Cordwell's avatar
Steven Cordwell committed
66

Steven Cordwell's avatar
Steven Cordwell committed
67
from .utils import check, getSpan
68

Steven Cordwell's avatar
Steven Cordwell committed
69
class MDP(object):
70
    
Steven Cordwell's avatar
Steven Cordwell committed
71
72
    """A Markov Decision Problem.
    
73
74
    Let S = the number of states, and A = the number of acions.
    
Steven Cordwell's avatar
Steven Cordwell committed
75
76
77
    Parameters
    ----------
    transitions : array
78
79
80
81
82
83
84
85
86
        Transition probability matrices. These can be defined in a variety of 
        ways. The simplest is a numpy array that has the shape (A, S, S),
        though there are other possibilities. It can be a tuple or list or
        numpy object array of length A, where each element contains a numpy
        array or matrix that has the shape (S, S). This "list of matrices" form
        is useful when the transition matrices are sparse as
        scipy.sparse.csr_matrix matrices can be used. In summary, each action's
        transition matrix must be indexable like ``P[a]`` where
        ``a`` ∈ {0, 1...A-1}.
Steven Cordwell's avatar
Steven Cordwell committed
87
    reward : array
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
        Reward matrices or vectors. Like the transition matrices, these can
        also be defined in a variety of ways. Again the simplest is a numpy
        array that has the shape (S, A), (S,) or (A, S, S). A list of lists can
        be used, where each inner list has length S. A list of numpy arrays is
        possible where each inner array can be of the shape (S,), (S, 1),
        (1, S) or (S, S). Also scipy.sparse.csr_matrix can be used instead of
        numpy arrays. In addition, the outer list can be replaced with a tuple
        or numpy object array can be used.
    discount : float
        Discount factor. The per time-step discount factor on future rewards.
        Valid values are greater than 0 upto and including 1. If the discount
        factor is 1, then convergence is cannot be assumed and a warning will
        be displayed. Subclasses of ``MDP`` may pass None in the case where the
        algorithm does not use a discount factor.
    epsilon : float
        Stopping criterion. The maximum change in the value function at each
        iteration is compared against ``epsilon``. Once the change falls below
        this value, then the value function is considered to have converged to
        the optimal value function. Subclasses of ``MDP`` may pass None in the
        case where the algorithm does not use a stopping criterion.
    max_iter : int
        Maximum number of iterations. The algorithm will be terminated once
        this many iterations have elapsed. This must be greater than 0 if
        specified. Subclasses of ``MDP`` may pass None in the case where the 
        algorithm does not use a maximum number of iterations.
Steven Cordwell's avatar
Steven Cordwell committed
113
114
115
116
    
    Attributes
    ----------
    P : array
117
        Transition probability matrices.
Steven Cordwell's avatar
Steven Cordwell committed
118
    R : array
119
120
121
        Reward vectors.
    V : tuple
        The optimal value function.
Steven Cordwell's avatar
Steven Cordwell committed
122
    discount : float
123
        The discount rate on future rewards.
Steven Cordwell's avatar
Steven Cordwell committed
124
    max_iter : int
125
126
127
        The maximum number of iterations.
    policy : tuple
        The optimal policy.
Steven Cordwell's avatar
Steven Cordwell committed
128
    time : float
129
130
131
        The time used to converge to the optimal policy.
    verbose : boolean
        Whether verbose output should be displayed in not.
Steven Cordwell's avatar
Steven Cordwell committed
132
133
134
    
    Methods
    -------
135
136
137
    run
        Implemented in child classes as the main algorithm loop. Raises and
        exception if it has not been overridden.
Steven Cordwell's avatar
Steven Cordwell committed
138
139
140
141
142
143
    setSilent
        Turn the verbosity off
    setVerbose
        Turn the verbosity on
    
    """
Steven Cordwell's avatar
Steven Cordwell committed
144
    
145
    def __init__(self, transitions, reward, discount, epsilon, max_iter):
146
        # Initialise a MDP based on the input parameters.
147
        
Steven Cordwell's avatar
Steven Cordwell committed
148
149
        # if the discount is None then the algorithm is assumed to not use it
        # in its computations
150
151
152
153
154
155
        if discount is not None:
            self.discount = float(discount)
            assert 0.0 < self.discount <= 1.0, "Discount rate must be in ]0; 1]"
            if self.discount == 1:
                print("PyMDPtoolbox WARNING: check conditions of convergence. "
                      "With no discount, convergence is not always assumed.")
Steven Cordwell's avatar
Steven Cordwell committed
156
157
        # if the max_iter is None then the algorithm is assumed to not use it
        # in its computations
158
159
160
161
        if max_iter is not None:
            self.max_iter = int(max_iter)
            assert self.max_iter > 0, "The maximum number of iterations " \
                                      "must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
162
        # check that epsilon is something sane
163
164
165
        if epsilon is not None:
            self.epsilon = float(epsilon)
            assert self.epsilon > 0, "Epsilon must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
166
167
168
169
        # we run a check on P and R to make sure they are describing an MDP. If
        # an exception isn't raised then they are assumed to be correct.
        check(transitions, reward)
        # computePR will assign the variables self.S, self.A, self.P and self.R
170
        self._computePR(transitions, reward)
Steven Cordwell's avatar
Steven Cordwell committed
171
172
173
174
        # the verbosity is by default turned off
        self.verbose = False
        # Initially the time taken to perform the computations is set to None
        self.time = None
175
176
        # set the initial iteration count to zero
        self.iter = 0
Steven Cordwell's avatar
Steven Cordwell committed
177
        # V should be stored as a vector ie shape of (S,) or (1, S)
Steven Cordwell's avatar
Steven Cordwell committed
178
        self.V = None
Steven Cordwell's avatar
Steven Cordwell committed
179
        # policy can also be stored as a vector
Steven Cordwell's avatar
Steven Cordwell committed
180
        self.policy = None
Steven Cordwell's avatar
Steven Cordwell committed
181
    
182
183
184
185
186
187
    def __repr__(self):
        P_repr = "P: \n"
        R_repr = "R: \n"
        for aa in range(self.A):
            P_repr += repr(self.P[aa]) + "\n"
            R_repr += repr(self.R[aa]) + "\n"
188
        return(P_repr + "\n" + R_repr)
189
    
190
    def _bellmanOperator(self, V=None):
Steven Cordwell's avatar
Steven Cordwell committed
191
        # Apply the Bellman operator on the value function.
192
        # 
Steven Cordwell's avatar
Steven Cordwell committed
193
        # Updates the value function and the Vprev-improving policy.
194
        # 
Steven Cordwell's avatar
Steven Cordwell committed
195
196
197
198
        # Returns: (policy, value), tuple of new policy and its value
        #
        # If V hasn't been sent into the method, then we assume to be working
        # on the objects V attribute
199
200
        if V is None:
            # this V should be a reference to the data rather than a copy
201
202
            V = self.V
        else:
Steven Cordwell's avatar
Steven Cordwell committed
203
            # make sure the user supplied V is of the right shape
204
            try:
205
206
                assert V.shape in ((self.S,), (1, self.S)), "V is not the " \
                    "right shape (Bellman operator)."
207
            except AttributeError:
208
                raise TypeError("V must be a numpy array or matrix.")
209
210
211
212
        # Looping through each action the the Q-value matrix is calculated.
        # P and V can be any object that supports indexing, so it is important
        # that you know they define a valid MDP before calling the
        # _bellmanOperator method. Otherwise the results will be meaningless.
Steven Cordwell's avatar
Steven Cordwell committed
213
        Q = empty((self.A, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
214
        for aa in range(self.A):
Steven Cordwell's avatar
Steven Cordwell committed
215
            Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)
Steven Cordwell's avatar
Steven Cordwell committed
216
        # Get the policy and value, for now it is being returned but...
217
        # Which way is better?
218
        # 1. Return, (policy, value)
219
        return (Q.argmax(axis=0), Q.max(axis=0))
Steven Cordwell's avatar
Steven Cordwell committed
220
221
        # 2. update self.policy and self.V directly
        # self.V = Q.max(axis=1)
222
        # self.policy = Q.argmax(axis=1)
Steven Cordwell's avatar
Steven Cordwell committed
223
    
224
225
226
227
228
229
230
231
232
233
234
235
    def _computeP(self, P):
        # Set self.P as a tuple of length A, with each element storing an S×S
        # matrix.
        self.A = len(P)
        try:
            if P.ndim == 3:
                self.S = P.shape[1]
            else:
               self.S = P[0].shape[0]
        except AttributeError:
            self.S = P[0].shape[0]
        # convert P to a tuple of numpy arrays
236
        self.P = tuple(P[aa] for aa in range(self.A))
237
    
238
    def _computePR(self, P, R):
Steven Cordwell's avatar
Steven Cordwell committed
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
        # Compute the reward for the system in one state chosing an action.
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        #    P(SxSxA)  = transition matrix 
        #        P could be an array with 3 dimensions or  a cell array (1xA), 
        #        each cell containing a matrix (SxS) possibly sparse
        #    R(SxSxA) or (SxA) = reward matrix
        #        R could be an array with 3 dimensions (SxSxA) or  a cell array 
        #        (1xA), each cell containing a sparse matrix (SxS) or a 2D 
        #        array(SxA) possibly sparse  
        # Evaluation
        # ----------
        #    PR(SxA)   = reward matrix
        #
254
        # We assume that P and R define a MDP i,e. assumption is that
Steven Cordwell's avatar
Steven Cordwell committed
255
        # check(P, R) has already been run and doesn't fail.
256
        #
257
258
        # First compute store P, S, and A
        self._computeP(P)
Steven Cordwell's avatar
Steven Cordwell committed
259
260
        # Set self.R as a tuple of length A, with each element storing an 1×S
        # vector.
261
        try:
262
263
            if R.ndim == 1:
                r = array(R).reshape(self.S)
264
                self.R = tuple(r for aa in range(self.A))
265
            elif R.ndim == 2:
266
267
                self.R = tuple(array(R[:, aa]).reshape(self.S)
                                for aa in range(self.A))
Steven Cordwell's avatar
Steven Cordwell committed
268
            else:
269
270
                self.R = tuple(multiply(P[aa], R[aa]).sum(1).reshape(self.S)
                                for aa in range(self.A))
271
        except AttributeError:
272
            if len(R) == self.A:
273
274
                self.R = tuple(multiply(P[aa], R[aa]).sum(1).reshape(self.S)
                                for aa in range(self.A))
275
276
            else:
                r = array(R).reshape(self.S)
277
                self.R = tuple(r for aa in range(self.A))
278
    
279
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
280
        # Raise error because child classes should implement this function.
281
        raise NotImplementedError("You should create a run() method.")
Steven Cordwell's avatar
Steven Cordwell committed
282
    
Steven Cordwell's avatar
Steven Cordwell committed
283
    def setSilent(self):
284
        """Set the MDP algorithm to silent mode."""
Steven Cordwell's avatar
Steven Cordwell committed
285
286
287
        self.verbose = False
    
    def setVerbose(self):
288
        """Set the MDP algorithm to verbose mode."""
Steven Cordwell's avatar
Steven Cordwell committed
289
        self.verbose = True
Steven Cordwell's avatar
Steven Cordwell committed
290
291

class FiniteHorizon(MDP):
292
    
Steven Cordwell's avatar
Steven Cordwell committed
293
    """A MDP solved using the finite-horizon backwards induction algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
294
295
    
    Let S = number of states, A = number of actions
Steven Cordwell's avatar
Steven Cordwell committed
296
297
298
    
    Parameters
    ----------
299
300
301
302
303
304
305
306
307
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
308
309
    N        = number of periods, upper than 0
    h(S)     = terminal reward, optional (default [0; 0; ... 0] )
Steven Cordwell's avatar
Steven Cordwell committed
310
311
    
    Attributes
Steven Cordwell's avatar
Steven Cordwell committed
312
    ----------
Steven Cordwell's avatar
Steven Cordwell committed
313
314
315
    
    Methods
    -------
Steven Cordwell's avatar
Steven Cordwell committed
316
317
318
319
320
321
322
323
324
325
326
327
328
    V(S,N+1)     = optimal value function
                 V(:,n) = optimal value function at stage n
                        with stage in 1, ..., N
                        V(:,N+1) = value function for terminal stage 
    policy(S,N)  = optimal policy
                 policy(:,n) = optimal policy at stage n
                        with stage in 1, ...,N
                        policy(:,N) = policy for stage N
    cpu_time = used CPU time
  
    Notes
    -----
    In verbose mode, displays the current stage and policy transpose.
329
    
Steven Cordwell's avatar
Steven Cordwell committed
330
331
    Examples
    --------
332
333
334
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3)
335
    >>> fh.run()
Steven Cordwell's avatar
Steven Cordwell committed
336
337
338
339
340
341
342
343
    >>> fh.V
    array([[ 2.6973,  0.81  ,  0.    ,  0.    ],
           [ 5.9373,  3.24  ,  1.    ,  0.    ],
           [ 9.9373,  7.24  ,  4.    ,  0.    ]])
    >>> fh.policy
    array([[0, 0, 0],
           [0, 0, 1],
           [0, 0, 0]])
344
    
Steven Cordwell's avatar
Steven Cordwell committed
345
    """
Steven Cordwell's avatar
Steven Cordwell committed
346

Steven Cordwell's avatar
Steven Cordwell committed
347
    def __init__(self, transitions, reward, discount, N, h=None):
348
        # Initialise a finite horizon MDP.
349
350
        self.N = int(N)
        assert self.N > 0, 'PyMDPtoolbox: N must be greater than 0.'
Steven Cordwell's avatar
Steven Cordwell committed
351
        # Initialise the base class
352
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
353
354
        # remove the iteration counter, it is not meaningful for backwards
        # induction
355
        del self.iter
Steven Cordwell's avatar
Steven Cordwell committed
356
        # There are value vectors for each time step up to the horizon
357
        self.V = zeros((self.S, N + 1))
Steven Cordwell's avatar
Steven Cordwell committed
358
359
360
361
362
        # There are policy vectors for each time step before the horizon, when
        # we reach the horizon we don't need to make decisions anymore.
        self.policy = empty((self.S, N), dtype=int)
        # Set the reward for the final transition to h, if specified.
        if h is not None:
363
            self.V[:, N] = h
364
        # Call the iteration method
365
        #self.run()
366
        
367
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
368
        # Run the finite horizon algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
369
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
370
        # loop through each time period
371
        for n in range(self.N):
Steven Cordwell's avatar
Steven Cordwell committed
372
373
374
            W, X = self._bellmanOperator(self.V[:, self.N - n])
            self.V[:, self.N - n - 1] = X
            self.policy[:, self.N - n - 1] = W
Steven Cordwell's avatar
Steven Cordwell committed
375
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
376
377
                print(("stage: %s ... policy transpose : %s") % (
                    self.N - n, self.policy[:, self.N - n -1].tolist()))
Steven Cordwell's avatar
Steven Cordwell committed
378
        # update time spent running
Steven Cordwell's avatar
Steven Cordwell committed
379
        self.time = time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
380
381
382
383
384
385
386
387
388
389
        # After this we could create a tuple of tuples for the values and 
        # policies.
        #V = []
        #p = []
        #for n in xrange(self.N):
        #    V.append()
        #    p.append()
        #V.append()
        #self.V = tuple(V)
        #self.policy = tuple(p)
Steven Cordwell's avatar
Steven Cordwell committed
390
391

class LP(MDP):
392
    
393
    """A discounted MDP soloved using linear programming.
Steven Cordwell's avatar
Steven Cordwell committed
394
395
396

    Arguments
    ---------
397
398
399
400
401
402
403
404
405
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
406
407
408
409
410
411
412
413
414
415
416
417
418
419
    h(S)     = terminal reward, optional (default [0; 0; ... 0] )
    
    Evaluation
    ----------
    V(S)   = optimal values
    policy(S) = optimal policy
    cpu_time = used CPU time
    
    Notes    
    -----
    In verbose mode, displays the current stage and policy transpose.
    
    Examples
    --------
420
421
422
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> lp = mdptoolbox.mdp.LP(P, R, 0.9)
423
    >>> lp.run()
424
    
Steven Cordwell's avatar
Steven Cordwell committed
425
    """
Steven Cordwell's avatar
Steven Cordwell committed
426

Steven Cordwell's avatar
Steven Cordwell committed
427
    def __init__(self, transitions, reward, discount):
428
        # Initialise a linear programming MDP.
Steven Cordwell's avatar
Steven Cordwell committed
429
        # import some functions from cvxopt and set them as object methods
Steven Cordwell's avatar
Steven Cordwell committed
430
431
        try:
            from cvxopt import matrix, solvers
432
433
            self._linprog = solvers.lp
            self._cvxmat = matrix
Steven Cordwell's avatar
Steven Cordwell committed
434
        except ImportError:
435
436
            raise ImportError("The python module cvxopt is required to use "
                              "linear programming functionality.")
Steven Cordwell's avatar
Steven Cordwell committed
437
438
        # we also need diagonal matrices, and using a sparse one may be more
        # memory efficient
Steven Cordwell's avatar
Steven Cordwell committed
439
        from scipy.sparse import eye as speye
440
        self._speye = speye
Steven Cordwell's avatar
Steven Cordwell committed
441
        # initialise the MDP. epsilon and max_iter are not needed
442
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
443
        # Set the cvxopt solver to be quiet by default, but ...
444
        # this doesn't do what I want it to do c.f. issue #3
445
446
        if not self.verbose:
            solvers.options['show_progress'] = False
447
        # Call the iteration method
448
        #self.run()
449
    
450
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
451
        #Run the linear programming algorithm.
452
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
453
        # The objective is to resolve : min V / V >= PR + discount*P*V
454
455
        # The function linprog of the optimisation Toolbox of Mathworks
        # resolves :
Steven Cordwell's avatar
Steven Cordwell committed
456
        # min f'* x / M * x <= b
457
458
459
460
        # So the objective could be expressed as :
        # min V / (discount*P-I) * V <= - PR
        # To avoid loop on states, the matrix M is structured following actions
        # M(A*S,S)
461
462
463
        f = self._cvxmat(ones((self.S, 1)))
        h = self._cvxmat(self.R.reshape(self.S * self.A, 1, order="F"), tc='d')
        M = zeros((self.A * self.S, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
464
465
        for aa in range(self.A):
            pos = (aa + 1) * self.S
466
467
468
            M[(pos - self.S):pos, :] = (
                self.discount * self.P[aa] - self._speye(self.S, self.S))
        M = self._cvxmat(M)
469
470
471
        # Using the glpk option will make this behave more like Octave
        # (Octave uses glpk) and perhaps Matlab. If solver=None (ie using the 
        # default cvxopt solver) then V agrees with the Octave equivalent
Steven Cordwell's avatar
Steven Cordwell committed
472
        # only to 10e-8 places. This assumes glpk is installed of course.
Steven Cordwell's avatar
Steven Cordwell committed
473
        self.V = array(self._linprog(f, M, -h, solver='glpk')['x'])
Steven Cordwell's avatar
Steven Cordwell committed
474
        # apply the Bellman operator
475
        self.policy, self.V =  self._bellmanOperator()
Steven Cordwell's avatar
Steven Cordwell committed
476
        # update the time spent solving
Steven Cordwell's avatar
Steven Cordwell committed
477
        self.time = time() - self.time
478
        # store value and policy as tuples
479
480
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
481
482

class PolicyIteration(MDP):
483
    
484
    """A discounted MDP solved using the policy iteration algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
485
    
Steven Cordwell's avatar
Steven Cordwell committed
486
487
    Arguments
    ---------
488
489
490
491
492
493
494
495
496
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details. 
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
497
    policy0(S) = starting policy, optional 
498
499
500
    max_iter : int
        Maximum number of iterations. See the documentation for the ``MDP``
        class for details. Default is 1000.
Steven Cordwell's avatar
Steven Cordwell committed
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
    eval_type = type of function used to evaluate policy: 
             0 for mdp_eval_policy_matrix, else mdp_eval_policy_iterative
             optional (default 0)
             
    Evaluation
    ----------
    V(S)   = value function 
    policy(S) = optimal policy
    iter     = number of done iterations
    cpu_time = used CPU time
    
    Notes
    -----
    In verbose mode, at each iteration, displays the number 
    of differents actions between policy n-1 and n
    
Steven Cordwell's avatar
Steven Cordwell committed
517
518
    Examples
    --------
519
520
521
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.rand()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
522
    >>> pi.run()
523
    
524
525
    >>> P, R = mdptoolbox.example.forest()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
526
    >>> pi.run()
Steven Cordwell's avatar
Steven Cordwell committed
527
    >>> pi.V
528
    (26.244000000000018, 29.48400000000002, 33.484000000000016)
Steven Cordwell's avatar
Steven Cordwell committed
529
    >>> pi.policy
530
    (0, 0, 0)
Steven Cordwell's avatar
Steven Cordwell committed
531
    """
Steven Cordwell's avatar
Steven Cordwell committed
532
    
533
534
    def __init__(self, transitions, reward, discount, policy0=None,
                 max_iter=1000, eval_type=0):
Steven Cordwell's avatar
Steven Cordwell committed
535
536
537
        # Initialise a policy iteration MDP.
        #
        # Set up the MDP, but don't need to worry about epsilon values
538
        MDP.__init__(self, transitions, reward, discount, None, max_iter)
Steven Cordwell's avatar
Steven Cordwell committed
539
        # Check if the user has supplied an initial policy. If not make one.
Steven Cordwell's avatar
Steven Cordwell committed
540
        if policy0 == None:
Steven Cordwell's avatar
Steven Cordwell committed
541
            # Initialise the policy to the one which maximises the expected
Steven Cordwell's avatar
Steven Cordwell committed
542
            # immediate reward
Steven Cordwell's avatar
Steven Cordwell committed
543
544
            null = zeros(self.S)
            self.policy, null = self._bellmanOperator(null)
545
            del null
Steven Cordwell's avatar
Steven Cordwell committed
546
        else:
Steven Cordwell's avatar
Steven Cordwell committed
547
548
            # Use the policy that the user supplied
            # Make sure it is a numpy array
Steven Cordwell's avatar
Steven Cordwell committed
549
            policy0 = array(policy0)
Steven Cordwell's avatar
Steven Cordwell committed
550
            # Make sure the policy is the right size and shape
551
552
            assert policy0.shape in ((self.S, ), (self.S, 1), (1, self.S)), \
                "'policy0' must a vector with length S."
Steven Cordwell's avatar
Steven Cordwell committed
553
            # reshape the policy to be a vector
Steven Cordwell's avatar
Steven Cordwell committed
554
            policy0 = policy0.reshape(self.S)
555
556
557
558
559
560
            # The policy can only contain integers between 0 and S-1
            msg = "'policy0' must be a vector of integers between 0 and S-1."
            assert not mod(policy0, 1).any(), msg
            assert (policy0 >= 0).all(), msg
            assert (policy0 < self.S).all(), msg
            self.policy = policy0
Steven Cordwell's avatar
Steven Cordwell committed
561
        # set the initial values to zero
Steven Cordwell's avatar
Steven Cordwell committed
562
        self.V = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
563
        # Do some setup depending on the evaluation type
Steven Cordwell's avatar
Steven Cordwell committed
564
        if eval_type in (0, "matrix"):
565
            from numpy.linalg import solve
566
            from scipy.sparse import eye
567
568
            self._speye = eye
            self._lin_eq = solve
Steven Cordwell's avatar
Steven Cordwell committed
569
570
571
572
            self.eval_type = "matrix"
        elif eval_type in (1, "iterative"):
            self.eval_type = "iterative"
        else:
573
574
575
576
            raise ValueError("PyMDPtoolbox: eval_type should be 0 for matrix "
                             "evaluation or 1 for iterative evaluation. "
                             "The strings 'matrix' and 'iterative' can also "
                             "be used.")
577
        # Call the iteration method
578
        #self.run()
Steven Cordwell's avatar
Steven Cordwell committed
579
    
580
    def _computePpolicyPRpolicy(self):
Steven Cordwell's avatar
Steven Cordwell committed
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
        # Compute the transition matrix and the reward matrix for a policy.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA)  = transition matrix 
        #     P could be an array with 3 dimensions or a cell array (1xA),
        #     each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #     R could be an array with 3 dimensions (SxSxA) or 
        #     a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #     a 2D array(SxA) possibly sparse  
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Ppolicy(SxS)  = transition matrix for policy
        # PRpolicy(S)   = reward matrix for policy
        #
Steven Cordwell's avatar
Steven Cordwell committed
600
601
        Ppolicy = empty((self.S, self.S))
        Rpolicy = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
602
        for aa in range(self.A): # avoid looping over S
Steven Cordwell's avatar
Steven Cordwell committed
603
604
            # the rows that use action a.
            ind = (self.policy == aa).nonzero()[0]
605
606
            # if no rows use action a, then no need to assign this
            if ind.size > 0:
607
608
609
610
                try:
                    Ppolicy[ind, :] = self.P[aa][ind, :]
                except ValueError:
                    Ppolicy[ind, :] = self.P[aa][ind, :].todense()
611
                #PR = self._computePR() # an apparently uneeded line, and
Steven Cordwell's avatar
Steven Cordwell committed
612
613
                # perhaps harmful in this implementation c.f.
                # mdp_computePpolicyPRpolicy.m
614
                Rpolicy[ind] = self.R[aa][ind]
Steven Cordwell's avatar
Steven Cordwell committed
615
616
617
618
619
620
621
622
623
624
        # self.R cannot be sparse with the code in its current condition, but
        # it should be possible in the future. Also, if R is so big that its
        # a good idea to use a sparse matrix for it, then converting PRpolicy
        # from a dense to sparse matrix doesn't seem very memory efficient
        if type(self.R) is sparse:
            Rpolicy = sparse(Rpolicy)
        #self.Ppolicy = Ppolicy
        #self.Rpolicy = Rpolicy
        return (Ppolicy, Rpolicy)
    
625
    def _evalPolicyIterative(self, V0=0, epsilon=0.0001, max_iter=10000):
Steven Cordwell's avatar
Steven Cordwell committed
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
        # Evaluate a policy using iteration.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA)  = transition matrix 
        #    P could be an array with 3 dimensions or 
        #    a cell array (1xS), each cell containing a matrix possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #    R could be an array with 3 dimensions (SxSxA) or 
        #    a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #    a 2D array(SxA) possibly sparse  
        # discount  = discount rate in ]0; 1[
        # policy(S) = a policy
        # V0(S)     = starting value function, optional (default : zeros(S,1))
        # epsilon   = epsilon-optimal policy search, upper than 0,
        #    optional (default : 0.0001)
        # max_iter  = maximum number of iteration to be done, upper than 0, 
        #    optional (default : 10000)
        #    
        # Evaluation
        # ----------
        # Vpolicy(S) = value function, associated to a specific policy
        #
        # Notes
        # -----
        # In verbose mode, at each iteration, displays the condition which
        # stopped iterations: epsilon-optimum value function found or maximum
        # number of iterations reached.
        #
656
657
658
659
660
661
662
        try:
            assert V0.shape in ((self.S, ), (self.S, 1), (1, self.S)), \
                "'V0' must be a vector of length S."
            policy_V = array(V0).reshape(self.S)
        except AttributeError:
            if len(V0) == self.S:
                policy_V = array(V0).reshape(self.S)
663
            else:
664
                policy_V = zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
665
        
666
        policy_P, policy_R = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
667
668
669
        
        if self.verbose:
            print('  Iteration    V_variation')
670
        
Steven Cordwell's avatar
Steven Cordwell committed
671
672
673
        itr = 0
        done = False
        while not done:
674
            itr += 1
675
676
            
            Vprev = policy_V
677
            policy_V = policy_R + self.discount * policy_P.dot(Vprev)
678
679
            
            variation = absolute(policy_V - Vprev).max()
Steven Cordwell's avatar
Steven Cordwell committed
680
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
681
                print(('      %s         %s') % (itr, variation))
682
683
684
            
            # ensure |Vn - Vpolicy| < epsilon
            if variation < ((1 - self.discount) / self.discount) * epsilon:
Steven Cordwell's avatar
Steven Cordwell committed
685
686
                done = True
                if self.verbose:
687
688
                    print("PyMDPtoolbox: iterations stopped, epsilon-optimal "
                          "value function.")
Steven Cordwell's avatar
Steven Cordwell committed
689
690
691
            elif itr == max_iter:
                done = True
                if self.verbose:
692
693
                    print("PyMDPtoolbox: iterations stopped by maximum number "
                          "of iteration condition.")
Steven Cordwell's avatar
Steven Cordwell committed
694
        
Steven Cordwell's avatar
Steven Cordwell committed
695
        self.V = policy_V
696
    
697
    def _evalPolicyMatrix(self):
Steven Cordwell's avatar
Steven Cordwell committed
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
        # Evaluate the value function of the policy using linear equations.
        #
        # Arguments 
        # ---------
        # Let S = number of states, A = number of actions
        # P(SxSxA) = transition matrix 
        #      P could be an array with 3 dimensions or a cell array (1xA),
        #      each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
        #      R could be an array with 3 dimensions (SxSxA) or 
        #      a cell array (1xA), each cell containing a sparse matrix (SxS) or
        #      a 2D array(SxA) possibly sparse  
        # discount = discount rate in ]0; 1[
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Vpolicy(S) = value function of the policy
        #
717
        Ppolicy, Rpolicy = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
718
        # V = PR + gPV  => (I-gP)V = PR  => V = inv(I-gP)* PR
719
720
        self.V = self._lin_eq(
            (self._speye(self.S, self.S) - self.discount * Ppolicy), Rpolicy)
Steven Cordwell's avatar
Steven Cordwell committed
721
    
722
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
723
724
        # Run the policy iteration algorithm.
        # If verbose the print a header
725
726
        if self.verbose:
            print('  Iteration  Number_of_different_actions')
Steven Cordwell's avatar
Steven Cordwell committed
727
        # Set up the while stopping condition and the current time
Steven Cordwell's avatar
Steven Cordwell committed
728
        done = False
729
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
730
        # loop until a stopping condition is reached
Steven Cordwell's avatar
Steven Cordwell committed
731
        while not done:
732
            self.iter += 1
733
            # these _evalPolicy* functions will update the classes value
Steven Cordwell's avatar
Steven Cordwell committed
734
            # attribute
Steven Cordwell's avatar
Steven Cordwell committed
735
            if self.eval_type == "matrix":
736
                self._evalPolicyMatrix()
Steven Cordwell's avatar
Steven Cordwell committed
737
            elif self.eval_type == "iterative":
738
                self._evalPolicyIterative()
Steven Cordwell's avatar
Steven Cordwell committed
739
740
            # This should update the classes policy attribute but leave the
            # value alone
741
            policy_next, null = self._bellmanOperator()
742
            del null
Steven Cordwell's avatar
Steven Cordwell committed
743
744
            # calculate in how many places does the old policy disagree with
            # the new policy
745
            n_different = (policy_next != self.policy).sum()
Steven Cordwell's avatar
Steven Cordwell committed
746
            # if verbose then continue printing a table
747
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
748
749
                print(('       %s                 %s') % (self.iter,
                                                         n_different))
Steven Cordwell's avatar
Steven Cordwell committed
750
751
            # Once the policy is unchanging of the maximum number of 
            # of iterations has been reached then stop
752
            if n_different == 0:
Steven Cordwell's avatar
Steven Cordwell committed
753
                done = True
754
                if self.verbose:
755
756
                    print("PyMDPtoolbox: iterations stopped, unchanging "
                          "policy found.")
757
758
759
            elif (self.iter == self.max_iter):
                done = True 
                if self.verbose:
760
761
                    print("PyMDPtoolbox: iterations stopped by maximum number "
                          "of iteration condition.")
762
763
            else:
                self.policy = policy_next
Steven Cordwell's avatar
Steven Cordwell committed
764
        # update the time to return th computation time
765
        self.time = time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
766
        # store value and policy as tuples
767
768
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
769

770
class PolicyIterationModified(PolicyIteration):
771
    
772
    """A discounted MDP  solved using a modifified policy iteration algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
773
774
775
    
    Arguments
    ---------
776
777
778
779
780
781
782
783
784
785
786
787
788
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
    *policy0(S) = starting policy, optional 
    max_iter : int
        Maximum number of iterations. See the documentation for the ``MDP``
        class for details. Default is 1000.
Steven Cordwell's avatar
Steven Cordwell committed
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
    eval_type = type of function used to evaluate policy: 
             0 for mdp_eval_policy_matrix, else mdp_eval_policy_iterative
             optional (default 0)
    
    Data Attributes
    ---------------
    V(S)   = value function 
    policy(S) = optimal policy
    iter     = number of done iterations
    cpu_time = used CPU time
    
    Notes
    -----
    In verbose mode, at each iteration, displays the number 
    of differents actions between policy n-1 and n
    
    Examples
    --------
807
808
809
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> pim = mdptoolbox.mdp.PolicyIterationModified(P, R, 0.9)
810
    >>> pim.run()
811
    >>> pim.policy
812
    (0, 0, 0)
813
    >>> pim.V
814
    (21.81408652334702, 25.054086523347017, 29.054086523347017)
815
    
Steven Cordwell's avatar
Steven Cordwell committed
816
    """
817
    
818
819
    def __init__(self, transitions, reward, discount, epsilon=0.01,
                 max_iter=10):
820
        # Initialise a (modified) policy iteration MDP.
Steven Cordwell's avatar
Steven Cordwell committed
821
        
822
823
824
        # Maybe its better not to subclass from PolicyIteration, because the
        # initialisation of the two are quite different. eg there is policy0
        # being calculated here which doesn't need to be. The only thing that
825
        # is needed from the PolicyIteration class is the _evalPolicyIterative
826
        # function. Perhaps there is a better way to do it?
827
828
        PolicyIteration.__init__(self, transitions, reward, discount, None,
                                 max_iter, 1)
829
        
830
831
        # PolicyIteration doesn't pass epsilon to MDP.__init__() so we will
        # check it here
832
833
        self.epsilon = float(epsilon)
        assert epsilon > 0, "'epsilon' must be greater than 0."
834
        
835
836
        # computation of threshold of variation for V for an epsilon-optimal
        # policy
837
        if self.discount != 1:
838
            self.thresh = self.epsilon * (1 - self.discount) / self.discount
839
        else:
840
            self.thresh = self.epsilon
841
        
842
        if self.discount == 1:
Steven Cordwell's avatar
Steven Cordwell committed
843
            self.V = zeros((self.S, 1))
844
        else:
845
            Rmin = min(R.min() for R in self.R)
846
            self.V = 1 / (1 - self.discount) * Rmin * ones((self.S,))
847
848
        
        # Call the iteration method
849
        #self.run()
Steven Cordwell's avatar
Steven Cordwell committed
850
    
851
    def run(self):
852
        # Run the modified policy iteration algorithm.
853
854
        
        if self.verbose:
855
            print('\tIteration\tV-variation')
856
        
Steven Cordwell's avatar
Steven Cordwell committed
857
        self.time = time()
Steven Cordwell's avatar
Steven Cordwell committed
858
        
859
860
        done = False
        while not done:
861
            self.iter += 1
862
            
863
            self.policy, Vnext = self._bellmanOperator()
864
            #[Ppolicy, PRpolicy] = mdp_computePpolicyPRpolicy(P, PR, policy);
865
            
866
            variation = getSpan(Vnext - self.V)
867
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
868
                print(("\t%s\t%s" % (self.iter, variation)))
869
            
Steven Cordwell's avatar
Steven Cordwell committed
870
            self.V = Vnext
Steven Cordwell's avatar
Steven Cordwell committed
871
            if variation < self.thresh:
872
873
874
875
                done = True
            else:
                is_verbose = False
                if self.verbose:
876
                    self.setSilent()
877
878
                    is_verbose = True
                
879
                self._evalPolicyIterative(self.V, self.epsilon, self.max_iter)
880
881
                
                if is_verbose:
882
                    self.setVerbose()
883
        
884
        self.time = time() - self.time
885
886
        
        # store value and policy as tuples
887
888
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
889
890

class QLearning(MDP):
891
    
892
    """A discounted MDP solved using the Q learning algorithm.
Steven Cordwell's avatar
Steven Cordwell committed
893
894
895
    
    Parameters
    ----------
896
897
898
899
900
901
902
903
904
905
906
907
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details. 
    n_iter : int
        Number of iterations to execute. Default value = 10000. This is ignored
        unless it is an integer greater than the default value.
Steven Cordwell's avatar
Steven Cordwell committed
908
909
910
911
912
    
    Results
    -------
    Q : learned Q matrix (SxA) 
    
Steven Cordwell's avatar
Steven Cordwell committed
913
    V : learned value function (S).
Steven Cordwell's avatar
Steven Cordwell committed
914
915
916
917
918
919
920
    
    policy : learned optimal policy (S).
    
    mean_discrepancy : vector of V discrepancy mean over 100 iterations
        Then the length of this vector for the default value of N is 100 
        (N/100).

921
    Examples
Steven Cordwell's avatar
Steven Cordwell committed
922
    ---------
923
924
925
    >>> # These examples are reproducible only if random seed is set to 0 in
    >>> # both the random and numpy.random modules.
    >>> import numpy as np
926
    >>> import mdptoolbox, mdptoolbox.example
927
    >>> np.random.seed(0)
928
929
    >>> P, R = mdptoolbox.example.forest()
    >>> ql = mdptoolbox.mdp.QLearning(P, R, 0.96)
930
    >>> ql.run()
Steven Cordwell's avatar
Steven Cordwell committed
931
    >>> ql.Q
932
933
934
    array([[ 68.38037354,  43.24888454],
           [ 72.37777922,  42.75549145],
           [ 77.02892702,  64.68712932]])
Steven Cordwell's avatar
Steven Cordwell committed
935
    >>> ql.V
936
    (68.38037354422798, 72.37777921607258, 77.02892701616531)
Steven Cordwell's avatar
Steven Cordwell committed
937
    >>> ql.policy
938
    (0, 0, 0)
Steven Cordwell's avatar
Steven Cordwell committed
939
    
940
    >>> import mdptoolbox
Steven Cordwell's avatar
Steven Cordwell committed
941
942
943
    >>> import numpy as np
    >>> P = np.array([[[0.5, 0.5],[0.8, 0.2]],[[0, 1],[0.1, 0.9]]])
    >>> R = np.array([[5, 10], [-1, 2]])
944
    >>> np.random.seed(0)
945
946
    >>> ql = mdptoolbox.mdp.QLearning(P, R, 0.9)
    >>> ql.run()
Steven Cordwell's avatar
Steven Cordwell committed
947
    >>> ql.Q
948
949
    array([[ 39.933691  ,  43.17543338],
           [ 36.94394224,  35.42568056]])
Steven Cordwell's avatar
Steven Cordwell committed
950
    >>> ql.V
951
    (43.17543338090149, 36.943942243204454)
Steven Cordwell's avatar
Steven Cordwell committed
952
    >>> ql.policy
953
    (1, 0)
954
    
Steven Cordwell's avatar
Steven Cordwell committed
955
956
957
    """
    
    def __init__(self, transitions, reward, discount, n_iter=10000):
958
        # Initialise a Q-learning MDP.
Steven Cordwell's avatar
Steven Cordwell committed
959
        
Steven Cordwell's avatar