mdp.py 54.8 KB
Newer Older
Steven Cordwell's avatar
Steven Cordwell committed
1
# -*- coding: utf-8 -*-
2
3
"""Markov Decision Process (MDP) Toolbox: ``mdp`` module
=====================================================
4

5
6
The ``mdp`` module provides classes for the resolution of descrete-time Markov
Decision Processes.
Steven Cordwell's avatar
Steven Cordwell committed
7

Steven Cordwell's avatar
Steven Cordwell committed
8
9
10
11
12
Available classes
-----------------
MDP
    Base Markov decision process class
FiniteHorizon
Steven Cordwell's avatar
Steven Cordwell committed
13
    Backwards induction finite horizon MDP
Steven Cordwell's avatar
Steven Cordwell committed
14
15
16
17
18
19
20
21
22
23
24
25
26
27
LP
    Linear programming MDP
PolicyIteration
    Policy iteration MDP
PolicyIterationModified
    Modified policy iteration MDP
QLearning
    Q-learning MDP
RelativeValueIteration
    Relative value iteration MDP
ValueIteration
    Value iteration MDP
ValueIterationGS
    Gauss-Seidel value iteration MDP
Steven Cordwell's avatar
Steven Cordwell committed
28
29
30

"""

31
# Copyright (c) 2011-2015 Steven A. W. Cordwell
32
# Copyright (c) 2009 INRA
33
#
Steven Cordwell's avatar
Steven Cordwell committed
34
# All rights reserved.
35
#
Steven Cordwell's avatar
Steven Cordwell committed
36
37
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
38
#
Steven Cordwell's avatar
Steven Cordwell committed
39
40
41
42
43
44
45
46
#   * Redistributions of source code must retain the above copyright notice,
#     this list of conditions and the following disclaimer.
#   * Redistributions in binary form must reproduce the above copyright notice,
#     this list of conditions and the following disclaimer in the documentation
#     and/or other materials provided with the distribution.
#   * Neither the name of the <ORGANIZATION> nor the names of its contributors
#     may be used to endorse or promote products derived from this software
#     without specific prior written permission.
47
#
Steven Cordwell's avatar
Steven Cordwell committed
48
49
50
51
52
53
54
55
56
57
58
59
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.

60
61
import math as _math
import time as _time
Steven Cordwell's avatar
Steven Cordwell committed
62

63
64
import numpy as _np
import scipy.sparse as _sp
Steven Cordwell's avatar
Steven Cordwell committed
65

66
import mdptoolbox.util as _util
67

68
_MSG_STOP_MAX_ITER = "Iterating stopped due to maximum number of iterations " \
Steven Cordwell's avatar
Steven Cordwell committed
69
    "condition."
70
_MSG_STOP_EPSILON_OPTIMAL_POLICY = "Iterating stopped, epsilon-optimal " \
Steven Cordwell's avatar
Steven Cordwell committed
71
    "policy found."
72
_MSG_STOP_EPSILON_OPTIMAL_VALUE = "Iterating stopped, epsilon-optimal value " \
Steven Cordwell's avatar
Steven Cordwell committed
73
    "function found."
74
_MSG_STOP_UNCHANGING_POLICY = "Iterating stopped, unchanging policy found."
Steven Cordwell's avatar
Steven Cordwell committed
75

Steven Cordwell's avatar
Steven Cordwell committed
76
class MDP(object):
77

Steven Cordwell's avatar
Steven Cordwell committed
78
    """A Markov Decision Problem.
79

Steven Cordwell's avatar
Steven Cordwell committed
80
    Let ``S`` = the number of states, and ``A`` = the number of acions.
81

Steven Cordwell's avatar
Steven Cordwell committed
82
83
84
    Parameters
    ----------
    transitions : array
85
        Transition probability matrices. These can be defined in a variety of
Steven Cordwell's avatar
Steven Cordwell committed
86
        ways. The simplest is a numpy array that has the shape ``(A, S, S)``,
87
        though there are other possibilities. It can be a tuple or list or
Steven Cordwell's avatar
Steven Cordwell committed
88
89
90
91
92
93
94
        numpy object array of length ``A``, where each element contains a numpy
        array or matrix that has the shape ``(S, S)``. This "list of matrices"
        form is useful when the transition matrices are sparse as
        ``scipy.sparse.csr_matrix`` matrices can be used. In summary, each
        action's transition matrix must be indexable like ``transitions[a]``
        where ``a`` ∈ {0, 1...A-1}, and ``transitions[a]`` returns an ``S`` ×
        ``S`` array-like object.
Steven Cordwell's avatar
Steven Cordwell committed
95
    reward : array
96
97
        Reward matrices or vectors. Like the transition matrices, these can
        also be defined in a variety of ways. Again the simplest is a numpy
Steven Cordwell's avatar
Steven Cordwell committed
98
99
100
101
102
103
104
105
        array that has the shape ``(S, A)``, ``(S,)`` or ``(A, S, S)``. A list
        of lists can be used, where each inner list has length ``S`` and the
        outer list has length ``A``. A list of numpy arrays is possible where
        each inner array can be of the shape ``(S,)``, ``(S, 1)``, ``(1, S)``
        or ``(S, S)``. Also ``scipy.sparse.csr_matrix`` can be used instead of
        numpy arrays. In addition, the outer list can be replaced by any object
        that can be indexed like ``reward[a]`` such as a tuple or numpy object
        array of length ``A``.
106
107
108
109
    discount : float
        Discount factor. The per time-step discount factor on future rewards.
        Valid values are greater than 0 upto and including 1. If the discount
        factor is 1, then convergence is cannot be assumed and a warning will
Steven Cordwell's avatar
Steven Cordwell committed
110
111
        be displayed. Subclasses of ``MDP`` may pass ``None`` in the case where
        the algorithm does not use a discount factor.
112
113
114
115
    epsilon : float
        Stopping criterion. The maximum change in the value function at each
        iteration is compared against ``epsilon``. Once the change falls below
        this value, then the value function is considered to have converged to
Steven Cordwell's avatar
Steven Cordwell committed
116
117
118
        the optimal value function. Subclasses of ``MDP`` may pass ``None`` in
        the case where the algorithm does not use an epsilon-optimal stopping
        criterion.
119
120
121
    max_iter : int
        Maximum number of iterations. The algorithm will be terminated once
        this many iterations have elapsed. This must be greater than 0 if
Steven Cordwell's avatar
Steven Cordwell committed
122
123
        specified. Subclasses of ``MDP`` may pass ``None`` in the case where
        the algorithm does not use a maximum number of iterations.
124

Steven Cordwell's avatar
Steven Cordwell committed
125
126
127
    Attributes
    ----------
    P : array
128
        Transition probability matrices.
Steven Cordwell's avatar
Steven Cordwell committed
129
    R : array
130
131
        Reward vectors.
    V : tuple
Steven Cordwell's avatar
Steven Cordwell committed
132
133
134
        The optimal value function. Each element is a float corresponding to
        the expected value of being in that state assuming the optimal policy
        is followed.
Steven Cordwell's avatar
Steven Cordwell committed
135
    discount : float
136
        The discount rate on future rewards.
Steven Cordwell's avatar
Steven Cordwell committed
137
    max_iter : int
138
139
140
        The maximum number of iterations.
    policy : tuple
        The optimal policy.
Steven Cordwell's avatar
Steven Cordwell committed
141
    time : float
142
143
        The time used to converge to the optimal policy.
    verbose : boolean
Steven Cordwell's avatar
Steven Cordwell committed
144
        Whether verbose output should be displayed or not.
145

Steven Cordwell's avatar
Steven Cordwell committed
146
147
    Methods
    -------
148
    run
Steven Cordwell's avatar
Steven Cordwell committed
149
        Implemented in child classes as the main algorithm loop. Raises an
150
        exception if it has not been overridden.
Steven Cordwell's avatar
Steven Cordwell committed
151
152
153
154
    setSilent
        Turn the verbosity off
    setVerbose
        Turn the verbosity on
155

Steven Cordwell's avatar
Steven Cordwell committed
156
    """
157

158
    def __init__(self, transitions, reward, discount, epsilon, max_iter):
159
        # Initialise a MDP based on the input parameters.
160

Steven Cordwell's avatar
Steven Cordwell committed
161
162
        # if the discount is None then the algorithm is assumed to not use it
        # in its computations
163
164
165
166
        if discount is not None:
            self.discount = float(discount)
            assert 0.0 < self.discount <= 1.0, "Discount rate must be in ]0; 1]"
            if self.discount == 1:
Steven Cordwell's avatar
Steven Cordwell committed
167
                print("WARNING: check conditions of convergence. With no "
Steven Cordwell's avatar
Steven Cordwell committed
168
                      "discount, convergence can not be assumed.")
Steven Cordwell's avatar
Steven Cordwell committed
169
170
        # if the max_iter is None then the algorithm is assumed to not use it
        # in its computations
171
172
173
174
        if max_iter is not None:
            self.max_iter = int(max_iter)
            assert self.max_iter > 0, "The maximum number of iterations " \
                                      "must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
175
        # check that epsilon is something sane
176
177
178
        if epsilon is not None:
            self.epsilon = float(epsilon)
            assert self.epsilon > 0, "Epsilon must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
179
180
        # we run a check on P and R to make sure they are describing an MDP. If
        # an exception isn't raised then they are assumed to be correct.
181
        _util.check(transitions, reward)
Steven Cordwell's avatar
Steven Cordwell committed
182
        # computePR will assign the variables self.S, self.A, self.P and self.R
183
        self._computePR(transitions, reward)
Steven Cordwell's avatar
Steven Cordwell committed
184
185
186
187
        # the verbosity is by default turned off
        self.verbose = False
        # Initially the time taken to perform the computations is set to None
        self.time = None
188
189
        # set the initial iteration count to zero
        self.iter = 0
Steven Cordwell's avatar
Steven Cordwell committed
190
        # V should be stored as a vector ie shape of (S,) or (1, S)
Steven Cordwell's avatar
Steven Cordwell committed
191
        self.V = None
Steven Cordwell's avatar
Steven Cordwell committed
192
        # policy can also be stored as a vector
Steven Cordwell's avatar
Steven Cordwell committed
193
        self.policy = None
194

195
196
197
198
199
200
    def __repr__(self):
        P_repr = "P: \n"
        R_repr = "R: \n"
        for aa in range(self.A):
            P_repr += repr(self.P[aa]) + "\n"
            R_repr += repr(self.R[aa]) + "\n"
201
        return(P_repr + "\n" + R_repr)
202

203
    def _bellmanOperator(self, V=None):
Steven Cordwell's avatar
Steven Cordwell committed
204
        # Apply the Bellman operator on the value function.
205
        #
Steven Cordwell's avatar
Steven Cordwell committed
206
        # Updates the value function and the Vprev-improving policy.
207
        #
Steven Cordwell's avatar
Steven Cordwell committed
208
209
210
211
        # Returns: (policy, value), tuple of new policy and its value
        #
        # If V hasn't been sent into the method, then we assume to be working
        # on the objects V attribute
212
213
        if V is None:
            # this V should be a reference to the data rather than a copy
214
215
            V = self.V
        else:
Steven Cordwell's avatar
Steven Cordwell committed
216
            # make sure the user supplied V is of the right shape
217
            try:
218
219
                assert V.shape in ((self.S,), (1, self.S)), "V is not the " \
                    "right shape (Bellman operator)."
220
            except AttributeError:
221
                raise TypeError("V must be a numpy array or matrix.")
222
223
224
225
        # Looping through each action the the Q-value matrix is calculated.
        # P and V can be any object that supports indexing, so it is important
        # that you know they define a valid MDP before calling the
        # _bellmanOperator method. Otherwise the results will be meaningless.
226
        Q = _np.empty((self.A, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
227
        for aa in range(self.A):
Steven Cordwell's avatar
Steven Cordwell committed
228
            Q[aa] = self.R[aa] + self.discount * self.P[aa].dot(V)
Steven Cordwell's avatar
Steven Cordwell committed
229
        # Get the policy and value, for now it is being returned but...
230
        # Which way is better?
231
        # 1. Return, (policy, value)
232
        return (Q.argmax(axis=0), Q.max(axis=0))
Steven Cordwell's avatar
Steven Cordwell committed
233
234
        # 2. update self.policy and self.V directly
        # self.V = Q.max(axis=1)
235
        # self.policy = Q.argmax(axis=1)
236

237
238
239
240
241
242
243
244
    def _computeP(self, P):
        # Set self.P as a tuple of length A, with each element storing an S×S
        # matrix.
        self.A = len(P)
        try:
            if P.ndim == 3:
                self.S = P.shape[1]
            else:
245
                self.S = P[0].shape[0]
246
247
248
        except AttributeError:
            self.S = P[0].shape[0]
        # convert P to a tuple of numpy arrays
249
        self.P = tuple(P[aa] for aa in range(self.A))
250

251
    def _computePR(self, P, R):
Steven Cordwell's avatar
Steven Cordwell committed
252
253
254
255
        # Compute the reward for the system in one state chosing an action.
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
256
257
        #    P(SxSxA)  = transition matrix
        #        P could be an array with 3 dimensions or  a cell array (1xA),
Steven Cordwell's avatar
Steven Cordwell committed
258
259
        #        each cell containing a matrix (SxS) possibly sparse
        #    R(SxSxA) or (SxA) = reward matrix
260
261
262
        #        R could be an array with 3 dimensions (SxSxA) or  a cell array
        #        (1xA), each cell containing a sparse matrix (SxS) or a 2D
        #        array(SxA) possibly sparse
Steven Cordwell's avatar
Steven Cordwell committed
263
264
265
266
        # Evaluation
        # ----------
        #    PR(SxA)   = reward matrix
        #
267
        # We assume that P and R define a MDP i,e. assumption is that
268
        # _util.check(P, R) has already been run and doesn't fail.
269
        #
270
271
        # First compute store P, S, and A
        self._computeP(P)
Steven Cordwell's avatar
Steven Cordwell committed
272
273
        # Set self.R as a tuple of length A, with each element storing an 1×S
        # vector.
274
        try:
275
            if R.ndim == 1:
276
                r = _np.array(R).reshape(self.S)
277
                self.R = tuple(r for aa in range(self.A))
278
            elif R.ndim == 2:
279
                self.R = tuple(_np.array(R[:, aa]).reshape(self.S)
280
                               for aa in range(self.A))
Steven Cordwell's avatar
Steven Cordwell committed
281
            else:
282
                self.R = tuple(_np.multiply(P[aa], R[aa]).sum(1).reshape(self.S)
283
                               for aa in range(self.A))
284
        except AttributeError:
285
            if len(R) == self.A:
286
                self.R = tuple(_np.multiply(P[aa], R[aa]).sum(1).reshape(self.S)
287
                               for aa in range(self.A))
288
            else:
289
                r = _np.array(R).reshape(self.S)
290
                self.R = tuple(r for aa in range(self.A))
291

292
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
293
        # Raise error because child classes should implement this function.
294
        raise NotImplementedError("You should create a run() method.")
295

Steven Cordwell's avatar
Steven Cordwell committed
296
    def setSilent(self):
297
        """Set the MDP algorithm to silent mode."""
Steven Cordwell's avatar
Steven Cordwell committed
298
        self.verbose = False
299

Steven Cordwell's avatar
Steven Cordwell committed
300
    def setVerbose(self):
301
        """Set the MDP algorithm to verbose mode."""
Steven Cordwell's avatar
Steven Cordwell committed
302
        self.verbose = True
Steven Cordwell's avatar
Steven Cordwell committed
303
304

class FiniteHorizon(MDP):
305

Steven Cordwell's avatar
Steven Cordwell committed
306
    """A MDP solved using the finite-horizon backwards induction algorithm.
307

Steven Cordwell's avatar
Steven Cordwell committed
308
309
    Parameters
    ----------
310
311
312
313
314
315
316
317
318
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
319
320
321
322
    N : int
        Number of periods. Must be greater than 0.
    h : array, optional
        Terminal reward. Default: a vector of zeros.
323

Steven Cordwell's avatar
Steven Cordwell committed
324
325
    Data Attributes
    ---------------
326
    V : array
Steven Cordwell's avatar
Steven Cordwell committed
327
        Optimal value function. Shape = (S, N+1). ``V[:, n]`` = optimal value
Steven Cordwell's avatar
Steven Cordwell committed
328
        function at stage ``n`` with stage in {0, 1...N-1}. ``V[:, N]`` value
329
        function for terminal stage.
Steven Cordwell's avatar
Steven Cordwell committed
330
331
    policy : array
        Optimal policy. ``policy[:, n]`` = optimal policy at stage ``n`` with
Steven Cordwell's avatar
Steven Cordwell committed
332
        stage in {0, 1...N}. ``policy[:, N]`` = policy for stage ``N``.
Steven Cordwell's avatar
Steven Cordwell committed
333
334
    time : float
        used CPU time
335

Steven Cordwell's avatar
Steven Cordwell committed
336
337
338
    Notes
    -----
    In verbose mode, displays the current stage and policy transpose.
339

Steven Cordwell's avatar
Steven Cordwell committed
340
341
    Examples
    --------
342
343
344
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3)
345
    >>> fh.run()
Steven Cordwell's avatar
Steven Cordwell committed
346
347
348
349
350
351
352
353
    >>> fh.V
    array([[ 2.6973,  0.81  ,  0.    ,  0.    ],
           [ 5.9373,  3.24  ,  1.    ,  0.    ],
           [ 9.9373,  7.24  ,  4.    ,  0.    ]])
    >>> fh.policy
    array([[0, 0, 0],
           [0, 0, 1],
           [0, 0, 0]])
354

Steven Cordwell's avatar
Steven Cordwell committed
355
    """
Steven Cordwell's avatar
Steven Cordwell committed
356

Steven Cordwell's avatar
Steven Cordwell committed
357
    def __init__(self, transitions, reward, discount, N, h=None):
358
        # Initialise a finite horizon MDP.
359
        self.N = int(N)
Steven Cordwell's avatar
Steven Cordwell committed
360
        assert self.N > 0, "N must be greater than 0."
Steven Cordwell's avatar
Steven Cordwell committed
361
        # Initialise the base class
362
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
363
364
        # remove the iteration counter, it is not meaningful for backwards
        # induction
365
        del self.iter
Steven Cordwell's avatar
Steven Cordwell committed
366
        # There are value vectors for each time step up to the horizon
367
        self.V = _np.zeros((self.S, N + 1))
Steven Cordwell's avatar
Steven Cordwell committed
368
369
        # There are policy vectors for each time step before the horizon, when
        # we reach the horizon we don't need to make decisions anymore.
370
        self.policy = _np.empty((self.S, N), dtype=int)
Steven Cordwell's avatar
Steven Cordwell committed
371
372
        # Set the reward for the final transition to h, if specified.
        if h is not None:
373
            self.V[:, N] = h
374

375
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
376
        # Run the finite horizon algorithm.
377
        self.time = _time.time()
Steven Cordwell's avatar
Steven Cordwell committed
378
        # loop through each time period
379
        for n in range(self.N):
Steven Cordwell's avatar
Steven Cordwell committed
380
            W, X = self._bellmanOperator(self.V[:, self.N - n])
Steven Cordwell's avatar
Steven Cordwell committed
381
382
383
            stage = self.N - n - 1
            self.V[:, stage] = X
            self.policy[:, stage] = W
Steven Cordwell's avatar
Steven Cordwell committed
384
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
385
386
                print(("stage: %s, policy: %s") % (
                    stage, self.policy[:, stage].tolist()))
Steven Cordwell's avatar
Steven Cordwell committed
387
        # update time spent running
388
        self.time = _time.time() - self.time
389
        # After this we could create a tuple of tuples for the values and
Steven Cordwell's avatar
Steven Cordwell committed
390
        # policies.
Steven Cordwell's avatar
Steven Cordwell committed
391
392
393
        #self.V = tuple(tuple(self.V[:, n].tolist()) for n in range(self.N))
        #self.policy = tuple(tuple(self.policy[:, n].tolist())
        #                    for n in range(self.N))
Steven Cordwell's avatar
Steven Cordwell committed
394
395

class LP(MDP):
396

397
    """A discounted MDP soloved using linear programming.
398

Steven Cordwell's avatar
Steven Cordwell committed
399
    This class requires the Python ``cvxopt`` module to be installed.
Steven Cordwell's avatar
Steven Cordwell committed
400
401
402

    Arguments
    ---------
403
404
405
406
407
408
409
410
411
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
412
413
    h : array, optional
        Terminal reward. Default: a vector of zeros.
414

Steven Cordwell's avatar
Steven Cordwell committed
415
416
417
418
419
420
421
422
    Data Attributes
    ---------------
    V : tuple
        optimal values
    policy : tuple
        optimal policy
    time : float
        used CPU time
423

Steven Cordwell's avatar
Steven Cordwell committed
424
425
    Examples
    --------
426
    >>> import mdptoolbox.example
427
428
    >>> P, R = mdptoolbox.example.forest()
    >>> lp = mdptoolbox.mdp.LP(P, R, 0.9)
429
    >>> lp.run()
430
431
432
433
434
435
436
437

    >>> import numpy, mdptoolbox
    >>> P = numpy.array((((0.5, 0.5), (0.8, 0.2)), ((0, 1), (0.1, 0.9))))
    >>> R = numpy.array(((5, 10), (-1, 2)))
    >>> lp = mdptoolbox.mdp.LP(P, R, 0.9)
    >>> lp.run()
    >>> #lp.policy #FIXME: gives (1, 1), should be (1, 0)

Steven Cordwell's avatar
Steven Cordwell committed
438
    """
Steven Cordwell's avatar
Steven Cordwell committed
439

Steven Cordwell's avatar
Steven Cordwell committed
440
    def __init__(self, transitions, reward, discount):
441
        # Initialise a linear programming MDP.
Steven Cordwell's avatar
Steven Cordwell committed
442
        # import some functions from cvxopt and set them as object methods
Steven Cordwell's avatar
Steven Cordwell committed
443
444
        try:
            from cvxopt import matrix, solvers
445
446
            self._linprog = solvers.lp
            self._cvxmat = matrix
Steven Cordwell's avatar
Steven Cordwell committed
447
        except ImportError:
448
449
            raise ImportError("The python module cvxopt is required to use "
                              "linear programming functionality.")
Steven Cordwell's avatar
Steven Cordwell committed
450
        # initialise the MDP. epsilon and max_iter are not needed
451
        MDP.__init__(self, transitions, reward, discount, None, None)
Steven Cordwell's avatar
Steven Cordwell committed
452
        # Set the cvxopt solver to be quiet by default, but ...
453
        # this doesn't do what I want it to do c.f. issue #3
454
455
        if not self.verbose:
            solvers.options['show_progress'] = False
456

457
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
458
        #Run the linear programming algorithm.
459
        self.time = _time.time()
Steven Cordwell's avatar
Steven Cordwell committed
460
        # The objective is to resolve : min V / V >= PR + discount*P*V
461
462
        # The function linprog of the optimisation Toolbox of Mathworks
        # resolves :
Steven Cordwell's avatar
Steven Cordwell committed
463
        # min f'* x / M * x <= b
464
465
466
467
        # So the objective could be expressed as :
        # min V / (discount*P-I) * V <= - PR
        # To avoid loop on states, the matrix M is structured following actions
        # M(A*S,S)
468
        f = self._cvxmat(_np.ones((self.S, 1)))
469
470
        h = _np.array(self.R).reshape(self.S * self.A, 1, order="F")
        h = self._cvxmat(h, tc='d')
471
        M = _np.zeros((self.A * self.S, self.S))
Steven Cordwell's avatar
Steven Cordwell committed
472
473
        for aa in range(self.A):
            pos = (aa + 1) * self.S
474
            M[(pos - self.S):pos, :] = (
475
                self.discount * self.P[aa] - _sp.eye(self.S, self.S))
476
        M = self._cvxmat(M)
477
        # Using the glpk option will make this behave more like Octave
478
        # (Octave uses glpk) and perhaps Matlab. If solver=None (ie using the
479
        # default cvxopt solver) then V agrees with the Octave equivalent
Steven Cordwell's avatar
Steven Cordwell committed
480
        # only to 10e-8 places. This assumes glpk is installed of course.
481
        self.V = _np.array(self._linprog(f, M, -h)['x']).reshape(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
482
        # apply the Bellman operator
483
        self.policy, self.V = self._bellmanOperator()
Steven Cordwell's avatar
Steven Cordwell committed
484
        # update the time spent solving
485
        self.time = _time.time() - self.time
486
        # store value and policy as tuples
487
488
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
489
490

class PolicyIteration(MDP):
491

492
    """A discounted MDP solved using the policy iteration algorithm.
493

Steven Cordwell's avatar
Steven Cordwell committed
494
495
    Arguments
    ---------
496
497
498
499
500
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
501
        for details.
502
503
504
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
505
506
507
    policy0 : array, optional
        Starting policy.
    max_iter : int, optional
508
509
        Maximum number of iterations. See the documentation for the ``MDP``
        class for details. Default is 1000.
Steven Cordwell's avatar
Steven Cordwell committed
510
511
512
513
    eval_type : int or string, optional
        Type of function used to evaluate policy. 0 or "matrix" to solve as a
        set of linear equations. 1 or "iterative" to solve iteratively.
        Default: 0.
514

Steven Cordwell's avatar
Steven Cordwell committed
515
516
517
    Data Attributes
    ---------------
    V : tuple
518
        value function
Steven Cordwell's avatar
Steven Cordwell committed
519
520
521
522
523
524
    policy : tuple
        optimal policy
    iter : int
        number of done iterations
    time : float
        used CPU time
525

Steven Cordwell's avatar
Steven Cordwell committed
526
527
    Notes
    -----
528
    In verbose mode, at each iteration, displays the number
Steven Cordwell's avatar
Steven Cordwell committed
529
    of differents actions between policy n-1 and n
530

Steven Cordwell's avatar
Steven Cordwell committed
531
532
    Examples
    --------
533
    >>> import mdptoolbox, mdptoolbox.example
534
    >>> P, R = mdptoolbox.example.rand(10, 3)
535
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
536
    >>> pi.run()
537

538
539
    >>> P, R = mdptoolbox.example.forest()
    >>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9)
540
    >>> pi.run()
541
542
543
    >>> expected = (26.244000000000014, 29.484000000000016, 33.484000000000016)
    >>> all(expected[k] - pi.V[k] < 1e-12 for k in range(len(expected)))
    True
Steven Cordwell's avatar
Steven Cordwell committed
544
    >>> pi.policy
545
    (0, 0, 0)
Steven Cordwell's avatar
Steven Cordwell committed
546
    """
547

548
549
    def __init__(self, transitions, reward, discount, policy0=None,
                 max_iter=1000, eval_type=0):
Steven Cordwell's avatar
Steven Cordwell committed
550
551
552
        # Initialise a policy iteration MDP.
        #
        # Set up the MDP, but don't need to worry about epsilon values
553
        MDP.__init__(self, transitions, reward, discount, None, max_iter)
Steven Cordwell's avatar
Steven Cordwell committed
554
        # Check if the user has supplied an initial policy. If not make one.
555
        if policy0 is None:
Steven Cordwell's avatar
Steven Cordwell committed
556
            # Initialise the policy to the one which maximises the expected
Steven Cordwell's avatar
Steven Cordwell committed
557
            # immediate reward
558
            null = _np.zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
559
            self.policy, null = self._bellmanOperator(null)
560
            del null
Steven Cordwell's avatar
Steven Cordwell committed
561
        else:
Steven Cordwell's avatar
Steven Cordwell committed
562
563
            # Use the policy that the user supplied
            # Make sure it is a numpy array
564
            policy0 = _np.array(policy0)
Steven Cordwell's avatar
Steven Cordwell committed
565
            # Make sure the policy is the right size and shape
566
567
            assert policy0.shape in ((self.S, ), (self.S, 1), (1, self.S)), \
                "'policy0' must a vector with length S."
Steven Cordwell's avatar
Steven Cordwell committed
568
            # reshape the policy to be a vector
Steven Cordwell's avatar
Steven Cordwell committed
569
            policy0 = policy0.reshape(self.S)
570
571
            # The policy can only contain integers between 0 and S-1
            msg = "'policy0' must be a vector of integers between 0 and S-1."
572
            assert not _np.mod(policy0, 1).any(), msg
573
574
575
            assert (policy0 >= 0).all(), msg
            assert (policy0 < self.S).all(), msg
            self.policy = policy0
Steven Cordwell's avatar
Steven Cordwell committed
576
        # set the initial values to zero
577
        self.V = _np.zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
578
        # Do some setup depending on the evaluation type
Steven Cordwell's avatar
Steven Cordwell committed
579
580
581
582
583
        if eval_type in (0, "matrix"):
            self.eval_type = "matrix"
        elif eval_type in (1, "iterative"):
            self.eval_type = "iterative"
        else:
Steven Cordwell's avatar
Steven Cordwell committed
584
585
586
            raise ValueError("'eval_type' should be '0' for matrix evaluation "
                             "or '1' for iterative evaluation. The strings "
                             "'matrix' and 'iterative' can also be used.")
587

588
    def _computePpolicyPRpolicy(self):
Steven Cordwell's avatar
Steven Cordwell committed
589
590
591
592
593
        # Compute the transition matrix and the reward matrix for a policy.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
594
        # P(SxSxA)  = transition matrix
Steven Cordwell's avatar
Steven Cordwell committed
595
596
597
        #     P could be an array with 3 dimensions or a cell array (1xA),
        #     each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
598
        #     R could be an array with 3 dimensions (SxSxA) or
Steven Cordwell's avatar
Steven Cordwell committed
599
        #     a cell array (1xA), each cell containing a sparse matrix (SxS) or
600
        #     a 2D array(SxA) possibly sparse
Steven Cordwell's avatar
Steven Cordwell committed
601
602
603
604
605
606
607
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Ppolicy(SxS)  = transition matrix for policy
        # PRpolicy(S)   = reward matrix for policy
        #
608
609
        Ppolicy = _np.empty((self.S, self.S))
        Rpolicy = _np.zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
610
        for aa in range(self.A): # avoid looping over S
Steven Cordwell's avatar
Steven Cordwell committed
611
612
            # the rows that use action a.
            ind = (self.policy == aa).nonzero()[0]
613
614
            # if no rows use action a, then no need to assign this
            if ind.size > 0:
615
616
617
618
                try:
                    Ppolicy[ind, :] = self.P[aa][ind, :]
                except ValueError:
                    Ppolicy[ind, :] = self.P[aa][ind, :].todense()
619
                #PR = self._computePR() # an apparently uneeded line, and
Steven Cordwell's avatar
Steven Cordwell committed
620
621
                # perhaps harmful in this implementation c.f.
                # mdp_computePpolicyPRpolicy.m
622
                Rpolicy[ind] = self.R[aa][ind]
Steven Cordwell's avatar
Steven Cordwell committed
623
624
625
626
        # self.R cannot be sparse with the code in its current condition, but
        # it should be possible in the future. Also, if R is so big that its
        # a good idea to use a sparse matrix for it, then converting PRpolicy
        # from a dense to sparse matrix doesn't seem very memory efficient
627
628
        if type(self.R) is _sp.csr_matrix:
            Rpolicy = _sp.csr_matrix(Rpolicy)
Steven Cordwell's avatar
Steven Cordwell committed
629
630
631
        #self.Ppolicy = Ppolicy
        #self.Rpolicy = Rpolicy
        return (Ppolicy, Rpolicy)
632

633
    def _evalPolicyIterative(self, V0=0, epsilon=0.0001, max_iter=10000):
Steven Cordwell's avatar
Steven Cordwell committed
634
635
636
637
638
        # Evaluate a policy using iteration.
        #
        # Arguments
        # ---------
        # Let S = number of states, A = number of actions
639
640
        # P(SxSxA)  = transition matrix
        #    P could be an array with 3 dimensions or
Steven Cordwell's avatar
Steven Cordwell committed
641
642
        #    a cell array (1xS), each cell containing a matrix possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
643
        #    R could be an array with 3 dimensions (SxSxA) or
Steven Cordwell's avatar
Steven Cordwell committed
644
        #    a cell array (1xA), each cell containing a sparse matrix (SxS) or
645
        #    a 2D array(SxA) possibly sparse
Steven Cordwell's avatar
Steven Cordwell committed
646
647
648
649
650
        # discount  = discount rate in ]0; 1[
        # policy(S) = a policy
        # V0(S)     = starting value function, optional (default : zeros(S,1))
        # epsilon   = epsilon-optimal policy search, upper than 0,
        #    optional (default : 0.0001)
651
        # max_iter  = maximum number of iteration to be done, upper than 0,
Steven Cordwell's avatar
Steven Cordwell committed
652
        #    optional (default : 10000)
653
        #
Steven Cordwell's avatar
Steven Cordwell committed
654
655
656
657
658
659
660
661
662
663
        # Evaluation
        # ----------
        # Vpolicy(S) = value function, associated to a specific policy
        #
        # Notes
        # -----
        # In verbose mode, at each iteration, displays the condition which
        # stopped iterations: epsilon-optimum value function found or maximum
        # number of iterations reached.
        #
664
665
666
        try:
            assert V0.shape in ((self.S, ), (self.S, 1), (1, self.S)), \
                "'V0' must be a vector of length S."
667
            policy_V = _np.array(V0).reshape(self.S)
668
        except AttributeError:
Steven Cordwell's avatar
Steven Cordwell committed
669
            if V0 == 0:
670
                policy_V = _np.zeros(self.S)
Steven Cordwell's avatar
Steven Cordwell committed
671
            else:
672
                policy_V = _np.array(V0).reshape(self.S)
673

674
        policy_P, policy_R = self._computePpolicyPRpolicy()
675

Steven Cordwell's avatar
Steven Cordwell committed
676
        if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
677
            print('    Iteration\t\t    V variation')
678

Steven Cordwell's avatar
Steven Cordwell committed
679
680
681
        itr = 0
        done = False
        while not done:
682
            itr += 1
683

684
            Vprev = policy_V
685
            policy_V = policy_R + self.discount * policy_P.dot(Vprev)
686

687
            variation = _np.absolute(policy_V - Vprev).max()
Steven Cordwell's avatar
Steven Cordwell committed
688
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
689
                print(('      %s\t\t      %s') % (itr, variation))
690

691
692
            # ensure |Vn - Vpolicy| < epsilon
            if variation < ((1 - self.discount) / self.discount) * epsilon:
Steven Cordwell's avatar
Steven Cordwell committed
693
694
                done = True
                if self.verbose:
695
                    print(_MSG_STOP_EPSILON_OPTIMAL_VALUE)
Steven Cordwell's avatar
Steven Cordwell committed
696
697
698
            elif itr == max_iter:
                done = True
                if self.verbose:
699
                    print(_MSG_STOP_MAX_ITER)
700

Steven Cordwell's avatar
Steven Cordwell committed
701
        self.V = policy_V
702

703
    def _evalPolicyMatrix(self):
Steven Cordwell's avatar
Steven Cordwell committed
704
705
        # Evaluate the value function of the policy using linear equations.
        #
706
        # Arguments
Steven Cordwell's avatar
Steven Cordwell committed
707
708
        # ---------
        # Let S = number of states, A = number of actions
709
        # P(SxSxA) = transition matrix
Steven Cordwell's avatar
Steven Cordwell committed
710
711
712
        #      P could be an array with 3 dimensions or a cell array (1xA),
        #      each cell containing a matrix (SxS) possibly sparse
        # R(SxSxA) or (SxA) = reward matrix
713
        #      R could be an array with 3 dimensions (SxSxA) or
Steven Cordwell's avatar
Steven Cordwell committed
714
        #      a cell array (1xA), each cell containing a sparse matrix (SxS) or
715
        #      a 2D array(SxA) possibly sparse
Steven Cordwell's avatar
Steven Cordwell committed
716
717
718
719
720
721
722
        # discount = discount rate in ]0; 1[
        # policy(S) = a policy
        #
        # Evaluation
        # ----------
        # Vpolicy(S) = value function of the policy
        #
723
        Ppolicy, Rpolicy = self._computePpolicyPRpolicy()
Steven Cordwell's avatar
Steven Cordwell committed
724
        # V = PR + gPV  => (I-gP)V = PR  => V = inv(I-gP)* PR
725
726
        self.V = _np.linalg.solve(
            (_sp.eye(self.S, self.S) - self.discount * Ppolicy), Rpolicy)
727

728
    def run(self):
Steven Cordwell's avatar
Steven Cordwell committed
729
730
        # Run the policy iteration algorithm.
        # If verbose the print a header
731
        if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
732
            print('  Iteration\t\tNumber of different actions')
Steven Cordwell's avatar
Steven Cordwell committed
733
        # Set up the while stopping condition and the current time
Steven Cordwell's avatar
Steven Cordwell committed
734
        done = False
735
        self.time = _time.time()
Steven Cordwell's avatar
Steven Cordwell committed
736
        # loop until a stopping condition is reached
Steven Cordwell's avatar
Steven Cordwell committed
737
        while not done:
738
            self.iter += 1
739
            # these _evalPolicy* functions will update the classes value
Steven Cordwell's avatar
Steven Cordwell committed
740
            # attribute
Steven Cordwell's avatar
Steven Cordwell committed
741
            if self.eval_type == "matrix":
742
                self._evalPolicyMatrix()
Steven Cordwell's avatar
Steven Cordwell committed
743
            elif self.eval_type == "iterative":
744
                self._evalPolicyIterative()
Steven Cordwell's avatar
Steven Cordwell committed
745
746
            # This should update the classes policy attribute but leave the
            # value alone
747
            policy_next, null = self._bellmanOperator()
748
            del null
Steven Cordwell's avatar
Steven Cordwell committed
749
750
            # calculate in how many places does the old policy disagree with
            # the new policy
751
            n_different = (policy_next != self.policy).sum()
Steven Cordwell's avatar
Steven Cordwell committed
752
            # if verbose then continue printing a table
753
            if self.verbose:
Steven Cordwell's avatar
Steven Cordwell committed
754
                print(('    %s\t\t  %s') % (self.iter, n_different))
755
            # Once the policy is unchanging of the maximum number of
Steven Cordwell's avatar
Steven Cordwell committed
756
            # of iterations has been reached then stop
757
            if n_different == 0:
Steven Cordwell's avatar
Steven Cordwell committed
758
                done = True
759
                if self.verbose:
760
                    print(_MSG_STOP_UNCHANGING_POLICY)
761
            elif self.iter == self.max_iter:
762
                done = True
763
                if self.verbose:
764
                    print(_MSG_STOP_MAX_ITER)
765
766
            else:
                self.policy = policy_next
Steven Cordwell's avatar
Steven Cordwell committed
767
        # update the time to return th computation time
768
        self.time = _time.time() - self.time
Steven Cordwell's avatar
Steven Cordwell committed
769
        # store value and policy as tuples
770
771
        self.V = tuple(self.V.tolist())
        self.policy = tuple(self.policy.tolist())
Steven Cordwell's avatar
Steven Cordwell committed
772

773
class PolicyIterationModified(PolicyIteration):
774

775
    """A discounted MDP  solved using a modifified policy iteration algorithm.
776

Steven Cordwell's avatar
Steven Cordwell committed
777
778
    Arguments
    ---------
779
780
781
782
783
784
785
786
787
    transitions : array
        Transition probability matrices. See the documentation for the ``MDP``
        class for details.
    reward : array
        Reward matrices or vectors. See the documentation for the ``MDP`` class
        for details.
    discount : float
        Discount factor. See the documentation for the ``MDP`` class for
        details.
Steven Cordwell's avatar
Steven Cordwell committed
788
789
790
791
    epsilon : float, optional
        Stopping criterion. See the documentation for the ``MDP`` class for
        details. Default: 0.01.
    max_iter : int, optional
792
        Maximum number of iterations. See the documentation for the ``MDP``
Steven Cordwell's avatar
Steven Cordwell committed
793
        class for details. Default is 10.
794

Steven Cordwell's avatar
Steven Cordwell committed
795
796
    Data Attributes
    ---------------
Steven Cordwell's avatar
Steven Cordwell committed
797
    V : tuple
798
        value function
Steven Cordwell's avatar
Steven Cordwell committed
799
800
801
802
803
804
    policy : tuple
        optimal policy
    iter : int
        number of done iterations
    time : float
        used CPU time
805

Steven Cordwell's avatar
Steven Cordwell committed
806
807
    Examples
    --------
808
809
810
    >>> import mdptoolbox, mdptoolbox.example
    >>> P, R = mdptoolbox.example.forest()
    >>> pim = mdptoolbox.mdp.PolicyIterationModified(P, R, 0.9)