### fix some typos in docstrings

parent 118a2fa8
 ... @@ -70,46 +70,50 @@ class MDP(object): ... @@ -70,46 +70,50 @@ class MDP(object): """A Markov Decision Problem. """A Markov Decision Problem. Let S = the number of states, and A = the number of acions. Let ``S`` = the number of states, and ``A`` = the number of acions. Parameters Parameters ---------- ---------- transitions : array transitions : array Transition probability matrices. These can be defined in a variety of Transition probability matrices. These can be defined in a variety of ways. The simplest is a numpy array that has the shape (A, S, S), ways. The simplest is a numpy array that has the shape ``(A, S, S)``, though there are other possibilities. It can be a tuple or list or though there are other possibilities. It can be a tuple or list or numpy object array of length A, where each element contains a numpy numpy object array of length ``A``, where each element contains a numpy array or matrix that has the shape (S, S). This "list of matrices" form array or matrix that has the shape ``(S, S)``. This "list of matrices" is useful when the transition matrices are sparse as form is useful when the transition matrices are sparse as scipy.sparse.csr_matrix matrices can be used. In summary, each action's ``scipy.sparse.csr_matrix`` matrices can be used. In summary, each transition matrix must be indexable like ``P[a]`` where action's transition matrix must be indexable like ``transitions[a]`` ``a`` ∈ {0, 1...A-1}. where ``a`` ∈ {0, 1...A-1}, and ``transitions[a]`` returns an ``S`` × ``S`` array-like object. reward : array reward : array Reward matrices or vectors. Like the transition matrices, these can Reward matrices or vectors. Like the transition matrices, these can also be defined in a variety of ways. Again the simplest is a numpy also be defined in a variety of ways. Again the simplest is a numpy array that has the shape (S, A), (S,) or (A, S, S). A list of lists can array that has the shape ``(S, A)``, ``(S,)`` or ``(A, S, S)``. A list be used, where each inner list has length S. A list of numpy arrays is of lists can be used, where each inner list has length ``S`` and the possible where each inner array can be of the shape (S,), (S, 1), outer list has length ``A``. A list of numpy arrays is possible where (1, S) or (S, S). Also scipy.sparse.csr_matrix can be used instead of each inner array can be of the shape ``(S,)``, ``(S, 1)``, ``(1, S)`` numpy arrays. In addition, the outer list can be replaced with a tuple or ``(S, S)``. Also ``scipy.sparse.csr_matrix`` can be used instead of or numpy object array can be used. numpy arrays. In addition, the outer list can be replaced by any object that can be indexed like ``reward[a]`` such as a tuple or numpy object array of length ``A``. discount : float discount : float Discount factor. The per time-step discount factor on future rewards. Discount factor. The per time-step discount factor on future rewards. Valid values are greater than 0 upto and including 1. If the discount Valid values are greater than 0 upto and including 1. If the discount factor is 1, then convergence is cannot be assumed and a warning will factor is 1, then convergence is cannot be assumed and a warning will be displayed. Subclasses of ``MDP`` may pass None in the case where the be displayed. Subclasses of ``MDP`` may pass ``None`` in the case where algorithm does not use a discount factor. the algorithm does not use a discount factor. epsilon : float epsilon : float Stopping criterion. The maximum change in the value function at each Stopping criterion. The maximum change in the value function at each iteration is compared against ``epsilon``. Once the change falls below iteration is compared against ``epsilon``. Once the change falls below this value, then the value function is considered to have converged to this value, then the value function is considered to have converged to the optimal value function. Subclasses of ``MDP`` may pass None in the the optimal value function. Subclasses of ``MDP`` may pass ``None`` in case where the algorithm does not use a stopping criterion. the case where the algorithm does not use an epsilon-optimal stopping criterion. max_iter : int max_iter : int Maximum number of iterations. The algorithm will be terminated once Maximum number of iterations. The algorithm will be terminated once this many iterations have elapsed. This must be greater than 0 if this many iterations have elapsed. This must be greater than 0 if specified. Subclasses of ``MDP`` may pass None in the case where the specified. Subclasses of ``MDP`` may pass ``None`` in the case where algorithm does not use a maximum number of iterations. the algorithm does not use a maximum number of iterations. Attributes Attributes ---------- ---------- ... @@ -130,12 +134,12 @@ class MDP(object): ... @@ -130,12 +134,12 @@ class MDP(object): time : float time : float The time used to converge to the optimal policy. The time used to converge to the optimal policy. verbose : boolean verbose : boolean Whether verbose output should be displayed in not. Whether verbose output should be displayed or not. Methods Methods ------- ------- run run Implemented in child classes as the main algorithm loop. Raises and Implemented in child classes as the main algorithm loop. Raises an exception if it has not been overridden. exception if it has not been overridden. setSilent setSilent Turn the verbosity off Turn the verbosity off ... @@ -314,11 +318,11 @@ class FiniteHorizon(MDP): ... @@ -314,11 +318,11 @@ class FiniteHorizon(MDP): --------------- --------------- V : array V : array Optimal value function. Shape = (S, N+1). ``V[:, n]`` = optimal value Optimal value function. Shape = (S, N+1). ``V[:, n]`` = optimal value function at stage ``n`` with stage in (0, 1...N-1). ``V[:, N]`` value function at stage ``n`` with stage in {0, 1...N-1}. ``V[:, N]`` value function for terminal stage. function for terminal stage. policy : array policy : array Optimal policy. ``policy[:, n]`` = optimal policy at stage ``n`` with Optimal policy. ``policy[:, n]`` = optimal policy at stage ``n`` with stage in (0, 1...N). ``policy[:, N]`` = policy for stage ``N``. stage in {0, 1...N}. ``policy[:, N]`` = policy for stage ``N``. time : float time : float used CPU time used CPU time ... ...
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!