Commit 7b64d26b authored by Steven Cordwell's avatar Steven Cordwell
Browse files

add another option for returning the policy and value from bellmanOperator():...

add another option for returning the policy and value from bellmanOperator(): not yet implemented; comments
parent cafe5342
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
""" """
Copyright (c) 2011, 2012 Steven Cordwell Copyright (c) 2011, 2012, 2013 Steven Cordwell
Copyright (c) 2009, Iadine Chadès Copyright (c) 2009, Iadine Chadès
Copyright (c) 2009, Marie-Josée Cros Copyright (c) 2009, Marie-Josée Cros
Copyright (c) 2009, Frédérick Garcia Copyright (c) 2009, Frédérick Garcia
...@@ -244,16 +244,19 @@ class MDP(object): ...@@ -244,16 +244,19 @@ class MDP(object):
Updates the value function and the Vprev-improving policy. Updates the value function and the Vprev-improving policy.
Updates Returns
------- -------
self.value : new value function (policy, value) : tuple of new policy and its value
self.policy : Vprev-improving policy
""" """
Q = matrix(zeros((self.S, self.A))) Q = matrix(zeros((self.S, self.A)))
for aa in range(self.A): for aa in range(self.A):
Q[:, aa] = self.R[:, aa] + (self.discount * self.P[aa] * self.value) Q[:, aa] = self.R[:, aa] + (self.discount * self.P[aa] * self.value)
# update the value and policy # Which way is better? if choose the first way, then the classes that
# call this function must be changed
# 1. Return, (policy, value)
# return (Q.argmax(axis=1), Q.max(axis=1))
# 2. change self.policy and self.value directly
self.value = Q.max(axis=1) self.value = Q.max(axis=1)
self.policy = Q.argmax(axis=1) self.policy = Q.argmax(axis=1)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment