POLICY IMPROVEMENT IN MARKOV DECISION PROCESSES AND MARKOV POTENTIAL THEORY
Author(s) -
Masami Yasuda
Publication year - 1978
Publication title -
bulletin of mathematical statistics
Language(s) - English
Resource type - Journals
ISSN - 0007-4993
DOI - 10.5109/13123
Subject(s) - markov decision process , markov chain , markov model , markov process , computer science , mathematics , mathematical optimization , statistics
A connection between Markov Decision Process (MDP) and Markov potential theory has two sides. One is the potential theoritic development of MDP and the other is the alternative proof of the results in MDP owing to Markov potential theory. Shaufele [12] belongs to the later, but it seems interesting from the standpoint of the mathematical programming to establish the development of MDP by using certain potential notion. Several approaches have been tried. Watanabe [16] interpreted the monotonicity of Howard's iteration [8] in the relation to the a dual problem of Linear Programming. By the property of a potential kernel, Furukawa [6] and Aso and Kimura [1] proved a policy improvement. A formulation of MDP by potential theoretic notion has been tried by Hordijk [7]. In many cases it is restricted to a transient potential theory because its analysis is simpler. In this paper we shall define a new potential in order to serve a general policy improvement. Our aim is to expose theorems which are available to several cases of MDP. By the potential theoretic terms, we can interpret policy improvements of MDP as follows ; The increase of rewards in MDP consists of the potential with a charge of an increment of the policy improvement and a regular function. If it is transient, then the potential is reduced to the ordinary one and the regular function equals zero. Hence this consists with that of Watanabe [16]. The merit of the potential is that it connects the policy improvement with the increment of rewards. We shall consider the following cost criteria of MDP ; (1) discounted case, (2) average case, (3) nearly optimal case and (4) sensitive discounted case. Case (1) and (2) are representitive and discussed by many authors. Especially we list up Howard [7] and Blackwell [2], [3] for (1) and Howard [8] and Derman [4], [5] for (2). Case (3) is due to Blackwell [2]. Extending case (3), case (4) is studied by Miller and Veinott [11] and Veinott [14], [15].
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom