A LEARNING ALGORITHM FOR COMMUNICATING MARKOV DECISION PROCESSES WITH UNKNOWN TRANSITION MATRICES
Author(s) -
Tetsuichiro Iki,
Masayuki Horiguchi,
Masami Yasuda,
Masami Kurano
Publication year - 2007
Publication title -
bulletin of informatics and cybernetics
Language(s) - English
Resource type - Journals
eISSN - 2435-743X
pISSN - 0286-522X
DOI - 10.5109/16771
Subject(s) - markov decision process , transition (genetics) , markov chain , computer science , stochastic matrix , markov model , algorithm , markov process , artificial intelligence , machine learning , mathematics , statistics , chemistry , biochemistry , gene
This study is concerned with finite Markov decision processes (MDPs) whose state are exactly observable but its transition matrix is unknown. We develop a learning algorithm of the reward-penalty type for the communicating case of multi-chain MDPs. An adaptively optimal policy and an asymptotic sequence of adaptive policies with nearly optimal properties are constructed under the average expected reward criterion. Also, a numerical experiment is given to show the practical effectiveness of the algorithm.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom