z-logo
open-access-imgOpen Access
Task-Aware Verifiable RNN-Based Policies for Partially Observable Markov Decision Processes
Author(s) -
Steven Carr,
Nils Jansen,
Ufuk Topcu
Publication year - 2021
Publication title -
journal of artificial intelligence research/the journal of artificial intelligence research
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.79
H-Index - 123
eISSN - 1943-5037
pISSN - 1076-9757
DOI - 10.1613/jair.1.12963
Subject(s) - partially observable markov decision process , computer science , markov decision process , recurrent neural network , temporal logic , markov chain , artificial intelligence , probabilistic logic , markov process , machine learning , markov model , theoretical computer science , artificial neural network , mathematics , statistics
Partially observable Markov decision processes (POMDPs) are models for sequential decision-making under uncertainty and incomplete information. Machine learning methods typically train recurrent neural networks (RNN) as effective representations of POMDP policies that can efficiently process sequential data. However, it is hard to verify whether the POMDP driven by such RNN-based policies satisfies safety constraints, for instance, given by temporal logic specifications. We propose a novel method that combines techniques from machine learning with the field of formal methods: training an RNN-based policy and then automatically extracting a so-called finite-state controller (FSC) from the RNN. Such FSCs offer a convenient way to verify temporal logic constraints. Implemented on a POMDP, they induce a Markov chain, and probabilistic verification methods can efficiently check whether this induced Markov chain satisfies a temporal logic specification. Using such methods, if the Markov chain does not satisfy the specification, a byproduct of verification is diagnostic information about the states in the POMDP that are critical for the specification. The method exploits this diagnostic information to either adjust the complexity of the extracted FSC or improve the policy by performing focused retraining of the RNN. The method synthesizes policies that satisfy temporal logic specifications for POMDPs with up to millions of states, which are three orders of magnitude larger than comparable approaches.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here