English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

This paper considers a multi-armed bandit (MAB) problem in which multiplemobile agents receive rewards by sampling from a collection of spatiallydispersed stochastic processes, called bandits. The goal is to formulate adecentralized policy for each agent, in order to maximize the total cumulativereward over all agents, subject to option availability and inter-agentcommunication constraints. The problem formulation is motivated by applicationsin which a team of autonomous mobile robots cooperates to accomplish anexploration and exploitation task in an uncertain environment. Bandit locationsare represented by vertices of the spatial graph. At any time, an agent'soption consist of sampling the bandit at its current location, or travelingalong an edge of the spatial graph to a new bandit location. Communicationconstraints are described by a directed, non-stationary, stochasticcommunication graph. At any time, agents may receive data only from theircommunication graph in-neighbors. For the case of a single agent on a fullyconnected spatial graph, it is known that the expected regret for any optimalpolicy is necessarily bounded below by a function that grows as the logarithmof time. A class of policies called upper confidence bound (UCB) algorithmsasymptotically achieve logarithmic regret for the classical MAB problem. Inthis paper, we propose a UCB-based decentralized motion and option selectionpolicy and a non-stationary stochastic communication protocol that guaranteelogarithmic regret. To our knowledge, this is the first such decentralizedpolicy for non-fully connected spatial graphs with communication constraints.When the spatial graph is fully connected and the communication graph isstationary, our decentralized algorithm matches or exceeds the best reportedprior results from the literature.

A Decentralized Policy with Logarithmic Regret for a Class of  Multi-Agent Multi-Armed Bandit Problems with Option Unavailability  Constraints and Stochastic Communication Protocols