English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

We consider an agent who is involved in a Markov decision process andreceives a vector of outcomes every round. Her objective is to maximize aglobal concave reward function on the average vectorial outcome. The problemmodels applications such as multi-objective optimization, maximum entropyexploration, and constrained optimization in Markovian environments. In ourgeneral setting where a stationary policy could have multiple recurrentclasses, the agent faces a subtle yet consequential trade-off in alternatingamong different actions for balancing the vectorial outcomes. In particular,stationary policies are in general sub-optimal. We propose a no-regretalgorithm based on online convex optimization (OCO) tools (Agrawal and Devanur2014) and UCRL2 (Jaksch et al. 2010). Importantly, we introduce a novelgradient threshold procedure, which carefully controls the switches amongactions to handle the subtle trade-off. By delaying the gradient updates, ourprocedure produces a non-stationary policy that diversifies the outcomes foroptimizing the objective. The procedure is compatible with a variety of OCOtools.

Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards