English (United Kingdom)

https://curated-unify.zendy.io/wp-json/zendy-region/v1/featured_content/oa?rat=en

https://curated-unify.zendy.io/wp-json/zendy-region/v1/highlighted_journal/

Zendy Plus

Presents the access of premium content as premium feature

Premium Content

Presents the keyphrase highlighting as premium feature

Keyphrase Highlighting

Presents the summarisation as premium feature

Summarisation

Insights

Presents the pdf analysis as premium feature

PDF Analysis

Presents the zaia usage as premium feature

ZAIA

Zendy Tools

Zendy Open

Off-policy reinforcement learning with eligibility traces is challengingbecause of the discrepancy between target policy and behavior policy. Onecommon approach is to measure the difference between two policies in aprobabilistic way, such as importance sampling and tree-backup. However,existing off-policy learning methods based on probabilistic policy measurementare inefficient when utilizing traces under a greedy target policy, which isineffective for control problems. The traces are cut immediately when anon-greedy action is taken, which may lose the advantage of eligibility tracesand slow down the learning process. Alternatively, some non-probabilisticmeasurement methods such as General Q($\lambda$) and Naive Q($\lambda$) nevercut traces, but face convergence problems in practice. To address the aboveissues, this paper introduces a new method named TBQ($\sigma$), whicheffectively unifies the tree-backup algorithm and Naive Q($\lambda$). Byintroducing a new parameter $\sigma$ to illustrate the \emph{degree} ofutilizing traces, TBQ($\sigma$) creates an effective integration ofTB($\lambda$) and Naive Q($\lambda$) and continuous role shift between them.The contraction property of TB($\sigma$) is theoretically analyzed for bothpolicy evaluation and control settings. We also derive the online version ofTBQ($\sigma$) and give the convergence proof. We empirically show that, for$\epsilon\in(0,1]$ in $\epsilon$-greedy policies, there exists some degree ofutilizing traces for $\lambda\in[0,1]$, which can improve the efficiency intrace utilization for off-policy reinforcement learning, to both accelerate thelearning process and improve the performance.

TBQ($\sigma$): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning