Log in

Relevant bibliographies by topics / Off-Policy learning / Journal articles

Journal articles on the topic 'Off-Policy learning'

To see the other types of publications on this topic, follow the link: Off-Policy learning.

Author: Grafiati

Published: 1 June 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Off-Policy learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Meng, Wenjia, Qian Zheng, Gang Pan, and Yilong Yin. "Off-Policy Proximal Policy Optimization." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 9162–70. http://dx.doi.org/10.1609/aaai.v37i8.26099.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Proximal Policy Optimization (PPO) is an important reinforcement learning method, which has achieved great success in sequential decision-making problems. However, PPO faces the issue of sample inefficiency, which is due to the PPO cannot make use of off-policy data. In this paper, we propose an Off-Policy Proximal Policy Optimization method (Off-Policy PPO) that improves the sample efficiency of PPO by utilizing off-policy data. Specifically, we first propose a clipped surrogate objective function that can utilize off-policy data and avoid excessively large policy updates. Next, we theoretically clarify the stability of the optimization process of the proposed surrogate objective by demonstrating the degree of policy update distance is consistent with that in the PPO. We then describe the implementation details of the proposed Off-Policy PPO which iteratively updates policies by optimizing the proposed clipped surrogate objective. Finally, the experimental results on representative continuous control tasks validate that our method outperforms the state-of-the-art methods on most tasks.

2

Schmitt, Simon, John Shawe-Taylor, and Hado van Hasselt. "Chaining Value Functions for Off-Policy Learning." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8187–95. http://dx.doi.org/10.1609/aaai.v36i8.20792.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.

3

Xu, Da, Yuting Ye, Chuanwei Ruan, and Bo Yang. "Towards Robust Off-Policy Learning for Runtime Uncertainty." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 9 (June 28, 2022): 10101–9. http://dx.doi.org/10.1609/aaai.v36i9.21249.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Off-policy learning plays a pivotal role in optimizing and evaluating policies prior to the online deployment. However, during the real-time serving, we observe varieties of interventions and constraints that cause inconsistency between the online and offline setting, which we summarize and term as runtime uncertainty. Such uncertainty cannot be learned from the logged data due to its abnormality and rareness nature. To assert a certain level of robustness, we perturb the off-policy estimators along an adversarial direction in view of the runtime uncertainty. It allows the resulting estimators to be robust not only to observed but also unexpected runtime uncertainties. Leveraging this idea, we bring runtime-uncertainty robustness to three major off-policy learning methods: the inverse propensity score method, reward-model method, and doubly robust method. We theoretically justify the robustness of our methods to runtime uncertainty, and demonstrate their effectiveness using both the simulation and the real-world online experiments.

4

Peters, James F., and Christopher Henry. "Approximation spaces in off-policy Monte Carlo learning." Engineering Applications of Artificial Intelligence 20, no. 5 (August 2007): 667–75. http://dx.doi.org/10.1016/j.engappai.2006.11.005.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Yu, Jiayu, Jingyao Li, Shuai Lü, and Shuai Han. "Mixed experience sampling for off-policy reinforcement learning." Expert Systems with Applications 251 (October 2024): 124017. http://dx.doi.org/10.1016/j.eswa.2024.124017.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Cetin, Edoardo, and Oya Celiktutan. "Learning Pessimism for Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 6 (June 26, 2023): 6971–79. http://dx.doi.org/10.1609/aaai.v37i6.25852.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Off-policy deep reinforcement learning algorithms commonly compensate for overestimation bias during temporal-difference learning by utilizing pessimistic estimates of the expected target returns. In this work, we propose Generalized Pessimism Learning (GPL), a strategy employing a novel learnable penalty to enact such pessimism. In particular, we propose to learn this penalty alongside the critic with dual TD-learning, a new procedure to estimate and minimize the magnitude of the target returns bias with trivial computational cost. GPL enables us to accurately counteract overestimation bias throughout training without incurring the downsides of overly pessimistic targets. By integrating GPL with popular off-policy algorithms, we achieve state-of-the-art results in both competitive proprioceptive and pixel-based benchmarks.

7

Kong, Seung-Hyun, I. Made Aswin Nahrendra, and Dong-Hee Paek. "Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay." IEEE Access 9 (2021): 93152–64. http://dx.doi.org/10.1109/access.2021.3085142.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Li, Lihong. "A perspective on off-policy evaluation in reinforcement learning." Frontiers of Computer Science 13, no. 5 (June 17, 2019): 911–12. http://dx.doi.org/10.1007/s11704-019-9901-7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Luo, Biao, Huai-Ning Wu, and Tingwen Huang. "Off-Policy Reinforcement Learning for $ H_\infty $ Control Design." IEEE Transactions on Cybernetics 45, no. 1 (January 2015): 65–76. http://dx.doi.org/10.1109/tcyb.2014.2319577.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Sun, Mingfei, Sam Devlin, Katja Hofmann, and Shimon Whiteson. "Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8378–85. http://dx.doi.org/10.1609/aaai.v36i8.20813.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Sample efficiency is crucial for imitation learning methods to be applicable in real-world applications. Many studies improve sample efficiency by extending adversarial imitation to be off-policy regardless of the fact that these off-policy extensions could either change the original objective or involve complicated optimization. We revisit the foundation of adversarial imitation and propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization. Our formulation capitalizes on two key insights: (1) the similarity between the Bellman equation and the stationary state-action distribution equation allows us to derive a novel temporal difference (TD) learning approach; and (2) the use of a deterministic policy simplifies the TD learning. Combined, these insights yield a practical algorithm, Deterministic and Discriminative Imitation (D2-Imitation), which oper- ates by first partitioning samples into two replay buffers and then learning a deterministic policy via off-policy reinforcement learning. Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation on many control tasks.

11

Jain, Arushi, Gandharv Patil, Ayush Jain, Khimya Khetarpal, and Doina Precup. "Variance Penalized On-Policy and Off-Policy Actor-Critic." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 9 (May 18, 2021): 7899–907. http://dx.doi.org/10.1609/aaai.v35i9.16964.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Reinforcement learning algorithms are typically geared towards optimizing the expected return of an agent. However, in many practical applications, low variance in the return is desired to ensure the reliability of an algorithm. In this paper, we propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Previous work uses the second moment of return to estimate the variance indirectly. Instead, we use a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods. Using the variance-penalized criterion, we guarantee the convergence of our algorithm to locally optimal policies for finite state action Markov decision processes. We demonstrate the utility of our algorithm in tabular and continuous MuJoCo domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.

12

Hao, Longyan, Chaoli Wang, and Yibo Shi. "Quadratic Tracking Control of Linear Stochastic Systems with Unknown Dynamics Using Average Off-Policy Q-Learning Method." Mathematics 12, no. 10 (May 14, 2024): 1533. http://dx.doi.org/10.3390/math12101533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This article investigates the optimal tracking control problem for data-based stochastic discrete-time linear systems. An average off-policy Q-learning algorithm is proposed to solve the optimal control problem with random disturbances. Compared with the existing off-policy reinforcement learning (RL) algorithm, the proposed average off-policy Q-learning algorithm avoids the assumption of an initial stability control. First, a pole placement strategy is used to design an initial stable control for systems with unknown dynamics. Second, the initial stable control is used to design a data-based average off-policy Q-learning algorithm. Then, this algorithm is used to solve the stochastic linear quadratic tracking (LQT) problem, and a convergence proof of the algorithm is provided. Finally, numerical examples show that this algorithm outperforms other algorithms in a simulation.

13

Gelada, Carles, and Marc G. Bellemare. "Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 3647–55. http://dx.doi.org/10.1609/aaai.v33i01.33013647.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method, online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.’s solution is appealing, it cannot easily be transferred to nonlinear function approximation. First, it requires a projection step onto the probability simplex; second, even though the operator describing the expected behavior of the off-policy learning algorithm is convergent, it is not known to be a contraction mapping, and hence, may be more unstable in practice. We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be minimized online and obviates the need for an explicit projection step. We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we find discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally, we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain, where we find performance gains for our approach.

14

Xiao, Teng, and Suhang Wang. "Towards Off-Policy Learning for Ranking Policies with Logged Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8700–8707. http://dx.doi.org/10.1609/aaai.v36i8.20849.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking metric, but cannot maximize long-term rewards. Reinforcement learning models have been proposed to maximize user long-term rewards by formulating the recommendation as a sequential decision-making problem, but could only achieve inferior accuracy compared to LTR counterparts, primarily due to the lack of online interactions and the characteristics of ranking. In this paper, we propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline for improved sample efficiency in a unified Expectation-Maximization (EM) framework. We theoretically and empirically show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions. Extensive offline and online experiments demonstrate the effectiveness of our methods

15

Li, Jinna, Hamidreza Modares, Tianyou Chai, Frank L. Lewis, and Lihua Xie. "Off-Policy Reinforcement Learning for Synchronization in Multiagent Graphical Games." IEEE Transactions on Neural Networks and Learning Systems 28, no. 10 (October 2017): 2434–45. http://dx.doi.org/10.1109/tnnls.2016.2609500.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Zhang, Hengrui, Youfang Lin, Shuo Shen, Sheng Han, and Kai Lv. "Enhancing Off-Policy Constrained Reinforcement Learning through Adaptive Ensemble C Estimation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 19 (March 24, 2024): 21770–78. http://dx.doi.org/10.1609/aaai.v38i19.30177.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In the domain of real-world agents, the application of Reinforcement Learning (RL) remains challenging due to the necessity for safety constraints. Previously, Constrained Reinforcement Learning (CRL) has predominantly focused on on-policy algorithms. Although these algorithms exhibit a degree of efficacy, their interactivity efficiency in real-world settings is sub-optimal, highlighting the demand for more efficient off-policy methods. However, off-policy CRL algorithms grapple with challenges in precise estimation of the C-function, particularly due to the fluctuations in the constrained Lagrange multiplier. Addressing this gap, our study focuses on the nuances of C-value estimation in off-policy CRL and introduces the Adaptive Ensemble C-learning (AEC) approach to reduce these inaccuracies. Building on state-of-the-art off-policy algorithms, we propose AEC-based CRL algorithms designed for enhanced task optimization. Extensive experiments on nine constrained robotics tasks reveal the superior interaction efficiency and performance of our algorithms in comparison to preceding methods.

17

Zhang, Shangtong, Bo Liu, and Shimon Whiteson. "Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10905–13. http://dx.doi.org/10.1609/aaai.v35i12.17302.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP optimizing the variance of a per-step reward random variable. MVPI enjoys great flexibility in that any policy evaluation method and risk-neutral control method can be dropped in for risk-averse control off the shelf, in both on- and off-policy settings. This flexibility reduces the gap between risk-neutral control and risk-averse control and is achieved by working on a novel augmented MDP directly. We propose risk-averse TD3 as an example instantiating MVPI, which outperforms vanilla TD3 and many previous risk-averse control methods in challenging Mujoco robot simulation tasks under a risk-aware performance metric. This risk-averse TD3 is the first to introduce deterministic policies and off-policy learning into risk-averse reinforcement learning, both of which are key to the performance boost we show in Mujoco domains.

18

Ali, Raja Farrukh, Kevin Duong, Nasik Muhammad Nafi, and William Hsu. "Multi-Horizon Learning in Procedurally-Generated Environments for Off-Policy Reinforcement Learning (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 13 (June 26, 2023): 16150–51. http://dx.doi.org/10.1609/aaai.v37i13.26935.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Value estimates at multiple timescales can help create advanced discounting functions and allow agents to form more effective predictive models of their environment. In this work, we investigate learning over multiple horizons concurrently for off-policy reinforcement learning by using an advantage-based action selection method and introducing architectural improvements. Our proposed agent learns over multiple horizons simultaneously, while using either exponential or hyperbolic discounting functions. We implement our approach on Rainbow, a value-based off-policy algorithm, and test on Procgen, a collection of procedurally-generated environments, to demonstrate the effectiveness of this approach, specifically to evaluate the agent's performance in previously unseen scenarios.

19

Tennenholtz, Guy, Uri Shalit, and Shie Mannor. "Off-Policy Evaluation in Partially Observable Environments." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 06 (April 3, 2020): 10276–83. http://dx.doi.org/10.1609/aaai.v34i06.6590.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.

20

Nakamura, Yutaka, Takeshi Mori, Yoichi Tokita, Tomohiro Shibata, and Shin Ishii. "Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller." Journal of Robotics and Mechatronics 17, no. 6 (December 20, 2005): 636–44. http://dx.doi.org/10.20965/jrm.2005.p0636.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Referring to the mechanism of animals’ rhythmic movements, motor control schemes using a central pattern generator (CPG) controller have been studied. We previously proposed reinforcement learning (RL) called the CPG-actor-critic model, as an autonomous learning framework for a CPG controller. Here, we propose an off-policy natural policy gradient RL algorithm for the CPG-actor-critic model, to solve the “exploration-exploitation” problem by meta-controlling “behavior policy.” We apply this RL algorithm to an automatic control problem using a biped robot simulator. Computer simulation demonstrated that the CPG controller enables the biped robot to walk stably and efficiently based on our new algorithm.

21

Wang, Mingyang, Zhenshan Bing, Xiangtong Yao, Shuai Wang, Huang Kai, Hang Su, Chenguang Yang, and Alois Knoll. "Meta-Reinforcement Learning Based on Self-Supervised Task Representation Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 10157–65. http://dx.doi.org/10.1609/aaai.v37i8.26210.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Meta-reinforcement learning enables artificial agents to learn from related training tasks and adapt to new tasks efficiently with minimal interaction data. However, most existing research is still limited to narrow task distributions that are parametric and stationary, and does not consider out-of-distribution tasks during the evaluation, thus, restricting its application. In this paper, we propose MoSS, a context-based Meta-reinforcement learning algorithm based on Self-Supervised task representation learning to address this challenge. We extend meta-RL to broad non-parametric task distributions which have never been explored before, and also achieve state-of-the-art results in non-stationary and out-of-distribution tasks. Specifically, MoSS consists of a task inference module and a policy module. We utilize the Gaussian mixture model for task representation to imitate the parametric and non-parametric task variations. Additionally, our online adaptation strategy enables the agent to react at the first sight of a task change, thus being applicable in non-stationary tasks. MoSS also exhibits strong generalization robustness in out-of-distributions tasks which benefits from the reliable and robust task representation. The policy is built on top of an off-policy RL algorithm and the entire network is trained completely off-policy to ensure high sample efficiency. On MuJoCo and Meta-World benchmarks, MoSS outperforms prior works in terms of asymptotic performance, sample efficiency (3-50x faster), adaptation efficiency, and generalization robustness on broad and diverse task distributions.

22

Cao, Jiaqing, Quan Liu, Fei Zhu, Qiming Fu, and Shan Zhong. "Gradient temporal-difference learning for off-policy evaluation using emphatic weightings." Information Sciences 580 (November 2021): 311–30. http://dx.doi.org/10.1016/j.ins.2021.08.082.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Tian, Chang, An Liu, Guan Huang, and Wu Luo. "Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning." IEEE Transactions on Signal Processing 70 (2022): 1609–24. http://dx.doi.org/10.1109/tsp.2022.3158737.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Karimpanal, Thommen George, and Erik Wilhelm. "Identification and off-policy learning of multiple objectives using adaptive clustering." Neurocomputing 263 (November 2017): 39–47. http://dx.doi.org/10.1016/j.neucom.2017.04.074.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Kiumarsi, Bahare, Frank L. Lewis, and Zhong-Ping Jiang. "H∞ control of linear discrete-time systems: Off-policy reinforcement learning." Automatica 78 (April 2017): 144–52. http://dx.doi.org/10.1016/j.automatica.2016.12.009.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Li, Jinna, Zhenfei Xiao, and Ping Li. "Discrete-Time Multi-Player Games Based on Off-Policy Q-Learning." IEEE Access 7 (2019): 134647–59. http://dx.doi.org/10.1109/access.2019.2939384.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Kiumarsi, Bahare, Wei Kang, and Frank L. Lewis. "H∞ Control of Nonaffine Aerial Systems Using Off-policy Reinforcement Learning." Unmanned Systems 04, no. 01 (January 2016): 51–60. http://dx.doi.org/10.1142/s2301385016400069.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper presents a completely model-free [Formula: see text] optimal tracking solution to the control of a general class of nonlinear nonaffine systems in the presence of the input constraints. The proposed method is motivated by nonaffine unmanned aerial vehicle (UAV) system as a real application. First, a general class of nonlinear nonaffine system dynamics is presented as an affine system in terms of a nonlinear function of the control input. It is shown that the optimal control of nonaffine systems may not have an admissible solution if the utility function is not defined properly. Moreover, the boundness of the optimal control input cannot be guaranteed for standard performance functions. A new performance function is defined and used in the [Formula: see text]-gain condition for this class of nonaffine system. This performance function guarantees the existence of an admissible solution (if any exists) and boundness of the control input solution. An off-policy reinforcement learning (RL) is employed to iteratively solve the [Formula: see text] optimal tracking control online using the measured data along the system trajectories. The proposed off-policy RL does not require any knowledge of the system dynamics. Moreover, the disturbance input does not need to be adjustable in a specific manner.

28

Lian, Bosen, Wenqian Xue, Yijing Xie, Frank L. Lewis, and Ali Davoudi. "Off-policy inverse Q-learning for discrete-time antagonistic unknown systems." Automatica 155 (September 2023): 111171. http://dx.doi.org/10.1016/j.automatica.2023.111171.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Kim, Man-Je, Hyunsoo Park, and Chang Wook Ahn. "Nondominated Policy-Guided Learning in Multi-Objective Reinforcement Learning." Electronics 11, no. 7 (March 28, 2022): 1069. http://dx.doi.org/10.3390/electronics11071069.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Control intelligence is a typical field where there is a trade-off between target objectives, and researchers in this field have longed for artificial intelligence that achieves the target objectives. Multi-objective deep reinforcement learning was sufficient to satisfy this need. In particular, multi-objective deep reinforcement learning methods based on policy optimization are leading the optimization of control intelligence. However, multi-objective reinforcement learning has difficulties when finding various Pareto optimals of multi-objectives due to the greedy nature of reinforcement learning. We propose a method of policy assimilation to solve this problem. This method was applied to MO-V-MPO, one of preference-based multi-objective reinforcement learning, to increase diversity. The performance of this method has been verified through experiments in a continuous control environment.

30

Chaudhari, Shreyas, David Arbour, Georgios Theocharous, and Nikos Vlassis. "Distributional Off-Policy Evaluation for Slate Recommendations." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (March 24, 2024): 8265–73. http://dx.doi.org/10.1609/aaai.v38i8.28667.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Recommendation strategies are typically evaluated by using previously logged data, employing off-policy evaluation methods to estimate their expected performance. However, for strategies that present users with slates of multiple items, the resulting combinatorial action space renders many of these methods impractical. Prior work has developed estimators that leverage the structure in slates to estimate the expected off-policy performance, but the estimation of the entire performance distribution remains elusive. Estimating the complete distribution allows for a more comprehensive evaluation of recommendation strategies, particularly along the axes of risk and fairness that employ metrics computable from the distribution. In this paper, we propose an estimator for the complete off-policy performance distribution for slates and establish conditions under which the estimator is unbiased and consistent. This builds upon prior work on off-policy evaluation for slates and off-policy distribution estimation in reinforcement learning. We validate the efficacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (MovieLens-20M). Our results show a significant reduction in estimation variance and improved sample efficiency over prior work across a range of slate structures.

31

Zhang, Ruiyi, Tong Yu, Yilin Shen, and Hongxia Jin. "Text-Based Interactive Recommendation via Offline Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 11694–702. http://dx.doi.org/10.1609/aaai.v36i10.21424.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Interactive recommendation with natural-language feedback can provide richer user feedback and has demonstrated advantages over traditional recommender systems. However, the classical online paradigm involves iteratively collecting experience via interaction with users, which is expensive and risky. We consider an offline interactive recommendation to exploit arbitrary experience collected by multiple unknown policies. A direct application of policy learning with such fixed experience suffers from the distribution shift. To tackle this issue, we develop a behavior-agnostic off-policy correction framework to make offline interactive recommendation possible. Specifically, we leverage the conservative Q-function to perform off-policy evaluation, which enables learning effective policies from fixed datasets without further interactions. Empirical results on the simulator derived from real-world datasets demonstrate the effectiveness of our proposed offline training framework.

32

Xu, Z., L. Cao, and X. Chen. "Deep Reinforcement Learning with Adaptive Update Target Combination." Computer Journal 63, no. 7 (August 15, 2019): 995–1003. http://dx.doi.org/10.1093/comjnl/bxz066.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract Simple and efficient exploration remains a core challenge in deep reinforcement learning. While many exploration methods can be applied to high-dimensional tasks, these methods manually adjust exploration parameters according to domain knowledge. This paper proposes a novel method that can automatically balance exploration and exploitation, as well as combine on-policy and off-policy update targets through a dynamic weighted way based on value difference. The proposed method does not directly affect the probability of a selected action but utilizes the value difference produced during the learning process to adjust update target for guiding the direction of agent’s learning. We demonstrate the performance of the proposed method on CartPole-v1, MountainCar-v0, and LunarLander-v2 classic control tasks from the OpenAI Gym. Empirical evaluation results show that by integrating on-policy and off-policy update targets dynamically, this method exhibits superior performance and stability than does the exclusive use of the update target.

33

Shahid, Asad Ali, Dario Piga, Francesco Braghin, and Loris Roveda. "Continuous control actions learning and adaptation for robotic manipulation through reinforcement learning." Autonomous Robots 46, no. 3 (February 9, 2022): 483–98. http://dx.doi.org/10.1007/s10514-022-10034-z.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

AbstractThis paper presents a learning-based method that uses simulation data to learn an object manipulation task using two model-free reinforcement learning (RL) algorithms. The learning performance is compared across on-policy and off-policy algorithms: Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). In order to accelerate the learning process, the fine-tuning procedure is proposed that demonstrates the continuous adaptation of on-policy RL to new environments, allowing the learned policy to adapt and execute the (partially) modified task. A dense reward function is designed for the task to enable an efficient learning of the agent. A grasping task involving a Franka Emika Panda manipulator is considered as the reference task to be learned. The learned control policy is demonstrated to be generalizable across multiple object geometries and initial robot/parts configurations. The approach is finally tested on a real Franka Emika Panda robot, showing the possibility to transfer the learned behavior from simulation. Experimental results show 100% of successful grasping tasks, making the proposed approach applicable to real applications.

34

Hollenstein, Jakob, Georg Martius, and Justus Piater. "Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 11 (March 24, 2024): 12466–72. http://dx.doi.org/10.1609/aaai.v38i11.29139.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Proximal Policy Optimization (PPO), a popular on-policy deep reinforcement learning method, employs a stochastic policy for exploration. In this paper, we propose a colored noise-based stochastic policy variant of PPO. Previous research highlighted the importance of temporal correlation in action noise for effective exploration in off-policy reinforcement learning. Building on this, we investigate whether correlated noise can also enhance exploration in on-policy methods like PPO. We discovered that correlated noise for action selection improves learning performance and outperforms the currently popular uncorrelated white noise approach in on-policy methods. Unlike off-policy learning, where pink noise was found to be highly effective, we found that a colored noise, intermediate between white and pink, performed best for on-policy learning in PPO. We examined the impact of varying the amount of data collected for each update by modifying the number of parallel simulation environments for data collection and observed that with a larger number of parallel environments, more strongly correlated noise is beneficial. Due to the significant impact and ease of implementation, we recommend switching to correlated noise as the default noise source in PPO.

35

Ren, He, Jing Dai, Huaguang Zhang, and Kun Zhang. "Off-policy integral reinforcement learning algorithm in dealing with nonzero sum game for nonlinear distributed parameter systems." Transactions of the Institute of Measurement and Control 42, no. 15 (July 6, 2020): 2919–28. http://dx.doi.org/10.1177/0142331220932634.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Benefitting from the technology of integral reinforcement learning, the nonzero sum (NZS) game for distributed parameter systems is effectively solved in this paper when the information of system dynamics are unavailable. The Karhunen-Loève decomposition (KLD) is employed to convert the partial differential equation (PDE) systems into high-order ordinary differential equation (ODE) systems. Moreover, the off-policy IRL technology is introduced to design the optimal strategies for the NZS game. To confirm that the presented algorithm will converge to the optimal value functions, the traditional adaptive dynamic programming (ADP) method is first discussed. Then, the equivalence between the traditional ADP method and the presented off-policy method is proved. For implementing the presented off-policy IRL method, actor and critic neural networks are utilized to approach the value functions and control strategies in the iteration process, individually. Finally, a numerical simulation is shown to illustrate the effectiveness of the proposal off-policy algorithm.

36

Levine, Alexander, and Soheil Feizi. "Goal-Conditioned Q-learning as Knowledge Distillation." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 7 (June 26, 2023): 8500–8509. http://dx.doi.org/10.1609/aaai.v37i7.26024.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Many applications of reinforcement learning can be formalized as goal-conditioned environments, where, in each episode, there is a "goal" that affects the rewards obtained during that episode but does not affect the dynamics. Various techniques have been proposed to improve performance in goal-conditioned environments, such as automatic curriculum generation and goal relabeling. In this work, we explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation. In particular: the current Q-value function and the target Q-value estimate are both functions of the goal, and we would like to train the Q-value function to match its target for all goals. We therefore apply Gradient-Based Attention Transfer (Zagoruyko and Komodakis 2017), a knowledge distillation technique, to the Q-function update. We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional. We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals, where the agent can attain a reward by achieving any one of a large set of objectives, all specified at test time. Finally, to provide theoretical support, we give examples of classes of environments where (under some assumptions) standard off-policy algorithms such as DDPG require at least O(d^2) replay buffer transitions to learn an optimal policy, while our proposed technique requires only O(d) transitions, where d is the dimensionality of the goal and state space. Code and appendix are available at https://github.com/alevine0/ReenGAGE.

37

Yang, Hyunjun, Hyeonjun Park, and Kyungjae Lee. "A Selective Portfolio Management Algorithm with Off-Policy Reinforcement Learning Using Dirichlet Distribution." Axioms 11, no. 12 (November 23, 2022): 664. http://dx.doi.org/10.3390/axioms11120664.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Existing methods in portfolio management deterministically produce an optimal portfolio. However, according to modern portfolio theory, there exists a trade-off between a portfolio’s expected returns and risks. Therefore, the optimal portfolio does not exist definitively, but several exist, and using only one deterministic portfolio is disadvantageous for risk management. We proposed Dirichlet Distribution Trader (DDT), an algorithm that calculates multiple optimal portfolios by taking Dirichlet Distribution as a policy. The DDT algorithm makes several optimal portfolios according to risk levels. In addition, by obtaining the pi value from the distribution and applying importance sampling to off-policy learning, the sample is used efficiently. Furthermore, the architecture of our model is scalable because the feed-forward of information between portfolio stocks occurs independently. This means that even if untrained stocks are added to the portfolio, the optimal weight can be adjusted. We also conducted three experiments. In the scalability experiment, it was shown that the DDT extended model, which is trained with only three stocks, had little difference in performance from the DDT model that learned all the stocks in the portfolio. In an experiment comparing the off-policy algorithm and the on-policy algorithm, it was shown that the off-policy algorithm had good performance regardless of the stock price trend. In an experiment comparing investment results according to risk level, it was shown that a higher return or a better Sharpe ratio could be obtained through risk control.

38

Suttle, Wesley, Zhuoran Yang, Kaiqing Zhang, Zhaoran Wang, Tamer Başar, and Ji Liu. "A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning." IFAC-PapersOnLine 53, no. 2 (2020): 1549–54. http://dx.doi.org/10.1016/j.ifacol.2020.12.2021.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Stanković, Miloš S., Marko Beko, and Srdjan S. Stanković. "Distributed Gradient Temporal Difference Off-policy Learning With Eligibility Traces: Weak Convergence." IFAC-PapersOnLine 53, no. 2 (2020): 1563–68. http://dx.doi.org/10.1016/j.ifacol.2020.12.2184.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Li, Jinna, Zhenfei Xiao, Tianyou Chai, Frank L. Lewis, and Sarangapani Jagannathan. "Off-Policy Q-Learning for Anti-Interference Control of Multi-Player Systems." IFAC-PapersOnLine 53, no. 2 (2020): 9189–94. http://dx.doi.org/10.1016/j.ifacol.2020.12.2180.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Kim and Park. "Exploration with Multiple Random ε-Buffers in Off-Policy Deep Reinforcement Learning." Symmetry 11, no. 11 (November 1, 2019): 1352. http://dx.doi.org/10.3390/sym11111352.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In terms of deep reinforcement learning (RL), exploration is highly significant in achieving better generalization. In benchmark studies, ε-greedy random actions have been used to encourage exploration and prevent over-fitting, thereby improving generalization. Deep RL with random ε-greedy policies, such as deep Q-networks (DQNs), can demonstrate efficient exploration behavior. A random ε-greedy policy exploits additional replay buffers in an environment of sparse and binary rewards, such as in the real-time online detection of network securities by verifying whether the network is “normal or anomalous.” Prior studies have illustrated that a prioritized replay memory attributed to a complex temporal difference error provides superior theoretical results. However, another implementation illustrated that in certain environments, the prioritized replay memory is not superior to the randomly-selected buffers of random ε-greedy policy. Moreover, a key challenge of hindsight experience replay inspires our objective by using additional buffers corresponding to each different goal. Therefore, we attempt to exploit multiple random ε-greedy buffers to enhance explorations for a more near-perfect generalization with one original goal in off-policy RL. We demonstrate the benefit of off-policy learning from our method through an experimental comparison of DQN and a deep deterministic policy gradient in terms of discrete action, as well as continuous control for complete symmetric environments.

42

Chen, Ning, Shuhan Luo, Jiayang Dai, Biao Luo, and Weihua Gui. "Optimal Control of Iron-Removal Systems Based on Off-Policy Reinforcement Learning." IEEE Access 8 (2020): 149730–40. http://dx.doi.org/10.1109/access.2020.3015801.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Hachiya, Hirotaka, Takayuki Akiyama, Masashi Sugiayma, and Jan Peters. "Adaptive importance sampling for value function approximation in off-policy reinforcement learning." Neural Networks 22, no. 10 (December 2009): 1399–410. http://dx.doi.org/10.1016/j.neunet.2009.01.002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Zuo, Guoyu, Qishen Zhao, Kexin Chen, Jiangeng Li, and Daoxiong Gong. "Off-policy adversarial imitation learning for robotic tasks with low-quality demonstrations." Applied Soft Computing 97 (December 2020): 106795. http://dx.doi.org/10.1016/j.asoc.2020.106795.

Full text

APA, Harvard, Vancouver, ISO, and other styles

45

Givchi, Arash, and Maziar Palhang. "Off-policy temporal difference learning with distribution adaptation in fast mixing chains." Soft Computing 22, no. 3 (January 30, 2017): 737–50. http://dx.doi.org/10.1007/s00500-017-2490-1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Liu, Mushuang, Yan Wan, Frank L. Lewis, and Victor G. Lopez. "Adaptive Optimal Control for Stochastic Multiplayer Differential Games Using On-Policy and Off-Policy Reinforcement Learning." IEEE Transactions on Neural Networks and Learning Systems 31, no. 12 (December 2020): 5522–33. http://dx.doi.org/10.1109/tnnls.2020.2969215.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Pritchett, Lant, and Justin Sandefur. "Learning from Experiments when Context Matters." American Economic Review 105, no. 5 (May 1, 2015): 471–75. http://dx.doi.org/10.1257/aer.p20151016.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Suppose a policymaker is interested in the impact of an existing social program. Impact estimates using observational data suffer potential bias, while unbiased experimental estimates are often limited to other contexts. This creates a practical trade-off between internal and external validity for evidence-based policymaking. We explore this trade-off empirically for several common policies analyzed in development economics, including microcredit, migration, and education interventions. Based on mean-squared error, non-experimental evidence within context outperforms experimental evidence from another context. This advantage declines, but may not reverse, with experimental replication. We offer four reasons these findings are of general relevance to policy evaluation.

48

Chen, Zaiwei. "A Unified Lyapunov Framework for Finite-Sample Analysis of Reinforcement Learning Algorithms." ACM SIGMETRICS Performance Evaluation Review 50, no. 3 (December 30, 2022): 12–15. http://dx.doi.org/10.1145/3579342.3579346.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Reinforcement learning (RL) is a paradigm where an agent learns to accomplish tasks by interacting with the environment, similar to how humans learn. RL is therefore viewed as a promising approach to achieve artificial intelligence, as evidenced by the remarkable empirical successes. However, many RL algorithms are theoretically not well-understood, especially in the setting where function approximation and off-policy sampling are employed. My thesis [1] aims at developing thorough theoretical understanding to the performance of various RL algorithms through finite-sample analysis. Since most of the RL algorithms are essentially stochastic approximation (SA) algorithms for solving variants of the Bellman equation, the first part of thesis is dedicated to the analysis of general SA involving a contraction operator, and under Markovian noise. We develop a Lyapunov approach where we construct a novel Lyapunov function called the generaled Moreau envelope. The results on SA enable us to establish finite-sample bounds of various RL algorithms in the tabular setting (cf. Part II of the thesis) and when using function approximation (cf. Part III of the thesis), which in turn provide theoretical insights to several important problems in the RL community, such as the efficiency of bootstrapping, the bias-variance trade-off in off-policy learning, and the stability of off-policy control. The main body of this document provides an overview of the contributions of my thesis.

49

Narita, Yusuke, Kyohei Okumura, Akihiro Shimizu, and Kohei Yata. "Counterfactual Learning with General Data-Generating Policies." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 9286–93. http://dx.doi.org/10.1609/aaai.v37i8.26113.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We validate our method with experiments on partly and entirely deterministic logging policies. Finally, we apply it to evaluate coupon targeting policies by a major online platform and show how to improve the existing policy.

50

Kim, MyeongSeop, Jung-Su Kim, Myoung-Su Choi, and Jae-Han Park. "Adaptive Discount Factor for Deep Reinforcement Learning in Continuing Tasks with Uncertainty." Sensors 22, no. 19 (September 25, 2022): 7266. http://dx.doi.org/10.3390/s22197266.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Reinforcement learning (RL) trains an agent by maximizing the sum of a discounted reward. Since the discount factor has a critical effect on the learning performance of the RL agent, it is important to choose the discount factor properly. When uncertainties are involved in the training, the learning performance with a constant discount factor can be limited. For the purpose of obtaining acceptable learning performance consistently, this paper proposes an adaptive rule for the discount factor based on the advantage function. Additionally, how to use the advantage function in both on-policy and off-policy algorithms is presented. To demonstrate the performance of the proposed adaptive rule, it is applied to PPO (Proximal Policy Optimization) for Tetris in order to validate the on-policy case, and to SAC (Soft Actor-Critic) for the motion planning of a robot manipulator to validate the off-policy case. In both cases, the proposed method results in a better or similar performance compared with cases using the best constant discount factors found by exhaustive search. Hence, the proposed adaptive discount factor automatically finds a discount factor that leads to comparable training performance, and that can be applied to representative deep reinforcement learning problems.