Se connecter

Bibliographies thématiques / Off-Policy learning / Articles de revues

Articles de revues sur le sujet « Off-Policy learning »

Pour voir les autres types de publications sur ce sujet consultez le lien suivant : Off-Policy learning.

Auteur : Grafiati

Publié le 1 juin 2024

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Consultez les 50 meilleurs articles de revues pour votre recherche sur le sujet « Off-Policy learning ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Parcourez les articles de revues sur diverses disciplines et organisez correctement votre bibliographie.

1

Meng, Wenjia, Qian Zheng, Gang Pan et Yilong Yin. « Off-Policy Proximal Policy Optimization ». Proceedings of the AAAI Conference on Artificial Intelligence 37, n^o 8 (26 juin 2023) : 9162–70. http://dx.doi.org/10.1609/aaai.v37i8.26099.

Texte intégral

Résumé :

Proximal Policy Optimization (PPO) is an important reinforcement learning method, which has achieved great success in sequential decision-making problems. However, PPO faces the issue of sample inefficiency, which is due to the PPO cannot make use of off-policy data. In this paper, we propose an Off-Policy Proximal Policy Optimization method (Off-Policy PPO) that improves the sample efficiency of PPO by utilizing off-policy data. Specifically, we first propose a clipped surrogate objective function that can utilize off-policy data and avoid excessively large policy updates. Next, we theoretically clarify the stability of the optimization process of the proposed surrogate objective by demonstrating the degree of policy update distance is consistent with that in the PPO. We then describe the implementation details of the proposed Off-Policy PPO which iteratively updates policies by optimizing the proposed clipped surrogate objective. Finally, the experimental results on representative continuous control tasks validate that our method outperforms the state-of-the-art methods on most tasks.

Styles APA, Harvard, Vancouver, ISO, etc.

2

Schmitt, Simon, John Shawe-Taylor et Hado van Hasselt. « Chaining Value Functions for Off-Policy Learning ». Proceedings of the AAAI Conference on Artificial Intelligence 36, n^o 8 (28 juin 2022) : 8187–95. http://dx.doi.org/10.1609/aaai.v36i8.20792.

Texte intégral

Résumé :

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.

Styles APA, Harvard, Vancouver, ISO, etc.

3

Xu, Da, Yuting Ye, Chuanwei Ruan et Bo Yang. « Towards Robust Off-Policy Learning for Runtime Uncertainty ». Proceedings of the AAAI Conference on Artificial Intelligence 36, n^o 9 (28 juin 2022) : 10101–9. http://dx.doi.org/10.1609/aaai.v36i9.21249.

Texte intégral

Résumé :

Off-policy learning plays a pivotal role in optimizing and evaluating policies prior to the online deployment. However, during the real-time serving, we observe varieties of interventions and constraints that cause inconsistency between the online and offline setting, which we summarize and term as runtime uncertainty. Such uncertainty cannot be learned from the logged data due to its abnormality and rareness nature. To assert a certain level of robustness, we perturb the off-policy estimators along an adversarial direction in view of the runtime uncertainty. It allows the resulting estimators to be robust not only to observed but also unexpected runtime uncertainties. Leveraging this idea, we bring runtime-uncertainty robustness to three major off-policy learning methods: the inverse propensity score method, reward-model method, and doubly robust method. We theoretically justify the robustness of our methods to runtime uncertainty, and demonstrate their effectiveness using both the simulation and the real-world online experiments.

Styles APA, Harvard, Vancouver, ISO, etc.

4

Peters, James F., et Christopher Henry. « Approximation spaces in off-policy Monte Carlo learning ». Engineering Applications of Artificial Intelligence 20, n^o 5 (août 2007) : 667–75. http://dx.doi.org/10.1016/j.engappai.2006.11.005.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

5

Yu, Jiayu, Jingyao Li, Shuai Lü et Shuai Han. « Mixed experience sampling for off-policy reinforcement learning ». Expert Systems with Applications 251 (octobre 2024) : 124017. http://dx.doi.org/10.1016/j.eswa.2024.124017.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

6

Cetin, Edoardo, et Oya Celiktutan. « Learning Pessimism for Reinforcement Learning ». Proceedings of the AAAI Conference on Artificial Intelligence 37, n^o 6 (26 juin 2023) : 6971–79. http://dx.doi.org/10.1609/aaai.v37i6.25852.

Texte intégral

Résumé :

Off-policy deep reinforcement learning algorithms commonly compensate for overestimation bias during temporal-difference learning by utilizing pessimistic estimates of the expected target returns. In this work, we propose Generalized Pessimism Learning (GPL), a strategy employing a novel learnable penalty to enact such pessimism. In particular, we propose to learn this penalty alongside the critic with dual TD-learning, a new procedure to estimate and minimize the magnitude of the target returns bias with trivial computational cost. GPL enables us to accurately counteract overestimation bias throughout training without incurring the downsides of overly pessimistic targets. By integrating GPL with popular off-policy algorithms, we achieve state-of-the-art results in both competitive proprioceptive and pixel-based benchmarks.

Styles APA, Harvard, Vancouver, ISO, etc.

7

Kong, Seung-Hyun, I. Made Aswin Nahrendra et Dong-Hee Paek. « Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay ». IEEE Access 9 (2021) : 93152–64. http://dx.doi.org/10.1109/access.2021.3085142.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

8

Li, Lihong. « A perspective on off-policy evaluation in reinforcement learning ». Frontiers of Computer Science 13, n^o 5 (17 juin 2019) : 911–12. http://dx.doi.org/10.1007/s11704-019-9901-7.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

9

Luo, Biao, Huai-Ning Wu et Tingwen Huang. « Off-Policy Reinforcement Learning for $ H_\infty $ Control Design ». IEEE Transactions on Cybernetics 45, n^o 1 (janvier 2015) : 65–76. http://dx.doi.org/10.1109/tcyb.2014.2319577.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

10

Sun, Mingfei, Sam Devlin, Katja Hofmann et Shimon Whiteson. « Deterministic and Discriminative Imitation (D2-Imitation) : Revisiting Adversarial Imitation for Sample Efficiency ». Proceedings of the AAAI Conference on Artificial Intelligence 36, n^o 8 (28 juin 2022) : 8378–85. http://dx.doi.org/10.1609/aaai.v36i8.20813.

Texte intégral

Résumé :

Sample efficiency is crucial for imitation learning methods to be applicable in real-world applications. Many studies improve sample efficiency by extending adversarial imitation to be off-policy regardless of the fact that these off-policy extensions could either change the original objective or involve complicated optimization. We revisit the foundation of adversarial imitation and propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization. Our formulation capitalizes on two key insights: (1) the similarity between the Bellman equation and the stationary state-action distribution equation allows us to derive a novel temporal difference (TD) learning approach; and (2) the use of a deterministic policy simplifies the TD learning. Combined, these insights yield a practical algorithm, Deterministic and Discriminative Imitation (D2-Imitation), which oper- ates by first partitioning samples into two replay buffers and then learning a deterministic policy via off-policy reinforcement learning. Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation on many control tasks.

Styles APA, Harvard, Vancouver, ISO, etc.

11

Jain, Arushi, Gandharv Patil, Ayush Jain, Khimya Khetarpal et Doina Precup. « Variance Penalized On-Policy and Off-Policy Actor-Critic ». Proceedings of the AAAI Conference on Artificial Intelligence 35, n^o 9 (18 mai 2021) : 7899–907. http://dx.doi.org/10.1609/aaai.v35i9.16964.

Texte intégral

Résumé :

Reinforcement learning algorithms are typically geared towards optimizing the expected return of an agent. However, in many practical applications, low variance in the return is desired to ensure the reliability of an algorithm. In this paper, we propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Previous work uses the second moment of return to estimate the variance indirectly. Instead, we use a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods. Using the variance-penalized criterion, we guarantee the convergence of our algorithm to locally optimal policies for finite state action Markov decision processes. We demonstrate the utility of our algorithm in tabular and continuous MuJoCo domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.

Styles APA, Harvard, Vancouver, ISO, etc.

12

Hao, Longyan, Chaoli Wang et Yibo Shi. « Quadratic Tracking Control of Linear Stochastic Systems with Unknown Dynamics Using Average Off-Policy Q-Learning Method ». Mathematics 12, n^o 10 (14 mai 2024) : 1533. http://dx.doi.org/10.3390/math12101533.

Texte intégral

Résumé :

This article investigates the optimal tracking control problem for data-based stochastic discrete-time linear systems. An average off-policy Q-learning algorithm is proposed to solve the optimal control problem with random disturbances. Compared with the existing off-policy reinforcement learning (RL) algorithm, the proposed average off-policy Q-learning algorithm avoids the assumption of an initial stability control. First, a pole placement strategy is used to design an initial stable control for systems with unknown dynamics. Second, the initial stable control is used to design a data-based average off-policy Q-learning algorithm. Then, this algorithm is used to solve the stochastic linear quadratic tracking (LQT) problem, and a convergence proof of the algorithm is provided. Finally, numerical examples show that this algorithm outperforms other algorithms in a simulation.

Styles APA, Harvard, Vancouver, ISO, etc.

13

Gelada, Carles, et Marc G. Bellemare. « Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift ». Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 juillet 2019) : 3647–55. http://dx.doi.org/10.1609/aaai.v33i01.33013647.

Texte intégral

Résumé :

In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pioneered by Hallak et al. (2017). Under this method, online updates to the value function are reweighted to avoid divergence issues typical of off-policy learning. While Hallak et al.’s solution is appealing, it cannot easily be transferred to nonlinear function approximation. First, it requires a projection step onto the probability simplex; second, even though the operator describing the expected behavior of the off-policy learning algorithm is convergent, it is not known to be a contraction mapping, and hence, may be more unstable in practice. We address these two issues by introducing a discount factor into COP-TD. We analyze the behavior of discounted COP-TD and find it better behaved from a theoretical perspective. We also propose an alternative soft normalization penalty that can be minimized online and obviates the need for an explicit projection step. We complement our analysis with an empirical evaluation of the two techniques in an off-policy setting on the game Pong from the Atari domain where we find discounted COP-TD to be better behaved in practice than the soft normalization penalty. Finally, we perform a more extensive evaluation of discounted COP-TD in 5 games of the Atari domain, where we find performance gains for our approach.

Styles APA, Harvard, Vancouver, ISO, etc.

14

Xiao, Teng, et Suhang Wang. « Towards Off-Policy Learning for Ranking Policies with Logged Feedback ». Proceedings of the AAAI Conference on Artificial Intelligence 36, n^o 8 (28 juin 2022) : 8700–8707. http://dx.doi.org/10.1609/aaai.v36i8.20849.

Texte intégral

Résumé :

Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking metric, but cannot maximize long-term rewards. Reinforcement learning models have been proposed to maximize user long-term rewards by formulating the recommendation as a sequential decision-making problem, but could only achieve inferior accuracy compared to LTR counterparts, primarily due to the lack of online interactions and the characteristics of ranking. In this paper, we propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline for improved sample efficiency in a unified Expectation-Maximization (EM) framework. We theoretically and empirically show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions. Extensive offline and online experiments demonstrate the effectiveness of our methods

Styles APA, Harvard, Vancouver, ISO, etc.

15

Li, Jinna, Hamidreza Modares, Tianyou Chai, Frank L. Lewis et Lihua Xie. « Off-Policy Reinforcement Learning for Synchronization in Multiagent Graphical Games ». IEEE Transactions on Neural Networks and Learning Systems 28, n^o 10 (octobre 2017) : 2434–45. http://dx.doi.org/10.1109/tnnls.2016.2609500.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

16

Zhang, Hengrui, Youfang Lin, Shuo Shen, Sheng Han et Kai Lv. « Enhancing Off-Policy Constrained Reinforcement Learning through Adaptive Ensemble C Estimation ». Proceedings of the AAAI Conference on Artificial Intelligence 38, n^o 19 (24 mars 2024) : 21770–78. http://dx.doi.org/10.1609/aaai.v38i19.30177.

Texte intégral

Résumé :

In the domain of real-world agents, the application of Reinforcement Learning (RL) remains challenging due to the necessity for safety constraints. Previously, Constrained Reinforcement Learning (CRL) has predominantly focused on on-policy algorithms. Although these algorithms exhibit a degree of efficacy, their interactivity efficiency in real-world settings is sub-optimal, highlighting the demand for more efficient off-policy methods. However, off-policy CRL algorithms grapple with challenges in precise estimation of the C-function, particularly due to the fluctuations in the constrained Lagrange multiplier. Addressing this gap, our study focuses on the nuances of C-value estimation in off-policy CRL and introduces the Adaptive Ensemble C-learning (AEC) approach to reduce these inaccuracies. Building on state-of-the-art off-policy algorithms, we propose AEC-based CRL algorithms designed for enhanced task optimization. Extensive experiments on nine constrained robotics tasks reveal the superior interaction efficiency and performance of our algorithms in comparison to preceding methods.

Styles APA, Harvard, Vancouver, ISO, etc.

17

Zhang, Shangtong, Bo Liu et Shimon Whiteson. « Mean-Variance Policy Iteration for Risk-Averse Reinforcement Learning ». Proceedings of the AAAI Conference on Artificial Intelligence 35, n^o 12 (18 mai 2021) : 10905–13. http://dx.doi.org/10.1609/aaai.v35i12.17302.

Texte intégral

Résumé :

We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP optimizing the variance of a per-step reward random variable. MVPI enjoys great flexibility in that any policy evaluation method and risk-neutral control method can be dropped in for risk-averse control off the shelf, in both on- and off-policy settings. This flexibility reduces the gap between risk-neutral control and risk-averse control and is achieved by working on a novel augmented MDP directly. We propose risk-averse TD3 as an example instantiating MVPI, which outperforms vanilla TD3 and many previous risk-averse control methods in challenging Mujoco robot simulation tasks under a risk-aware performance metric. This risk-averse TD3 is the first to introduce deterministic policies and off-policy learning into risk-averse reinforcement learning, both of which are key to the performance boost we show in Mujoco domains.

Styles APA, Harvard, Vancouver, ISO, etc.

18

Ali, Raja Farrukh, Kevin Duong, Nasik Muhammad Nafi et William Hsu. « Multi-Horizon Learning in Procedurally-Generated Environments for Off-Policy Reinforcement Learning (Student Abstract) ». Proceedings of the AAAI Conference on Artificial Intelligence 37, n^o 13 (26 juin 2023) : 16150–51. http://dx.doi.org/10.1609/aaai.v37i13.26935.

Texte intégral

Résumé :

Value estimates at multiple timescales can help create advanced discounting functions and allow agents to form more effective predictive models of their environment. In this work, we investigate learning over multiple horizons concurrently for off-policy reinforcement learning by using an advantage-based action selection method and introducing architectural improvements. Our proposed agent learns over multiple horizons simultaneously, while using either exponential or hyperbolic discounting functions. We implement our approach on Rainbow, a value-based off-policy algorithm, and test on Procgen, a collection of procedurally-generated environments, to demonstrate the effectiveness of this approach, specifically to evaluate the agent's performance in previously unseen scenarios.

Styles APA, Harvard, Vancouver, ISO, etc.

19

Tennenholtz, Guy, Uri Shalit et Shie Mannor. « Off-Policy Evaluation in Partially Observable Environments ». Proceedings of the AAAI Conference on Artificial Intelligence 34, n^o 06 (3 avril 2020) : 10276–83. http://dx.doi.org/10.1609/aaai.v34i06.6590.

Texte intégral

Résumé :

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.

Styles APA, Harvard, Vancouver, ISO, etc.

20

Nakamura, Yutaka, Takeshi Mori, Yoichi Tokita, Tomohiro Shibata et Shin Ishii. « Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller ». Journal of Robotics and Mechatronics 17, n^o 6 (20 décembre 2005) : 636–44. http://dx.doi.org/10.20965/jrm.2005.p0636.

Texte intégral

Résumé :

Referring to the mechanism of animals’ rhythmic movements, motor control schemes using a central pattern generator (CPG) controller have been studied. We previously proposed reinforcement learning (RL) called the CPG-actor-critic model, as an autonomous learning framework for a CPG controller. Here, we propose an off-policy natural policy gradient RL algorithm for the CPG-actor-critic model, to solve the “exploration-exploitation” problem by meta-controlling “behavior policy.” We apply this RL algorithm to an automatic control problem using a biped robot simulator. Computer simulation demonstrated that the CPG controller enables the biped robot to walk stably and efficiently based on our new algorithm.

Styles APA, Harvard, Vancouver, ISO, etc.

21

Wang, Mingyang, Zhenshan Bing, Xiangtong Yao, Shuai Wang, Huang Kai, Hang Su, Chenguang Yang et Alois Knoll. « Meta-Reinforcement Learning Based on Self-Supervised Task Representation Learning ». Proceedings of the AAAI Conference on Artificial Intelligence 37, n^o 8 (26 juin 2023) : 10157–65. http://dx.doi.org/10.1609/aaai.v37i8.26210.

Texte intégral

Résumé :

Meta-reinforcement learning enables artificial agents to learn from related training tasks and adapt to new tasks efficiently with minimal interaction data. However, most existing research is still limited to narrow task distributions that are parametric and stationary, and does not consider out-of-distribution tasks during the evaluation, thus, restricting its application. In this paper, we propose MoSS, a context-based Meta-reinforcement learning algorithm based on Self-Supervised task representation learning to address this challenge. We extend meta-RL to broad non-parametric task distributions which have never been explored before, and also achieve state-of-the-art results in non-stationary and out-of-distribution tasks. Specifically, MoSS consists of a task inference module and a policy module. We utilize the Gaussian mixture model for task representation to imitate the parametric and non-parametric task variations. Additionally, our online adaptation strategy enables the agent to react at the first sight of a task change, thus being applicable in non-stationary tasks. MoSS also exhibits strong generalization robustness in out-of-distributions tasks which benefits from the reliable and robust task representation. The policy is built on top of an off-policy RL algorithm and the entire network is trained completely off-policy to ensure high sample efficiency. On MuJoCo and Meta-World benchmarks, MoSS outperforms prior works in terms of asymptotic performance, sample efficiency (3-50x faster), adaptation efficiency, and generalization robustness on broad and diverse task distributions.

Styles APA, Harvard, Vancouver, ISO, etc.

22

Cao, Jiaqing, Quan Liu, Fei Zhu, Qiming Fu et Shan Zhong. « Gradient temporal-difference learning for off-policy evaluation using emphatic weightings ». Information Sciences 580 (novembre 2021) : 311–30. http://dx.doi.org/10.1016/j.ins.2021.08.082.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

23

Tian, Chang, An Liu, Guan Huang et Wu Luo. « Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning ». IEEE Transactions on Signal Processing 70 (2022) : 1609–24. http://dx.doi.org/10.1109/tsp.2022.3158737.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

24

Karimpanal, Thommen George, et Erik Wilhelm. « Identification and off-policy learning of multiple objectives using adaptive clustering ». Neurocomputing 263 (novembre 2017) : 39–47. http://dx.doi.org/10.1016/j.neucom.2017.04.074.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

25

Kiumarsi, Bahare, Frank L. Lewis et Zhong-Ping Jiang. « H∞ control of linear discrete-time systems : Off-policy reinforcement learning ». Automatica 78 (avril 2017) : 144–52. http://dx.doi.org/10.1016/j.automatica.2016.12.009.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

26

Li, Jinna, Zhenfei Xiao et Ping Li. « Discrete-Time Multi-Player Games Based on Off-Policy Q-Learning ». IEEE Access 7 (2019) : 134647–59. http://dx.doi.org/10.1109/access.2019.2939384.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

27

Kiumarsi, Bahare, Wei Kang et Frank L. Lewis. « H∞ Control of Nonaffine Aerial Systems Using Off-policy Reinforcement Learning ». Unmanned Systems 04, n^o 01 (janvier 2016) : 51–60. http://dx.doi.org/10.1142/s2301385016400069.

Texte intégral

Résumé :

This paper presents a completely model-free [Formula: see text] optimal tracking solution to the control of a general class of nonlinear nonaffine systems in the presence of the input constraints. The proposed method is motivated by nonaffine unmanned aerial vehicle (UAV) system as a real application. First, a general class of nonlinear nonaffine system dynamics is presented as an affine system in terms of a nonlinear function of the control input. It is shown that the optimal control of nonaffine systems may not have an admissible solution if the utility function is not defined properly. Moreover, the boundness of the optimal control input cannot be guaranteed for standard performance functions. A new performance function is defined and used in the [Formula: see text]-gain condition for this class of nonaffine system. This performance function guarantees the existence of an admissible solution (if any exists) and boundness of the control input solution. An off-policy reinforcement learning (RL) is employed to iteratively solve the [Formula: see text] optimal tracking control online using the measured data along the system trajectories. The proposed off-policy RL does not require any knowledge of the system dynamics. Moreover, the disturbance input does not need to be adjustable in a specific manner.

Styles APA, Harvard, Vancouver, ISO, etc.

28

Lian, Bosen, Wenqian Xue, Yijing Xie, Frank L. Lewis et Ali Davoudi. « Off-policy inverse Q-learning for discrete-time antagonistic unknown systems ». Automatica 155 (septembre 2023) : 111171. http://dx.doi.org/10.1016/j.automatica.2023.111171.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

29

Kim, Man-Je, Hyunsoo Park et Chang Wook Ahn. « Nondominated Policy-Guided Learning in Multi-Objective Reinforcement Learning ». Electronics 11, n^o 7 (28 mars 2022) : 1069. http://dx.doi.org/10.3390/electronics11071069.

Texte intégral

Résumé :

Control intelligence is a typical field where there is a trade-off between target objectives, and researchers in this field have longed for artificial intelligence that achieves the target objectives. Multi-objective deep reinforcement learning was sufficient to satisfy this need. In particular, multi-objective deep reinforcement learning methods based on policy optimization are leading the optimization of control intelligence. However, multi-objective reinforcement learning has difficulties when finding various Pareto optimals of multi-objectives due to the greedy nature of reinforcement learning. We propose a method of policy assimilation to solve this problem. This method was applied to MO-V-MPO, one of preference-based multi-objective reinforcement learning, to increase diversity. The performance of this method has been verified through experiments in a continuous control environment.

Styles APA, Harvard, Vancouver, ISO, etc.

30

Chaudhari, Shreyas, David Arbour, Georgios Theocharous et Nikos Vlassis. « Distributional Off-Policy Evaluation for Slate Recommendations ». Proceedings of the AAAI Conference on Artificial Intelligence 38, n^o 8 (24 mars 2024) : 8265–73. http://dx.doi.org/10.1609/aaai.v38i8.28667.

Texte intégral

Résumé :

Recommendation strategies are typically evaluated by using previously logged data, employing off-policy evaluation methods to estimate their expected performance. However, for strategies that present users with slates of multiple items, the resulting combinatorial action space renders many of these methods impractical. Prior work has developed estimators that leverage the structure in slates to estimate the expected off-policy performance, but the estimation of the entire performance distribution remains elusive. Estimating the complete distribution allows for a more comprehensive evaluation of recommendation strategies, particularly along the axes of risk and fairness that employ metrics computable from the distribution. In this paper, we propose an estimator for the complete off-policy performance distribution for slates and establish conditions under which the estimator is unbiased and consistent. This builds upon prior work on off-policy evaluation for slates and off-policy distribution estimation in reinforcement learning. We validate the efficacy of our method empirically on synthetic data as well as on a slate recommendation simulator constructed from real-world data (MovieLens-20M). Our results show a significant reduction in estimation variance and improved sample efficiency over prior work across a range of slate structures.

Styles APA, Harvard, Vancouver, ISO, etc.

31

Zhang, Ruiyi, Tong Yu, Yilin Shen et Hongxia Jin. « Text-Based Interactive Recommendation via Offline Reinforcement Learning ». Proceedings of the AAAI Conference on Artificial Intelligence 36, n^o 10 (28 juin 2022) : 11694–702. http://dx.doi.org/10.1609/aaai.v36i10.21424.

Texte intégral

Résumé :

Interactive recommendation with natural-language feedback can provide richer user feedback and has demonstrated advantages over traditional recommender systems. However, the classical online paradigm involves iteratively collecting experience via interaction with users, which is expensive and risky. We consider an offline interactive recommendation to exploit arbitrary experience collected by multiple unknown policies. A direct application of policy learning with such fixed experience suffers from the distribution shift. To tackle this issue, we develop a behavior-agnostic off-policy correction framework to make offline interactive recommendation possible. Specifically, we leverage the conservative Q-function to perform off-policy evaluation, which enables learning effective policies from fixed datasets without further interactions. Empirical results on the simulator derived from real-world datasets demonstrate the effectiveness of our proposed offline training framework.

Styles APA, Harvard, Vancouver, ISO, etc.

32

Xu, Z., L. Cao et X. Chen. « Deep Reinforcement Learning with Adaptive Update Target Combination ». Computer Journal 63, n^o 7 (15 août 2019) : 995–1003. http://dx.doi.org/10.1093/comjnl/bxz066.

Texte intégral

Résumé :

Abstract Simple and efficient exploration remains a core challenge in deep reinforcement learning. While many exploration methods can be applied to high-dimensional tasks, these methods manually adjust exploration parameters according to domain knowledge. This paper proposes a novel method that can automatically balance exploration and exploitation, as well as combine on-policy and off-policy update targets through a dynamic weighted way based on value difference. The proposed method does not directly affect the probability of a selected action but utilizes the value difference produced during the learning process to adjust update target for guiding the direction of agent’s learning. We demonstrate the performance of the proposed method on CartPole-v1, MountainCar-v0, and LunarLander-v2 classic control tasks from the OpenAI Gym. Empirical evaluation results show that by integrating on-policy and off-policy update targets dynamically, this method exhibits superior performance and stability than does the exclusive use of the update target.

Styles APA, Harvard, Vancouver, ISO, etc.

33

Shahid, Asad Ali, Dario Piga, Francesco Braghin et Loris Roveda. « Continuous control actions learning and adaptation for robotic manipulation through reinforcement learning ». Autonomous Robots 46, n^o 3 (9 février 2022) : 483–98. http://dx.doi.org/10.1007/s10514-022-10034-z.

Texte intégral

Résumé :

AbstractThis paper presents a learning-based method that uses simulation data to learn an object manipulation task using two model-free reinforcement learning (RL) algorithms. The learning performance is compared across on-policy and off-policy algorithms: Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC). In order to accelerate the learning process, the fine-tuning procedure is proposed that demonstrates the continuous adaptation of on-policy RL to new environments, allowing the learned policy to adapt and execute the (partially) modified task. A dense reward function is designed for the task to enable an efficient learning of the agent. A grasping task involving a Franka Emika Panda manipulator is considered as the reference task to be learned. The learned control policy is demonstrated to be generalizable across multiple object geometries and initial robot/parts configurations. The approach is finally tested on a real Franka Emika Panda robot, showing the possibility to transfer the learned behavior from simulation. Experimental results show 100% of successful grasping tasks, making the proposed approach applicable to real applications.

Styles APA, Harvard, Vancouver, ISO, etc.

34

Hollenstein, Jakob, Georg Martius et Justus Piater. « Colored Noise in PPO : Improved Exploration and Performance through Correlated Action Sampling ». Proceedings of the AAAI Conference on Artificial Intelligence 38, n^o 11 (24 mars 2024) : 12466–72. http://dx.doi.org/10.1609/aaai.v38i11.29139.

Texte intégral

Résumé :

Proximal Policy Optimization (PPO), a popular on-policy deep reinforcement learning method, employs a stochastic policy for exploration. In this paper, we propose a colored noise-based stochastic policy variant of PPO. Previous research highlighted the importance of temporal correlation in action noise for effective exploration in off-policy reinforcement learning. Building on this, we investigate whether correlated noise can also enhance exploration in on-policy methods like PPO. We discovered that correlated noise for action selection improves learning performance and outperforms the currently popular uncorrelated white noise approach in on-policy methods. Unlike off-policy learning, where pink noise was found to be highly effective, we found that a colored noise, intermediate between white and pink, performed best for on-policy learning in PPO. We examined the impact of varying the amount of data collected for each update by modifying the number of parallel simulation environments for data collection and observed that with a larger number of parallel environments, more strongly correlated noise is beneficial. Due to the significant impact and ease of implementation, we recommend switching to correlated noise as the default noise source in PPO.

Styles APA, Harvard, Vancouver, ISO, etc.

35

Ren, He, Jing Dai, Huaguang Zhang et Kun Zhang. « Off-policy integral reinforcement learning algorithm in dealing with nonzero sum game for nonlinear distributed parameter systems ». Transactions of the Institute of Measurement and Control 42, n^o 15 (6 juillet 2020) : 2919–28. http://dx.doi.org/10.1177/0142331220932634.

Texte intégral

Résumé :

Benefitting from the technology of integral reinforcement learning, the nonzero sum (NZS) game for distributed parameter systems is effectively solved in this paper when the information of system dynamics are unavailable. The Karhunen-Loève decomposition (KLD) is employed to convert the partial differential equation (PDE) systems into high-order ordinary differential equation (ODE) systems. Moreover, the off-policy IRL technology is introduced to design the optimal strategies for the NZS game. To confirm that the presented algorithm will converge to the optimal value functions, the traditional adaptive dynamic programming (ADP) method is first discussed. Then, the equivalence between the traditional ADP method and the presented off-policy method is proved. For implementing the presented off-policy IRL method, actor and critic neural networks are utilized to approach the value functions and control strategies in the iteration process, individually. Finally, a numerical simulation is shown to illustrate the effectiveness of the proposal off-policy algorithm.

Styles APA, Harvard, Vancouver, ISO, etc.

36

Levine, Alexander, et Soheil Feizi. « Goal-Conditioned Q-learning as Knowledge Distillation ». Proceedings of the AAAI Conference on Artificial Intelligence 37, n^o 7 (26 juin 2023) : 8500–8509. http://dx.doi.org/10.1609/aaai.v37i7.26024.

Texte intégral

Résumé :

Many applications of reinforcement learning can be formalized as goal-conditioned environments, where, in each episode, there is a "goal" that affects the rewards obtained during that episode but does not affect the dynamics. Various techniques have been proposed to improve performance in goal-conditioned environments, such as automatic curriculum generation and goal relabeling. In this work, we explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation. In particular: the current Q-value function and the target Q-value estimate are both functions of the goal, and we would like to train the Q-value function to match its target for all goals. We therefore apply Gradient-Based Attention Transfer (Zagoruyko and Komodakis 2017), a knowledge distillation technique, to the Q-function update. We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional. We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals, where the agent can attain a reward by achieving any one of a large set of objectives, all specified at test time. Finally, to provide theoretical support, we give examples of classes of environments where (under some assumptions) standard off-policy algorithms such as DDPG require at least O(d^2) replay buffer transitions to learn an optimal policy, while our proposed technique requires only O(d) transitions, where d is the dimensionality of the goal and state space. Code and appendix are available at https://github.com/alevine0/ReenGAGE.

Styles APA, Harvard, Vancouver, ISO, etc.

37

Yang, Hyunjun, Hyeonjun Park et Kyungjae Lee. « A Selective Portfolio Management Algorithm with Off-Policy Reinforcement Learning Using Dirichlet Distribution ». Axioms 11, n^o 12 (23 novembre 2022) : 664. http://dx.doi.org/10.3390/axioms11120664.

Texte intégral

Résumé :

Existing methods in portfolio management deterministically produce an optimal portfolio. However, according to modern portfolio theory, there exists a trade-off between a portfolio’s expected returns and risks. Therefore, the optimal portfolio does not exist definitively, but several exist, and using only one deterministic portfolio is disadvantageous for risk management. We proposed Dirichlet Distribution Trader (DDT), an algorithm that calculates multiple optimal portfolios by taking Dirichlet Distribution as a policy. The DDT algorithm makes several optimal portfolios according to risk levels. In addition, by obtaining the pi value from the distribution and applying importance sampling to off-policy learning, the sample is used efficiently. Furthermore, the architecture of our model is scalable because the feed-forward of information between portfolio stocks occurs independently. This means that even if untrained stocks are added to the portfolio, the optimal weight can be adjusted. We also conducted three experiments. In the scalability experiment, it was shown that the DDT extended model, which is trained with only three stocks, had little difference in performance from the DDT model that learned all the stocks in the portfolio. In an experiment comparing the off-policy algorithm and the on-policy algorithm, it was shown that the off-policy algorithm had good performance regardless of the stock price trend. In an experiment comparing investment results according to risk level, it was shown that a higher return or a better Sharpe ratio could be obtained through risk control.

Styles APA, Harvard, Vancouver, ISO, etc.

38

Suttle, Wesley, Zhuoran Yang, Kaiqing Zhang, Zhaoran Wang, Tamer Başar et Ji Liu. « A Multi-Agent Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning ». IFAC-PapersOnLine 53, n^o 2 (2020) : 1549–54. http://dx.doi.org/10.1016/j.ifacol.2020.12.2021.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

39

Stanković, Miloš S., Marko Beko et Srdjan S. Stanković. « Distributed Gradient Temporal Difference Off-policy Learning With Eligibility Traces : Weak Convergence ». IFAC-PapersOnLine 53, n^o 2 (2020) : 1563–68. http://dx.doi.org/10.1016/j.ifacol.2020.12.2184.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

40

Li, Jinna, Zhenfei Xiao, Tianyou Chai, Frank L. Lewis et Sarangapani Jagannathan. « Off-Policy Q-Learning for Anti-Interference Control of Multi-Player Systems ». IFAC-PapersOnLine 53, n^o 2 (2020) : 9189–94. http://dx.doi.org/10.1016/j.ifacol.2020.12.2180.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

41

Kim et Park. « Exploration with Multiple Random ε-Buffers in Off-Policy Deep Reinforcement Learning ». Symmetry 11, n^o 11 (1 novembre 2019) : 1352. http://dx.doi.org/10.3390/sym11111352.

Texte intégral

Résumé :

In terms of deep reinforcement learning (RL), exploration is highly significant in achieving better generalization. In benchmark studies, ε-greedy random actions have been used to encourage exploration and prevent over-fitting, thereby improving generalization. Deep RL with random ε-greedy policies, such as deep Q-networks (DQNs), can demonstrate efficient exploration behavior. A random ε-greedy policy exploits additional replay buffers in an environment of sparse and binary rewards, such as in the real-time online detection of network securities by verifying whether the network is “normal or anomalous.” Prior studies have illustrated that a prioritized replay memory attributed to a complex temporal difference error provides superior theoretical results. However, another implementation illustrated that in certain environments, the prioritized replay memory is not superior to the randomly-selected buffers of random ε-greedy policy. Moreover, a key challenge of hindsight experience replay inspires our objective by using additional buffers corresponding to each different goal. Therefore, we attempt to exploit multiple random ε-greedy buffers to enhance explorations for a more near-perfect generalization with one original goal in off-policy RL. We demonstrate the benefit of off-policy learning from our method through an experimental comparison of DQN and a deep deterministic policy gradient in terms of discrete action, as well as continuous control for complete symmetric environments.

Styles APA, Harvard, Vancouver, ISO, etc.

42

Chen, Ning, Shuhan Luo, Jiayang Dai, Biao Luo et Weihua Gui. « Optimal Control of Iron-Removal Systems Based on Off-Policy Reinforcement Learning ». IEEE Access 8 (2020) : 149730–40. http://dx.doi.org/10.1109/access.2020.3015801.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

43

Hachiya, Hirotaka, Takayuki Akiyama, Masashi Sugiayma et Jan Peters. « Adaptive importance sampling for value function approximation in off-policy reinforcement learning ». Neural Networks 22, n^o 10 (décembre 2009) : 1399–410. http://dx.doi.org/10.1016/j.neunet.2009.01.002.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

44

Zuo, Guoyu, Qishen Zhao, Kexin Chen, Jiangeng Li et Daoxiong Gong. « Off-policy adversarial imitation learning for robotic tasks with low-quality demonstrations ». Applied Soft Computing 97 (décembre 2020) : 106795. http://dx.doi.org/10.1016/j.asoc.2020.106795.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

45

Givchi, Arash, et Maziar Palhang. « Off-policy temporal difference learning with distribution adaptation in fast mixing chains ». Soft Computing 22, n^o 3 (30 janvier 2017) : 737–50. http://dx.doi.org/10.1007/s00500-017-2490-1.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

46

Liu, Mushuang, Yan Wan, Frank L. Lewis et Victor G. Lopez. « Adaptive Optimal Control for Stochastic Multiplayer Differential Games Using On-Policy and Off-Policy Reinforcement Learning ». IEEE Transactions on Neural Networks and Learning Systems 31, n^o 12 (décembre 2020) : 5522–33. http://dx.doi.org/10.1109/tnnls.2020.2969215.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

47

Pritchett, Lant, et Justin Sandefur. « Learning from Experiments when Context Matters ». American Economic Review 105, n^o 5 (1 mai 2015) : 471–75. http://dx.doi.org/10.1257/aer.p20151016.

Texte intégral

Résumé :

Suppose a policymaker is interested in the impact of an existing social program. Impact estimates using observational data suffer potential bias, while unbiased experimental estimates are often limited to other contexts. This creates a practical trade-off between internal and external validity for evidence-based policymaking. We explore this trade-off empirically for several common policies analyzed in development economics, including microcredit, migration, and education interventions. Based on mean-squared error, non-experimental evidence within context outperforms experimental evidence from another context. This advantage declines, but may not reverse, with experimental replication. We offer four reasons these findings are of general relevance to policy evaluation.

Styles APA, Harvard, Vancouver, ISO, etc.

48

Chen, Zaiwei. « A Unified Lyapunov Framework for Finite-Sample Analysis of Reinforcement Learning Algorithms ». ACM SIGMETRICS Performance Evaluation Review 50, n^o 3 (30 décembre 2022) : 12–15. http://dx.doi.org/10.1145/3579342.3579346.

Texte intégral

Résumé :

Reinforcement learning (RL) is a paradigm where an agent learns to accomplish tasks by interacting with the environment, similar to how humans learn. RL is therefore viewed as a promising approach to achieve artificial intelligence, as evidenced by the remarkable empirical successes. However, many RL algorithms are theoretically not well-understood, especially in the setting where function approximation and off-policy sampling are employed. My thesis [1] aims at developing thorough theoretical understanding to the performance of various RL algorithms through finite-sample analysis. Since most of the RL algorithms are essentially stochastic approximation (SA) algorithms for solving variants of the Bellman equation, the first part of thesis is dedicated to the analysis of general SA involving a contraction operator, and under Markovian noise. We develop a Lyapunov approach where we construct a novel Lyapunov function called the generaled Moreau envelope. The results on SA enable us to establish finite-sample bounds of various RL algorithms in the tabular setting (cf. Part II of the thesis) and when using function approximation (cf. Part III of the thesis), which in turn provide theoretical insights to several important problems in the RL community, such as the efficiency of bootstrapping, the bias-variance trade-off in off-policy learning, and the stability of off-policy control. The main body of this document provides an overview of the contributions of my thesis.

Styles APA, Harvard, Vancouver, ISO, etc.

49

Narita, Yusuke, Kyohei Okumura, Akihiro Shimizu et Kohei Yata. « Counterfactual Learning with General Data-Generating Policies ». Proceedings of the AAAI Conference on Artificial Intelligence 37, n^o 8 (26 juin 2023) : 9286–93. http://dx.doi.org/10.1609/aaai.v37i8.26113.

Texte intégral

Résumé :

Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log data from a different policy. We extend its applicability by developing an OPE method for a class of both full support and deficient support logging policies in contextual-bandit settings. This class includes deterministic bandit (such as Upper Confidence Bound) as well as deterministic decision-making based on supervised and unsupervised learning. We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases. We validate our method with experiments on partly and entirely deterministic logging policies. Finally, we apply it to evaluate coupon targeting policies by a major online platform and show how to improve the existing policy.

Styles APA, Harvard, Vancouver, ISO, etc.

50

Kim, MyeongSeop, Jung-Su Kim, Myoung-Su Choi et Jae-Han Park. « Adaptive Discount Factor for Deep Reinforcement Learning in Continuing Tasks with Uncertainty ». Sensors 22, n^o 19 (25 septembre 2022) : 7266. http://dx.doi.org/10.3390/s22197266.

Texte intégral

Résumé :

Reinforcement learning (RL) trains an agent by maximizing the sum of a discounted reward. Since the discount factor has a critical effect on the learning performance of the RL agent, it is important to choose the discount factor properly. When uncertainties are involved in the training, the learning performance with a constant discount factor can be limited. For the purpose of obtaining acceptable learning performance consistently, this paper proposes an adaptive rule for the discount factor based on the advantage function. Additionally, how to use the advantage function in both on-policy and off-policy algorithms is presented. To demonstrate the performance of the proposed adaptive rule, it is applied to PPO (Proximal Policy Optimization) for Tetris in order to validate the on-policy case, and to SAC (Soft Actor-Critic) for the motion planning of a robot manipulator to validate the off-policy case. In both cases, the proposed method results in a better or similar performance compared with cases using the best constant discount factors found by exhaustive search. Hence, the proposed adaptive discount factor automatically finds a discount factor that leads to comparable training performance, and that can be applied to representative deep reinforcement learning problems.

Styles APA, Harvard, Vancouver, ISO, etc.

Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!