Academic literature on the topic 'Off-Policy learning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Off-Policy learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Off-Policy learning":

1

Meng, Wenjia, Qian Zheng, Gang Pan, and Yilong Yin. "Off-Policy Proximal Policy Optimization." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 9162–70. http://dx.doi.org/10.1609/aaai.v37i8.26099.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Proximal Policy Optimization (PPO) is an important reinforcement learning method, which has achieved great success in sequential decision-making problems. However, PPO faces the issue of sample inefficiency, which is due to the PPO cannot make use of off-policy data. In this paper, we propose an Off-Policy Proximal Policy Optimization method (Off-Policy PPO) that improves the sample efficiency of PPO by utilizing off-policy data. Specifically, we first propose a clipped surrogate objective function that can utilize off-policy data and avoid excessively large policy updates. Next, we theoretically clarify the stability of the optimization process of the proposed surrogate objective by demonstrating the degree of policy update distance is consistent with that in the PPO. We then describe the implementation details of the proposed Off-Policy PPO which iteratively updates policies by optimizing the proposed clipped surrogate objective. Finally, the experimental results on representative continuous control tasks validate that our method outperforms the state-of-the-art methods on most tasks.
2

Schmitt, Simon, John Shawe-Taylor, and Hado van Hasselt. "Chaining Value Functions for Off-Policy Learning." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8187–95. http://dx.doi.org/10.1609/aaai.v36i8.20792.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
3

Xu, Da, Yuting Ye, Chuanwei Ruan, and Bo Yang. "Towards Robust Off-Policy Learning for Runtime Uncertainty." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 9 (June 28, 2022): 10101–9. http://dx.doi.org/10.1609/aaai.v36i9.21249.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Off-policy learning plays a pivotal role in optimizing and evaluating policies prior to the online deployment. However, during the real-time serving, we observe varieties of interventions and constraints that cause inconsistency between the online and offline setting, which we summarize and term as runtime uncertainty. Such uncertainty cannot be learned from the logged data due to its abnormality and rareness nature. To assert a certain level of robustness, we perturb the off-policy estimators along an adversarial direction in view of the runtime uncertainty. It allows the resulting estimators to be robust not only to observed but also unexpected runtime uncertainties. Leveraging this idea, we bring runtime-uncertainty robustness to three major off-policy learning methods: the inverse propensity score method, reward-model method, and doubly robust method. We theoretically justify the robustness of our methods to runtime uncertainty, and demonstrate their effectiveness using both the simulation and the real-world online experiments.
4

Peters, James F., and Christopher Henry. "Approximation spaces in off-policy Monte Carlo learning." Engineering Applications of Artificial Intelligence 20, no. 5 (August 2007): 667–75. http://dx.doi.org/10.1016/j.engappai.2006.11.005.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Yu, Jiayu, Jingyao Li, Shuai Lü, and Shuai Han. "Mixed experience sampling for off-policy reinforcement learning." Expert Systems with Applications 251 (October 2024): 124017. http://dx.doi.org/10.1016/j.eswa.2024.124017.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Cetin, Edoardo, and Oya Celiktutan. "Learning Pessimism for Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 6 (June 26, 2023): 6971–79. http://dx.doi.org/10.1609/aaai.v37i6.25852.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Off-policy deep reinforcement learning algorithms commonly compensate for overestimation bias during temporal-difference learning by utilizing pessimistic estimates of the expected target returns. In this work, we propose Generalized Pessimism Learning (GPL), a strategy employing a novel learnable penalty to enact such pessimism. In particular, we propose to learn this penalty alongside the critic with dual TD-learning, a new procedure to estimate and minimize the magnitude of the target returns bias with trivial computational cost. GPL enables us to accurately counteract overestimation bias throughout training without incurring the downsides of overly pessimistic targets. By integrating GPL with popular off-policy algorithms, we achieve state-of-the-art results in both competitive proprioceptive and pixel-based benchmarks.
7

Kong, Seung-Hyun, I. Made Aswin Nahrendra, and Dong-Hee Paek. "Enhanced Off-Policy Reinforcement Learning With Focused Experience Replay." IEEE Access 9 (2021): 93152–64. http://dx.doi.org/10.1109/access.2021.3085142.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Li, Lihong. "A perspective on off-policy evaluation in reinforcement learning." Frontiers of Computer Science 13, no. 5 (June 17, 2019): 911–12. http://dx.doi.org/10.1007/s11704-019-9901-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Luo, Biao, Huai-Ning Wu, and Tingwen Huang. "Off-Policy Reinforcement Learning for $ H_\infty $ Control Design." IEEE Transactions on Cybernetics 45, no. 1 (January 2015): 65–76. http://dx.doi.org/10.1109/tcyb.2014.2319577.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Sun, Mingfei, Sam Devlin, Katja Hofmann, and Shimon Whiteson. "Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8378–85. http://dx.doi.org/10.1609/aaai.v36i8.20813.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Sample efficiency is crucial for imitation learning methods to be applicable in real-world applications. Many studies improve sample efficiency by extending adversarial imitation to be off-policy regardless of the fact that these off-policy extensions could either change the original objective or involve complicated optimization. We revisit the foundation of adversarial imitation and propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization. Our formulation capitalizes on two key insights: (1) the similarity between the Bellman equation and the stationary state-action distribution equation allows us to derive a novel temporal difference (TD) learning approach; and (2) the use of a deterministic policy simplifies the TD learning. Combined, these insights yield a practical algorithm, Deterministic and Discriminative Imitation (D2-Imitation), which oper- ates by first partitioning samples into two replay buffers and then learning a deterministic policy via off-policy reinforcement learning. Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation on many control tasks.

Dissertations / Theses on the topic "Off-Policy learning":

1

Hauser, Kristen. "Hyperparameter Tuning for Reinforcement Learning with Bandits and Off-Policy Sampling." Case Western Reserve University School of Graduate Studies / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=case1613034993418088.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Tosatto, Samuele [Verfasser], Jan [Akademischer Betreuer] Peters, and Martha [Akademischer Betreuer] White. "Off-Policy Reinforcement Learning for Robotics / Samuele Tosatto ; Jan Peters, Martha White." Darmstadt : Universitäts- und Landesbibliothek, 2021. http://d-nb.info/1227582293/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Sakhi, Otmane. "Offline Contextual Bandit : Theory and Large Scale Applications." Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAG011.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Cette thèse s'intéresse au problème de l'apprentissage à partir d'interactions en utilisant le cadre du bandit contextuel hors ligne. En particulier, nous nous intéressons à deux sujets connexes : (1) l'apprentissage de politiques hors ligne avec des certificats de performance, et (2) l'apprentissage rapide et efficace de politiques, pour le problème de recommandation à grande échelle. Pour (1), nous tirons d'abord parti des résultats du cadre d'optimisation distributionnellement robuste pour construire des bornes asymptotiques, sensibles à la variance, qui permettent l'évaluation des performances des politiques. Ces bornes nous aident à obtenir de nouveaux objectifs d'apprentissage plus pratiques grâce à leur nature composite et à leur calibrage simple. Nous analysons ensuite le problème d'un point de vue PAC-Bayésien et fournissons des bornes, plus étroites, sur les performances des politiques. Nos résultats motivent de nouvelles stratégies, qui offrent des certificats de performance sur nos politiques avant de les déployer en ligne. Les stratégies nouvellement dérivées s'appuient sur des objectifs d'apprentissage composites qui ne nécessitent pas de réglage supplémentaire. Pour (2), nous proposons d'abord un modèle bayésien hiérarchique, qui combine différents signaux, pour estimer efficacement la qualité de la recommandation. Nous fournissons les outils computationnels appropriés pour adapter l'inférence aux problèmes à grande échelle et démontrons empiriquement les avantages de l'approche dans plusieurs scénarios. Nous abordons ensuite la question de l'accélération des approches communes d'optimisation des politiques, en nous concentrant particulièrement sur les problèmes de recommandation avec des catalogues de millions de produits. Nous dérivons des méthodes d'optimisation, basées sur de nouvelles approximations du gradient calculées en temps logarithmique par rapport à la taille du catalogue. Notre approche améliore le temps linéaire des méthodes courantes de calcul de gradient, et permet un apprentissage rapide sans nuire à la qualité des politiques obtenues
This thesis presents contributions to the problem of learning from logged interactions using the offline contextual bandit framework. We are interested in two related topics: (1) offline policy learning with performance certificates, and (2) fast and efficient policy learning applied to large scale, real world recommendation. For (1), we first leverage results from the distributionally robust optimisation framework to construct asymptotic, variance-sensitive bounds to evaluate policies' performances. These bounds lead to new, more practical learning objectives thanks to their composite nature and straightforward calibration. We then analyse the problem from the PAC-Bayesian perspective, and provide tighter, non-asymptotic bounds on the performance of policies. Our results motivate new strategies, that offer performance certificates before deploying the policies online. The newly derived strategies rely on composite learning objectives that do not require additional tuning. For (2), we first propose a hierarchical Bayesian model, that combines different signals, to efficiently estimate the quality of recommendation. We provide proper computational tools to scale the inference to real world problems, and demonstrate empirically the benefits of the approach in multiple scenarios. We then address the question of accelerating common policy optimisation approaches, particularly focusing on recommendation problems with catalogues of millions of items. We derive optimisation routines, based on new gradient approximations, computed in logarithmic time with respect to the catalogue size. Our approach improves on common, linear time gradient computations, yielding fast optimisation with no loss on the quality of the learned policies
4

Tosatto, Samuele. "Off-Policy Reinforcement Learning for Robotics." Phd thesis, 2021. https://tuprints.ulb.tu-darmstadt.de/17536/1/thesis.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Nowadays, industrial processes are vastly automated by means of robotic manipulators. In some cases, robots occupy a large fraction of the production line, performing a rich range of tasks. In contrast to their tireless ability to repeatedly perform the same tasks with millimetric precision, current robotics exhibits low adaptability to new scenarios. This lack of adaptability in many cases hinders a closer human-robot interaction; furthermore, when one needs to apply some change to the production line, the robots need to be reconfigured by highly-qualified figures. Machine learning and, more particularly, reinforcement learning hold the promise to provide automated systems that can adapt to new situations and learn new tasks. Despite the overwhelming progress in recent years in the field, the vast majority of reinforcement learning is not directly applicable to real robotics. State-of-the-art reinforcement learning algorithms require intensive interaction with the environment and are unsafe in the early stage of learning when the policy perform poorly and potentially harms the systems. For these reasons, the application of reinforcement learning has been successful mainly on simulated tasks such as computer- and board-games, where it is possible to collect a vast amount of samples in parallel, and there is no possibility to damage any real system. To mitigate these issues, researchers proposed first to employ imitation learning to obtain a reasonable policy, and subsequently to refine it via reinforcement learning. In this thesis, we focus on two main issues that prevent the mentioned pipe-line from working efficiently: (i) robotic movements are represented with a high number of parameters, which prevent both safe and efficient exploration; (ii) the policy improvement is usually on-policy, which also causes inefficient and unsafe updates. In Chapter 3 we propose an efficient method to perform dimensionality reduction of learned robotic movements, exploiting redundancies in the movement spaces (which occur more commonly in manipulation tasks) rather than redundancies in the robot kinematics. The dimensionality reduction allows the projection to latent spaces, representing with high probability movements close to the demonstrated ones. To make reinforcement learning safer and more efficient, we define the off-policy update in the movement’s latent space in Chapter 4. In Chapter 5, we propose a novel off-policy gradient estimation, which makes use of a particular non-parametric technique named Nadaraya-Watson kernel regression. Building on a solid theoretical framework, we derive statistical guarantees. We believe that providing strong guarantees is at the core of a safe machine learning. In this spirit, we further expand and analyze the statistical guarantees on Nadaraya-Watson kernel regression in Chapter 6. Usually, to avoid challenging exploration in reinforcement learning applied to robotics, one must define highly engineered reward-function. This limitation hinders the possibility of allowing non-expert users to define new tasks. Exploration remains an open issue in high-dimensional and sparse reward. To mitigate this issue, we propose a far-sighted exploration bonus built on information-theoretic principles in Chapter 7. To test our algorithms, we provided a full analysis both on simulated environment, and in some cases on real world robotic tasks. The analysis supports our statement, showing that our proposed techniques can safely learn in the presence of a limited set of demonstration and robotic interactions.
5

Delp, Michael. "Experiments in off-policy reinforcement learning with the GQ(lambda) algorithm." Master's thesis, 2011. http://hdl.handle.net/10048/1762.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Off-policy reinforcement learning is useful in many contexts. Maei, Sutton, Szepesvari, and others, have recently introduced a new class of algorithms, the most advanced of which is GQ(lambda), for off-policy reinforcement learning. These algorithms are the first stable methods for general off-policy learning whose computational complexity scales linearly with the number of parameters, thereby making them potentially applicable to large applications involving function approximation. Despite these promising theoretical properties, these algorithms have received no significant empirical test of their effectiveness in off-policy settings prior to the current work. Here, GQ(lambda) is applied to a variety of prediction and control domains, including on a mobile robot, where it is able to learn multiple optimal policies in parallel from random actions. Overall, we find GQ(lambda) to be a promising algorithm for use with large real-world continuous learning tasks. We believe it could be the base algorithm of an autonomous sensorimotor robot.
6

Diddigi, Raghuram Bharadwaj. "Reinforcement Learning Algorithms for Off-Policy, Multi-Agent Learning and Applications to Smart Grids." Thesis, 2022. https://etd.iisc.ac.in/handle/2005/5673.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Reinforcement Learning (RL) algorithms are a popular class of algorithms for training an agent to learn desired behavior through interaction with an environment whose dynamics is unknown to the agent. RL algorithms combined with neural network architectures have enjoyed much success in various disciplines like games, medicine, energy management, economics and supply chain management. In our thesis, we study interesting extensions of standard single-agent RL settings, like off-policy and multi-agent settings. We discuss the motivations and importance of these settings and propose convergent algorithms to solve these problems. Finally, we consider one of the important applications of RL, namely smart grids. The goal of the smart grid is to develop a power grid model that intelligently manages its energy resources. In our thesis, we propose RL models for efficient smart grid design. Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving prediction problems. TD algorithms with linear function approximation are convergent when the data samples are generated from the target policy (known as on-policy prediction) itself. However, it has been well established in the literature that off-policy TD algorithms under linear function approximation may diverge. In the first part of the thesis, we propose a convergent online off-policy TD algorithm under linear function approximation. The main idea is to penalize updates of the algorithm to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our proposed scheme. Subsequently, we consider the “off-policy control” setup in RL, where an agent’s objective is to compute an optimal policy based on the data obtained from a behavior policy. As the optimal policy can be very different from the behavior policy, learning optimal behavior is very hard in the “offpolicy” setting compared to the “on-policy” setting wherein the data is collected from the new policy updates. In this work, we propose the first deep off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency. Unlike the existing natural gradient-based actor-critic algorithms that use only fixed features for policy and value function approximation, the proposed natural actor-critic algorithm can utilize a deep neural network’s power to approximate both policy and value function. We illustrate the benefit of the proposed off-policy natural gradient algorithm by comparing it with the Euclidean gradient actor-critic algorithm on benchmark RL tasks. In the third part of the thesis, we consider the problem of two-player zero-sum games. In this setting, there are two agents, both of whom aim to optimize their payoffs. Both the agents observe the same state of the game, and the agents’ objective is to compute a strategy profile that maximizes their payoffs. However, the payoff of the second agent is the negative of the payoff obtained by the first agent. Therefore, the objective of the second agent is to minimize the total payoff obtained by the first agent. This problem is formulated as a min-max Markov game in the literature. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation. Successive relaxation has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the two-player zero-sum games. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is unknown. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic approximation techniques. Through experiments, we demonstrate the advantages of our proposed algorithm. Next, we consider a cooperative stochastic games framework where multiple agents work towards learning optimal joint actions in an unknown environment to achieve a common goal. In many realworld applications, however, constraints are often imposed on the actions that the agents can jointly take. In such scenarios, the agents aim to learn joint actions to achieve a common goal (minimizing a specified cost function) while meeting the given constraints (specified via certain penalty functions). Our work considers the relaxation of the constrained optimization problem by constructing the Lagrangian of the cost and penalty functions. We propose a nested actor-critic solution approach to solve this relaxed problem. In this approach, an actor-critic scheme is employed to improve the policy for a given Lagrange parameter update on a faster timescale as in the classical actor-critic architecture. Using this faster timescale policy update, a meta actor-critic scheme is employed to improve the Lagrange parameters on the slower timescale. Utilizing the proposed nested actor-critic scheme, we develop three Nested Actor-Critic (N-AC) algorithms. In recent times, actor-critic algorithms with attention mechanisms have been successfully applied to obtain optimal actions for RL agents in multi-agent environments. In the fifth part of our thesis, we extend this algorithm to the constrained multi-agent RL setting considered above. The idea here is that optimizing the common goal and satisfying the constraints may require different modes of attention. Thus, by incorporating different attention modes, the agents can select useful information required for optimizing the objective and satisfying the constraints separately, thereby yielding better actions. Through experiments on benchmark multi-agent environments, we discuss the advantages of our proposed attention-based actor-critic algorithm. In the last part of our thesis, we study the applications of RL algorithms to Smart Grids. We consider two important problems - on the supply-side and demand-side, respectively, and study both in a unified framework. On the supply side, we study the problem of energy trading among microgrids to maximize profit obtained from selling power while at the same time satisfying the customer demand. On the demand side, we consider optimally scheduling the time-adjustable demand - i.e., of loads with flexible time windows in which they can be scheduled. While previous works have treated these two problems in isolation, we combine these problems and provide a unified Markov decision process (MDP) framework for these problems.

Books on the topic "Off-Policy learning":

1

Kabay, Sarah. Access, Quality, and the Global Learning Crisis. Oxford University Press, 2021. http://dx.doi.org/10.1093/oso/9780192896865.001.0001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Around the world, 250 million children cannot read, write, or perform basic mathematics. They represent almost 40 percent of all primary school-aged children. This situation has come to be called the “Global Learning Crisis,” and it is one of the most critical challenges facing the world today. Work to address this situation depends on how it is understood. Typically, the Global Learning Crisis and efforts to improve primary education are defined in relation to two terms: access and quality. This book is focused on the connection between them. In a mixed-methods case study, this book provides detailed, contextualized analysis of Ugandan primary education. As one of the first countries in sub-Saharan Africa to enact dramatic and far-reaching primary education policy, Uganda serves as a compelling case study. With both quantitative and qualitative data from over 400 Ugandan schools and communities, the book analyzes grade repetition, private primary schools, and school fees, viewing each issue as an illustration of the connection between access to education and education quality. This analysis finds evidence of a positive association, challenging a key assumption that there is a trade-off or disconnect between efforts to improve access to education and efforts to improve education quality. The book concludes that embracing the complexity of education systems and focusing on dynamics where improvements in access and quality can be mutually reinforcing can be a new approach for improving basic education in contexts around the world.
2

Startz, Richard. Profit of Education. ABC-CLIO, LLC, 2010. http://dx.doi.org/10.5040/9798216001799.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This important book translates evidence and examines policy, proposing a plan to save America's schools by rewarding teachers with professional-level salaries distributed wisely. Profit of Education makes it clear that rethinking the teaching profession is the key to repairing America's broken-down education system and securing our nation's future. Accomplishing that, author Dick Startz says, requires lifting teacher pay to professional levels and rewarding teachers for student success, with the goal of improving student learning by the equivalent of one extra year of schooling. Profit of Education takes the reader on a chapter-by-chapter walk through the evidence on pay-oriented, teacher-centric reform of the public school system, showing that such an approach can work. Startz translates the extensive scientific evidence on school reform into easily understood terms, demonstrating the enormous difference teachers make in student outcomes. Proposed levels of teacher salaries are established, and the difficult issue of differential pay is examined in depth, as are many of the practical and political issues involved in measuring teacher success. Last, but hardly least, Startz shows how teacher-centric school reform will pay off for the taxpayer and the economy.

Book chapters on the topic "Off-Policy learning":

1

Li, Jinna, Frank L. Lewis, and Jialu Fan. "Off-Policy Game Reinforcement Learning." In Reinforcement Learning, 185–232. Cham: Springer International Publishing, 2023. http://dx.doi.org/10.1007/978-3-031-28394-9_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Zhang, Li, Xin Li, Mingzhong Wang, and Andong Tian. "Off-Policy Differentiable Logic Reinforcement Learning." In Machine Learning and Knowledge Discovery in Databases. Research Track, 617–32. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-86520-7_38.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Cief, Matej, Jacek Golebiowski, Philipp Schmidt, Ziawasch Abedjan, and Artur Bekasov. "Learning Action Embeddings for Off-Policy Evaluation." In Lecture Notes in Computer Science, 108–22. Cham: Springer Nature Switzerland, 2024. http://dx.doi.org/10.1007/978-3-031-56027-9_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Klein, Edouard, Matthieu Geist, and Olivier Pietquin. "Batch, Off-Policy and Model-Free Apprenticeship Learning." In Lecture Notes in Computer Science, 285–96. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. http://dx.doi.org/10.1007/978-3-642-29946-9_28.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Rak, Alexandra, Alexey Skrynnik, and Aleksandr I. Panov. "Flexible Data Augmentation in Off-Policy Reinforcement Learning." In Artificial Intelligence and Soft Computing, 224–35. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-87986-0_20.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Rak, Alexandra, Alexey Skrynnik, and Aleksandr I. Panov. "Flexible Data Augmentation in Off-Policy Reinforcement Learning." In Artificial Intelligence and Soft Computing, 224–35. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-87986-0_20.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Steckelmacher, Denis, Hélène Plisnier, Diederik M. Roijers, and Ann Nowé. "Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics." In Machine Learning and Knowledge Discovery in Databases, 19–34. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-46133-1_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Roettger, Frederic. "Reviewing On-Policy/Off-Policy Critic Learning in the Context of Temporal Differences and Residual Learning." In Reinforcement Learning Algorithms: Analysis and Applications, 15–24. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-41188-6_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Zhang, Qichao, Dongbin Zhao, and Sibo Zhang. "Off-Policy Reinforcement Learning for Partially Unknown Nonzero-Sum Games." In Neural Information Processing, 822–30. Cham: Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-70087-8_84.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Wei, Qinglai, Ruizhuo Song, Benkai Li, and Xiaofeng Lin. "Off-Policy IRL Optimal Tracking Control for Continuous-Time Chaotic Systems." In Self-Learning Optimal Control of Nonlinear Systems, 201–14. Singapore: Springer Singapore, 2017. http://dx.doi.org/10.1007/978-981-10-4080-1_9.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Off-Policy learning":

1

He, Li, Long Xia, Wei Zeng, Zhi-Ming Ma, Yihong Zhao, and Dawei Yin. "Off-policy Learning for Multiple Loggers." In KDD '19: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2019. http://dx.doi.org/10.1145/3292500.3330864.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

White, Adam, Joseph Modayil, and Richard S. Sutton. "Scaling life-long off-policy learning." In 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL). IEEE, 2012. http://dx.doi.org/10.1109/devlrn.2012.6400860.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Zhang, Yan, and Michael M. Zavlanos. "Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus." In 2019 IEEE 58th Conference on Decision and Control (CDC). IEEE, 2019. http://dx.doi.org/10.1109/cdc40024.2019.9029969.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Zheng, Bowen, and Ran Cheng. "Rethinking Population-assisted Off-policy Reinforcement Learning." In GECCO '23: Genetic and Evolutionary Computation Conference. New York, NY, USA: ACM, 2023. http://dx.doi.org/10.1145/3583131.3590512.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Cheng, Zhihao, Li Shen, and Dacheng Tao. "Off-policy Imitation Learning from Visual Inputs." In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023. http://dx.doi.org/10.1109/icra48891.2023.10161566.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Miao, Dadong, Yanan Wang, Guoyu Tang, Lin Liu, Sulong Xu, Bo Long, Yun Xiao, Lingfei Wu, and Yunjiang Jiang. "Sequential Search with Off-Policy Reinforcement Learning." In CIKM '21: The 30th ACM International Conference on Information and Knowledge Management. New York, NY, USA: ACM, 2021. http://dx.doi.org/10.1145/3459637.3481954.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Jeunen, Olivier, Sean Murphy, and Ben Allison. "Off-Policy Learning-to-Bid with AuctionGym." In KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2023. http://dx.doi.org/10.1145/3580305.3599877.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Saito, Yuta, Himan Abdollahpouri, Jesse Anderton, Ben Carterette, and Mounia Lalmas. "Long-term Off-Policy Evaluation and Learning." In WWW '24: The ACM Web Conference 2024. New York, NY, USA: ACM, 2024. http://dx.doi.org/10.1145/3589334.3645446.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Joseph, Ajin George, and Shalabh Bhatnagar. "Bounds for off-policy prediction in reinforcement learning." In 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017. http://dx.doi.org/10.1109/ijcnn.2017.7966359.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Marvi, Zahra, and Bahare Kiumarsi. "Safe Off-policy Reinforcement Learning Using Barrier Functions." In 2020 American Control Conference (ACC). IEEE, 2020. http://dx.doi.org/10.23919/acc45564.2020.9147584.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Off-Policy learning":

1

Private sector and food security. Commercial Agriculture for Smallholders and Agribusiness (CASA), 2023. http://dx.doi.org/10.1079/20240191178.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The global community is facing escalating acute food insecurity crises, predominantly in Sub- Saharan Africa, due to climate change, the Russia-Ukraine conflict, and COVID-19 shocks. Related impacts on donor government budgets, domestic conflicts and limited fiscal capacity in countries already experiencing acute food insecurity, often on top of high chronic food insecurity levels, further exacerbate the issue. This policy brief examines the potential of private sector financing to alleviate acute food insecurity, through providing a targeted review of key mechanisms for mobilizing private sector investment in priority regions affected by acute food insecurity. These mechanisms include (1) donor-private sector partnerships, (2) private sector industry initiatives, and (3) standalone investors and institutions. They have been analysed through case studies and stakeholder consultations, to offer insights into the potential of private sector investment to address acute food insecurity challenges. The analysis emphasizes the role of private sector commercial investment, including short-term investments in addressing immediate food supply needs and medium- to long-term investments in enhancing the resilience of local food systems, focusing on geographies experiencing Integrated Food Security Phase Classification (IPC) Acute Food Insecurity Phases 2 and 3.1 These are acute food insecurity contexts where the private sector might still perceive a viable investment opportunity and where such investments can contribute to building more resilient food systems. Based on this initial review of mechanisms to mobilize private sector financing, the brief concludes that private sector financing has a role to play in building the resilience of medium-term food systems in order to prevent future emergencies, but that it is not suitable for addressing short-term, urgent financing needs related to acute food insecurity that is at crisis levels or near to them. Private sector investors also need significant de-risking and blended finance in countries that are most affected by acute food insecurity, as well as policy predictability and demonstrated national commitments to domestic and regional food and agriculture strategies, due to the long timeframes of, and risks for, most agricultural investments. This indicates that substantial additional donor and public sector intervention is needed to catalyse private sector investment and to direct it towards investments that will have the biggest impact on food security. Learnings from the case studies and other documents reviewed for this policy brief, along with interviews with a range of sectoral stakeholders, indicate that initiatives to mobilize private sector investment should prioritize two objectives so as to achieve the most food security impact. These will shift countries that are experiencing acute food insecurity away from exporting unprocessed agricultural production and importing consumable food and towards national and regional processing and value addition for local consumption. First, focus efforts on catalysing private investment in local agricultural processing and value addition. The missing value chain link in many acutely food-insecure countries is local processing and value addition capacity, which would also provide local off-take for domestic agricultural production. Many initiatives to date have not focused on this piece of the equation, but rather on access to inputs and smallholder farmer support. Second, leverage blended financing to mobilize local financial institutional lending to processing and value addition SMEs. Local currency lending is often the type of financing that agricultural SMEs most need: SME financing needs are not well-matched with the types of foreign currency investment that development finance institutions (DFIs) and other international investors offer, especially with regard to ticket size and return expectations. This brief also recognizes the limitations of its approach and the complexity of the dynamics around using private sector investment to alleviate acute food insecurity. Therefore, the brief concludes by highlighting critical questions for further research, including the positioning of smallholder engagement for food security, innovation in blended financing instruments, and enabling trade and agricultural policy frameworks.

To the bibliography