Relevant bibliographies by topics / Sparse Reward

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers
Reports

Academic literature on the topic 'Sparse Reward'

Author: Grafiati

Published: 8 March 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Sparse Reward.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Sparse Reward"

Park, Junseok, Yoonsung Kim, Hee bin Yoo, Min Whoo Lee, Kibeom Kim, Won-Seok Choi, Minsu Lee, and Byoung-Tak Zhang. "Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 1 (March 24, 2024): 592–600. http://dx.doi.org/10.1609/aaai.v38i1.27815.

Full text

Abstract:

Toddlers evolve from free exploration with sparse feedback to exploiting prior experiences for goal-directed learning with denser rewards. Drawing inspiration from this Toddler-Inspired Reward Transition, we set out to explore the implications of varying reward transitions when incorporated into Reinforcement Learning (RL) tasks. Central to our inquiry is the transition from sparse to potential-based dense rewards, which share optimal strategies regardless of reward changes. Through various experiments, including those in egocentric navigation and robotic arm manipulation tasks, we found that proper reward transitions significantly influence sample efficiency and success rates. Of particular note is the efficacy of the toddler-inspired Sparse-to-Dense (S2D) transition. Beyond these performance metrics, using Cross-Density Visualizer technique, we observed that transitions, especially the S2D, smooth the policy loss landscape, promoting wide minima that enhance generalization in RL models.

APA, Harvard, Vancouver, ISO, and other styles

Xu, Pei, Junge Zhang, Qiyue Yin, Chao Yu, Yaodong Yang, and Kaiqi Huang. "Subspace-Aware Exploration for Sparse-Reward Multi-Agent Tasks." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 10 (June 26, 2023): 11717–25. http://dx.doi.org/10.1609/aaai.v37i10.26384.

Full text

Abstract:

Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. One possible solution to this issue is to exploit inherent task structures for an acceleration of exploration. In this paper, we present a novel exploration approach, which encodes a special structural prior on the reward function into exploration, for sparse-reward multi-agent tasks. Specifically, a novel entropic exploration objective which encodes the structural prior is proposed to accelerate the discovery of rewards. By maximizing the lower bound of this objective, we then propose an algorithm with moderate computational cost, which can be applied to practical tasks. Under the sparse-reward setting, we show that the proposed algorithm significantly outperforms the state-of-the-art algorithms in the multiple-particle environment, the Google Research Football and StarCraft II micromanagement tasks. To the best of our knowledge, on some hard tasks (such as 27m_vs_30m}) which have relatively larger number of agents and need non-trivial strategies to defeat enemies, our method is the first to learn winning strategies under the sparse-reward setting.

APA, Harvard, Vancouver, ISO, and other styles

Mguni, David, Taher Jafferjee, Jianhong Wang, Nicolas Perez-Nieves, Wenbin Song, Feifei Tong, Matthew Taylor, et al. "Learning to Shape Rewards Using a Game of Two Partners." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 10 (June 26, 2023): 11604–12. http://dx.doi.org/10.1609/aaai.v37i10.26371.

Full text

Abstract:

Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construc- tion is time-consuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimising Shaping Algorithm (ROSA), an automated reward shaping framework in which the shaping-reward function is constructed in a Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards for more efficient learning while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which adopts existing RL algorithms, learns to construct a shaping-reward function that is beneficial to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA’s properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.

APA, Harvard, Vancouver, ISO, and other styles

Meng, Fanxiao. "Research on Multi-agent Sparse Reward Problem." Highlights in Science, Engineering and Technology 85 (March 13, 2024): 96–103. http://dx.doi.org/10.54097/er0mx710.

Full text

Abstract:

Sparse reward poses a significant challenge in deep reinforcement learning, leading to issues such as low sample utilization, slow agent convergence, and subpar performance of optimal policies. Overcoming these challenges requires tackling the complexity of sparse reward algorithms and addressing the lack of unified understanding. This paper aims to address these issues by introducing the concepts of reinforcement learning and sparse reward, as well as presenting three categories of sparse reward algorithms. Furthermore, the paper conducts an analysis and summary of three key aspects: manual labeling, hierarchical reinforcement learning, and the incorporation of intrinsic rewards. Hierarchical reinforcement learning is further divided into option-based and subgoal-based methods. The implementation principles, advantages, and disadvantages of all algorithms are thoroughly examined. In conclusion, this paper provides a comprehensive review and offers future directions for research in this field.

APA, Harvard, Vancouver, ISO, and other styles

Zuo, Guoyu, Qishen Zhao, Jiahao Lu, and Jiangeng Li. "Efficient hindsight reinforcement learning using demonstrations for robotic tasks with sparse rewards." International Journal of Advanced Robotic Systems 17, no. 1 (January 1, 2020): 172988141989834. http://dx.doi.org/10.1177/1729881419898342.

Full text

Abstract:

The goal of reinforcement learning is to enable an agent to learn by using rewards. However, some robotic tasks naturally specify with sparse rewards, and manually shaping reward functions is a difficult project. In this article, we propose a general and model-free approach for reinforcement learning to learn robotic tasks with sparse rewards. First, a variant of Hindsight Experience Replay, Curious and Aggressive Hindsight Experience Replay, is proposed to improve the sample efficiency of reinforcement learning methods and avoid the need for complicated reward engineering. Second, based on Twin Delayed Deep Deterministic policy gradient algorithm, demonstrations are leveraged to overcome the exploration problem and speed up the policy training process. Finally, the action loss is added into the loss function in order to minimize the vibration of output action while maximizing the value of the action. The experiments on simulated robotic tasks are performed with different hyperparameters to verify the effectiveness of our method. Results show that our method can effectively solve the sparse reward problem and obtain a high learning speed.

APA, Harvard, Vancouver, ISO, and other styles

Velasquez, Alvaro, Brett Bissey, Lior Barak, Andre Beckus, Ismail Alkhouri, Daniel Melcer, and George Atia. "Dynamic Automaton-Guided Reward Shaping for Monte Carlo Tree Search." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 13 (May 18, 2021): 12015–23. http://dx.doi.org/10.1609/aaai.v35i13.17427.

Full text

Abstract:

Reinforcement learning and planning have been revolutionized in recent years, due in part to the mass adoption of deep convolutional neural networks and the resurgence of powerful methods to refine decision-making policies. However, the problem of sparse reward signals and their representation remains pervasive in many domains. While various rewardshaping mechanisms and imitation learning approaches have been proposed to mitigate this problem, the use of humanaided artificial rewards introduces human error, sub-optimal behavior, and a greater propensity for reward hacking. In this paper, we mitigate this by representing objectives as automata in order to define novel reward shaping functions over this structured representation. In doing so, we address the sparse rewards problem within a novel implementation of Monte Carlo Tree Search (MCTS) by proposing a reward shaping function which is updated dynamically to capture statistics on the utility of each automaton transition as it pertains to satisfying the goal of the agent. We further demonstrate that such automaton-guided reward shaping can be utilized to facilitate transfer learning between different environments when the objective is the same.

APA, Harvard, Vancouver, ISO, and other styles

Corazza, Jan, Ivan Gavran, and Daniel Neider. "Reinforcement Learning with Stochastic Reward Machines." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 6 (June 28, 2022): 6429–36. http://dx.doi.org/10.1609/aaai.v36i6.20594.

Full text

Abstract:

Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.

APA, Harvard, Vancouver, ISO, and other styles

Gaina, Raluca D., Simon M. Lucas, and Diego Pérez-Liébana. "Tackling Sparse Rewards in Real-Time Games with Statistical Forward Planning Methods." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 1691–98. http://dx.doi.org/10.1609/aaai.v33i01.33011691.

Full text

Abstract:

One of the issues general AI game players are required to deal with is the different reward systems in the variety of games they are expected to be able to play at a high level. Some games may present plentiful rewards which the agents can use to guide their search for the best solution, whereas others feature sparse reward landscapes that provide little information to the agents. The work presented in this paper focuses on the latter case, which most agents struggle with. Thus, modifications are proposed for two algorithms, Monte Carlo Tree Search and Rolling Horizon Evolutionary Algorithms, aiming at improving performance in this type of games while maintaining overall win rate across those where rewards are plentiful. Results show that longer rollouts and individual lengths, either fixed or responsive to changes in fitness landscape features, lead to a boost of performance in the games during testing without being detrimental to non-sparse reward scenarios.

APA, Harvard, Vancouver, ISO, and other styles

Zhou, Xiao, Song Zhou, Xingang Mou, and Yi He. "Multirobot Collaborative Pursuit Target Robot by Improved MADDPG." Computational Intelligence and Neuroscience 2022 (February 25, 2022): 1–10. http://dx.doi.org/10.1155/2022/4757394.

Full text

Abstract:

Policy formulation is one of the main problems in multirobot systems, especially in multirobot pursuit-evasion scenarios, where both sparse rewards and random environment changes bring great difficulties to find better strategy. Existing multirobot decision-making methods mostly use environmental rewards to promote robots to complete the target task that cannot achieve good results. This paper proposes a multirobot pursuit method based on improved multiagent deep deterministic policy gradient (MADDPG), which solves the problem of sparse rewards in multirobot pursuit-evasion scenarios by combining the intrinsic reward and the external environment. The state similarity module based on the threshold constraint is as a part of the intrinsic reward signal output by the intrinsic curiosity module, which is used to balance overexploration and insufficient exploration, so that the agent can use the intrinsic reward more effectively to learn better strategies. The simulation experiment results show that the proposed method can improve the reward value of robots and the success rate of the pursuit task significantly. The intuitive change is obviously reflected in the real-time distance between the pursuer and the escapee, the pursuer using the improved algorithm for training can get closer to the escapee more quickly, and the average following distance also decreases.

APA, Harvard, Vancouver, ISO, and other styles

Jiang, Jiechuan, and Zongqing Lu. "Generative Exploration and Exploitation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 4337–44. http://dx.doi.org/10.1609/aaai.v34i04.5858.

Full text

Abstract:

Sparse reward is one of the biggest challenges in reinforcement learning (RL). In this paper, we propose a novel method called Generative Exploration and Exploitation (GENE) to overcome sparse reward. GENE automatically generates start states to encourage the agent to explore the environment and to exploit received reward signals. GENE can adaptively tradeoff between exploration and exploitation according to the varying distributions of states experienced by the agent as the learning progresses. GENE relies on no prior knowledge about the environment and can be combined with any RL algorithm, no matter on-policy or off-policy, single-agent or multi-agent. Empirically, we demonstrate that GENE significantly outperforms existing methods in three tasks with only binary rewards, including Maze, Maze Ant, and Cooperative Navigation. Ablation studies verify the emergence of progressive exploration and automatic reversing.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Sparse Reward"

Hanski, Jari, and Kaan Baris Biçak. "An Evaluation of the Unity Machine Learning Agents Toolkit in Dense and Sparse Reward Video Game Environments." Thesis, Uppsala universitet, Institutionen för speldesign, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-444982.

Full text

Abstract:

In computer games, one use case for artificial intelligence is used to create interesting problems for the player. To do this new techniques such as reinforcement learning allows game developers to create artificial intelligence agents with human-like or superhuman abilities. The Unity ML-agents toolkit is a plugin that provides game developers with access to reinforcement algorithms without expertise in machine learning. In this paper, we compare reinforcement learning methods and provide empirical training data from two different environments. First, we describe the chosen reinforcement methods and then explain the design of both training environments. We compared the benefits in both dense and sparse rewards environments. The reinforcement learning methods were evaluated by comparing the training speed and cumulative rewards of the agents. The goal was to evaluate how much the combination of extrinsic and intrinsic rewards accelerated the training process in the sparse rewards environment. We hope this study helps game developers utilize reinforcement learning more effectively, saving time during the training process by choosing the most fitting training method for their video game environment. The results show that when training reinforcement agents in sparse rewards environments the agents trained faster with the combination of extrinsic and intrinsic rewards. And when training an agent in a sparse reward environment with only extrinsic rewards the agent failed to learn to complete the task.

APA, Harvard, Vancouver, ISO, and other styles

Castanet, Nicolas. "Automatic state representation and goal selection in unsupervised reinforcement learning." Electronic Thesis or Diss., Sorbonne université, 2025. http://www.theses.fr/2025SORUS005.

Full text

Abstract:

Au cours des dernières années, l'apprentissage par renforcement a connu un succès considérable en entrainant des agents spécialisés capables de dépasser radicalement les performances humaines dans des jeux complexes comme les échecs ou le go, ou dans des applications robotiques. Ces agents manquent souvent de polyvalence, ce qui oblige l'ingénierie humaine à concevoir leur comportement pour des tâches spécifiques avec un signal de récompense prédéfini, limitant ainsi leur capacité à faire face à de nouvelles circonstances. La spécialisation de ces agents se traduit par de faibles capacités de généralisation, ce qui les rend vulnérables à de petites variations de facteurs externes. L'un des objectifs de la recherche en intelligence artificielle est de dépasser les agents spécialisés d'aujourd'hui pour aller vers des systèmes plus généralistes pouvant s'adapter en temps réel à des facteurs externes imprévisibles et à de nouvelles tâches en aval. Ce travail va dans ce sens, en s'attaquant aux problèmes d'apprentissage par renforcement non supervisé, un cadre dans lequel les agents ne reçoivent pas de récompenses externes et doivent donc apprendre de manière autonome de nouvelles tâches tout au long de leur vie, guidés par des motivations intrinsèques. Le concept de motivation intrinsèque découle de notre compréhension de la capacité des humains à adopter certains comportements autonomes au cours de leur développement, tels que le jeu ou la curiosité. Cette capacité permet aux individus de concevoir et de résoudre leurs propres tâches, et de construire des représentations physiques et sociales de leur environnement, acquérant ainsi un ensemble ouvert de compétences tout au long de leur existence. Cette thèse s'inscrit dans l'effort de recherche visant à incorporer ces caractéristiques essentielles dans les agents artificiels, en s'appuyant sur l'apprentissage par renforcement conditionné par les buts pour concevoir des agents capables de découvrir et de maîtriser tous les buts réalisables dans des environnements complexes. Dans notre première contribution, nous étudions la sélection autonome de buts intrinsèques, car un agent polyvalent doit être capable de déterminer ses propres objectifs et l'ordre dans lequel apprendre ces objectifs pour améliorer ses performances. En tirant parti d'un modèle appris des capacités actuelles de l'agent à atteindre des buts, nous montrons que nous pouvons construire une distribution de buts optimale en fonction de leur difficulté, permettant d'échantillonner des buts dans la zone de développement proximal (ZDP) de l'agent, qui est un concept issu de la psychologie signifiant à la frontière entre ce qu'un agent sait et ce qu'il ne sait pas, constituant l'espace de connaissances qui n'est pas encore maîtrisé, mais qui a le potentiel d'être acquis. Nous démontrons que le fait de cibler la ZDP de l'agent entraîne une augmentation significative des performances pour une grande variété de tâches. Une autre compétence clé est d'extraire une représentation pertinente de l'environnement à partir des observations issues des capteurs disponibles. Nous abordons cette question dans notre deuxième contribution, en soulignant la difficulté d'apprendre une représentation correcte de l'environnement dans un cadre en ligne, où l'agent acquiert des connaissances de manière incrémentale au fur et à mesure de ses progrès. Dans ce contexte, les objectifs récemment atteints sont considérés comme des valeurs aberrantes, car il y a très peu d'occurrences de cette nouvelle compétence dans les expériences de l'agent, ce qui rend leurs représentations fragiles. Nous exploitons le cadre adversaire de l'Optimisation Distributionnellement Robuste afin que les représentations de l'agent pour de tels exemples soient fiables. Nous montrons que notre méthode conduit à un cercle vertueux, car l'apprentissage de représentations correctes pour de nouveaux objectifs favorise l'exploration de l'environnement
In the past few years, Reinforcement Learning (RL) achieved tremendous success by training specialized agents owning the ability to drastically exceed human performance in complex games like Chess or Go, or in robotics applications. These agents often lack versatility, requiring human engineering to design their behavior for specific tasks with predefined reward signal, limiting their ability to handle new circumstances. This agent's specialization results in poor generalization capabilities, which make them vulnerable to small variations of external factors and adversarial attacks. A long term objective in artificial intelligence research is to move beyond today's specialized RL agents toward more generalist systems endowed with the capability to adapt in real time to unpredictable external factors and to new downstream tasks. This work aims in this direction, tackling unsupervised reinforcement learning problems, a framework where agents are not provided with external rewards, and thus must autonomously learn new tasks throughout their lifespan, guided by intrinsic motivations. The concept of intrinsic motivation arise from our understanding of humans ability to exhibit certain self-sufficient behaviors during their development, such as playing or having curiosity. This ability allows individuals to design and solve their own tasks, and to build inner physical and social representations of their environments, acquiring an open-ended set of skills throughout their lifespan as a result. This thesis is part of the research effort to incorporate these essential features in artificial agents, leveraging goal-conditioned reinforcement learning to design agents able to discover and master every feasible goals in complex environments. In our first contribution, we investigate autonomous intrinsic goal setting, as a versatile agent should be able to determine its own goals and the order in which to learn these goals to enhance its performances. By leveraging a learned model of the agent's current goal reaching abilities, we show that we can shape an optimal difficulty goal distribution, enabling to sample goals in the Zone of Proximal Development (ZPD) of the agent, which is a psychological concept referring to the frontier between what a learner knows and what it does not, constituting the space of knowledge that is not mastered yet but have the potential to be acquired. We demonstrate that targeting the ZPD of the agent's result in a significant increase in performance for a great variety of goal-reaching tasks. Another core competence is to extract a relevant representation of what matters in the environment from observations coming from any available sensors. We address this question in our second contribution, by highlighting the difficulty to learn a correct representation of the environment in an online setting, where the agent acquires knowledge incrementally as it make progresses. In this context, recent achieved goals are outliers, as there are very few occurrences of this new skill in the agent's experiences, making their representations brittle. We leverage the adversarial setting of Distributionally Robust Optimization in order for the agent's representations of such outliers to be reliable. We show that our method leads to a virtuous circle, as learning accurate representations for new goals fosters the exploration of the environment

APA, Harvard, Vancouver, ISO, and other styles

Paolo, Giuseppe. "Learning in Sparse Rewards setting through Quality Diversity algorithms." Electronic Thesis or Diss., Sorbonne université, 2021. http://www.theses.fr/2021SORUS400.

Full text

Abstract:

Les agents incarnés, qu'ils soient naturels ou artificiels, peuvent apprendre à interagir avec l'environnement dans lequel ils se trouvent par un processus d'essais et d'erreurs. Ce processus peut être formalisé dans le cadre de l'apprentissage par renforcement, dans lequel l'agent effectue une action dans l'environnement et observe son résultat par le biais d'une observation et d'un signal de récompense. C'est le signal de récompense qui indique à l'agent la qualité de l'action effectuée par rapport à la tâche. Cela signifie que plus une récompense est donnée, plus il est facile d'améliorer la solution actuelle. Lorsque ce n'est pas le cas, et que la récompense est donnée avec parcimonie, l'agent se retrouve dans une situation de récompenses éparses. Cela nécessite de se concentrer sur l'exploration, c'est-à-dire de tester différentes choses, afin de découvrir quelle action ou quel ensemble d'actions mène à la récompense. Les agents RL ont généralement du mal à le faire. L'exploration est le point central des méthodes de Qualité-Diversité, une famille d'algorithmes évolutionnaires qui recherche un ensemble de politiques dont les comportements sont aussi différents que possible, tout en améliorant leurs performances. Dans cette thèse, nous abordons le problème des récompenses éparses avec ces algorithmes, et en particulier avec Novelty Search. Il s'agit d'une méthode qui, contrairement à de nombreuses autres approches Qualité-Diversité, n'améliore pas les performances des récompenses découvertes, mais uniquement leur diversité. Grâce à cela, elle peut explorer rapidement tout l'espace des comportements possibles des politiques. La première partie de la thèse se concentre sur l'apprentissage autonome d'une représentation de l'espace de recherche dans lequel l'algorithme évalue les politiques découvertes. A cet égard, nous proposons l'algorithme Task Agnostic eXploration of Outcome spaces through Novelty and Surprise (TAXONS). Cette méthode apprend une représentation à faible dimension de l'espace de recherche dans des situations où il n'est pas facile de concevoir manuellement cette représentation. TAXONS s'est avéré efficace dans trois environnements différents mais nécessite encore des informations sur le moment où il faut saisir l'observation utilisée pour apprendre l'espace de recherche. Cette limitation est abordée en réalisant une étude sur les multiples façons d'encoder dans l'espace de recherche des informations sur la trajectoire complète des observations générées pendant une évaluation de politique. Parmi les méthodes étudiées, nous analysons en particulier la transformation mathématique appelée signature et sa pertinence pour construire des représentations au niveau de la trajectoire. Le manuscrit se poursuit par l'étude d'un problème complémentaire à celui abordé par TAXONS : comment se concentrer sur les parties les plus intéressantes de l'espace de recherche. Novelty Search est limitée par le fait que toute information sur une récompense découverte au cours du processus d'exploration est ignorée. Dans notre deuxième contribution, nous présentons l'algorithme Sparse Reward Exploration via Novelty Search and Emitters (SERENE). Cette méthode sépare l'exploration de l'espace de recherche de l'exploitation de la récompense par une approche en deux étapes alternées. L'exploration est effectuée par Novelty Search, mais lorsqu'une récompense est découverte, elle est exploitée par des instances de méthodes basées sur la récompense - appelées émetteurs - qui effectuent une optimisation locale de la récompense. Des expériences sur différents environnements montrent comment SERENE peut obtenir rapidement des solutions à forte récompense sans nuire aux performances d'exploration de la méthode. Dans notre troisième et dernière contribution, nous combinons les deux idées présentées avec TAXONS et SERENE en une seule approche :TAXONS augmentés par SERENE (STAX). Cet algorithme peut apprendre de manière autonome [...]
Embodied agents, both natural and artificial, can learn to interact with the environment they are in through a process of trial and error. This process can be formalized through the Reinforcement Learning framework, in which the agent performs an action in the environment and observes its outcome through an observation and a reward signal. It is the reward signal that tells the agent how good the performed action is with respect to the task. This means that the more often a reward is given, the easier it is to improve on the current solution. When this is not the case, and the reward is given sparingly, the agent finds itself in a situation of sparse rewards. This requires a big focus on exploration, that is on testing different things, in order to discover which action, or set of actions leads to the reward. RL agents usually struggle with this. Exploration is the focus of Quality-Diversity methods, a family of evolutionary algorithms that searches for a set of policies whose behaviors are as different as possible, while also improving on their performances. In this thesis, we approach the problem of sparse rewards with these algorithms, and in particular with Novelty Search. This is a method that, contrary to many other Quality-Diversity approaches, does not improve on the performances of the discovered rewards, but only on their diversity. Thanks to this it can quickly explore the whole space of possible policies behaviors. The first part of the thesis focuses on autonomously learning a representation of the search space in which the algorithm evaluates the discovered policies. In this regard, we propose the Task Agnostic eXploration of Outcome spaces through Novelty and Surprise (TAXONS) algorithm. This method learns a low-dimensional representation of the search space in situations in which it is not easy to hand-design said representation. TAXONS has proven effective in three different environments but still requires information on when to capture the observation used to learn the search space. This limitation is addressed by performing a study on multiple ways to encode into the search space information about the whole trajectory of observations generated during a policy evaluation. Among the studied methods, we analyze in particular the mathematical transform called signature and its relevance to build trajectory-level representations. The manuscript continues with the study of a complementary problem to the one addressed by TAXONS: how to focus on the most interesting parts of the search space. Novelty Search is limited by the fact that all information about any reward discovered during the exploration process is ignored. In our second contribution, we introduce the Sparse Reward Exploration via Novelty Search and Emitters (SERENE) algorithm. This method separates the exploration of the search space from the exploitation of the reward through a two-alternating-steps approach. The exploration is performed through Novelty Search, but whenever a reward is discovered, it is exploited by instances of reward-based methods - called emitters - that perform local optimization of the reward. Experiments on different environments show how SERENE can quickly obtain high rewarding solutions without hindering the exploration performances of the method. In our third and final contribution, we combine the two ideas presented with TAXONS and SERENE into a single approach: SERENE augmented TAXONS (STAX). This algorithm can autonomously learn a low-dimensional representation of the search space while quickly optimizing any discovered reward through emitters. Experiments conducted on various environments show how the method can i) learn a representation allowing the discovery of all rewards and ii) quickly [...]

APA, Harvard, Vancouver, ISO, and other styles

Beretta, Davide. "Experience Replay in Sparse Rewards Problems using Deep Reinforcement Techniques." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/17531/.

Full text

Abstract:

In questo lavoro si introduce il lettore al Reinforcement Learning, un'area del Machine Learning su cui negli ultimi anni è stata fatta molta ricerca. In seguito vengono presentate alcune modifiche ad ACER, un algoritmo noto e molto interessante che fa uso di Experience Replay. Lo scopo è quello di cercare di aumentarne le performance su problemi generali ma in particolar modo sugli sparse reward problem. Per verificare la bontà delle idee proposte è utilizzato Montezuma's Revenge, un gioco sviluppato per Atari 2600 e considerato tra i più difficili da trattare.

APA, Harvard, Vancouver, ISO, and other styles

Parisi, Simone [Verfasser], Jan [Akademischer Betreuer] Peters, and Joschka [Akademischer Betreuer] Boedeker. "Reinforcement Learning with Sparse and Multiple Rewards / Simone Parisi ; Jan Peters, Joschka Boedeker." Darmstadt : Universitäts- und Landesbibliothek Darmstadt, 2020. http://d-nb.info/1203301545/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Benini, Francesco. "Predicting death in games with deep reinforcement learning." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/20755/.

Full text

Abstract:

Il contesto in cui si pone l'elaborato è una branca del machine learning, chiamato reinforcement learning. Quest'elaborato si pone come obiettivo di migliorare il lavoro sviluppato dal collega M. Conciatori. In questa tesi ci si vuole soffermare sui giochi con ricompense molto sparse, dove la soluzione precedente non era riuscita a conseguire traguardi. I giochi con ricompense sparse sono quelli in cui l'agente prima di ottenere un premio, che gli faccia comprendere che sta eseguendo la sequenza di azioni corretta, deve compiere un gran numero di azioni. Tra i giochi con queste caratteristiche, ci si è focalizzati su uno, Montezuma's Revenge. Montezuma's Revenge si distingue perché per ottenere il primo reward è necessario eseguire un gran numero di azioni. Per questo la totalità degli algoritmi sviluppati non è riuscita ad ottenere risultati soddisfacenti. L'idea di proseguire il lavoro del collega M. Conciatori è nata proprio dal fatto che Lower Bound DQN riusciva solo ad ottenere la prima ricompensa. Ci si è posti, perciò, come scopo principale di trovare una soluzione per poter ottenere risultati ottimali e si è, a tal fine, pensato di prevedere la morte dell'agente, aiutandolo, di conseguenza, ad evitare le azioni sbagliate e guadagnare maggiori ricompense. L'agente in questo contesto impiega più tempo per esplorare l'ambiente e conoscere quali comportamenti hanno un compenso positivo. In conseguenza di questo si è pensato di venire in aiuto dell'agente restituendogli una penalità per ciò che era dannoso al suo modo di agire, perciò, attribuendo una sanzione a tutte quelle azioni che causano la terminazione dell'episodio e quindi la sua morte. Le esperienze negative si memorizzano in un buffer apposito, chiamato done buffer, dal quale si estraggono poi per allenare la rete. Nel momento in cui l'agente si troverà nuovamente nella stessa situazione saprà quale azione sia meglio evitare, e con il tempo anche quale scegliere.

APA, Harvard, Vancouver, ISO, and other styles

Gallouedec, Quentin. "Toward the generalization of reinforcement learning." Electronic Thesis or Diss., Ecully, Ecole centrale de Lyon, 2024. http://www.theses.fr/2024ECDL0013.

Full text

Abstract:

L’apprentissage par renforcement conventionnel implique l’entraînement d’un agent unimodal sur une tâche unique et bien définie, guidé par un signal de récompense optimisé pour le gradient. Ce cadre ne nous permet pas d’envisager un agent d’apprentissage adapté aux problèmes du monde réel impliquant des flux de diverses modalités, des tâches multiples, souvent mal définies, voire pas définies du tout. C’est pourquoi nous préconisons une transition vers un cadre plus général, visant à créer des algorithmes d’apprentissage par renforcement plus adaptables et intrinsèquement polyvalents. Pour progresser dans cette direction, nous identifions deux domaines d’intérêt principaux. Le premier est l’amélioration de l’exploration, qui permet à l’agent d’apprendre de l’environnement en dépendant le moins possible du signal de récompense. Nous présentons Latent Go-Explore (LGE), une généralisation de l’algorithme Go-Explore qui, malgré ses résultats impressionnants, était limité par une forte contrainte de connaissance du domaine. LGE atténue ces limitations et permet une application plus large dans un cadre plus général. LGE démontre son efficacité et sa polyvalence accrues en surpassant de manière significative les lignes de base dans tous les environnements testés. Le deuxième domaine d’intérêt est celui de la conception d’un agent polyvalent qui peut fonctionner dans une variété d’environnements, impliquant ainsi une structure multimodale et transcendant même le cadre séquentiel conventionnel de l’apprentissage par renforcement. Nous présentons Jack of All Trades (JAT), une architecture multimodale basée Transformers, spécialement conçue pour les tâches de décision séquentielle. En utilisant un seul ensemble de poids, JAT démontre sa robustesse et sa polyvalence, rivalisant avec son unique référence sur plusieurs benchmarks d’apprentissage par renforcement et montrant même des performances prometteuses sur des tâches de vision et textuelles. Nous pensons que ces deux contributions constituent une étape importante vers une approche plus générale de l’apprentissage par renforcement. En outre, nous présentons d’autres avancées méthodologiques et techniques qui sont étroitement liées à notre question de recherche initiale. La première est l’introduction d’un ensemble d’environnements robotiques simulés à récompense éparse, conçus pour fournir à la communauté les outils nécessaires à l’apprentissage dans des conditions de faible supervision. Trois ans après son introduction, cette contribution a été largement adoptée par la communauté et continue de faire l’objet d’une maintenance et d’un support actifs. D’autre part, nous présentons Open RL Benchmark, notre initiative pionnière visant à fournir un ensemble complet et entièrement enregistré d’expériences d’apprentissage par renforcement, allant au-delà des données typiques pour inclure toutes les métriques spécifiques à l’algorithme et au système. Ce benchmark vise à améliorer l’efficacité de la recherche en fournissant des données prêtes à l’emploi et en v vi facilitant la reproductibilité précise des expériences. Grâce à son approche communautaire, il est rapidement devenu une ressource importante, documentant plus de 25 000 exécutions. Ces avancées techniques et méthodologiques, associées aux contributions scientifiques décrites ci-dessus, visent à promouvoir une approche plus générale de l’apprentissage par renforcement et, nous l’espérons, représentent une étape significative vers le développement à terme d’un agent plus opérationnel
Conventional Reinforcement Learning (RL) involves training a unimodal agent on a single, well-defined task, guided by a gradient-optimized reward signal. This framework does not allow us to envisage a learning agent adapted to real-world problems involving diverse modality streams, multiple tasks, often poorly defined, sometimes not defined at all. Hence, we advocate for transitioning towards a more general framework, aiming to create RL algorithms that more inherently versatile.To advance in this direction, we identify two primary areas of focus. The first aspect involves improving exploration, enabling the agent to learn from the environment with reduced dependence on the reward signal. We present Latent Go-Explore (LGE), an extension of the Go-Explore algorithm. While Go-Explore achieved impressive results, it was constrained by domain-specific knowledge. LGE overcomes these limitations, offering wider applicability within a general framework. In various tested environments, LGE consistently outperforms the baselines, showcasing its enhanced effectiveness and versatility. The second focus is to design a general-purpose agent that can operate in a variety of environments, thus involving a multimodal structure and even transcending the conventional sequential framework of RL. We introduce Jack of All Trades (JAT), a multimodal Transformer-based architecture uniquely tailored to sequential decision tasks. Using a single set of weights, JAT demonstrates robustness and versatility, competing its unique baseline on several RL benchmarks and even showing promising performance on vision and textual tasks. We believe that these two contributions are a valuable step towards a more general approach to RL. In addition, we present other methodological and technical advances that are closely related to our core research question. The first is the introduction of a set of sparsely rewarded simulated robotic environments designed to provide the community with the necessary tools for learning under conditions of low supervision. Notably, three years after its introduction, this contribution has been widely adopted by the community and continues to receive active maintenance and support. On the other hand, we present Open RL Benchmark, our pioneering initiative to provide a comprehensive and fully tracked set of RL experiments, going beyond typical data to include all algorithm-specific and system metrics. This benchmark aims to improve research efficiency by providing out-of-the-box RL data and facilitating accurate reproducibility of experiments. With its community-driven approach, it has quickly become an important resource, documenting over 25,000 runs.These technical and methodological advances, along with the scientific contributions described above, are intended to promote a more general approach to Reinforcement Learning and, we hope, represent a meaningful step toward the eventual development of a more operative RL agent

APA, Harvard, Vancouver, ISO, and other styles

Junyent, Barbany Miquel. "Width-Based Planning and Learning." Doctoral thesis, Universitat Pompeu Fabra, 2021. http://hdl.handle.net/10803/672779.

Full text

Abstract:

Optimal sequential decision making is a fundamental problem to many diverse fields. In recent years, Reinforcement Learning (RL) methods have experienced unprecedented success, largely enabled by the use of deep learning models, reaching human-level performance in several domains, such as the Atari video games or the ancient game of Go. In contrast to the RL approach in which the agent learns a policy from environment interaction samples, ignoring the structure of the problem, the planning approach for decision making assumes known models for the agent's goals and domain dynamics, and focuses on determining how the agent should behave to achieve its objectives. Current planners are able to solve problem instances involving huge state spaces by precisely exploiting the problem structure that is defined in the state-action model. In this work we combine the two approaches, leveraging fast and compact policies from learning methods and the capacity to perform lookaheads in combinatorial problems from planning methods. In particular, we focus on a family of planners called width-based planners, that has demonstrated great success in recent years due to its ability to scale independently of the size of the state space. The basic algorithm, Iterated Width (IW), was originally proposed for classical planning problems, where a model for state transitions and goals, represented by sets of atoms, is fully determined. Nevertheless, width-based planners do not require a fully defined model of the environment, and can be used with simulators. For instance, they have been recently applied in pixel domains such as the Atari games. Despite its success, IW is purely exploratory, and does not leverage past reward information. Furthermore, it requires the state to be factored into features that need to be pre-defined for the particular task. Moreover, running the algorithm with a width larger than 1 in practice is usually computationally intractable, prohibiting IW from solving higher width problems. We begin this dissertation by studying the complexity of width-based methods when the state space is defined by multivalued features, as in the RL setting, instead of Boolean atoms. We provide a tight upper bound on the amount of nodes expanded by IW, as well as overall algorithmic complexity results. In order to deal with more challenging problems (i.e., those with a width higher than 1), we present a hierarchical algorithm that plans at two levels of abstraction. A high-level planner uses abstract features that are incrementally discovered from low-level pruning decisions. We illustrate this algorithm in classical planning PDDL domains as well as in pixel-based simulator domains. In classical planning, we show how IW(1) at two levels of abstraction can solve problems of width 2. To leverage past reward information, we extend width-based planning by incorporating an explicit policy in the action selection mechanism. Our method, called π-IW, interleaves width-based planning and policy learning using the state-actions visited by the planner. The policy estimate takes the form of a neural network and is in turn used to guide the planning step, thus reinforcing promising paths. Notably, the representation learned by the neural network can be used as a feature space for the width-based planner without degrading its performance, thus removing the requirement of pre-defined features for the planner. We compare π-IW with previous width-based methods and with AlphaZero, a method that also interleaves planning and learning, in simple environments, and show that π-IW has superior performance. We also show that the π-IW algorithm outperforms previous width-based methods in the pixel setting of Atari games suite. Finally, we show that the proposed hierarchical IW can be seamlessly integrated with our policy learning scheme, resulting in an algorithm that outperforms flat IW-based planners in Atari games with sparse rewards.
La presa seqüencial de decisions òptimes és un problema fonamental en diversos camps. En els últims anys, els mètodes d'aprenentatge per reforç (RL) han experimentat un èxit sense precedents, en gran part gràcies a l'ús de models d'aprenentatge profund, aconseguint un rendiment a nivell humà en diversos dominis, com els videojocs d'Atari o l'antic joc de Go. En contrast amb l'enfocament de RL, on l'agent aprèn una política a partir de mostres d'interacció amb l'entorn, ignorant l'estructura del problema, l'enfocament de planificació assumeix models coneguts per als objectius de l'agent i la dinàmica del domini, i es basa en determinar com ha de comportar-se l'agent per aconseguir els seus objectius. Els planificadors actuals són capaços de resoldre problemes que involucren grans espais d'estats precisament explotant l'estructura del problema, definida en el model estat-acció. En aquest treball combinem els dos enfocaments, aprofitant polítiques ràpides i compactes dels mètodes d'aprenentatge i la capacitat de fer cerques en problemes combinatoris dels mètodes de planificació. En particular, ens enfoquem en una família de planificadors basats en el width (ample), que han tingut molt èxit en els últims anys gràcies a que la seva escalabilitat és independent de la mida de l'espai d'estats. L'algorisme bàsic, Iterated Width (IW), es va proposar originalment per problemes de planificació clàssica, on el model de transicions d'estat i objectius ve completament determinat, representat per conjunts d'àtoms. No obstant, els planificadors basats en width no requereixen un model de l'entorn completament definit i es poden utilitzar amb simuladors. Per exemple, s'han aplicat recentment a dominis gràfics com els jocs d'Atari. Malgrat el seu èxit, IW és un algorisme purament exploratori i no aprofita la informació de recompenses anteriors. A més, requereix que l'estat estigui factoritzat en característiques, que han de predefinirse per a la tasca en concret. A més, executar l'algorisme amb un width superior a 1 sol ser computacionalment intractable a la pràctica, el que impedeix que IW resolgui problemes de width superior. Comencem aquesta tesi estudiant la complexitat dels mètodes basats en width quan l'espai d'estats està definit per característiques multivalor, com en els problemes de RL, en lloc d'àtoms booleans. Proporcionem un límit superior més precís en la quantitat de nodes expandits per IW, així com resultats generals de complexitat algorísmica. Per fer front a problemes més complexos (és a dir, aquells amb un width superior a 1), presentem un algorisme jeràrquic que planifica en dos nivells d'abstracció. El planificador d'alt nivell utilitza característiques abstractes que es van descobrint gradualment a partir de decisions de poda en l'arbre de baix nivell. Il·lustrem aquest algorisme en dominis PDDL de planificació clàssica, així com en dominis de simuladors gràfics. En planificació clàssica, mostrem com IW(1) en dos nivells d'abstracció pot resoldre problemes de width 2. Per aprofitar la informació de recompenses passades, incorporem una política explícita en el mecanisme de selecció d'accions. El nostre mètode, anomenat π-IW, intercala la planificació basada en width i l'aprenentatge de la política usant les accions visitades pel planificador. Representem la política amb una xarxa neuronal que, al seu torn, s'utilitza per guiar la planificació, reforçant així camins prometedors. A més, la representació apresa per la xarxa neuronal es pot utilitzar com a característiques per al planificador sense degradar el seu rendiment, eliminant així el requisit d'usar característiques predefinides. Comparem π-IW amb mètodes anteriors basats en width i amb AlphaZero, un mètode que també intercala planificació i aprenentatge, i mostrem que π-IW té un rendiment superior en entorns simples. També mostrem que l'algorisme π-IW supera altres mètodes basats en width en els jocs d'Atari. Finalment, mostrem que el mètode IW jeràrquic proposat pot integrar-se fàcilment amb el nostre esquema d'aprenentatge de la política, donant com a resultat un algorisme que supera els planificadors no jeràrquics basats en IW en els jocs d'Atari amb recompenses distants.
La toma secuencial de decisiones óptimas es un problema fundamental en diversos campos. En los últimos años, los métodos de aprendizaje por refuerzo (RL) han experimentado un éxito sin precedentes, en gran parte gracias al uso de modelos de aprendizaje profundo, alcanzando un rendimiento a nivel humano en varios dominios, como los videojuegos de Atari o el antiguo juego de Go. En contraste con el enfoque de RL, donde el agente aprende una política a partir de muestras de interacción con el entorno, ignorando la estructura del problema, el enfoque de planificación asume modelos conocidos para los objetivos del agente y la dinámica del dominio, y se basa en determinar cómo debe comportarse el agente para lograr sus objetivos. Los planificadores actuales son capaces de resolver problemas que involucran grandes espacios de estados precisamente explotando la estructura del problema, definida en el modelo estado-acción. En este trabajo combinamos los dos enfoques, aprovechando políticas rápidas y compactas de los métodos de aprendizaje y la capacidad de realizar búsquedas en problemas combinatorios de los métodos de planificación. En particular, nos enfocamos en una familia de planificadores basados en el width (ancho), que han demostrado un gran éxito en los últimos años debido a que su escalabilidad es independiente del tamaño del espacio de estados. El algoritmo básico, Iterated Width (IW), se propuso originalmente para problemas de planificación clásica, donde el modelo de transiciones de estado y objetivos viene completamente determinado, representado por conjuntos de átomos. Sin embargo, los planificadores basados en width no requieren un modelo del entorno completamente definido y se pueden utilizar con simuladores. Por ejemplo, se han aplicado recientemente en dominios gráficos como los juegos de Atari. A pesar de su éxito, IW es un algoritmo puramente exploratorio y no aprovecha la información de recompensas anteriores. Además, requiere que el estado esté factorizado en características, que deben predefinirse para la tarea en concreto. Además, ejecutar el algoritmo con un width superior a 1 suele ser computacionalmente intratable en la práctica, lo que impide que IW resuelva problemas de width superior. Empezamos esta tesis estudiando la complejidad de los métodos basados en width cuando el espacio de estados está definido por características multivalor, como en los problemas de RL, en lugar de átomos booleanos. Proporcionamos un límite superior más preciso en la cantidad de nodos expandidos por IW, así como resultados generales de complejidad algorítmica. Para hacer frente a problemas más complejos (es decir, aquellos con un width superior a 1), presentamos un algoritmo jerárquico que planifica en dos niveles de abstracción. El planificador de alto nivel utiliza características abstractas que se van descubriendo gradualmente a partir de decisiones de poda en el árbol de bajo nivel. Ilustramos este algoritmo en dominios PDDL de planificación clásica, así como en dominios de simuladores gráficos. En planificación clásica, mostramos cómo IW(1) en dos niveles de abstracción puede resolver problemas de width 2. Para aprovechar la información de recompensas pasadas, incorporamos una política explícita en el mecanismo de selección de acciones. Nuestro método, llamado π-IW, intercala la planificación basada en width y el aprendizaje de la política usando las acciones visitadas por el planificador. Representamos la política con una red neuronal que, a su vez, se utiliza para guiar la planificación, reforzando así caminos prometedores. Además, la representación aprendida por la red neuronal se puede utilizar como características para el planificador sin degradar su rendimiento, eliminando así el requisito de usar características predefinidas. Comparamos π-IW con métodos anteriores basados en width y con AlphaZero, un método que también intercala planificación y aprendizaje, y mostramos que π-IW tiene un rendimiento superior en entornos simples. También mostramos que el algoritmo π-IW supera otros métodos basados en width en los juegos de Atari. Finalmente, mostramos que el IW jerárquico propuesto puede integrarse fácilmente con nuestro esquema de aprendizaje de la política, dando como resultado un algoritmo que supera a los planificadores no jerárquicos basados en IW en los juegos de Atari con recompensas distantes.

APA, Harvard, Vancouver, ISO, and other styles

Parisi, Simone. "Reinforcement Learning with Sparse and Multiple Rewards." Phd thesis, 2020. https://tuprints.ulb.tu-darmstadt.de/11372/1/THESIS.PDF.

Full text

Abstract:

Over the course of the last decade, the framework of reinforcement learning has developed into a promising tool for learning a large variety of task. The idea of reinforcement learning is, at its core, very simple yet effective. The learning agent is left to explore the world by performing actions based on its observations of the state of the world, and in turn receives a feedback, called reward, assessing the quality of its behavior. However, learning soon becomes challenging and even impractical as the complexity of the environment and of the task increase. In particular, learning without any pre-defined behavior (1) in the presence of rarely emitted or sparse rewards, (2) maintaining stability even with limited data, and (3) with possibly multiple conflicting objectives are some of the most prominent issues that the agent has to face. Consider the simple problem of a robot that needs to learn a parameterized controller in order to reach a certain point based solely on the raw sensory observation, e.g., internal reading of joints position and camera images of the surrounding environment, and on the binary reward "success'' / "failure''. Without any prior knowledge of the world's dynamics, or any hint on how to behave, the robot will start acting randomly. Such exploration strategy will be (1) very unlikely to bring the robot closer to the goal, and thus to experience the "success'' feedback, and (2) likely generate useless trajectories and, subsequently, learning will be unstable. Furthermore, (3) there are many different ways the robot can reach the goal. For instance, the robot can quickly accelerate and then suddenly stop at the desired point, or it can slowly and smoothly navigate to the goal. These behaviors are clearly opposite, but the binary feedback does not provide any hint on which is more desirable. It should be clear that even simple problems such as a reaching task can turn non-trivial for reinforcement learning. One possible solution is to pre-engineer the task, e.g., hand-crafting the initial exploration behavior with imitation learning, shaping the reward based on the distance from the goal, or adding auxiliary rewards based on speed and safety. Following this solution, in recent years a lot of effort has been directed towards scaling reinforcement learning to solve complex real-world problems, such as robotic tasks with many degrees of freedom, videogames, and board games like Chess, Go, and Shogi. These advances, however, were possible largely thanks to experts prior knowledge and engineering, such as pre-initialized parameterized agent behaviors and reward shaping, and often required a prohibitive amount of data. This large amount of required prior knowledge and pre-structuring is arguably in stark contrast to the goal of developing autonomous learning. In this thesis we will present methods to increase the autonomy of reinforcement learning algorithms, i.e., learning without expert pre-engineering, by addressing the issues discussed above. The key points of our research address (1) techniques to deal with multiple conflicting reward functions, (2) methods to enhance exploration in the presence of sparse rewards, and (3) techniques to enable more stable and safer learning. Progress in each of these aspects will lift reinforcement learning to a higher level of autonomy. First, we will address the presence of conflicting objective from a multi-objective optimization perspective. In this scenario, the standard concept of optimality is replaced by Pareto optimality, a concept for representing compromises among the objectives. Subsequently, the goal is to find the Pareto frontier, a set of solutions representing different compromises among the objectives. Despite recent advances in multi-objective optimization, achieving an accurate representation of the Pareto frontier is still an important challenge. Common practical approaches rely on experts to manually set priority or thresholds on the objectives. These methods require prior knowledge and are not able to learn the whole Pareto frontier but just a portion of it, possibly missing interesting solutions. On the contrary, we propose a manifold-based method which learn a continuous approximation of the frontier without the need of any prior knowledge. We will then consider learning in the presence of sparse rewards and present novel exploration strategies. Classical exploration techniques in reinforcement learning mostly revolve around the immediate reward, that is, how to choose an action to balance between exploitation and exploration for the current state. These methods, however, perform extremely poorly if only sparse rewards are provided. State-of-the-art exploration strategies, thus, rely either on local exploration along the current solution together with sensible initialization, or on handcrafted strategies based on heuristics. These approaches, however, either require prior knowledge or have poor guarantees of convergence, and often falls in local optima. On the contrary, we propose an approach that plans exploration actions far into the future based on what we call long-term visitation value. Intuitively, this value assesses the number of unvisited states that the agent can visit in the future by performing that action. Finally, we address the problem of stabilizing learning when little data is available. Even assuming efficient exploration strategies, dense rewards, and the presence of only one objective, reinforcement learning can exhibit unstable behavior. Interestingly, the most successful algorithms, namely actor-critic methods, are also the most sensible to this issue. These methods typically separate the problem of learning the value of a given state from the problem of learning the optimal action to execute in such a state. The former is fullfilled by the so-called critic, while the latter by the so-called actor. In this scenario, the instability is due the interplay between these two components, especially when nonlinear approximators, such as neural networks, are employed. To avoid such issues, we propose to regularize the learning objective of the actor by penalizing the error of the critic. This improves stability by avoiding large steps in the actor update whenever the critic is highly inaccurate. Altogether, the individual contributions of this thesis allow reinforcement learning to rely less on expert pre-engineering. The proposed methods can be applied to a large variety of common algorithms, and are evaluated on a wide array of tasks. Results on both standard and novel benchmarks confirm their effectiveness.

APA, Harvard, Vancouver, ISO, and other styles

Chi, Lu-cheng, and 紀律呈. "An Improved Deep Reinforcement Learning with Sparse Rewards." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/eq94pr.

Full text

Abstract:

碩士
國立中山大學
電機工程學系研究所
107
In reinforcement learning, how an agent explores in an environment with sparse rewards is a long-standing problem. An improved deep reinforcement learning described in this thesis encourages an agent to explore unvisited environmental states in an environment with sparse rewards. In deep reinforcement learning, an agent directly uses an image observation from environment as an input to the neural network. However, some neglected observations from environment, such as depth, might provide valuable information. An improved deep reinforcement learning described in this thesis is based on the Actor-Critic algorithm and uses the convolutional neural network as a hetero-encoder between an image input and other observations from environment. In the environment with sparse rewards, we use these neglected observations from environment as a target output of supervised learning and provide an agent denser training signals through supervised learning to bootstrap reinforcement learning. In addition, we use the loss from supervised learning as the feedback for an agent’s exploration behavior in an environment, called the label reward, to encourage an agent to explore unvisited environmental states. Finally, we construct multiple neural networks by Asynchronous Advantage Actor-Critic algorithm and learn the policy with multiple agents. An improved deep reinforcement learning described in this thesis is compared with other deep reinforcement learning in an environment with sparse rewards and achieves better performance.

APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Sparse Reward"

Rudyard, Kipling. Rewards and fairies. Garden City, N.Y: Doubleday, Page, 1989.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Rudyard, Kipling. Puck of Pook's Hill ; and, Rewards and fairies. Oxford: Oxford University Press, 1992.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Persson, Fabian. Women at the Early Modern Swedish Court. NL Amsterdam: Amsterdam University Press, 2021. http://dx.doi.org/10.5117/9789463725200.

Full text

Abstract:

What was possible for a woman to achieve at an early modern court? By analysing the experiences of a wide range of women at the court of Sweden, this book demonstrates the opportunities open to women who served at, and interacted with, the court; the complexities of women's agency in a court society; and, ultimately, the precariousness of power. In doing so, it provides an institutional context to women's lives at court, charting the full extent of the rewards that they might obtain, alongside the social and institutional constrictions that they faced. Its longue durée approach, moreover, clarifies how certain periods, such as that of the queens regnant, brought new possibilities. Based on an extensive array of Swedish and international primary sources, including correspondence, financial records and diplomatic reports, it also takes into account the materialities used to create hierarchies and ceremonies, such as physical structures and spaces within the court. Comprehensive in its scope, the book is divided into three parts, which focus respectively on outsiders at court, insiders, and members of the royal family.

APA, Harvard, Vancouver, ISO, and other styles

Prima. Official Sega Genesis: Power Tips Book. Rocklin, CA: Prima Publishing, 1992.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Mcdermott, Leeanne. GamePro Presents: Sega Genesis Games Secrets: Greatest Tips. Rocklin: Prima Publishing, 1992.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Sandler, Corey. Official Sega Genesis and Game Gear strategies, 3RD Edition. New York: Bantam Books, 1992.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Rudyard, Kipling. Rewards & Fairies. Amereon Ltd, 1988.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Rudyard, Kipling. Rewards and Fairies. Createspace Independent Publishing Platform, 2016.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Rudyard, Kipling. Rewards and Fairies. Independently Published, 2021.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Rudyard, Kipling. Rewards and Fairies. Pan MacMillan, 2016.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

More sources

Book chapters on the topic "Sparse Reward"

Hensel, Maximilian. "Exploration Methods in Sparse Reward Environments." In Reinforcement Learning Algorithms: Analysis and Applications, 35–45. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-41188-6_4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Moy, Glennn, and Slava Shekh. "Evolution Strategies for Sparse Reward Gridworld Environments." In AI 2022: Advances in Artificial Intelligence, 266–78. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-22695-3_19.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Jeewa, Asad, Anban W. Pillay, and Edgar Jembere. "Learning to Generalise in Sparse Reward Navigation Environments." In Artificial Intelligence Research, 85–100. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-66151-9_6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Chen, Zhongpeng, and Qiang Guan. "Continuous Exploration via Multiple Perspectives in Sparse Reward Environment." In Pattern Recognition and Computer Vision, 57–68. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-8435-0_5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Lei, Hejun, Paul Weng, Juan Rojas, and Yisheng Guan. "Planning with Q-Values in Sparse Reward Reinforcement Learning." In Intelligent Robotics and Applications, 603–14. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-13844-7_56.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Fu, Yupeng, Yuan Xiao, Jun Fang, Xiangyang Deng, Ziqiang Zhu, and Limin Zhang. "Distributed Advantage-Based Weights Reshaping Algorithm with Sparse Reward." In Lecture Notes in Computer Science, 391–400. Singapore: Springer Nature Singapore, 2024. http://dx.doi.org/10.1007/978-981-97-7181-3_31.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Le, Bang-Giang, Thi-Linh Hoang, Hai-Dang Kieu, and Viet-Cuong Ta. "Structural and Compact Latent Representation Learning on Sparse Reward Environments." In Intelligent Information and Database Systems, 40–51. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-5837-5_4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wu, Feng, and Xiaoping Chen. "Solving Large-Scale and Sparse-Reward DEC-POMDPs with Correlation-MDPs." In RoboCup 2007: Robot Soccer World Cup XI, 208–19. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. http://dx.doi.org/10.1007/978-3-540-68847-1_18.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Mizukami, Naoki, Jun Suzuki, Hirotaka Kameko, and Yoshimasa Tsuruoka. "Exploration Bonuses Based on Upper Confidence Bounds for Sparse Reward Games." In Lecture Notes in Computer Science, 165–75. Cham: Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-71649-7_14.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Kang, Yongxin, Enmin Zhao, Yifan Zang, Kai Li, and Junliang Xing. "Towards a Unified Benchmark for Reinforcement Learning in Sparse Reward Environments." In Communications in Computer and Information Science, 189–201. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-1639-9_16.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Sparse Reward"

Hossain, Jumman, Abu-Zaher Faridee, Nirmalya Roy, Jade Freeman, Timothy Gregory, and Theron Trout. "TopoNav: Topological Navigation for Efficient Exploration in Sparse Reward Environments." In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 693–700. IEEE, 2024. https://doi.org/10.1109/iros58592.2024.10802380.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Huang, Chao, Yibei Guo, Zhihui Zhu, Mei Si, Daniel Blankenberg, and Rui Liu. "Quantum Exploration-based Reinforcement Learning for Efficient Robot Path Planning in Sparse-Reward Environment." In 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN), 516–21. IEEE, 2024. http://dx.doi.org/10.1109/ro-man60168.2024.10731199.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Yang, Kai, Zhirui Fang, Xiu Li, and Jian Tao. "CMBE: Curiosity-driven Model-Based Exploration for Multi-Agent Reinforcement Learning in Sparse Reward Settings." In 2024 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2024. http://dx.doi.org/10.1109/ijcnn60899.2024.10650769.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Farkaš, Igor. "Explaining Internal Representations in Deep Networks: Adversarial Vulnerability of Image Classifiers and Learning Sequential Tasks with Sparse Reward." In 2025 IEEE 23rd World Symposium on Applied Machine Intelligence and Informatics (SAMI), 000015–16. IEEE, 2025. https://doi.org/10.1109/sami63904.2025.10883317.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Xi, Lele, Hongkun Wang, Zhijie Li, and Changchun Hua. "An Experience Replay Approach Based on SSIM to Solve the Sparse Reward Problem in Pursuit Evasion Game^*." In 2024 China Automation Congress (CAC), 6238–43. IEEE, 2024. https://doi.org/10.1109/cac63892.2024.10864615.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wang, Guojian, Faguo Wu, and Xiao Zhang. "Trajectory-Oriented Policy Optimization with Sparse Rewards." In 2024 2nd International Conference on Intelligent Perception and Computer Vision (CIPCV), 76–81. IEEE, 2024. http://dx.doi.org/10.1109/cipcv61763.2024.00023.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Cheng, Hao, Jiahang Cao, Erjia Xiao, Mengshu Sun, and Renjing Xu. "Gaining the Sparse Rewards by Exploring Lottery Tickets in Spiking Neural Networks." In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 442–49. IEEE, 2024. https://doi.org/10.1109/iros58592.2024.10802854.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Huang, Yuming, Bin Ren, Ziming Xu, and Lianghong Wu. "MRHER: Model-based Relay Hindsight Experience Replay for Sequential Object Manipulation Tasks with Sparse Rewards." In 2024 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE, 2024. http://dx.doi.org/10.1109/ijcnn60899.2024.10650959.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Tian, Yuhe, Ayooluwa Akintola, Yazhou Jiang, Dewei Wang, Jie Bao, Miguel A. Zamarripa, Brandon Paul, et al. "Reinforcement Learning-Driven Process Design: A Hydrodealkylation Example." In Foundations of Computer-Aided Process Design, 387–93. Hamilton, Canada: PSE Press, 2024. http://dx.doi.org/10.69997/sct.119603.

Full text

Abstract:

In this work, we present a follow-up work of reinforcement learning (RL)-driven process design using the Institute for Design of Advanced Energy Systems Process Systems Engineering (IDAES-PSE) Framework. Herein, process designs are generated as stream inlet-outlet matrices and optimized using the IDAES platform, the objective function value of which is the reward to RL agent. Deep Q-Network is employed as the RL agent including a series of convolutional neural network layers and fully connected layers to compute the actions of adding or removing any stream connections, thus creating a new process design. The process design is then informed back to the RL agent to refine its learning. The iteration continues until the maximum number of steps is reached with feasible process designs generated. To further expedite the RL search of the design space which can comprise the selection of any candidate unit(s) with arbitrary stream connections, we investigate the role of RL reward function and their impacts on exploring more complicated versus intensified process configurations. A sub-space search strategy is also developed to branch the combinatorial design space to accelerate the discovery of feasible process design solutions particularly when a large pool of candidate process units is selected by the user. The potential of the enhanced RL-assisted process design strategy is showcased via a hydrodealkylation example.

APA, Harvard, Vancouver, ISO, and other styles

Yang, Dong, and Yuhua Tang. "Adaptive Inner-reward Shaping in Sparse Reward Games." In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020. http://dx.doi.org/10.1109/ijcnn48605.2020.9207302.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Sparse Reward"

Erik Lyngdorf, Niels, Selina Thelin Ruggaard, Kathrin Otrel-Cass, and Eamon Costello. The Hacking Innovative Pedagogies (HIP) framework: - Rewilding the digital learning ecology. Aalborg University, 2023. http://dx.doi.org/10.54337/aau602808725.

Full text

Abstract:

The HIP framework aims to guide higher education (HE) teachers and researchers to reconsider and reflect on how to rethink HE pedagogy in new and different ways. It builds on insights from the report Hacking Innovative Pedagogy: Innovation and Digitisation to Rewild Higher Education. A Commented Atlas (Beskorsa, et al., 2023) and incorporates the spirit of rewilding and hacking pedagogies to inspire new professional communities focused on innovating digital education. The framework considers and guides the development of teachers’ digital pedagogy competences through an inclusive bottom-up approach that gives space for individual teacher’s agency while also ensuring a collective teaching culture. The framework emphasizes how pedagogical approaches can address the different needs that HE teachers and student communities have that reflect disciplines cultures and/or the diversity of learners. Only a framework mindful of heterogeneity will be able to address questions of justice and fair access to education. Likewise, in the spirit of rewilding, the framework should not be considered a static “one size fits all” solution. We aim for an organic and dynamic framework that may be used to pause and reflect to then turn back to one’s own teaching community to consider (learn from, listen to and respond to the teaching and learning of different communities). Therefore we plan that this framework will be a living document throughout the HIP-project’s lifetime.

APA, Harvard, Vancouver, ISO, and other styles

Murray, Chris, Keith Williams, Norrie Millar, Monty Nero, Amy O'Brien, and Damon Herd. A New Palingenesis. University of Dundee, November 2022. http://dx.doi.org/10.20933/100001273.

Full text

Abstract:

Robert Duncan Milne (1844-99), from Cupar, Fife, was a pioneering author of science fiction stories, most of which appeared in San Francisco’s Argonaut magazine in the 1880s and ’90s. SF historian Sam Moskowitz credits Milne with being the first full-time SF writer, and his contribution to the genre is arguably greater than anyone else including Stevenson and Conan Doyle, yet it has all but disappeared into oblivion. Milne was fascinated by science. He drew on the work of Scottish physicists and inventors such as James Clark Maxwell and Alexander Graham Bell into the possibilities of electromagnetic forces and new communications media to overcome distances in space and time. Milne wrote about visual time-travelling long before H.G. Wells. He foresaw virtual ‘tele-presencing’, remote surveillance, mobile phones and worldwide satellite communications – not to mention climate change, scientific terrorism and drone warfare, cryogenics and molecular reengineering. Milne also wrote on alien life forms, artificial immortality, identity theft and personality exchange, lost worlds and the rediscovery of extinct species. ‘A New Palingenesis’, originally published in The Argonaut on July 7th 1883, and adapted in this comic, is a secular version of the resurrection myth. Mary Shelley was the first scientiser of the occult to rework the supernatural idea of reanimating the dead through the mysterious powers of electricity in Frankenstein (1818). In Milne’s story, in which Doctor S- dissolves his terminally ill wife’s body in order to bring her back to life in restored health, is a striking, further modernisation of Frankenstein, to reflect late-nineteenth century interest in electromagnetic science and spiritualism. In particular, it is a retelling of Shelley’s narrative strand about Frankenstein’s aborted attempt to shape a female mate for his creature, but also his misogynistic ambition to bypass the sexual principle in reproducing life altogether. By doing so, Milne interfused Shelley’s updating of the Promethean myth with others. ‘A New Palingenesis’ is also a version of Pygmalion and his male-ordered, wish-fulfilling desire to animate his idealised female sculpture, Galatea from Ovid’s Metamorphoses, perhaps giving a positive twist to Orpheus’s attempt to bring his corpse-bride Eurydice back from the underworld as well? With its basis in spiritualist ideas about the soul as a kind of electrical intelligence, detachable from the body but a material entity nonetheless, Doctor S- treats his wife as an ‘intelligent battery’. He is thus able to preserve her personality after death and renew her body simultaneously because that captured electrical intelligence also carries a DNA-like code for rebuilding the individual organism itself from its chemical constituents. The descriptions of the experiment and the body’s gradual re-materialisation are among Milne’s most visually impressive, anticipating the X-raylike anatomisation and reversal of Griffin’s disappearance process in Wells’s The Invisible Man (1897). In the context of the 1880s, it must have been a compelling scientisation of the paranormal, combining highly technical descriptions of the Doctor’s system of electrically linked glass coffins with ghostly imagery. It is both dramatic and highly visual, even cinematic in its descriptions, and is here brought to life in the form of a comic.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Contents

Academic literature on the topic 'Sparse Reward'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Journal articles on the topic "Sparse Reward"

Dissertations / Theses on the topic "Sparse Reward"

Books on the topic "Sparse Reward"

Book chapters on the topic "Sparse Reward"

Conference papers on the topic "Sparse Reward"

Reports on the topic "Sparse Reward"