Acceder

Bibliografías temáticas / Sparse Reward / Artículos de revistas

Siga este enlace para ver otros tipos de publicaciones sobre el tema: Sparse Reward.

Artículos de revistas sobre el tema "Sparse Reward"

Autor: Grafiati

Publicado: 8 de marzo de 2025

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros

Elija tipo de fuente:

Consulte los 50 mejores artículos de revistas para su investigación sobre el tema "Sparse Reward".

Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.

También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.

Explore artículos de revistas sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.

1

Park, Junseok, Yoonsung Kim, Hee bin Yoo, Min Whoo Lee, Kibeom Kim, Won-Seok Choi, Minsu Lee y Byoung-Tak Zhang. "Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning". Proceedings of the AAAI Conference on Artificial Intelligence 38, n.º 1 (24 de marzo de 2024): 592–600. http://dx.doi.org/10.1609/aaai.v38i1.27815.

Texto completo

Resumen

Toddlers evolve from free exploration with sparse feedback to exploiting prior experiences for goal-directed learning with denser rewards. Drawing inspiration from this Toddler-Inspired Reward Transition, we set out to explore the implications of varying reward transitions when incorporated into Reinforcement Learning (RL) tasks. Central to our inquiry is the transition from sparse to potential-based dense rewards, which share optimal strategies regardless of reward changes. Through various experiments, including those in egocentric navigation and robotic arm manipulation tasks, we found that proper reward transitions significantly influence sample efficiency and success rates. Of particular note is the efficacy of the toddler-inspired Sparse-to-Dense (S2D) transition. Beyond these performance metrics, using Cross-Density Visualizer technique, we observed that transitions, especially the S2D, smooth the policy loss landscape, promoting wide minima that enhance generalization in RL models.

Los estilos APA, Harvard, Vancouver, ISO, etc.

2

Xu, Pei, Junge Zhang, Qiyue Yin, Chao Yu, Yaodong Yang y Kaiqi Huang. "Subspace-Aware Exploration for Sparse-Reward Multi-Agent Tasks". Proceedings of the AAAI Conference on Artificial Intelligence 37, n.º 10 (26 de junio de 2023): 11717–25. http://dx.doi.org/10.1609/aaai.v37i10.26384.

Texto completo

Resumen

Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. One possible solution to this issue is to exploit inherent task structures for an acceleration of exploration. In this paper, we present a novel exploration approach, which encodes a special structural prior on the reward function into exploration, for sparse-reward multi-agent tasks. Specifically, a novel entropic exploration objective which encodes the structural prior is proposed to accelerate the discovery of rewards. By maximizing the lower bound of this objective, we then propose an algorithm with moderate computational cost, which can be applied to practical tasks. Under the sparse-reward setting, we show that the proposed algorithm significantly outperforms the state-of-the-art algorithms in the multiple-particle environment, the Google Research Football and StarCraft II micromanagement tasks. To the best of our knowledge, on some hard tasks (such as 27m_vs_30m}) which have relatively larger number of agents and need non-trivial strategies to defeat enemies, our method is the first to learn winning strategies under the sparse-reward setting.

Los estilos APA, Harvard, Vancouver, ISO, etc.

3

Mguni, David, Taher Jafferjee, Jianhong Wang, Nicolas Perez-Nieves, Wenbin Song, Feifei Tong, Matthew Taylor et al. "Learning to Shape Rewards Using a Game of Two Partners". Proceedings of the AAAI Conference on Artificial Intelligence 37, n.º 10 (26 de junio de 2023): 11604–12. http://dx.doi.org/10.1609/aaai.v37i10.26371.

Texto completo

Resumen

Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construc- tion is time-consuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimising Shaping Algorithm (ROSA), an automated reward shaping framework in which the shaping-reward function is constructed in a Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards for more efficient learning while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which adopts existing RL algorithms, learns to construct a shaping-reward function that is beneficial to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA’s properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.

Los estilos APA, Harvard, Vancouver, ISO, etc.

4

Meng, Fanxiao. "Research on Multi-agent Sparse Reward Problem". Highlights in Science, Engineering and Technology 85 (13 de marzo de 2024): 96–103. http://dx.doi.org/10.54097/er0mx710.

Texto completo

Resumen

Sparse reward poses a significant challenge in deep reinforcement learning, leading to issues such as low sample utilization, slow agent convergence, and subpar performance of optimal policies. Overcoming these challenges requires tackling the complexity of sparse reward algorithms and addressing the lack of unified understanding. This paper aims to address these issues by introducing the concepts of reinforcement learning and sparse reward, as well as presenting three categories of sparse reward algorithms. Furthermore, the paper conducts an analysis and summary of three key aspects: manual labeling, hierarchical reinforcement learning, and the incorporation of intrinsic rewards. Hierarchical reinforcement learning is further divided into option-based and subgoal-based methods. The implementation principles, advantages, and disadvantages of all algorithms are thoroughly examined. In conclusion, this paper provides a comprehensive review and offers future directions for research in this field.

Los estilos APA, Harvard, Vancouver, ISO, etc.

5

Zuo, Guoyu, Qishen Zhao, Jiahao Lu y Jiangeng Li. "Efficient hindsight reinforcement learning using demonstrations for robotic tasks with sparse rewards". International Journal of Advanced Robotic Systems 17, n.º 1 (1 de enero de 2020): 172988141989834. http://dx.doi.org/10.1177/1729881419898342.

Texto completo

Resumen

The goal of reinforcement learning is to enable an agent to learn by using rewards. However, some robotic tasks naturally specify with sparse rewards, and manually shaping reward functions is a difficult project. In this article, we propose a general and model-free approach for reinforcement learning to learn robotic tasks with sparse rewards. First, a variant of Hindsight Experience Replay, Curious and Aggressive Hindsight Experience Replay, is proposed to improve the sample efficiency of reinforcement learning methods and avoid the need for complicated reward engineering. Second, based on Twin Delayed Deep Deterministic policy gradient algorithm, demonstrations are leveraged to overcome the exploration problem and speed up the policy training process. Finally, the action loss is added into the loss function in order to minimize the vibration of output action while maximizing the value of the action. The experiments on simulated robotic tasks are performed with different hyperparameters to verify the effectiveness of our method. Results show that our method can effectively solve the sparse reward problem and obtain a high learning speed.

Los estilos APA, Harvard, Vancouver, ISO, etc.

6

Velasquez, Alvaro, Brett Bissey, Lior Barak, Andre Beckus, Ismail Alkhouri, Daniel Melcer y George Atia. "Dynamic Automaton-Guided Reward Shaping for Monte Carlo Tree Search". Proceedings of the AAAI Conference on Artificial Intelligence 35, n.º 13 (18 de mayo de 2021): 12015–23. http://dx.doi.org/10.1609/aaai.v35i13.17427.

Texto completo

Resumen

Reinforcement learning and planning have been revolutionized in recent years, due in part to the mass adoption of deep convolutional neural networks and the resurgence of powerful methods to refine decision-making policies. However, the problem of sparse reward signals and their representation remains pervasive in many domains. While various rewardshaping mechanisms and imitation learning approaches have been proposed to mitigate this problem, the use of humanaided artificial rewards introduces human error, sub-optimal behavior, and a greater propensity for reward hacking. In this paper, we mitigate this by representing objectives as automata in order to define novel reward shaping functions over this structured representation. In doing so, we address the sparse rewards problem within a novel implementation of Monte Carlo Tree Search (MCTS) by proposing a reward shaping function which is updated dynamically to capture statistics on the utility of each automaton transition as it pertains to satisfying the goal of the agent. We further demonstrate that such automaton-guided reward shaping can be utilized to facilitate transfer learning between different environments when the objective is the same.

Los estilos APA, Harvard, Vancouver, ISO, etc.

7

Corazza, Jan, Ivan Gavran y Daniel Neider. "Reinforcement Learning with Stochastic Reward Machines". Proceedings of the AAAI Conference on Artificial Intelligence 36, n.º 6 (28 de junio de 2022): 6429–36. http://dx.doi.org/10.1609/aaai.v36i6.20594.

Texto completo

Resumen

Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.

Los estilos APA, Harvard, Vancouver, ISO, etc.

8

Gaina, Raluca D., Simon M. Lucas y Diego Pérez-Liébana. "Tackling Sparse Rewards in Real-Time Games with Statistical Forward Planning Methods". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 de julio de 2019): 1691–98. http://dx.doi.org/10.1609/aaai.v33i01.33011691.

Texto completo

Resumen

One of the issues general AI game players are required to deal with is the different reward systems in the variety of games they are expected to be able to play at a high level. Some games may present plentiful rewards which the agents can use to guide their search for the best solution, whereas others feature sparse reward landscapes that provide little information to the agents. The work presented in this paper focuses on the latter case, which most agents struggle with. Thus, modifications are proposed for two algorithms, Monte Carlo Tree Search and Rolling Horizon Evolutionary Algorithms, aiming at improving performance in this type of games while maintaining overall win rate across those where rewards are plentiful. Results show that longer rollouts and individual lengths, either fixed or responsive to changes in fitness landscape features, lead to a boost of performance in the games during testing without being detrimental to non-sparse reward scenarios.

Los estilos APA, Harvard, Vancouver, ISO, etc.

9

Zhou, Xiao, Song Zhou, Xingang Mou y Yi He. "Multirobot Collaborative Pursuit Target Robot by Improved MADDPG". Computational Intelligence and Neuroscience 2022 (25 de febrero de 2022): 1–10. http://dx.doi.org/10.1155/2022/4757394.

Texto completo

Resumen

Policy formulation is one of the main problems in multirobot systems, especially in multirobot pursuit-evasion scenarios, where both sparse rewards and random environment changes bring great difficulties to find better strategy. Existing multirobot decision-making methods mostly use environmental rewards to promote robots to complete the target task that cannot achieve good results. This paper proposes a multirobot pursuit method based on improved multiagent deep deterministic policy gradient (MADDPG), which solves the problem of sparse rewards in multirobot pursuit-evasion scenarios by combining the intrinsic reward and the external environment. The state similarity module based on the threshold constraint is as a part of the intrinsic reward signal output by the intrinsic curiosity module, which is used to balance overexploration and insufficient exploration, so that the agent can use the intrinsic reward more effectively to learn better strategies. The simulation experiment results show that the proposed method can improve the reward value of robots and the success rate of the pursuit task significantly. The intuitive change is obviously reflected in the real-time distance between the pursuer and the escapee, the pursuer using the improved algorithm for training can get closer to the escapee more quickly, and the average following distance also decreases.

Los estilos APA, Harvard, Vancouver, ISO, etc.

10

Jiang, Jiechuan y Zongqing Lu. "Generative Exploration and Exploitation". Proceedings of the AAAI Conference on Artificial Intelligence 34, n.º 04 (3 de abril de 2020): 4337–44. http://dx.doi.org/10.1609/aaai.v34i04.5858.

Texto completo

Resumen

Sparse reward is one of the biggest challenges in reinforcement learning (RL). In this paper, we propose a novel method called Generative Exploration and Exploitation (GENE) to overcome sparse reward. GENE automatically generates start states to encourage the agent to explore the environment and to exploit received reward signals. GENE can adaptively tradeoff between exploration and exploitation according to the varying distributions of states experienced by the agent as the learning progresses. GENE relies on no prior knowledge about the environment and can be combined with any RL algorithm, no matter on-policy or off-policy, single-agent or multi-agent. Empirically, we demonstrate that GENE significantly outperforms existing methods in three tasks with only binary rewards, including Maze, Maze Ant, and Cooperative Navigation. Ablation studies verify the emergence of progressive exploration and automatic reversing.

Los estilos APA, Harvard, Vancouver, ISO, etc.

11

Yan Kong, Yan Kong, Yefeng Rui Yan Kong y Chih-Hsien Hsia Yefeng Rui. "A Deep Reinforcement Learning-Based Approach in Porker Game". 電腦學刊 34, n.º 2 (abril de 2023): 041–51. http://dx.doi.org/10.53106/199115992023043402004.

Texto completo

Resumen

<p>Recent years have witnessed the big success deep reinforcement learning achieved in the domain of card and board games, such as Go, chess and Texas Hold’em poker. However, Dou Di Zhu, a traditional Chinese card game, is still a challenging task for deep reinforcement learning methods due to the enormous action space and the sparse and delayed reward of each action from the environment. Basic reinforcement learning algorithms are more effective in the simple environments which have small action spaces and valuable and concrete reward functions, and unfortunately, are shown not be able to deal with Dou Di Zhu satisfactorily. This work introduces an approach named Two-steps Q-Network based on DQN to playing Dou Di Zhu, which compresses the huge action space through dividing it into two parts according to the rules of Dou Di Zhu and fills in the sparse rewards using inverse reinforcement learning (IRL) through abstracting the reward function from experts’ demonstrations. It is illustrated by the experiments that two-steps Q-network gains great advancements compared with DQN used in Dou Di Zhu.</p> <p> </p>

Los estilos APA, Harvard, Vancouver, ISO, etc.

12

Dann, Michael, Fabio Zambetta y John Thangarajah. "Deriving Subgoals Autonomously to Accelerate Learning in Sparse Reward Domains". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 de julio de 2019): 881–89. http://dx.doi.org/10.1609/aaai.v33i01.3301881.

Texto completo

Resumen

Sparse reward games, such as the infamous Montezuma’s Revenge, pose a significant challenge for Reinforcement Learning (RL) agents. Hierarchical RL, which promotes efficient exploration via subgoals, has shown promise in these games. However, existing agents rely either on human domain knowledge or slow autonomous methods to derive suitable subgoals. In this work, we describe a new, autonomous approach for deriving subgoals from raw pixels that is more efficient than competing methods. We propose a novel intrinsic reward scheme for exploiting the derived subgoals, applying it to three Atari games with sparse rewards. Our agent’s performance is comparable to that of state-of-the-art methods, demonstrating the usefulness of the subgoals found.

Los estilos APA, Harvard, Vancouver, ISO, etc.

13

Bougie, Nicolas y Ryutaro Ichise. "Skill-based curiosity for intrinsically motivated reinforcement learning". Machine Learning 109, n.º 3 (10 de octubre de 2019): 493–512. http://dx.doi.org/10.1007/s10994-019-05845-8.

Texto completo

Resumen

Abstract Reinforcement learning methods rely on rewards provided by the environment that are extrinsic to the agent. However, many real-world scenarios involve sparse or delayed rewards. In such cases, the agent can develop its own intrinsic reward function called curiosity to enable the agent to explore its environment in the quest of new skills. We propose a novel end-to-end curiosity mechanism for deep reinforcement learning methods, that allows an agent to gradually acquire new skills. Our method scales to high-dimensional problems, avoids the need of directly predicting the future, and, can perform in sequential decision scenarios. We formulate the curiosity as the ability of the agent to predict its own knowledge about the task. We base the prediction on the idea of skill learning to incentivize the discovery of new skills, and guide exploration towards promising solutions. To further improve data efficiency and generalization of the agent, we propose to learn a latent representation of the skills. We present a variety of sparse reward tasks in MiniGrid, MuJoCo, and Atari games. We compare the performance of an augmented agent that uses our curiosity reward to state-of-the-art learners. Experimental evaluation exhibits higher performance compared to reinforcement learning models that only learn by maximizing extrinsic rewards.

Los estilos APA, Harvard, Vancouver, ISO, etc.

14

Catacora Ocana, Jim Martin, Roberto Capobianco y Daniele Nardi. "An Overview of Environmental Features that Impact Deep Reinforcement Learning in Sparse-Reward Domains". Journal of Artificial Intelligence Research 76 (26 de abril de 2023): 1181–218. http://dx.doi.org/10.1613/jair.1.14390.

Texto completo

Resumen

Deep reinforcement learning has achieved impressive results in recent years; yet, it is still severely troubled by environments showcasing sparse rewards. On top of that, not all sparse-reward environments are created equal, i.e., they can differ in the presence or absence of various features, with many of them having a great impact on learning. In light of this, the present work puts together a literature compilation of such environmental features, covering particularly those that have been taken advantage of and those that continue to pose a challenge. We expect this effort to provide guidance to researchers for assessing the generality of their new proposals and to call their attention to issues that remain unresolved when dealing with sparse rewards.

Los estilos APA, Harvard, Vancouver, ISO, etc.

15

Zhu, Yiwen, Yuan Zheng, Wenya Wei y Zhou Fang. "Enhancing Automated Maneuvering Decisions in UCAV Air Combat Games Using Homotopy-Based Reinforcement Learning". Drones 8, n.º 12 (13 de diciembre de 2024): 756. https://doi.org/10.3390/drones8120756.

Texto completo

Resumen

In the field of real-time autonomous decision-making for Unmanned Combat Aerial Vehicles (UCAVs), reinforcement learning is widely used to enhance their decision-making capabilities in high-dimensional spaces. These enhanced capabilities allow UCAVs to better respond to the maneuvers of various opponents, with the win rate often serving as the primary optimization metric. However, relying solely on the terminal outcome of victory or defeat as the optimization target, but without incorporating additional rewards throughout the process, poses significant challenges for reinforcement learning due to the sparse reward structure inherent in these scenarios. While algorithms enhanced with densely distributed artificial rewards show potential, they risk deviating from the primary objectives. To address these challenges, we introduce a novel approach: the homotopy-based soft actor–critic (HSAC) method. This technique gradually transitions from auxiliary tasks enriched with artificial rewards to the main task characterized by sparse rewards through homotopic paths. We demonstrate the consistent convergence of the HSAC method and its effectiveness through deployment in two distinct scenarios within a 3D air combat game simulation: attacking horizontally flying UCAVs and a combat scenario involving two UCAVs. Our experimental results reveal that HSAC significantly outperforms traditional algorithms, which rely solely on using sparse rewards or those supplemented with artificially aided rewards.

Los estilos APA, Harvard, Vancouver, ISO, etc.

16

Gehring, Clement, Masataro Asai, Rohan Chitnis, Tom Silver, Leslie Kaelbling, Shirin Sohrabi y Michael Katz. "Reinforcement Learning for Classical Planning: Viewing Heuristics as Dense Reward Generators". Proceedings of the International Conference on Automated Planning and Scheduling 32 (13 de junio de 2022): 588–96. http://dx.doi.org/10.1609/icaps.v32i1.19846.

Texto completo

Resumen

Recent advances in reinforcement learning (RL) have led to a growing interest in applying RL to classical planning domains or applying classical planning methods to some complex RL domains. However, the long-horizon goal-based problems found in classical planning lead to sparse rewards for RL, making direct application inefficient. In this paper, we propose to leverage domain-independent heuristic functions commonly used in the classical planning literature to improve the sample efficiency of RL. These classical heuristics act as dense reward generators to alleviate the sparse-rewards issue and enable our RL agent to learn domain-specific value functions as residuals on these heuristics, making learning easier. Correct application of this technique requires consolidating the discounted metric used in RL and the non-discounted metric used in heuristics. We implement the value functions using Neural Logic Machines, a neural network architecture designed for grounded first-order logic inputs. We demonstrate on several classical planning domains that using classical heuristics for RL allows for good sample efficiency compared to sparse-reward RL. We further show that our learned value functions generalize to novel problem instances in the same domain. The source code and the appendix are available at github.com/ibm/pddlrl and arxiv.org/abs/2109.14830.

Los estilos APA, Harvard, Vancouver, ISO, etc.

17

Xu, Zhe, Ivan Gavran, Yousef Ahmad, Rupak Majumdar, Daniel Neider, Ufuk Topcu y Bo Wu. "Joint Inference of Reward Machines and Policies for Reinforcement Learning". Proceedings of the International Conference on Automated Planning and Scheduling 30 (1 de junio de 2020): 590–98. http://dx.doi.org/10.1609/icaps.v30i1.6756.

Texto completo

Resumen

Incorporating high-level knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the high-level knowledge is in the form of reward machines, a type of Mealy machines that encode non-Markovian reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, q-learning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It uses a separate q-function defined for each state of the current hypothesis reward machine to determine the policy and performs RL to update the q-functions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relation between states of reward machines, we transfer the q-functions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit. The experiments show that learning high-level knowledge in the form of reward machines leads to fast convergence to optimal policies in RL, while the baseline RL methods fail to converge to optimal policies after a substantial number of training steps.

Los estilos APA, Harvard, Vancouver, ISO, etc.

18

Ye, Chenhao, Wei Zhu, Shiluo Guo y Jinyin Bai. "DQN-Based Shaped Reward Function Mold for UAV Emergency Communication". Applied Sciences 14, n.º 22 (14 de noviembre de 2024): 10496. http://dx.doi.org/10.3390/app142210496.

Texto completo

Resumen

Unmanned aerial vehicles (UAVs) have emerged as pivotal tools in emergency communication scenarios. In the aftermath of disasters, UAVs can be communication nodes to provide communication services for users in the area. In this paper, we establish a meticulously crafted virtual simulation environment and leverage advanced deep reinforcement learning algorithms to train UAVs agents. Notwithstanding, the development of reinforcement learning algorithms is beset with challenges such as sparse rewards and protracted training durations. To mitigate these issues, we devise an enhanced reward function aimed at bolstering training efficiency. Initially, we delineate a specific mountainous emergency communication scenario and integrate it with the particularized application of UAVs to undertake virtual simulations, constructing a realistic virtual environment. Furthermore, we introduce a supplementary shaped reward function tailored to alleviate the problem of sparse rewards. By refining the DQN algorithm and devising a reward structure grounded on potential functions, we observe marked improvements in the final evaluation metrics, substantiating the efficacy of our approach. The experimental outcomes underscore the prowess of our methodology in effectively curtailing training time while augmenting convergence rates. In summary, our work underscores the potential of leveraging sophisticated virtual environments and refined reinforcement learning techniques to optimize UAVs deployment in emergency communication contexts.

Los estilos APA, Harvard, Vancouver, ISO, etc.

19

Dharmavaram, Akshay, Matthew Riemer y Shalabh Bhatnagar. "Hierarchical Average Reward Policy Gradient Algorithms (Student Abstract)". Proceedings of the AAAI Conference on Artificial Intelligence 34, n.º 10 (3 de abril de 2020): 13777–78. http://dx.doi.org/10.1609/aaai.v34i10.7160.

Texto completo

Resumen

Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards.

Los estilos APA, Harvard, Vancouver, ISO, etc.

20

Abu Bakar, Mohamad Hafiz, Abu Ubaidah Shamsudin, Zubair Adil Soomro, Satoshi Tadokoro y C. J. Salaan. "FUSION SPARSE AND SHAPING REWARD FUNCTION IN SOFT ACTOR-CRITIC DEEP REINFORCEMENT LEARNING FOR MOBILE ROBOT NAVIGATION". Jurnal Teknologi 86, n.º 2 (15 de enero de 2024): 37–49. http://dx.doi.org/10.11113/jurnalteknologi.v86.20147.

Texto completo

Resumen

Nowadays, the advancement in autonomous robots is the latest influenced by the development of a world surrounded by new technologies. Deep Reinforcement Learning (DRL) allows systems to operate automatically, so the robot will learn the next movement based on the interaction with the environment. Moreover, since robots require continuous action, Soft Actor Critic Deep Reinforcement Learning (SAC DRL) is considered the latest DRL approach solution. SAC is used because its ability to control continuous action to produce more accurate movements. SAC fundamental is robust against unpredictability, but some weaknesses have been identified, particularly in the exploration process for accuracy learning with faster maturity. To address this issue, the study identified a solution using a reward function appropriate for the system to guide in the learning process. This research proposes several types of reward functions based on sparse and shaping reward in SAC method to investigate the effectiveness of mobile robot learning. Finally, the experiment shows that using fusion sparse and shaping rewards in the SAC DRL successfully navigates to the target position and can also increase accuracy based on the average error result of 4.99%.

Los estilos APA, Harvard, Vancouver, ISO, etc.

21

Sharip, Zati, Mohd Hafiz Zulkifli, Mohd Nur Farhan Abd Wahab, Zubaidi Johar y Mohd Zaki Mat Amin. "ASSESSING TROPHIC STATE AND WATER QUALITY OF SMALL LAKES AND PONDS IN PERAK". Jurnal Teknologi 86, n.º 2 (15 de enero de 2024): 51–59. http://dx.doi.org/10.11113/jurnalteknologi.v86.20566.

Texto completo

Resumen

Nowadays, the advancement in autonomous robots is the latest influenced by the development of a world surrounded by new technologies. Deep Reinforcement Learning (DRL) allows systems to operate automatically, so the robot will learn the next movement based on the interaction with the environment. Moreover, since robots require continuous action, Soft Actor Critic Deep Reinforcement Learning (SAC DRL) is considered the latest DRL approach solution. SAC is used because its ability to control continuous action to produce more accurate movements. SAC fundamental is robust against unpredictability, but some weaknesses have been identified, particularly in the exploration process for accuracy learning with faster maturity. To address this issue, the study identified a solution using a reward function appropriate for the system to guide in the learning process. This research proposes several types of reward functions based on sparse and shaping reward in SAC method to investigate the effectiveness of mobile robot learning. Finally, the experiment shows that using fusion sparse and shaping rewards in the SAC DRL successfully navigates to the target position and can also increase accuracy based on the average error result of 4.99%.

Los estilos APA, Harvard, Vancouver, ISO, etc.

22

Parisi, Simone, Davide Tateo, Maximilian Hensel, Carlo D’Eramo, Jan Peters y Joni Pajarinen. "Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning". Algorithms 15, n.º 3 (28 de febrero de 2022): 81. http://dx.doi.org/10.3390/a15030081.

Texto completo

Resumen

Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods that use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment.

Los estilos APA, Harvard, Vancouver, ISO, etc.

23

Forbes, Grant C. y David L. Roberts. "Potential-Based Reward Shaping for Intrinsic Motivation (Student Abstract)". Proceedings of the AAAI Conference on Artificial Intelligence 38, n.º 21 (24 de marzo de 2024): 23488–89. http://dx.doi.org/10.1609/aaai.v38i21.30441.

Texto completo

Resumen

Recently there has been a proliferation of intrinsic motivation (IM) reward shaping methods to learn in complex and sparse-reward environments. These methods can often inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior. Previous work on mitigating the risks of reward shaping, particularly through potential-based reward shaping (PBRS), has not been applicable to many IM methods, as they are often complex, trainable functions themselves, and therefore dependent on a wider set of variables than the traditional reward functions that PBRS was developed for. We present an extension to PBRS that we show preserves the set of optimal policies under a more general set of functions than has been previously demonstrated. We also present Potential-Based Intrinsic Motivation (PBIM), a method for converting IM rewards into a potential-based form that are useable without altering the set of optimal policies. Testing in the MiniGrid DoorKey environment, we demonstrate that PBIM successfully prevents the agent from converging to a suboptimal policy and can speed up training.

Los estilos APA, Harvard, Vancouver, ISO, etc.

24

Guo, Yijie, Qiucheng Wu y Honglak Lee. "Learning Action Translator for Meta Reinforcement Learning on Sparse-Reward Tasks". Proceedings of the AAAI Conference on Artificial Intelligence 36, n.º 6 (28 de junio de 2022): 6792–800. http://dx.doi.org/10.1609/aaai.v36i6.20635.

Texto completo

Resumen

Meta reinforcement learning (meta-RL) aims to learn a policy solving a set of training tasks simultaneously and quickly adapting to new tasks. It requires massive amounts of data drawn from training tasks to infer the common structure shared among tasks. Without heavy reward engineering, the sparse rewards in long-horizon tasks exacerbate the problem of sample efficiency in meta-RL. Another challenge in meta-RL is the discrepancy of difficulty level among tasks, which might cause one easy task dominating learning of the shared policy and thus preclude policy adaptation to new tasks. This work introduces a novel objective function to learn an action translator among training tasks. We theoretically verify that the value of the transferred policy with the action translator can be close to the value of the source policy and our objective function (approximately) upper bounds the value difference. We propose to combine the action translator with context-based meta-RL algorithms for better data collection and moreefficient exploration during meta-training. Our approach em-pirically improves the sample efficiency and performance ofmeta-RL algorithms on sparse-reward tasks.

Los estilos APA, Harvard, Vancouver, ISO, etc.

25

Booth, Serena, W. Bradley Knox, Julie Shah, Scott Niekum, Peter Stone y Alessandro Allievi. "The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications". Proceedings of the AAAI Conference on Artificial Intelligence 37, n.º 5 (26 de junio de 2023): 5920–29. http://dx.doi.org/10.1609/aaai.v37i5.25733.

Texto completo

Resumen

In reinforcement learning (RL), a reward function that aligns exactly with a task's true performance metric is often necessarily sparse. For example, a true task metric might encode a reward of 1 upon success and 0 otherwise. The sparsity of these true task metrics can make them hard to learn from, so in practice they are often replaced with alternative dense reward functions. These dense reward functions are typically designed by experts through an ad hoc process of trial and error. In this process, experts manually search for a reward function that improves performance with respect to the task metric while also enabling an RL algorithm to learn faster. This process raises the question of whether the same reward function is optimal for all algorithms, i.e., whether the reward function can be overfit to a particular algorithm. In this paper, we study the consequences of this wide yet unexamined practice of trial-and-error reward design. We first conduct computational experiments that confirm that reward functions can be overfit to learning algorithms and their hyperparameters. We then conduct a controlled observation study which emulates expert practitioners' typical experiences of reward design, in which we similarly find evidence of reward function overfitting. We also find that experts' typical approach to reward design---of adopting a myopic strategy and weighing the relative goodness of each state-action pair---leads to misdesign through invalid task specifications, since RL algorithms use cumulative reward rather than rewards for individual state-action pairs as an optimization target. Code, data: github.com/serenabooth/reward-design-perils

Los estilos APA, Harvard, Vancouver, ISO, etc.

26

Linke, Cam, Nadia M. Ady, Martha White, Thomas Degris y Adam White. "Adapting Behavior via Intrinsic Reward: A Survey and Empirical Study". Journal of Artificial Intelligence Research 69 (14 de diciembre de 2020): 1287–332. http://dx.doi.org/10.1613/jair.1.12087.

Texto completo

Resumen

Learning about many things can provide numerous benefits to a reinforcement learning system. For example, learning many auxiliary value functions, in addition to optimizing the environmental reward, appears to improve both exploration and representation learning. The question we tackle in this paper is how to sculpt the stream of experience—how to adapt the learning system’s behavior—to optimize the learning of a collection of value functions. A simple answer is to compute an intrinsic reward based on the statistics of each auxiliary learner, and use reinforcement learning to maximize that intrinsic reward. Unfortunately, implementing this simple idea has proven difficult, and thus has been the focus of decades of study. It remains unclear which of the many possible measures of learning would work well in a parallel learning setting where environmental reward is extremely sparse or absent. In this paper, we investigate and compare different intrinsic reward mechanisms in a new bandit-like parallel-learning testbed. We discuss the interaction between reward and prediction learners and highlight the importance of introspective prediction learners: those that increase their rate of learning when progress is possible, and decrease when it is not. We provide a comprehensive empirical comparison of 14 different rewards, including well-known ideas from reinforcement learning and active learning. Our results highlight a simple but seemingly powerful principle: intrinsic rewards based on the amount of learning can generate useful behavior, if each individual learner is introspective.

Los estilos APA, Harvard, Vancouver, ISO, etc.

27

Velasquez, Alvaro, Brett Bissey, Lior Barak, Daniel Melcer, Andre Beckus, Ismail Alkhouri y George Atia. "Multi-Agent Tree Search with Dynamic Reward Shaping". Proceedings of the International Conference on Automated Planning and Scheduling 32 (13 de junio de 2022): 652–61. http://dx.doi.org/10.1609/icaps.v32i1.19854.

Texto completo

Resumen

Sparse rewards and their representation in multi-agent domains remains a challenge for the development of multi-agent planning systems. While techniques from formal methods can be adopted to represent the underlying planning objectives, their use in facilitating and accelerating learning has witnessed limited attention in multi-agent settings. Reward shaping methods that leverage such formal representations in single-agent settings are typically static in the sense that the artificial rewards remain the same throughout the entire learning process. In contrast, we investigate the use of such formal objective representations to define novel reward shaping functions that capture the learned experience of the agents. More specifically, we leverage the automaton representation of the underlying team objectives in mixed cooperative-competitive domains such that each automaton transition is assigned an expected value proportional to the frequency with which it was observed in successful trajectories of past behavior. This form of dynamic reward shaping is proposed within a multi-agent tree search architecture wherein agents can simultaneously reason about the future behavior of other agents as well as their own future behavior.

Los estilos APA, Harvard, Vancouver, ISO, etc.

28

Sorg, Jonathan, Satinder Singh y Richard Lewis. "Optimal Rewards versus Leaf-Evaluation Heuristics in Planning Agents". Proceedings of the AAAI Conference on Artificial Intelligence 25, n.º 1 (4 de agosto de 2011): 465–70. http://dx.doi.org/10.1609/aaai.v25i1.7931.

Texto completo

Resumen

Planning agents often lack the computational resources needed to build full planning trees for their environments. Agent designers commonly overcome this finite-horizon approximation by applying an evaluation function at the leaf-states of the planning tree. Recent work has proposed an alternative approach for overcoming computational constraints on agent design: modify the reward function. In this work, we compare this reward design approach to the common leaf-evaluation heuristic approach for improving planning agents. We show that in many agents, the reward design approach strictly subsumes the leaf-evaluation approach, i.e., there exists a reward function for every leaf-evaluation heuristic that leads to equivalent behavior, but the converse is not true. We demonstrate that this generality leads to improved performance when an agent makes approximations in addition to the finite-horizon approximation. As part of our contribution, we extend PGRD, an online reward design algorithm, to develop reward design algorithms for Sparse Sampling and UCT, two algorithms capable of planning in large state spaces.

Los estilos APA, Harvard, Vancouver, ISO, etc.

29

Yin, Haiyan, Jianda Chen, Sinno Jialin Pan y Sebastian Tschiatschek. "Sequential Generative Exploration Model for Partially Observable Reinforcement Learning". Proceedings of the AAAI Conference on Artificial Intelligence 35, n.º 12 (18 de mayo de 2021): 10700–10708. http://dx.doi.org/10.1609/aaai.v35i12.17279.

Texto completo

Resumen

Many challenging partially observable reinforcement learning problems have sparse rewards and most existing model-free algorithms struggle with such reward sparsity. In this paper, we propose a novel reward shaping approach to infer the intrinsic rewards for the agent from a sequential generative model. Specifically, the sequential generative model processes a sequence of partial observations and actions from the agent's historical transitions to compile a belief state for performing forward dynamics prediction. Then we utilize the error of the dynamics prediction task to infer the intrinsic rewards for the agent. Our proposed method is able to derive intrinsic rewards that could better reflect the agent's surprise or curiosity over its ground-truth state by taking a sequential inference procedure. Furthermore, we formulate the inference procedure for dynamics prediction as a multi-step forward prediction task, where the time abstraction that has been incorporated could effectively help to increase the expressiveness of the intrinsic reward signals. To evaluate our method, we conduct extensive experiments on challenging 3D navigation tasks in ViZDoom and DeepMind Lab. Empirical evaluation results show that our proposed exploration method could lead to significantly faster convergence than various state-of-the-art exploration approaches in the testified navigation domains.

Los estilos APA, Harvard, Vancouver, ISO, etc.

30

Hasanbeig, Mohammadhosein, Natasha Yogananda Jeppu, Alessandro Abate, Tom Melham y Daniel Kroening. "DeepSynth: Automata Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning". Proceedings of the AAAI Conference on Artificial Intelligence 35, n.º 9 (18 de mayo de 2021): 7647–56. http://dx.doi.org/10.1609/aaai.v35i9.16935.

Texto completo

Resumen

This paper proposes DeepSynth, a method for effective training of deep Reinforcement Learning (RL) agents when the reward is sparse and non-Markovian, but at the same time progress towards the reward requires achieving an unknown sequence of high-level objectives. Our method employs a novel algorithm for synthesis of compact automata to uncover this sequential structure automatically. We synthesise a human-interpretable automaton from trace data collected by exploring the environment. The state space of the environment is then enriched with the synthesised automaton so that the generation of a control policy by deep RL is guided by the discovered structure encoded in the automaton. The proposed approach is able to cope with both high-dimensional, low-level features and unknown sparse non-Markovian rewards. We have evaluated DeepSynth's performance in a set of experiments that includes the Atari game Montezuma's Revenge. Compared to existing approaches, we obtain a reduction of two orders of magnitude in the number of iterations required for policy synthesis, and also a significant improvement in scalability.

Los estilos APA, Harvard, Vancouver, ISO, etc.

31

Hasanbeig, Hosein, Natasha Yogananda Jeppu, Alessandro Abate, Tom Melham y Daniel Kroening. "Symbolic Task Inference in Deep Reinforcement Learning". Journal of Artificial Intelligence Research 80 (23 de julio de 2024): 1099–137. http://dx.doi.org/10.1613/jair.1.14063.

Texto completo

Resumen

This paper proposes DeepSynth, a method for effective training of deep reinforcement learning agents when the reward is sparse or non-Markovian, but at the same time progress towards the reward requires achieving an unknown sequence of high-level objectives. Our method employs a novel algorithm for synthesis of compact finite state automata to uncover this sequential structure automatically. We synthesise a human-interpretable automaton from trace data collected by exploring the environment. The state space of the environment is then enriched with the synthesised automaton, so that the generation of a control policy by deep reinforcement learning is guided by the discovered structure encoded in the automaton. The proposed approach is able to cope with both high-dimensional, low-level features and unknown sparse or non-Markovian rewards. We have evaluated DeepSynth’s performance in a set of experiments that includes the Atari game Montezuma’s Revenge, known to be challenging. Compared to approaches that rely solely on deep reinforcement learning, we obtain a reduction of two orders of magnitude in the iterations required for policy synthesis, and a significant improvement in scalability.

Los estilos APA, Harvard, Vancouver, ISO, etc.

32

Jiang, Nan, Sheng Jin y Changshui Zhang. "Hierarchical automatic curriculum learning: Converting a sparse reward navigation task into dense reward". Neurocomputing 360 (septiembre de 2019): 265–78. http://dx.doi.org/10.1016/j.neucom.2019.06.024.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

33

Jin, Tianyuan, Hao-Lun Hsu, William Chang y Pan Xu. "Finite-Time Frequentist Regret Bounds of Multi-Agent Thompson Sampling on Sparse Hypergraphs". Proceedings of the AAAI Conference on Artificial Intelligence 38, n.º 11 (24 de marzo de 2024): 12956–64. http://dx.doi.org/10.1609/aaai.v38i11.29193.

Texto completo

Resumen

We study the multi-agent multi-armed bandit (MAMAB) problem, where agents are factored into overlapping groups. Each group represents a hyperedge, forming a hypergraph over the agents. At each round of interaction, the learner pulls a joint arm (composed of individual arms for each agent) and receives a reward according to the hypergraph structure. Specifically, we assume there is a local reward for each hyperedge, and the reward of the joint arm is the sum of these local rewards. Previous work introduced the multi-agent Thompson sampling (MATS) algorithm and derived a Bayesian regret bound. However, it remains an open problem how to derive a frequentist regret bound for Thompson sampling in this multi-agent setting. To address these issues, we propose an efficient variant of MATS, the epsilon-exploring Multi-Agent Thompson Sampling (eps-MATS) algorithm, which performs MATS exploration with probability epsilon while adopts a greedy policy otherwise. We prove that eps-MATS achieves a worst-case frequentist regret bound that is sublinear in both the time horizon and the local arm size. We also derive a lower bound for this setting, which implies our frequentist regret upper bound is optimal up to constant and logarithm terms, when the hypergraph is sufficiently sparse. Thorough experiments on standard MAMAB problems demonstrate the superior performance and the improved computational efficiency of eps-MATS compared with existing algorithms in the same setting.

Los estilos APA, Harvard, Vancouver, ISO, etc.

34

Ma, Ang, Yanhua Yu, Chuan Shi, Shuai Zhen, Liang Pang y Tat-Seng Chua. "PMHR: Path-Based Multi-Hop Reasoning Incorporating Rule-Enhanced Reinforcement Learning and KG Embeddings". Electronics 13, n.º 23 (9 de diciembre de 2024): 4847. https://doi.org/10.3390/electronics13234847.

Texto completo

Resumen

Multi-hop reasoning provides a means for inferring indirect relationships and missing information from knowledge graphs (KGs). Reinforcement learning (RL) was recently employed for multi-hop reasoning. Although RL-based methods provide explainability, they face challenges such as sparse rewards, spurious paths, large action spaces, and long training and running times. In this study, we present a novel approach that combines KG embeddings and RL strategies for multi-hop reasoning called path-based multi-hop reasoning (PMHR). We address the issues of sparse rewards and spurious paths by incorporating a well-designed reward function that combines soft rewards with rule-based rewards. The rewards are adjusted based on the target entity and the path to it. Furthermore, we perform action filtering and utilize the vectors of entities and relations acquired through KG embeddings to initialize the environment, thereby significantly reducing the runtime. Experiments involving a comprehensive performance evaluation, efficiency analysis, ablation studies, and a case study were performed. The experimental results on benchmark datasets demonstrate the effectiveness of PMHR in improving KG reasoning accuracy while preserving interpretability. Compared to existing state-of-the-art models, PMHR achieved Hit@1 improvements of 0.63%, 2.02%, and 3.17% on the UMLS, Kinship, and NELL-995 datasets, respectively. PMHR provides not only improved reasoning accuracy and explainability but also optimized computational efficiency, thereby offering a robust solution for multi-hop reasoning.

Los estilos APA, Harvard, Vancouver, ISO, etc.

35

Wei, Tianqi, Qinghai Guo y Barbara Webb. "Learning with sparse reward in a gap junction network inspired by the insect mushroom body". PLOS Computational Biology 20, n.º 5 (23 de mayo de 2024): e1012086. http://dx.doi.org/10.1371/journal.pcbi.1012086.

Texto completo

Resumen

Animals can learn in real-life scenarios where rewards are often only available when a goal is achieved. This ‘distal’ or ‘sparse’ reward problem remains a challenge for conventional reinforcement learning algorithms. Here we investigate an algorithm for learning in such scenarios, inspired by the possibility that axo-axonal gap junction connections, observed in neural circuits with parallel fibres such as the insect mushroom body, could form a resistive network. In such a network, an active node represents the task state, connections between nodes represent state transitions and their connection to actions, and current flow to a target state can guide decision making. Building on evidence that gap junction weights are adaptive, we propose that experience of a task can modulate the connections to form a graph encoding the task structure. We demonstrate that the approach can be used for efficient reinforcement learning under sparse rewards, and discuss whether it is plausible as an account of the insect mushroom body.

Los estilos APA, Harvard, Vancouver, ISO, etc.

36

Kang, Yongxin, Enmin Zhao, Kai Li y Junliang Xing. "Exploration via State influence Modeling". Proceedings of the AAAI Conference on Artificial Intelligence 35, n.º 9 (18 de mayo de 2021): 8047–54. http://dx.doi.org/10.1609/aaai.v35i9.16981.

Texto completo

Resumen

This paper studies the challenging problem of reinforcement learning (RL) in hard exploration tasks with sparse rewards. It focuses on the exploration stage before the agent gets the first positive reward, in which case, traditional RL algorithms with simple exploration strategies often work poorly. Unlike previous methods using some attribute of a single state as the intrinsic reward to encourage exploration, this work leverages the social influence between different states to permit more efficient exploration. It introduces a general intrinsic reward construction method to evaluate the social influence of states dynamically. Three kinds of social influence are introduced for a state: conformity, power, and authority. By measuring the state’s social influence, agents quickly find the focus state during the exploration process. The proposed RL framework with state social influence evaluation works well in hard exploration task. Extensive experimental analyses and comparisons in Grid Maze and many hard exploration Atari 2600 games demonstrate its high exploration efficiency.

Los estilos APA, Harvard, Vancouver, ISO, etc.

37

Sakamoto, Yuma y Kentarou Kurashige. "Self-Generating Evaluations for Robot’s Autonomy Based on Sensor Input". Machines 11, n.º 9 (6 de septiembre de 2023): 892. http://dx.doi.org/10.3390/machines11090892.

Texto completo

Resumen

Reinforcement learning has been explored within the context of robot operation in different environments. Designing the reward function in reinforcement learning is challenging for designers because it requires specialized knowledge. To reduce the design burden, we propose a reward design method that is independent of both specific environments and tasks in which reinforcement learning robots evaluate and generate rewards autonomously based on sensor information received from the environment. This method allows the robot to operate autonomously based on sensors. However, the existing approach to adaption attempts to adapt without considering the input properties for the strength of the sensor input, which may cause a robot to learn harmful actions from the environment. In this study, we propose a method for changing the threshold of a sensor input while considering the strength of the input and other properties. We also demonstrate the utility of the proposed method by presenting the results of simulation experiments on a path-finding problem conducted in an environment with sparse rewards.

Los estilos APA, Harvard, Vancouver, ISO, etc.

38

Morrison, Sara E., Vincent B. McGinty, Johann du Hoffmann y Saleem M. Nicola. "Limbic-motor integration by neural excitations and inhibitions in the nucleus accumbens". Journal of Neurophysiology 118, n.º 5 (1 de noviembre de 2017): 2549–67. http://dx.doi.org/10.1152/jn.00465.2017.

Texto completo

Resumen

The nucleus accumbens (NAc) has often been described as a “limbic-motor interface,” implying that the NAc integrates the value of expected rewards with the motor planning required to obtain them. However, there is little direct evidence that the signaling of individual NAc neurons combines information about predicted reward and behavioral response. We report that cue-evoked neural responses in the NAc form a likely physiological substrate for its limbic-motor integration function. Across task contexts, individual NAc neurons in behaving rats robustly encode the reward-predictive qualities of a cue, as well as the probability of behavioral response to the cue, as coexisting components of the neural signal. In addition, cue-evoked activity encodes spatial and locomotor aspects of the behavioral response, including proximity to a reward-associated target and the latency and speed of approach to the target. Notably, there are important limits to the ability of NAc neurons to integrate motivational information into behavior: in particular, updating of predicted reward value appears to occur on a relatively long timescale, since NAc neurons fail to discriminate between cues with reward associations that change frequently. Overall, these findings suggest that NAc cue-evoked signals, including inhibition of firing (as noted here for the first time), provide a mechanism for linking reward prediction and other motivationally relevant factors, such as spatial proximity, to the probability and vigor of a reward-seeking behavioral response. NEW & NOTEWORTHY The nucleus accumbens (NAc) is thought to link expected rewards and action planning, but evidence for this idea remains sparse. We show that, across contexts, both excitatory and inhibitory cue-evoked activity in the NAc jointly encode reward prediction and probability of behavioral responding to the cue, as well as spatial and locomotor properties of the response. Interestingly, although spatial information in the NAc is updated quickly, fine-grained updating of reward value occurs over a longer timescale.

Los estilos APA, Harvard, Vancouver, ISO, etc.

39

Han, Ziyao, Fan Yi y Kazuhiro Ohkura. "Collective Transport Behavior in a Robotic Swarm with Hierarchical Imitation Learning". Journal of Robotics and Mechatronics 36, n.º 3 (20 de junio de 2024): 538–45. http://dx.doi.org/10.20965/jrm.2024.p0538.

Texto completo

Resumen

Swarm robotics is the study of how a large number of relatively simple physically embodied robots can be designed such that a desired collective behavior emerges from local interactions. Furthermore, reinforcement learning (RL) is a promising approach for training robotic swarm controllers. However, the conventional RL approach suffers from the sparse reward problem in some complex tasks, such as key-to-door tasks. In this study, we applied hierarchical imitation learning to train a robotic swarm to address a key-to-door transport task with sparse rewards. The results demonstrate that the proposed approach outperforms the conventional RL method. Moreover, the proposed method outperforms the conventional hierarchical RL method in its ability to adapt to changes in the training environment.

Los estilos APA, Harvard, Vancouver, ISO, etc.

40

Tang, Wanxing, Chuang Cheng, Haiping Ai y Li Chen. "Dual-Arm Robot Trajectory Planning Based on Deep Reinforcement Learning under Complex Environment". Micromachines 13, n.º 4 (31 de marzo de 2022): 564. http://dx.doi.org/10.3390/mi13040564.

Texto completo

Resumen

In this article, the trajectory planning of the two manipulators of the dual-arm robot is studied to approach the patient in a complex environment with deep reinforcement learning algorithms. The shape of the human body and bed is complex which may lead to the collision between the human and the robot. Because the sparse reward the robot obtains from the environment may not support the robot to accomplish the task, a neural network is trained to control the manipulators of the robot to prepare to hold the patient up by using a proximal policy optimization algorithm with a continuous reward function. Firstly, considering the realistic scene, the 3D simulation environment is built to conduct the research. Secondly, inspired by the idea of the artificial potential field, a new reward and punishment function was proposed to help the robot obtain enough rewards to explore the environment. The function is consisting of four parts which include the reward guidance function, collision detection, obstacle avoidance function, and time function. Where the reward guidance function is used to guide the robot to approach the targets to hold the patient, the collision detection and obstacle avoidance function are complementary to each other and are used to avoid obstacles, and the time function is used to reduce the number of training episode. Finally, after the robot is trained to reach the targets, the training results are analyzed. Compared with the DDPG algorithm, the PPO algorithm reduces about 4 million steps for training to converge. Moreover, compared with the other reward and punishment functions, the function used in this paper will obtain many more rewards at the same training time. Apart from that, it will take much less time to converge, and the episode length will be shorter; so, the advantage of the algorithm used in this paper is verified.

Los estilos APA, Harvard, Vancouver, ISO, etc.

41

Xu, Xibao, Yushen Chen y Chengchao Bai. "Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing". Sensors 21, n.º 23 (6 de diciembre de 2021): 8161. http://dx.doi.org/10.3390/s21238161.

Texto completo

Resumen

Planetary soft landing has been studied extensively due to its promising application prospects. In this paper, a soft landing control algorithm based on deep reinforcement learning (DRL) with good convergence property is proposed. First, the soft landing problem of the powered descent phase is formulated and the theoretical basis of Reinforcement Learning (RL) used in this paper is introduced. Second, to make it easier to converge, a reward function is designed to include process rewards like velocity tracking reward, solving the problem of sparse reward. Then, by including the fuel consumption penalty and constraints violation penalty, the lander can learn to achieve velocity tracking goal while saving fuel and keeping attitude angle within safe ranges. Then, simulations of training are carried out under the frameworks of Deep deterministic policy gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor Critic (SAC), respectively, which are of the classical RL frameworks, and all converged. Finally, the trained policy is deployed into velocity tracking and soft landing experiments, results of which demonstrate the validity of the algorithm proposed.

Los estilos APA, Harvard, Vancouver, ISO, etc.

42

Song, Qingpeng, Yuansheng Liu, Ming Lu, Jun Zhang, Han Qi, Ziyu Wang y Zijian Liu. "Autonomous Driving Decision Control Based on Improved Proximal Policy Optimization Algorithm". Applied Sciences 13, n.º 11 (24 de mayo de 2023): 6400. http://dx.doi.org/10.3390/app13116400.

Texto completo

Resumen

The decision-making control of autonomous driving in complex urban road environments is a difficult problem in the research of autonomous driving. In order to solve the problem of high dimensional state space and sparse reward in autonomous driving decision control in this environment, this paper proposed a Coordinated Convolution Multi-Reward Proximal Policy Optimization (CCMR-PPO). This method reduces the dimension of the bird’s-eye view data through the coordinated convolution network and then fuses the processed data with the vehicle state data as the input of the algorithm to optimize the state space. The control commands acc (acc represents throttle and brake) and steer of the vehicle are used as the output of the algorithm.. Comprehensively considering the lateral error, safety distance, speed, and other factors of the vehicle, a multi-objective reward mechanism was designed to alleviate the sparse reward. Experiments on the CARLA simulation platform show that the proposed method can effectively increase the performance: compared with the PPO algorithm, the line crossed times are reduced by 24 %, and the number of tasks completed is increased by 54 %.

Los estilos APA, Harvard, Vancouver, ISO, etc.

43

Potjans, Wiebke, Abigail Morrison y Markus Diesmann. "A Spiking Neural Network Model of an Actor-Critic Learning Agent". Neural Computation 21, n.º 2 (febrero de 2009): 301–39. http://dx.doi.org/10.1162/neco.2008.08-07-593.

Texto completo

Resumen

The ability to adapt behavior to maximize reward as a result of interactions with the environment is crucial for the survival of any higher organism. In the framework of reinforcement learning, temporal-difference learning algorithms provide an effective strategy for such goal-directed adaptation, but it is unclear to what extent these algorithms are compatible with neural computation. In this article, we present a spiking neural network model that implements actor-critic temporal-difference learning by combining local plasticity rules with a global reward signal. The network is capable of solving a nontrivial gridworld task with sparse rewards. We derive a quantitative mapping of plasticity parameters and synaptic weights to the corresponding variables in the standard algorithmic formulation and demonstrate that the network learns with a similar speed to its discrete time counterpart and attains the same equilibrium performance.

Los estilos APA, Harvard, Vancouver, ISO, etc.

44

Kim, MyeongSeop y Jung-Su Kim. "Policy-based Deep Reinforcement Learning for Sparse Reward Environment". Transactions of The Korean Institute of Electrical Engineers 70, n.º 3 (31 de marzo de 2021): 506–14. http://dx.doi.org/10.5370/kiee.2021.70.3.506.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

45

Dai, Tianhong, Hengyan Liu y Anil Anthony Bharath. "Episodic Self-Imitation Learning with Hindsight". Electronics 9, n.º 10 (21 de octubre de 2020): 1742. http://dx.doi.org/10.3390/electronics9101742.

Texto completo

Resumen

Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state–action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation.

Los estilos APA, Harvard, Vancouver, ISO, etc.

46

Kubovčík, Martin, Iveta Dirgová Luptáková y Jiří Pospíchal. "Signal Novelty Detection as an Intrinsic Reward for Robotics". Sensors 23, n.º 8 (14 de abril de 2023): 3985. http://dx.doi.org/10.3390/s23083985.

Texto completo

Resumen

In advanced robot control, reinforcement learning is a common technique used to transform sensor data into signals for actuators, based on feedback from the robot’s environment. However, the feedback or reward is typically sparse, as it is provided mainly after the task’s completion or failure, leading to slow convergence. Additional intrinsic rewards based on the state visitation frequency can provide more feedback. In this study, an Autoencoder deep learning neural network was utilized as novelty detection for intrinsic rewards to guide the search process through a state space. The neural network processed signals from various types of sensors simultaneously. It was tested on simulated robotic agents in a benchmark set of classic control OpenAI Gym test environments (including Mountain Car, Acrobot, CartPole, and LunarLander), achieving more efficient and accurate robot control in three of the four tasks (with only slight degradation in the Lunar Lander task) when purely intrinsic rewards were used compared to standard extrinsic rewards. By incorporating autoencoder-based intrinsic rewards, robots could potentially become more dependable in autonomous operations like space or underwater exploration or during natural disaster response. This is because the system could better adapt to changing environments or unexpected situations.

Los estilos APA, Harvard, Vancouver, ISO, etc.

47

Liu, Yushen. "On the Performance of the Minimax Optimal Strategy in the Stochastic Case of Logistic Bandits". Applied and Computational Engineering 83, n.º 1 (31 de octubre de 2024): 130–39. http://dx.doi.org/10.54254/2755-2721/83/2024glg0072.

Texto completo

Resumen

The multi-armed bandit problem is a well-established model for examining the exploration/exploitation trade-offs in sequential decision-making tasks. This study focuses on the logistic bandit, where rewards are derived from two distinct datasets of movie ratings, ranging from 1 to 5, each characterized by different variances. Previous research has shown that regret bounds for multi-armed bandit algorithms can be unstable across varying environments. However, this paper provides new insights by demonstrating the robustness of the Minimax Optimal Strategy in the Stochastic (MOSS) algorithm across environments with differing reward variances. Unlike prior studies, this research shows that MOSS maintains superior performance in both dense and sparse reward settings, consistently outperforming widely used algorithms like UCB and TS, particularly in high variance conditions and over sufficient trials. The findings indicate that MOSS achieves logarithmic expected regret for both types of environments, effectively balancing exploration and exploitation. Specifically, with K arms and T time steps, the regret R(T) of MOSS is bounded by O((KT logT )). This work highlights MOSS as a robust solution for handling diverse stochastic conditions, filling a crucial gap in the understanding of its practical adaptability across different reward distributions.

Los estilos APA, Harvard, Vancouver, ISO, etc.

48

Alkaff, Muhammad, Abdullah Basuhail y Yuslena Sari. "Optimizing Water Use in Maize Irrigation with Reinforcement Learning". Mathematics 13, n.º 4 (11 de febrero de 2025): 595. https://doi.org/10.3390/math13040595.

Texto completo

Resumen

As global populations grow and environmental constraints intensify, improving agricultural water management is essential for sustainable food production. Traditional irrigation methods often lack adaptability, leading to inefficient water use. Reinforcement learning (RL) offers a promising solution for developing dynamic irrigation strategies that balance productivity and resource conservation. However, agricultural RL tasks are characterized by sparse actions—irrigation only when necessary—and delayed rewards realized at the end of the growing season. This study integrates RL with AquaCrop-OSPy simulations in the Gymnasium framework to develop adaptive irrigation policies for maize. We introduce a reward mechanism that penalizes incremental water usage while rewarding end-of-season yields, encouraging resource-efficient decisions. Using the Proximal Policy Optimization (PPO) algorithm, our RL-driven approach outperforms fixed-threshold irrigation strategies, reducing water use by 29% and increasing profitability by 9%. It achieves a water use efficiency of 76.76 kg/ha/mm, a 40% improvement over optimized soil moisture threshold methods. These findings highlight RL’s potential to address the challenges of sparse actions and delayed rewards in agricultural management, delivering significant environmental and economic benefits.

Los estilos APA, Harvard, Vancouver, ISO, etc.

49

de Hauwere, Yann-Michaël, Sam Devlin, Daniel Kudenko y Ann Nowé. "Context-sensitive reward shaping for sparse interaction multi-agent systems". Knowledge Engineering Review 31, n.º 1 (enero de 2016): 59–76. http://dx.doi.org/10.1017/s0269888915000193.

Texto completo

Resumen

AbstractPotential-based reward shaping is a commonly used approach in reinforcement learning to direct exploration based on prior knowledge. Both in single and multi-agent settings this technique speeds up learning without losing any theoretical convergence guarantees. However, if speed ups through reward shaping are to be achieved in multi-agent environments, a different shaping signal should be used for each context in which agents have a different subgoal or when agents are involved in a different interaction situation.This paper describes the use of context-aware potential functions in a multi-agent system in which the interactions between agents are sparse. This means that, unknown to the agentsa priori, the interactions between the agents only occur sporadically in certain regions of the state space. During these interactions, agents need to coordinate in order to reach the global optimal solution.We demonstrate how different reward shaping functions can be used on top of Future Coordinating Q-learning (FCQ-learning); an algorithm capable of automatically detecting when agents should take each other into consideration. Using FCQ-learning, coordination problems can even be anticipated before the actual problems occur, allowing the problems to be solved timely. We evaluate our approach on a range of gridworld problems, as well as a simulation of air traffic control.

Los estilos APA, Harvard, Vancouver, ISO, etc.

50

Wang, Xusheng, Jiexin Xie, Shijie Guo, Yue Li, Pengfei Sun y Zhongxue Gan. "Deep reinforcement learning-based rehabilitation robot trajectory planning with optimized reward functions". Advances in Mechanical Engineering 13, n.º 12 (diciembre de 2021): 168781402110670. http://dx.doi.org/10.1177/16878140211067011.

Texto completo

Resumen

Deep reinforcement learning (DRL) provides a new solution for rehabilitation robot trajectory planning in the unstructured working environment, which can bring great convenience to patients. Previous researches mainly focused on optimization strategies but ignored the construction of reward functions, which leads to low efficiency. Different from traditional sparse reward function, this paper proposes two dense reward functions. First, azimuth reward function mainly provides a global guidance and reasonable constraints in the exploration. To further improve the efficiency, a process-oriented aspiration reward function is proposed, it is capable of accelerating the exploration process and avoid locally optimal solution. Experiments show that the proposed reward functions are able to accelerate the convergence rate by 38.4% on average with the mainstream DRL methods. The mean of convergence also increases by 9.5%, and the percentage of standard deviation decreases by 21.2%–23.3%. Results show that the proposed reward functions can significantly improve learning efficiency of DRL methods, and then provide practical possibility for automatic trajectory planning of rehabilitation robot.

Los estilos APA, Harvard, Vancouver, ISO, etc.

Ofrecemos descuentos en todos los planes premium para autores cuyas obras están incluidas en selecciones literarias temáticas. ¡Contáctenos para obtener un código promocional único!