Log in

Relevant bibliographies by topics / Sparsely rewarded environments / Journal articles

Journal articles on the topic 'Sparsely rewarded environments'

To see the other types of publications on this topic, follow the link: Sparsely rewarded environments.

Author: Grafiati

Published: 7 July 2024

Last updated: 7 July 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Sparsely rewarded environments.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Dubey, Rachit, Thomas L. Griffiths, and Peter Dayan. "The pursuit of happiness: A reinforcement learning perspective on habituation and comparisons." PLOS Computational Biology 18, no. 8 (August 4, 2022): e1010316. http://dx.doi.org/10.1371/journal.pcbi.1010316.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In evaluating our choices, we often suffer from two tragic relativities. First, when our lives change for the better, we rapidly habituate to the higher standard of living. Second, we cannot escape comparing ourselves to various relative standards. Habituation and comparisons can be very disruptive to decision-making and happiness, and till date, it remains a puzzle why they have come to be a part of cognition in the first place. Here, we present computational evidence that suggests that these features might play an important role in promoting adaptive behavior. Using the framework of reinforcement learning, we explore the benefit of employing a reward function that, in addition to the reward provided by the underlying task, also depends on prior expectations and relative comparisons. We find that while agents equipped with this reward function are less happy, they learn faster and significantly outperform standard reward-based agents in a wide range of environments. Specifically, we find that relative comparisons speed up learning by providing an exploration incentive to the agents, and prior expectations serve as a useful aid to comparisons, especially in sparsely-rewarded and non-stationary environments. Our simulations also reveal potential drawbacks of this reward function and show that agents perform sub-optimally when comparisons are left unchecked and when there are too many similar options. Together, our results help explain why we are prone to becoming trapped in a cycle of never-ending wants and desires, and may shed light on psychopathologies such as depression, materialism, and overconsumption.

2

Shi, Xiaoping, Shiqi Zou, Shenmin Song, and Rui Guo. "A multi-objective sparse evolutionary framework for large-scale weapon target assignment based on a reward strategy." Journal of Intelligent & Fuzzy Systems 40, no. 5 (April 22, 2021): 10043–61. http://dx.doi.org/10.3233/jifs-202679.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The asset-based weapon target assignment (ABWTA) problem is one of the important branches of the weapon target assignment (WTA) problem. Due to the current large-scale battlefield environment, the ABWTA problem is a multi-objective optimization problem (MOP) with strong constraints, large-scale and sparse properties. The novel model of the ABWTA problem with the operation error parameter is established. An evolutionary algorithm for large-scale sparse problems (SparseEA) is introduced as the main framework for solving large-scale sparse ABWTA problem. The proposed framework (SparseEA-ABWTA) mainly addresses the issue that problem-specific initialization method and genetic operators with a reward strategy can generate solutions efficiently considering the sparsity of variables and an improved non-dominated solution selection method is presented to handle the constraints. Under the premise of constructing large-scale cases by the specific case generator, two numerical experiments on four outstanding multi-objective evolutionary algorithms (MOEAs) show Runtime of SparseEA-ABWTA is faster nearly 50% than others under the same convergence and the gap between MOEAs improved by the mechanism of SparseEA-ABWTA and SparseEA-ABWTA is reduced to nearly 20% in the convergence and distribution.

3

Sakamoto, Yuma, and Kentarou Kurashige. "Self-Generating Evaluations for Robot’s Autonomy Based on Sensor Input." Machines 11, no. 9 (September 6, 2023): 892. http://dx.doi.org/10.3390/machines11090892.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Reinforcement learning has been explored within the context of robot operation in different environments. Designing the reward function in reinforcement learning is challenging for designers because it requires specialized knowledge. To reduce the design burden, we propose a reward design method that is independent of both specific environments and tasks in which reinforcement learning robots evaluate and generate rewards autonomously based on sensor information received from the environment. This method allows the robot to operate autonomously based on sensors. However, the existing approach to adaption attempts to adapt without considering the input properties for the strength of the sensor input, which may cause a robot to learn harmful actions from the environment. In this study, we propose a method for changing the threshold of a sensor input while considering the strength of the input and other properties. We also demonstrate the utility of the proposed method by presenting the results of simulation experiments on a path-finding problem conducted in an environment with sparse rewards.

4

Parisi, Simone, Davide Tateo, Maximilian Hensel, Carlo D’Eramo, Jan Peters, and Joni Pajarinen. "Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning." Algorithms 15, no. 3 (February 28, 2022): 81. http://dx.doi.org/10.3390/a15030081.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods that use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment.

5

Mguni, David, Taher Jafferjee, Jianhong Wang, Nicolas Perez-Nieves, Wenbin Song, Feifei Tong, Matthew Taylor, et al. "Learning to Shape Rewards Using a Game of Two Partners." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 10 (June 26, 2023): 11604–12. http://dx.doi.org/10.1609/aaai.v37i10.26371.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construc- tion is time-consuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimising Shaping Algorithm (ROSA), an automated reward shaping framework in which the shaping-reward function is constructed in a Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards for more efficient learning while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which adopts existing RL algorithms, learns to construct a shaping-reward function that is beneficial to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA’s properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.

6

Forbes, Grant C., and David L. Roberts. "Potential-Based Reward Shaping for Intrinsic Motivation (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 21 (March 24, 2024): 23488–89. http://dx.doi.org/10.1609/aaai.v38i21.30441.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Recently there has been a proliferation of intrinsic motivation (IM) reward shaping methods to learn in complex and sparse-reward environments. These methods can often inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior. Previous work on mitigating the risks of reward shaping, particularly through potential-based reward shaping (PBRS), has not been applicable to many IM methods, as they are often complex, trainable functions themselves, and therefore dependent on a wider set of variables than the traditional reward functions that PBRS was developed for. We present an extension to PBRS that we show preserves the set of optimal policies under a more general set of functions than has been previously demonstrated. We also present Potential-Based Intrinsic Motivation (PBIM), a method for converting IM rewards into a potential-based form that are useable without altering the set of optimal policies. Testing in the MiniGrid DoorKey environment, we demonstrate that PBIM successfully prevents the agent from converging to a suboptimal policy and can speed up training.

7

Xu, Pei, Junge Zhang, Qiyue Yin, Chao Yu, Yaodong Yang, and Kaiqi Huang. "Subspace-Aware Exploration for Sparse-Reward Multi-Agent Tasks." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 10 (June 26, 2023): 11717–25. http://dx.doi.org/10.1609/aaai.v37i10.26384.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. One possible solution to this issue is to exploit inherent task structures for an acceleration of exploration. In this paper, we present a novel exploration approach, which encodes a special structural prior on the reward function into exploration, for sparse-reward multi-agent tasks. Specifically, a novel entropic exploration objective which encodes the structural prior is proposed to accelerate the discovery of rewards. By maximizing the lower bound of this objective, we then propose an algorithm with moderate computational cost, which can be applied to practical tasks. Under the sparse-reward setting, we show that the proposed algorithm significantly outperforms the state-of-the-art algorithms in the multiple-particle environment, the Google Research Football and StarCraft II micromanagement tasks. To the best of our knowledge, on some hard tasks (such as 27m_vs_30m}) which have relatively larger number of agents and need non-trivial strategies to defeat enemies, our method is the first to learn winning strategies under the sparse-reward setting.

8

Kubovčík, Martin, Iveta Dirgová Luptáková, and Jiří Pospíchal. "Signal Novelty Detection as an Intrinsic Reward for Robotics." Sensors 23, no. 8 (April 14, 2023): 3985. http://dx.doi.org/10.3390/s23083985.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In advanced robot control, reinforcement learning is a common technique used to transform sensor data into signals for actuators, based on feedback from the robot’s environment. However, the feedback or reward is typically sparse, as it is provided mainly after the task’s completion or failure, leading to slow convergence. Additional intrinsic rewards based on the state visitation frequency can provide more feedback. In this study, an Autoencoder deep learning neural network was utilized as novelty detection for intrinsic rewards to guide the search process through a state space. The neural network processed signals from various types of sensors simultaneously. It was tested on simulated robotic agents in a benchmark set of classic control OpenAI Gym test environments (including Mountain Car, Acrobot, CartPole, and LunarLander), achieving more efficient and accurate robot control in three of the four tasks (with only slight degradation in the Lunar Lander task) when purely intrinsic rewards were used compared to standard extrinsic rewards. By incorporating autoencoder-based intrinsic rewards, robots could potentially become more dependable in autonomous operations like space or underwater exploration or during natural disaster response. This is because the system could better adapt to changing environments or unexpected situations.

9

Catacora Ocana, Jim Martin, Roberto Capobianco, and Daniele Nardi. "An Overview of Environmental Features that Impact Deep Reinforcement Learning in Sparse-Reward Domains." Journal of Artificial Intelligence Research 76 (April 26, 2023): 1181–218. http://dx.doi.org/10.1613/jair.1.14390.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Deep reinforcement learning has achieved impressive results in recent years; yet, it is still severely troubled by environments showcasing sparse rewards. On top of that, not all sparse-reward environments are created equal, i.e., they can differ in the presence or absence of various features, with many of them having a great impact on learning. In light of this, the present work puts together a literature compilation of such environmental features, covering particularly those that have been taken advantage of and those that continue to pose a challenge. We expect this effort to provide guidance to researchers for assessing the generality of their new proposals and to call their attention to issues that remain unresolved when dealing with sparse rewards.

10

Zhou, Xiao, Song Zhou, Xingang Mou, and Yi He. "Multirobot Collaborative Pursuit Target Robot by Improved MADDPG." Computational Intelligence and Neuroscience 2022 (February 25, 2022): 1–10. http://dx.doi.org/10.1155/2022/4757394.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Policy formulation is one of the main problems in multirobot systems, especially in multirobot pursuit-evasion scenarios, where both sparse rewards and random environment changes bring great difficulties to find better strategy. Existing multirobot decision-making methods mostly use environmental rewards to promote robots to complete the target task that cannot achieve good results. This paper proposes a multirobot pursuit method based on improved multiagent deep deterministic policy gradient (MADDPG), which solves the problem of sparse rewards in multirobot pursuit-evasion scenarios by combining the intrinsic reward and the external environment. The state similarity module based on the threshold constraint is as a part of the intrinsic reward signal output by the intrinsic curiosity module, which is used to balance overexploration and insufficient exploration, so that the agent can use the intrinsic reward more effectively to learn better strategies. The simulation experiment results show that the proposed method can improve the reward value of robots and the success rate of the pursuit task significantly. The intuitive change is obviously reflected in the real-time distance between the pursuer and the escapee, the pursuer using the improved algorithm for training can get closer to the escapee more quickly, and the average following distance also decreases.

11

Velasquez, Alvaro, Brett Bissey, Lior Barak, Andre Beckus, Ismail Alkhouri, Daniel Melcer, and George Atia. "Dynamic Automaton-Guided Reward Shaping for Monte Carlo Tree Search." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 13 (May 18, 2021): 12015–23. http://dx.doi.org/10.1609/aaai.v35i13.17427.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Reinforcement learning and planning have been revolutionized in recent years, due in part to the mass adoption of deep convolutional neural networks and the resurgence of powerful methods to refine decision-making policies. However, the problem of sparse reward signals and their representation remains pervasive in many domains. While various rewardshaping mechanisms and imitation learning approaches have been proposed to mitigate this problem, the use of humanaided artificial rewards introduces human error, sub-optimal behavior, and a greater propensity for reward hacking. In this paper, we mitigate this by representing objectives as automata in order to define novel reward shaping functions over this structured representation. In doing so, we address the sparse rewards problem within a novel implementation of Monte Carlo Tree Search (MCTS) by proposing a reward shaping function which is updated dynamically to capture statistics on the utility of each automaton transition as it pertains to satisfying the goal of the agent. We further demonstrate that such automaton-guided reward shaping can be utilized to facilitate transfer learning between different environments when the objective is the same.

12

Yan Kong, Yan Kong, Yefeng Rui Yan Kong, and Chih-Hsien Hsia Yefeng Rui. "A Deep Reinforcement Learning-Based Approach in Porker Game." 電腦學刊 34, no. 2 (April 2023): 041–51. http://dx.doi.org/10.53106/199115992023043402004.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

<p>Recent years have witnessed the big success deep reinforcement learning achieved in the domain of card and board games, such as Go, chess and Texas Hold’em poker. However, Dou Di Zhu, a traditional Chinese card game, is still a challenging task for deep reinforcement learning methods due to the enormous action space and the sparse and delayed reward of each action from the environment. Basic reinforcement learning algorithms are more effective in the simple environments which have small action spaces and valuable and concrete reward functions, and unfortunately, are shown not be able to deal with Dou Di Zhu satisfactorily. This work introduces an approach named Two-steps Q-Network based on DQN to playing Dou Di Zhu, which compresses the huge action space through dividing it into two parts according to the rules of Dou Di Zhu and fills in the sparse rewards using inverse reinforcement learning (IRL) through abstracting the reward function from experts’ demonstrations. It is illustrated by the experiments that two-steps Q-network gains great advancements compared with DQN used in Dou Di Zhu.</p> <p> </p>

13

Bougie, Nicolas, and Ryutaro Ichise. "Skill-based curiosity for intrinsically motivated reinforcement learning." Machine Learning 109, no. 3 (October 10, 2019): 493–512. http://dx.doi.org/10.1007/s10994-019-05845-8.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract Reinforcement learning methods rely on rewards provided by the environment that are extrinsic to the agent. However, many real-world scenarios involve sparse or delayed rewards. In such cases, the agent can develop its own intrinsic reward function called curiosity to enable the agent to explore its environment in the quest of new skills. We propose a novel end-to-end curiosity mechanism for deep reinforcement learning methods, that allows an agent to gradually acquire new skills. Our method scales to high-dimensional problems, avoids the need of directly predicting the future, and, can perform in sequential decision scenarios. We formulate the curiosity as the ability of the agent to predict its own knowledge about the task. We base the prediction on the idea of skill learning to incentivize the discovery of new skills, and guide exploration towards promising solutions. To further improve data efficiency and generalization of the agent, we propose to learn a latent representation of the skills. We present a variety of sparse reward tasks in MiniGrid, MuJoCo, and Atari games. We compare the performance of an augmented agent that uses our curiosity reward to state-of-the-art learners. Experimental evaluation exhibits higher performance compared to reinforcement learning models that only learn by maximizing extrinsic rewards.

14

Jiang, Jiechuan, and Zongqing Lu. "Generative Exploration and Exploitation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 4337–44. http://dx.doi.org/10.1609/aaai.v34i04.5858.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Sparse reward is one of the biggest challenges in reinforcement learning (RL). In this paper, we propose a novel method called Generative Exploration and Exploitation (GENE) to overcome sparse reward. GENE automatically generates start states to encourage the agent to explore the environment and to exploit received reward signals. GENE can adaptively tradeoff between exploration and exploitation according to the varying distributions of states experienced by the agent as the learning progresses. GENE relies on no prior knowledge about the environment and can be combined with any RL algorithm, no matter on-policy or off-policy, single-agent or multi-agent. Empirically, we demonstrate that GENE significantly outperforms existing methods in three tasks with only binary rewards, including Maze, Maze Ant, and Cooperative Navigation. Ablation studies verify the emergence of progressive exploration and automatic reversing.

15

HUANG, XIAO, and JUYANG WENG. "INHERENT VALUE SYSTEMS FOR AUTONOMOUS MENTAL DEVELOPMENT." International Journal of Humanoid Robotics 04, no. 02 (June 2007): 407–33. http://dx.doi.org/10.1142/s0219843607001011.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The inherent value system of a developmental agent enables autonomous mental development to take place right after the agent's "birth." Biologically, it is not clear what basic components constitute a value system. In the computational model introduced here, we propose that inherent value systems should have at least three basic components: punishment, reward and novelty with decreasing weights from the first component to the last. Punishments and rewards are temporally sparse but novelty is temporally dense. We present a biologically inspired computational architecture that guides development of sensorimotor skills through real-time interactions with the environments, driven by an inborn value system. The inherent value system has been successfully tested on an artificial agent in a simulation environment and a robot in the real world.

16

Li, Yuangang, Tao Guo, Qinghua Li, and Xinyue Liu. "Optimized Feature Extraction for Sample Efficient Deep Reinforcement Learning." Electronics 12, no. 16 (August 18, 2023): 3508. http://dx.doi.org/10.3390/electronics12163508.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In deep reinforcement learning, agent exploration still has certain limitations, while low efficiency exploration further leads to the problem of low sample efficiency. In order to solve the exploration dilemma caused by white noise interference and the separation derailment problem in the environment, we present an innovative approach by introducing an intricately honed feature extraction module to harness the predictive errors, generate intrinsic rewards, and use an ancillary agent training paradigm that effectively solves the above problems and significantly enhances the agent’s capacity for comprehensive exploration within environments characterized by sparse reward distribution. The efficacy of the optimized feature extraction module is substantiated through comparative experiments conducted within the arduous exploration problem scenarios often employed in reinforcement learning investigations. Furthermore, a comprehensive performance analysis of our method is executed within the esteemed Atari 2600 experimental setting, yielding noteworthy advancements in performance and showcasing the attainment of superior outcomes in six selected experimental environments.

17

Tang, Wanxing, Chuang Cheng, Haiping Ai, and Li Chen. "Dual-Arm Robot Trajectory Planning Based on Deep Reinforcement Learning under Complex Environment." Micromachines 13, no. 4 (March 31, 2022): 564. http://dx.doi.org/10.3390/mi13040564.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this article, the trajectory planning of the two manipulators of the dual-arm robot is studied to approach the patient in a complex environment with deep reinforcement learning algorithms. The shape of the human body and bed is complex which may lead to the collision between the human and the robot. Because the sparse reward the robot obtains from the environment may not support the robot to accomplish the task, a neural network is trained to control the manipulators of the robot to prepare to hold the patient up by using a proximal policy optimization algorithm with a continuous reward function. Firstly, considering the realistic scene, the 3D simulation environment is built to conduct the research. Secondly, inspired by the idea of the artificial potential field, a new reward and punishment function was proposed to help the robot obtain enough rewards to explore the environment. The function is consisting of four parts which include the reward guidance function, collision detection, obstacle avoidance function, and time function. Where the reward guidance function is used to guide the robot to approach the targets to hold the patient, the collision detection and obstacle avoidance function are complementary to each other and are used to avoid obstacles, and the time function is used to reduce the number of training episode. Finally, after the robot is trained to reach the targets, the training results are analyzed. Compared with the DDPG algorithm, the PPO algorithm reduces about 4 million steps for training to converge. Moreover, compared with the other reward and punishment functions, the function used in this paper will obtain many more rewards at the same training time. Apart from that, it will take much less time to converge, and the episode length will be shorter; so, the advantage of the algorithm used in this paper is verified.

18

Shah, Naman, and Siddharth Srivastava. "Hierarchical Planning and Learning for Robots in Stochastic Settings Using Zero-Shot Option Invention." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 9 (March 24, 2024): 10358–67. http://dx.doi.org/10.1609/aaai.v38i9.28903.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper addresses the problem of inventing and using hierarchical representations for stochastic robot-planning problems. Rather than using hand-coded state or action representations as input, it presents new methods for learning how to create a high-level action representation for long-horizon, sparse reward robot planning problems in stochastic settings with unknown dynamics. After training, this system yields a robot-specific but environment independent planning system. Given new problem instances in unseen stochastic environments, it first creates zero-shot options (without any experience on the new environment) with dense pseudo-rewards and then uses them to solve the input problem in a hierarchical planning and refinement process. Theoretical results identify sufficient conditions for completeness of the presented approach. Extensive empirical analysis shows that even in settings that go beyond these sufficient conditions, this approach convincingly outperforms baselines by 2x in terms of solution time with orders of magnitude improvement in solution quality.

19

Li, Huale, Rui Cao, Xuan Wang, Xiaohan Hou, Tao Qian, Fengwei Jia, Jiajia Zhang, and Shuhan Qi. "AIBPO: Combine the Intrinsic Reward and Auxiliary Task for 3D Strategy Game." Complexity 2021 (July 13, 2021): 1–9. http://dx.doi.org/10.1155/2021/6698231.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In recent years, deep reinforcement learning (DRL) achieves great success in many fields, especially in the field of games, such as AlphaGo, AlphaZero, and AlphaStar. However, due to the reward sparsity problem, the traditional DRL-based method shows limited performance in 3D games, which contain much higher dimension of state space. To solve this problem, in this paper, we propose an intrinsic-based policy optimization (IBPO) algorithm for reward sparsity. In the IBPO, a novel intrinsic reward is integrated into the value network, which provides an additional reward in the environment with sparse reward, so as to accelerate the training. Besides, to deal with the problem of value estimation bias, we further design three types of auxiliary tasks, which can evaluate the state value and the action more accurately in 3D scenes. Finally, a framework of auxiliary intrinsic-based policy optimization (AIBPO) is proposed, which improves the performance of the IBPO. The experimental results show that the method is able to deal with the reward sparsity problem effectively. Therefore, the proposed method may be applied to real-world scenarios, such as 3-dimensional navigation and automatic driving, which can improve the sample utilization to reduce the cost of interactive sample collected by the real equipment.

20

Dharmavaram, Akshay, Matthew Riemer, and Shalabh Bhatnagar. "Hierarchical Average Reward Policy Gradient Algorithms (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 10 (April 3, 2020): 13777–78. http://dx.doi.org/10.1609/aaai.v34i10.7160.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards.

21

Zhu, Chenyang, Yujie Cai, Jinyu Zhu, Can Hu, and Jia Bi. "GR(1)-Guided Deep Reinforcement Learning for Multi-Task Motion Planning under a Stochastic Environment." Electronics 11, no. 22 (November 13, 2022): 3716. http://dx.doi.org/10.3390/electronics11223716.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Motion planning has been used in robotics research to make movement decisions under certain movement constraints. Deep Reinforcement Learning (DRL) approaches have been applied to the cases of motion planning with continuous state representations. However, current DRL approaches suffer from reward sparsity and overestimation issues. It is also challenging to train the agents to deal with complex task specifications under deep neural network approximations. This paper considers one of the fragments of Linear Temporal Logic (LTL), Generalized Reactivity of rank 1 (GR(1)), as a high-level reactive temporal logic to guide robots in learning efficient movement strategies under a stochastic environment. We first use the synthesized strategy of GR(1) to construct a potential-based reward machine, to which we save the experiences per state. We integrate GR(1) with DQN, double DQN and dueling double DQN. We also observe that the synthesized strategies of GR(1) could be in the form of directed cyclic graphs. We develop a topological-sort-based reward-shaping approach to calculate the potential values of the reward machine, based on which we use the dueling architecture on the double deep Q-network with the experiences to train the agents. Experiments on multi-task learning show that the proposed approach outperforms the state-of-art algorithms in learning rate and optimal rewards. In addition, compared with the value-iteration-based reward-shaping approaches, our topological-sort-based reward-shaping approach has a higher accumulated reward compared with the cases where the synthesized strategies are in the form of directed cyclic graphs.

22

Ramakrishnan, Santhosh K., Dinesh Jayaraman, and Kristen Grauman. "Emergence of exploratory look-around behaviors through active observation completion." Science Robotics 4, no. 30 (May 15, 2019): eaaw6326. http://dx.doi.org/10.1126/scirobotics.aaw6326.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Standard computer vision systems assume access to intelligently captured inputs (e.g., photos from a human photographer), yet autonomously capturing good observations is a major challenge in itself. We address the problem of learning to look around: How can an agent learn to acquire informative visual observations? We propose a reinforcement learning solution, where the agent is rewarded for reducing its uncertainty about the unobserved portions of its environment. Specifically, the agent is trained to select a short sequence of glimpses, after which it must infer the appearance of its full environment. To address the challenge of sparse rewards, we further introduce sidekick policy learning, which exploits the asymmetry in observability between training and test time. The proposed methods learned observation policies that not only performed the completion task for which they were trained but also generalized to exhibit useful “look-around” behavior for a range of active perception tasks.

23

Hasanbeig, Mohammadhosein, Natasha Yogananda Jeppu, Alessandro Abate, Tom Melham, and Daniel Kroening. "DeepSynth: Automata Synthesis for Automatic Task Segmentation in Deep Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 9 (May 18, 2021): 7647–56. http://dx.doi.org/10.1609/aaai.v35i9.16935.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper proposes DeepSynth, a method for effective training of deep Reinforcement Learning (RL) agents when the reward is sparse and non-Markovian, but at the same time progress towards the reward requires achieving an unknown sequence of high-level objectives. Our method employs a novel algorithm for synthesis of compact automata to uncover this sequential structure automatically. We synthesise a human-interpretable automaton from trace data collected by exploring the environment. The state space of the environment is then enriched with the synthesised automaton so that the generation of a control policy by deep RL is guided by the discovered structure encoded in the automaton. The proposed approach is able to cope with both high-dimensional, low-level features and unknown sparse non-Markovian rewards. We have evaluated DeepSynth's performance in a set of experiments that includes the Atari game Montezuma's Revenge. Compared to existing approaches, we obtain a reduction of two orders of magnitude in the number of iterations required for policy synthesis, and also a significant improvement in scalability.

24

Han, Huiyan, Jiaqi Wang, Liqun Kuang, Xie Han, and Hongxin Xue. "Improved Robot Path Planning Method Based on Deep Reinforcement Learning." Sensors 23, no. 12 (June 15, 2023): 5622. http://dx.doi.org/10.3390/s23125622.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

With the advancement of robotics, the field of path planning is currently experiencing a period of prosperity. Researchers strive to address this nonlinear problem and have achieved remarkable results through the implementation of the Deep Reinforcement Learning (DRL) algorithm DQN (Deep Q-Network). However, persistent challenges remain, including the curse of dimensionality, difficulties of model convergence and sparsity in rewards. To tackle these problems, this paper proposes an enhanced DDQN (Double DQN) path planning approach, in which the information after dimensionality reduction is fed into a two-branch network that incorporates expert knowledge and an optimized reward function to guide the training process. The data generated during the training phase are initially discretized into corresponding low-dimensional spaces. An “expert experience” module is introduced to facilitate the model’s early-stage training acceleration in the Epsilon–Greedy algorithm. To tackle navigation and obstacle avoidance separately, a dual-branch network structure is presented. We further optimize the reward function enabling intelligent agents to receive prompt feedback from the environment after performing each action. Experiments conducted in both virtual and real-world environments have demonstrated that the enhanced algorithm can accelerate model convergence, improve training stability and generate a smooth, shorter and collision-free path.

25

Zhang, Tengteng, and Hongwei Mo. "Research on Perception and Control Technology for Dexterous Robot Operation." Electronics 12, no. 14 (July 13, 2023): 3065. http://dx.doi.org/10.3390/electronics12143065.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Robotic grasping in cluttered environments is a fundamental and challenging task in robotics research. The ability to autonomously grasp objects in cluttered scenes is crucial for robots to perform complex tasks in real-world scenarios. Conventional grasping is based on the known object model in a structured environment, but the adaptability of unknown objects and complicated situations is constrained. In this paper, we present a robotic grasp architecture of attention-based deep reinforcement learning. To prevent the loss of local information, the prominent characteristics of input images are automatically extracted using a full convolutional network. In contrast to previous model-based and data-driven methods, the reward is remodeled in an effort to address the sparse rewards. The experimental results show that our method can double the learning speed in grasping a series of randomly placed objects. In real-word experiments, the grasping success rate of the robot platform reaches 90.4%, which outperforms several baselines.

26

Neider, Daniel, Jean-Raphael Gaglione, Ivan Gavran, Ufuk Topcu, Bo Wu, and Zhe Xu. "Advice-Guided Reinforcement Learning in a non-Markovian Environment." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 9073–80. http://dx.doi.org/10.1609/aaai.v35i10.17096.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We study a class of reinforcement learning tasks in which the agent receives its reward for complex, temporally-extended behaviors sparsely. For such tasks, the problem is how to augment the state-space so as to make the reward function Markovian in an efficient way. While some existing solutions assume that the reward function is explicitly provided to the learning algorithm (e.g., in the form of a reward machine), the others learn the reward function from the interactions with the environment, assuming no prior knowledge provided by the user. In this paper, we generalize both approaches and enable the user to give advice to the agent, representing the user’s best knowledge about the reward function, potentially fragmented, partial, or even incorrect. We formalize advice as a set of DFAs and present a reinforcement learning algorithm that takes advantage of such advice, with optimal con- vergence guarantee. The experiments show that using well- chosen advice can reduce the number of training steps needed for convergence to optimal policy, and can decrease the computation time to learn the reward function by up to two orders of magnitude.

27

Zhang, Xiaoping, Yihao Liu, Li Wang, Dunli Hu, and Lei Liu. "A Curiosity-Based Autonomous Navigation Algorithm for Maze Robot." Journal of Advanced Computational Intelligence and Intelligent Informatics 26, no. 6 (November 20, 2022): 893–904. http://dx.doi.org/10.20965/jaciii.2022.p0893.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The external reward plays an important role in the reinforcement learning process, and the quality of its design determines the final effect of the algorithm. However, in several real-world scenarios, rewards extrinsic to the agent are extremely sparse. This is particularly evident in mobile robot navigation. To solve this problem, this paper proposes a curiosity-based autonomous navigation algorithm that consists of a reinforcement learning framework and curiosity system. The curiosity system consists of three parts: prediction network, associative memory network, and curiosity rewards. The prediction network predicts the next state. An associative memory network was used to represent the world. Based on the associative memory network, an inference algorithm and distance calibration algorithm were designed. Curiosity rewards were combined with extrinsic rewards as complementary inputs to the Q-learning algorithm. The simulation results show that the algorithm helps the agent reduce repeated exploration of the environment during autonomous navigation. The algorithm also exhibits a better convergence effect.

28

Han, Ziyao, Fan Yi, and Kazuhiro Ohkura. "Collective Transport Behavior in a Robotic Swarm with Hierarchical Imitation Learning." Journal of Robotics and Mechatronics 36, no. 3 (June 20, 2024): 538–45. http://dx.doi.org/10.20965/jrm.2024.p0538.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Swarm robotics is the study of how a large number of relatively simple physically embodied robots can be designed such that a desired collective behavior emerges from local interactions. Furthermore, reinforcement learning (RL) is a promising approach for training robotic swarm controllers. However, the conventional RL approach suffers from the sparse reward problem in some complex tasks, such as key-to-door tasks. In this study, we applied hierarchical imitation learning to train a robotic swarm to address a key-to-door transport task with sparse rewards. The results demonstrate that the proposed approach outperforms the conventional RL method. Moreover, the proposed method outperforms the conventional hierarchical RL method in its ability to adapt to changes in the training environment.

29

Abu Bakar, Mohamad Hafiz, Abu Ubaidah Shamsudin, Zubair Adil Soomro, Satoshi Tadokoro, and C. J. Salaan. "FUSION SPARSE AND SHAPING REWARD FUNCTION IN SOFT ACTOR-CRITIC DEEP REINFORCEMENT LEARNING FOR MOBILE ROBOT NAVIGATION." Jurnal Teknologi 86, no. 2 (January 15, 2024): 37–49. http://dx.doi.org/10.11113/jurnalteknologi.v86.20147.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Nowadays, the advancement in autonomous robots is the latest influenced by the development of a world surrounded by new technologies. Deep Reinforcement Learning (DRL) allows systems to operate automatically, so the robot will learn the next movement based on the interaction with the environment. Moreover, since robots require continuous action, Soft Actor Critic Deep Reinforcement Learning (SAC DRL) is considered the latest DRL approach solution. SAC is used because its ability to control continuous action to produce more accurate movements. SAC fundamental is robust against unpredictability, but some weaknesses have been identified, particularly in the exploration process for accuracy learning with faster maturity. To address this issue, the study identified a solution using a reward function appropriate for the system to guide in the learning process. This research proposes several types of reward functions based on sparse and shaping reward in SAC method to investigate the effectiveness of mobile robot learning. Finally, the experiment shows that using fusion sparse and shaping rewards in the SAC DRL successfully navigates to the target position and can also increase accuracy based on the average error result of 4.99%.

30

Sharip, Zati, Mohd Hafiz Zulkifli, Mohd Nur Farhan Abd Wahab, Zubaidi Johar, and Mohd Zaki Mat Amin. "ASSESSING TROPHIC STATE AND WATER QUALITY OF SMALL LAKES AND PONDS IN PERAK." Jurnal Teknologi 86, no. 2 (January 15, 2024): 51–59. http://dx.doi.org/10.11113/jurnalteknologi.v86.20566.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Nowadays, the advancement in autonomous robots is the latest influenced by the development of a world surrounded by new technologies. Deep Reinforcement Learning (DRL) allows systems to operate automatically, so the robot will learn the next movement based on the interaction with the environment. Moreover, since robots require continuous action, Soft Actor Critic Deep Reinforcement Learning (SAC DRL) is considered the latest DRL approach solution. SAC is used because its ability to control continuous action to produce more accurate movements. SAC fundamental is robust against unpredictability, but some weaknesses have been identified, particularly in the exploration process for accuracy learning with faster maturity. To address this issue, the study identified a solution using a reward function appropriate for the system to guide in the learning process. This research proposes several types of reward functions based on sparse and shaping reward in SAC method to investigate the effectiveness of mobile robot learning. Finally, the experiment shows that using fusion sparse and shaping rewards in the SAC DRL successfully navigates to the target position and can also increase accuracy based on the average error result of 4.99%.

31

Su, Linfeng, Jinbo Wang, and Hongbo Chen. "A Real-Time and Optimal Hypersonic Entry Guidance Method Using Inverse Reinforcement Learning." Aerospace 10, no. 11 (November 7, 2023): 948. http://dx.doi.org/10.3390/aerospace10110948.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The mission of hypersonic vehicles faces the problem of highly nonlinear dynamics and complex environments, which presents challenges to the intelligent level and real-time performance of onboard guidance algorithms. In this paper, inverse reinforcement learning is used to address the hypersonic entry guidance problem. The state-control sample pairs and state-rewards sample pairs obtained by interacting with hypersonic entry dynamics are used to train the neural network by applying the distributed proximal policy optimization method. To overcome the sparse reward problem in the hypersonic entry problem, a novel reward function combined with a sophisticated discriminator network is designed to generate dense optimal rewards continuously, which is the main contribution of this paper. The optimized guidance methodology can achieve good terminal accuracy and high success rates with a small number of trajectories as datasets while satisfying heating rate, overload, and dynamic pressure constraints. The proposed guidance method is employed for two typical hypersonic entry vehicles (Common Aero Vehicle-Hypersonic and Reusable Launch Vehicle) to demonstrate the feasibility and potential. Numerical simulation results validate the real-time performance and optimality of the proposed method and indicate its suitability for onboard applications in the hypersonic entry flight.

32

Wang, Yifan, and Meibao Yao. "Autonomous Robots Traverse Multi-Terrain Environments via Hierarchical Reinforcement Learning with Skill Discovery." Journal of Physics: Conference Series 2762, no. 1 (May 1, 2024): 012003. http://dx.doi.org/10.1088/1742-6596/2762/1/012003.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract Although deep reinforcement learning (DRL) has been widely used in robotic mapless navigation tasks, most of the current research focuses on structured environments such as indoor or maze scenes, and little is targeted on outdoor environments. Unlike indoor environments and mazes, outdoor fields tend to be unstructured with complex landforms and sparse rewards for robotic navigation. The performance of most DRL-based strategies is directly affected by the design of the reward function, which greatly deteriorates its generalizability to outdoor environments. To this end, here we propose a two-stage learning paradigm based on skill discovery and hierarchical reinforcement learning (HRL) to cope with this challenge. Specifically, we implement skill discovery through a pre-training stage to acquire diverse skills with terrain-adaptive exploration strategies; then we select multiple skills using HRL to cope with more complex scenarios. We carry out the robotic multi-terrain traverse task based on a high-fidelity robotic simulation platform, Webots, and implement extensive comparative experiments and ablation studies to demonstrate the effectiveness of our approach.

33

Zhang, Yilin, Huimin Sun, Honglin Sun, Yuan Huang, and Kenji Hashimoto. "Biped Robots Control in Gusty Environments with Adaptive Exploration Based DDPG." Biomimetics 9, no. 6 (June 8, 2024): 346. http://dx.doi.org/10.3390/biomimetics9060346.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

As technology rapidly evolves, the application of bipedal robots in various environments has widely expanded. These robots, compared to their wheeled counterparts, exhibit a greater degree of freedom and a higher complexity in control, making the challenge of maintaining balance and stability under changing wind speeds particularly intricate. Overcoming this challenge is critical as it enables bipedal robots to sustain more stable gaits during outdoor tasks, thereby increasing safety and enhancing operational efficiency in outdoor settings. To transcend the constraints of existing methodologies, this research introduces an adaptive bio-inspired exploration framework for bipedal robots facing wind disturbances, which is based on the Deep Deterministic Policy Gradient (DDPG) approach. This framework allows the robots to perceive their bodily states through wind force inputs and adaptively modify their exploration coefficients. Additionally, to address the convergence challenges posed by sparse rewards, this study incorporates Hindsight Experience Replay (HER) and a reward-reshaping strategy to provide safer and more effective training guidance for the agents. Simulation outcomes reveal that robots utilizing this advanced method can more swiftly explore behaviors that contribute to stability in complex conditions, and demonstrate improvements in training speed and walking distance over traditional DDPG algorithms.

34

Song, Qingpeng, Yuansheng Liu, Ming Lu, Jun Zhang, Han Qi, Ziyu Wang, and Zijian Liu. "Autonomous Driving Decision Control Based on Improved Proximal Policy Optimization Algorithm." Applied Sciences 13, no. 11 (May 24, 2023): 6400. http://dx.doi.org/10.3390/app13116400.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The decision-making control of autonomous driving in complex urban road environments is a difficult problem in the research of autonomous driving. In order to solve the problem of high dimensional state space and sparse reward in autonomous driving decision control in this environment, this paper proposed a Coordinated Convolution Multi-Reward Proximal Policy Optimization (CCMR-PPO). This method reduces the dimension of the bird’s-eye view data through the coordinated convolution network and then fuses the processed data with the vehicle state data as the input of the algorithm to optimize the state space. The control commands acc (acc represents throttle and brake) and steer of the vehicle are used as the output of the algorithm.. Comprehensively considering the lateral error, safety distance, speed, and other factors of the vehicle, a multi-objective reward mechanism was designed to alleviate the sparse reward. Experiments on the CARLA simulation platform show that the proposed method can effectively increase the performance: compared with the PPO algorithm, the line crossed times are reduced by 24 %, and the number of tasks completed is increased by 54 %.

35

Kim, MyeongSeop, and Jung-Su Kim. "Policy-based Deep Reinforcement Learning for Sparse Reward Environment." Transactions of The Korean Institute of Electrical Engineers 70, no. 3 (March 31, 2021): 506–14. http://dx.doi.org/10.5370/kiee.2021.70.3.506.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Potjans, Wiebke, Abigail Morrison, and Markus Diesmann. "A Spiking Neural Network Model of an Actor-Critic Learning Agent." Neural Computation 21, no. 2 (February 2009): 301–39. http://dx.doi.org/10.1162/neco.2008.08-07-593.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The ability to adapt behavior to maximize reward as a result of interactions with the environment is crucial for the survival of any higher organism. In the framework of reinforcement learning, temporal-difference learning algorithms provide an effective strategy for such goal-directed adaptation, but it is unclear to what extent these algorithms are compatible with neural computation. In this article, we present a spiking neural network model that implements actor-critic temporal-difference learning by combining local plasticity rules with a global reward signal. The network is capable of solving a nontrivial gridworld task with sparse rewards. We derive a quantitative mapping of plasticity parameters and synaptic weights to the corresponding variables in the standard algorithmic formulation and demonstrate that the network learns with a similar speed to its discrete time counterpart and attains the same equilibrium performance.

37

Rauber, Paulo, Avinash Ummadisingu, Filipe Mutz, and Jürgen Schmidhuber. "Reinforcement Learning in Sparse-Reward Environments With Hindsight Policy Gradients." Neural Computation 33, no. 6 (May 13, 2021): 1498–553. http://dx.doi.org/10.1162/neco_a_01387.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enabling sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this letter, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency.

38

Yu, Sheng, Wei Zhu, and Yong Wang. "Research on Wargame Decision-Making Method Based on Multi-Agent Deep Deterministic Policy Gradient." Applied Sciences 13, no. 7 (April 4, 2023): 4569. http://dx.doi.org/10.3390/app13074569.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Wargames are essential simulators for various war scenarios. However, the increasing pace of warfare has rendered traditional wargame decision-making methods inadequate. To address this challenge, wargame-assisted decision-making methods that leverage artificial intelligence techniques, notably reinforcement learning, have emerged as a promising solution. The current wargame environment is beset by a large decision space and sparse rewards, presenting obstacles to optimizing decision-making methods. To overcome these hurdles, a Multi-Agent Deep Deterministic Policy Gradient (MADDPG) based wargame decision-making method is presented. The Partially Observable Markov Decision Process (POMDP), joint action-value function, and the Gumbel-Softmax estimator are applied to optimize MADDPG in order to adapt to the wargame environment. Furthermore, a wargame decision-making method based on the improved MADDPG algorithm is proposed. Using supervised learning in the proposed approach, the training efficiency is improved and the space for manipulation before the reinforcement learning phase is reduced. In addition, a policy gradient estimator is incorporated to reduce the action space and to obtain the global optimal solution. Furthermore, an additional reward function is designed to address the sparse reward problem. The experimental results demonstrate that our proposed wargame decision-making method outperforms the pre-optimization algorithm and other algorithms based on the AC framework in the wargame environment. Our approach offers a promising solution to the challenging problem of decision-making in wargame scenarios, particularly given the increasing speed and complexity of modern warfare.

39

Zhang, Danyang, Zhaolong Xuan, Yang Zhang, Jiangyi Yao, Xi Li, and Xiongwei Li. "Path Planning of Unmanned Aerial Vehicle in Complex Environments Based on State-Detection Twin Delayed Deep Deterministic Policy Gradient." Machines 11, no. 1 (January 13, 2023): 108. http://dx.doi.org/10.3390/machines11010108.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper investigates the path planning problem of an unmanned aerial vehicle (UAV) for completing a raid mission through ultra-low altitude flight in complex environments. The UAV needs to avoid radar detection areas, low-altitude static obstacles, and low-altitude dynamic obstacles during the flight process. Due to the uncertainty of low-altitude dynamic obstacle movement, this can slow down the convergence of existing algorithm models and also reduce the mission success rate of UAVs. In order to solve this problem, this paper designs a state detection method to encode the environmental state of the UAV’s direction of travel and compress the environmental state space. In considering the continuity of the state space and action space, the SD-TD3 algorithm is proposed in combination with the double-delayed deep deterministic policy gradient algorithm (TD3), which can accelerate the training convergence speed and improve the obstacle avoidance capability of the algorithm model. Further, to address the sparse reward problem of traditional reinforcement learning, a heuristic dynamic reward function is designed to give real-time rewards and guide the UAV to complete the task. The simulation results show that the training results of the SD-TD3 algorithm converge faster than the TD3 algorithm, and the actual results of the converged model are better.

40

Yao, Jiangyi, Xiongwei Li, Yang Zhang, Jingyu Ji, Yanchao Wang, and Yicen Liu. "Path Planning of Unmanned Helicopter in Complex Dynamic Environment Based on State-Coded Deep Q-Network." Symmetry 14, no. 5 (April 21, 2022): 856. http://dx.doi.org/10.3390/sym14050856.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Unmanned helicopters (UH) can avoid radar detection by flying at ultra-low altitudes; thus, they have been widely used in the battlefield. The flight safety of UH is seriously affected by moving obstacles such as flocks of birds in low airspace. Therefore, an algorithm that can plan a safe path to UH is urgently needed. Due to the strong randomness of the movement of bird flocks, the existing path planning algorithms are incompetent for this task. To solve this problem, a state-coded deep Q-network (SC-DQN) algorithm with symmetric properties is proposed, which can effectively avoid randomly moving obstacles and plan a safe path for UH. First, a dynamic reward function is designed to give UH appropriate rewards in real time, so as to improve the sparse reward problem. Then, a state-coding scheme is proposed, which uses binary Boolean expression to encode the environment state to compress environment state space. The encoded state is used as the input to the deep learning network, which is an important improvement to the traditional algorithm. Experimental results show that the SC-DQN algorithm can help UH avoid the moving obstacles to unknown motion status in the environment safely and effectively and successfully complete the raid task.

41

Lei, Xiaoyun, Zhian Zhang, and Peifang Dong. "Dynamic Path Planning of Unknown Environment Based on Deep Reinforcement Learning." Journal of Robotics 2018 (September 18, 2018): 1–10. http://dx.doi.org/10.1155/2018/5781591.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Dynamic path planning of unknown environment has always been a challenge for mobile robots. In this paper, we apply double Q-network (DDQN) deep reinforcement learning proposed by DeepMind in 2016 to dynamic path planning of unknown environment. The reward and punishment function and the training method are designed for the instability of the training stage and the sparsity of the environment state space. In different training stages, we dynamically adjust the starting position and target position. With the updating of neural network and the increase of greedy rule probability, the local space searched by agent is expanded. Pygame module in PYTHON is used to establish dynamic environments. Considering lidar signal and local target position as the inputs, convolutional neural networks (CNNs) are used to generalize the environmental state. Q-learning algorithm enhances the ability of the dynamic obstacle avoidance and local planning of the agents in environment. The results show that, after training in different dynamic environments and testing in a new environment, the agent is able to reach the local target position successfully in unknown dynamic environment.

42

Zhang, Zhizhuo, and Change Zheng. "Simulation of Robotic Arm Grasping Control Based on Proximal Policy Optimization Algorithm." Journal of Physics: Conference Series 2203, no. 1 (February 1, 2022): 012065. http://dx.doi.org/10.1088/1742-6596/2203/1/012065.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract There are many kinds of inverse kinematics solutions for robots. Deep reinforcement learning can make the robot spend a short time to find the optimal inverse kinematics solution. Aiming at the problem of sparse rewards in the process of deep reinforcement learning, this paper proposes an improved PPO algorithm. Firstly, built a simulation environment for the operation of the robotic arm. Secondly, use a convolutional neural network to process the data read by the camera of the robotic arm, obtaining a network about Actor and Critic. Thirdly, based on the principle of inverse kinematics of the robotic arm and the reward mechanism in deep reinforcement learning, design a hierarchical reward function containing motion accuracy to promote the convergence of the PPO algorithm. Finally, compare the improved PPO algorithm with the traditional PPO algorithm. The results show that the improved PPO algorithm has improved both the convergence speed and the operating accuracy.

43

Luu, Tung M., and Chang D. Yoo. "Hindsight Goal Ranking on Replay Buffer for Sparse Reward Environment." IEEE Access 9 (2021): 51996–2007. http://dx.doi.org/10.1109/access.2021.3069975.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Feng, Shiying, Xiaofeng Li, Lu Ren, and Shuiqing Xu. "Reinforcement learning with parameterized action space and sparse reward for UAV navigation." Intelligence & Robotics 3, no. 2 (June 27, 2023): 161–75. http://dx.doi.org/10.20517/ir.2023.10.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Autonomous navigation of unmanned aerial vehicles (UAVs) is widely used in building rescue systems. As the complexity of the task increases, traditional methods based on environment models are hard to apply. In this paper, a reinforcement learning (RL) algorithm is proposed to solve the UAV navigation problem. The UAV navigation task is modeled as a Markov Decision Process (MDP) with parameterized actions. In addition, the sparse reward problem is also taken into account. To address these issues, we develop the HER-MPDQN by combining Multi-Pass Deep Q-Network (MP-DQN) and Hindsight Experience Replay (HER). Two UAV navigation simulation environments with progressive difficulty are constructed to evaluate our method. The results show that HER-MPDQN outperforms other baselines in relatively simple tasks. Especially for complex tasks involving relay operations, only our method can achieve satisfactory performance.

45

Liu, Zeyang, Lipeng Wan, Xinrui Yang, Zhuoran Chen, Xingyu Chen, and Xuguang Lan. "Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (March 24, 2024): 17487–95. http://dx.doi.org/10.1609/aaai.v38i16.29698.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other's transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models.

46

Jiang, Haobin, Ziluo Ding, and Zongqing Lu. "Settling Decentralized Multi-Agent Coordinated Exploration by Novelty Sharing." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (March 24, 2024): 17444–52. http://dx.doi.org/10.1609/aaai.v38i16.29693.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Exploration in decentralized cooperative multi-agent reinforcement learning faces two challenges. One is that the novelty of global states is unavailable, while the novelty of local observations is biased. The other is how agents can explore in a coordinated way. To address these challenges, we propose MACE, a simple yet effective multi-agent coordinated exploration method. By communicating only local novelty, agents can take into account other agents' local novelty to approximate the global novelty. Further, we newly introduce weighted mutual information to measure the influence of one agent's action on other agents' accumulated novelty. We convert it as an intrinsic reward in hindsight to encourage agents to exert more influence on other agents' exploration and boost coordinated exploration. Empirically, we show that MACE achieves superior performance in three multi-agent environments with sparse rewards.

47

Xu, He A., Alireza Modirshanechi, Marco P. Lehmann, Wulfram Gerstner, and Michael H. Herzog. "Novelty is not surprise: Human exploratory and adaptive behavior in sequential decision-making." PLOS Computational Biology 17, no. 6 (June 3, 2021): e1009070. http://dx.doi.org/10.1371/journal.pcbi.1009070.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Classic reinforcement learning (RL) theories cannot explain human behavior in the absence of external reward or when the environment changes. Here, we employ a deep sequential decision-making paradigm with sparse reward and abrupt environmental changes. To explain the behavior of human participants in these environments, we show that RL theories need to include surprise and novelty, each with a distinct role. While novelty drives exploration before the first encounter of a reward, surprise increases the rate of learning of a world-model as well as of model-free action-values. Even though the world-model is available for model-based RL, we find that human decisions are dominated by model-free action choices. The world-model is only marginally used for planning, but it is important to detect surprising events. Our theory predicts human action choices with high probability and allows us to dissociate surprise, novelty, and reward in EEG signals.

48

Zeng, Junjie, Rusheng Ju, Long Qin, Yue Hu, Quanjun Yin, and Cong Hu. "Navigation in Unknown Dynamic Environments Based on Deep Reinforcement Learning." Sensors 19, no. 18 (September 5, 2019): 3837. http://dx.doi.org/10.3390/s19183837.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this paper, we propose a novel Deep Reinforcement Learning (DRL) algorithm which can navigate non-holonomic robots with continuous control in an unknown dynamic environment with moving obstacles. We call the approach MK-A3C (Memory and Knowledge-based Asynchronous Advantage Actor-Critic) for short. As its first component, MK-A3C builds a GRU-based memory neural network to enhance the robot’s capability for temporal reasoning. Robots without it tend to suffer from a lack of rationality in face of incomplete and noisy estimations for complex environments. Additionally, robots with certain memory ability endowed by MK-A3C can avoid local minima traps by estimating the environmental model. Secondly, MK-A3C combines the domain knowledge-based reward function and the transfer learning-based training task architecture, which can solve the non-convergence policies problems caused by sparse reward. These improvements of MK-A3C can efficiently navigate robots in unknown dynamic environments, and satisfy kinetic constraints while handling moving objects. Simulation experiments show that compared with existing methods, MK-A3C can realize successful robotic navigation in unknown and challenging environments by outputting continuous acceleration commands.

49

Park, Minjae, Chaneun Park, and Nam Kyu Kwon. "Autonomous Driving of Mobile Robots in Dynamic Environments Based on Deep Deterministic Policy Gradient: Reward Shaping and Hindsight Experience Replay." Biomimetics 9, no. 1 (January 13, 2024): 51. http://dx.doi.org/10.3390/biomimetics9010051.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this paper, we propose a reinforcement learning-based end-to-end learning method for the autonomous driving of a mobile robot in a dynamic environment with obstacles. Applying two additional techniques for reinforcement learning simultaneously helps the mobile robot in finding an optimal policy to reach the destination without collisions. First, the multifunctional reward-shaping technique guides the agent toward the goal by utilizing information about the destination and obstacles. Next, employing the hindsight experience replay technique to address the experience imbalance caused by the sparse reward problem assists the agent in finding the optimal policy. We validated the proposed technique in both simulation and real-world environments. To assess the effectiveness of the proposed method, we compared experiments for five different cases.

50

Mourad, Nafee, Ali Ezzeddine, Babak Nadjar Araabi, and Majid Nili Ahmadabadi. "Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach." Journal of Robotics 2020 (January 13, 2020): 1–18. http://dx.doi.org/10.1155/2020/3849309.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Programming by demonstrations is one of the most efficient methods for knowledge transfer to develop advanced learning systems, provided that teachers deliver abundant and correct demonstrations, and learners correctly perceive them. Nevertheless, demonstrations are sparse and inaccurate in almost all real-world problems. Complementary information is needed to compensate these shortcomings of demonstrations. In this paper, we target programming by a combination of nonoptimal and sparse demonstrations and a limited number of binary evaluative feedbacks, where the learner uses its own evaluated experiences as new demonstrations in an extended inverse reinforcement learning method. This provides the learner with a broader generalization and less regret as well as robustness in face of sparsity and nonoptimality in demonstrations and feedbacks. Our method alleviates the unrealistic burden on teachers to provide optimal and abundant demonstrations. Employing an evaluative feedback, which is easy for teachers to deliver, provides the opportunity to correct the learner’s behavior in an interactive social setting without requiring teachers to know and use their own accurate reward function. Here, we enhance the inverse reinforcement learning (IRL) to estimate the reward function using a mixture of nonoptimal and sparse demonstrations and evaluative feedbacks. Our method, called IRL from demonstration and human’s critique (IRLDC), has two phases. The teacher first provides some demonstrations for the learner to initialize its policy. Next, the learner interacts with the environment and the teacher provides binary evaluative feedbacks. Taking into account possible inconsistencies and mistakes in issuing and receiving feedbacks, the learner revises the estimated reward function by solving a single optimization problem. The IRLDC is devised to handle errors and sparsities in demonstrations and feedbacks and can generalize different combinations of these two sources expertise. We apply our method to three domains: a simulated navigation task, a simulated car driving problem with human interactions, and a navigation experiment of a mobile robot. The results indicate that the IRLDC significantly enhances the learning process where the standard IRL methods fail and learning from feedbacks (LfF) methods has a high regret. Also, the IRLDC works well at different levels of sparsity and optimality of the teacher’s demonstrations and feedbacks, where other state-of-the-art methods fail.