Увійти

Готові списки джерел за темами / Sparsely rewarded environments

Добірка наукової літератури з теми "Sparsely rewarded environments"

Автор: Grafiati

Опубліковано: 7 липня 2024

Оновлено: 7 липня 2024

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся зі списками актуальних статей, книг, дисертацій, тез та інших наукових джерел на тему "Sparsely rewarded environments".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Зміст

Статті в журналах
Дисертації
Частини книг
Тези доповідей конференцій

Статті в журналах з теми "Sparsely rewarded environments":

1

Dubey, Rachit, Thomas L. Griffiths, and Peter Dayan. "The pursuit of happiness: A reinforcement learning perspective on habituation and comparisons." PLOS Computational Biology 18, no. 8 (August 4, 2022): e1010316. http://dx.doi.org/10.1371/journal.pcbi.1010316.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

In evaluating our choices, we often suffer from two tragic relativities. First, when our lives change for the better, we rapidly habituate to the higher standard of living. Second, we cannot escape comparing ourselves to various relative standards. Habituation and comparisons can be very disruptive to decision-making and happiness, and till date, it remains a puzzle why they have come to be a part of cognition in the first place. Here, we present computational evidence that suggests that these features might play an important role in promoting adaptive behavior. Using the framework of reinforcement learning, we explore the benefit of employing a reward function that, in addition to the reward provided by the underlying task, also depends on prior expectations and relative comparisons. We find that while agents equipped with this reward function are less happy, they learn faster and significantly outperform standard reward-based agents in a wide range of environments. Specifically, we find that relative comparisons speed up learning by providing an exploration incentive to the agents, and prior expectations serve as a useful aid to comparisons, especially in sparsely-rewarded and non-stationary environments. Our simulations also reveal potential drawbacks of this reward function and show that agents perform sub-optimally when comparisons are left unchecked and when there are too many similar options. Together, our results help explain why we are prone to becoming trapped in a cycle of never-ending wants and desires, and may shed light on psychopathologies such as depression, materialism, and overconsumption.

2

Shi, Xiaoping, Shiqi Zou, Shenmin Song, and Rui Guo. "A multi-objective sparse evolutionary framework for large-scale weapon target assignment based on a reward strategy." Journal of Intelligent & Fuzzy Systems 40, no. 5 (April 22, 2021): 10043–61. http://dx.doi.org/10.3233/jifs-202679.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

The asset-based weapon target assignment (ABWTA) problem is one of the important branches of the weapon target assignment (WTA) problem. Due to the current large-scale battlefield environment, the ABWTA problem is a multi-objective optimization problem (MOP) with strong constraints, large-scale and sparse properties. The novel model of the ABWTA problem with the operation error parameter is established. An evolutionary algorithm for large-scale sparse problems (SparseEA) is introduced as the main framework for solving large-scale sparse ABWTA problem. The proposed framework (SparseEA-ABWTA) mainly addresses the issue that problem-specific initialization method and genetic operators with a reward strategy can generate solutions efficiently considering the sparsity of variables and an improved non-dominated solution selection method is presented to handle the constraints. Under the premise of constructing large-scale cases by the specific case generator, two numerical experiments on four outstanding multi-objective evolutionary algorithms (MOEAs) show Runtime of SparseEA-ABWTA is faster nearly 50% than others under the same convergence and the gap between MOEAs improved by the mechanism of SparseEA-ABWTA and SparseEA-ABWTA is reduced to nearly 20% in the convergence and distribution.

3

Sakamoto, Yuma, and Kentarou Kurashige. "Self-Generating Evaluations for Robot’s Autonomy Based on Sensor Input." Machines 11, no. 9 (September 6, 2023): 892. http://dx.doi.org/10.3390/machines11090892.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Reinforcement learning has been explored within the context of robot operation in different environments. Designing the reward function in reinforcement learning is challenging for designers because it requires specialized knowledge. To reduce the design burden, we propose a reward design method that is independent of both specific environments and tasks in which reinforcement learning robots evaluate and generate rewards autonomously based on sensor information received from the environment. This method allows the robot to operate autonomously based on sensors. However, the existing approach to adaption attempts to adapt without considering the input properties for the strength of the sensor input, which may cause a robot to learn harmful actions from the environment. In this study, we propose a method for changing the threshold of a sensor input while considering the strength of the input and other properties. We also demonstrate the utility of the proposed method by presenting the results of simulation experiments on a path-finding problem conducted in an environment with sparse rewards.

4

Parisi, Simone, Davide Tateo, Maximilian Hensel, Carlo D’Eramo, Jan Peters, and Joni Pajarinen. "Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning." Algorithms 15, no. 3 (February 28, 2022): 81. http://dx.doi.org/10.3390/a15030081.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods that use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment.

5

Mguni, David, Taher Jafferjee, Jianhong Wang, Nicolas Perez-Nieves, Wenbin Song, Feifei Tong, Matthew Taylor, et al. "Learning to Shape Rewards Using a Game of Two Partners." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 10 (June 26, 2023): 11604–12. http://dx.doi.org/10.1609/aaai.v37i10.26371.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construc- tion is time-consuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimising Shaping Algorithm (ROSA), an automated reward shaping framework in which the shaping-reward function is constructed in a Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards for more efficient learning while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which adopts existing RL algorithms, learns to construct a shaping-reward function that is beneficial to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA’s properties in three didactic experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.

6

Forbes, Grant C., and David L. Roberts. "Potential-Based Reward Shaping for Intrinsic Motivation (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 21 (March 24, 2024): 23488–89. http://dx.doi.org/10.1609/aaai.v38i21.30441.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Recently there has been a proliferation of intrinsic motivation (IM) reward shaping methods to learn in complex and sparse-reward environments. These methods can often inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior. Previous work on mitigating the risks of reward shaping, particularly through potential-based reward shaping (PBRS), has not been applicable to many IM methods, as they are often complex, trainable functions themselves, and therefore dependent on a wider set of variables than the traditional reward functions that PBRS was developed for. We present an extension to PBRS that we show preserves the set of optimal policies under a more general set of functions than has been previously demonstrated. We also present Potential-Based Intrinsic Motivation (PBIM), a method for converting IM rewards into a potential-based form that are useable without altering the set of optimal policies. Testing in the MiniGrid DoorKey environment, we demonstrate that PBIM successfully prevents the agent from converging to a suboptimal policy and can speed up training.

7

Xu, Pei, Junge Zhang, Qiyue Yin, Chao Yu, Yaodong Yang, and Kaiqi Huang. "Subspace-Aware Exploration for Sparse-Reward Multi-Agent Tasks." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 10 (June 26, 2023): 11717–25. http://dx.doi.org/10.1609/aaai.v37i10.26384.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. One possible solution to this issue is to exploit inherent task structures for an acceleration of exploration. In this paper, we present a novel exploration approach, which encodes a special structural prior on the reward function into exploration, for sparse-reward multi-agent tasks. Specifically, a novel entropic exploration objective which encodes the structural prior is proposed to accelerate the discovery of rewards. By maximizing the lower bound of this objective, we then propose an algorithm with moderate computational cost, which can be applied to practical tasks. Under the sparse-reward setting, we show that the proposed algorithm significantly outperforms the state-of-the-art algorithms in the multiple-particle environment, the Google Research Football and StarCraft II micromanagement tasks. To the best of our knowledge, on some hard tasks (such as 27m_vs_30m}) which have relatively larger number of agents and need non-trivial strategies to defeat enemies, our method is the first to learn winning strategies under the sparse-reward setting.

8

Kubovčík, Martin, Iveta Dirgová Luptáková, and Jiří Pospíchal. "Signal Novelty Detection as an Intrinsic Reward for Robotics." Sensors 23, no. 8 (April 14, 2023): 3985. http://dx.doi.org/10.3390/s23083985.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

In advanced robot control, reinforcement learning is a common technique used to transform sensor data into signals for actuators, based on feedback from the robot’s environment. However, the feedback or reward is typically sparse, as it is provided mainly after the task’s completion or failure, leading to slow convergence. Additional intrinsic rewards based on the state visitation frequency can provide more feedback. In this study, an Autoencoder deep learning neural network was utilized as novelty detection for intrinsic rewards to guide the search process through a state space. The neural network processed signals from various types of sensors simultaneously. It was tested on simulated robotic agents in a benchmark set of classic control OpenAI Gym test environments (including Mountain Car, Acrobot, CartPole, and LunarLander), achieving more efficient and accurate robot control in three of the four tasks (with only slight degradation in the Lunar Lander task) when purely intrinsic rewards were used compared to standard extrinsic rewards. By incorporating autoencoder-based intrinsic rewards, robots could potentially become more dependable in autonomous operations like space or underwater exploration or during natural disaster response. This is because the system could better adapt to changing environments or unexpected situations.

9

Catacora Ocana, Jim Martin, Roberto Capobianco, and Daniele Nardi. "An Overview of Environmental Features that Impact Deep Reinforcement Learning in Sparse-Reward Domains." Journal of Artificial Intelligence Research 76 (April 26, 2023): 1181–218. http://dx.doi.org/10.1613/jair.1.14390.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Deep reinforcement learning has achieved impressive results in recent years; yet, it is still severely troubled by environments showcasing sparse rewards. On top of that, not all sparse-reward environments are created equal, i.e., they can differ in the presence or absence of various features, with many of them having a great impact on learning. In light of this, the present work puts together a literature compilation of such environmental features, covering particularly those that have been taken advantage of and those that continue to pose a challenge. We expect this effort to provide guidance to researchers for assessing the generality of their new proposals and to call their attention to issues that remain unresolved when dealing with sparse rewards.

10

Zhou, Xiao, Song Zhou, Xingang Mou, and Yi He. "Multirobot Collaborative Pursuit Target Robot by Improved MADDPG." Computational Intelligence and Neuroscience 2022 (February 25, 2022): 1–10. http://dx.doi.org/10.1155/2022/4757394.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Policy formulation is one of the main problems in multirobot systems, especially in multirobot pursuit-evasion scenarios, where both sparse rewards and random environment changes bring great difficulties to find better strategy. Existing multirobot decision-making methods mostly use environmental rewards to promote robots to complete the target task that cannot achieve good results. This paper proposes a multirobot pursuit method based on improved multiagent deep deterministic policy gradient (MADDPG), which solves the problem of sparse rewards in multirobot pursuit-evasion scenarios by combining the intrinsic reward and the external environment. The state similarity module based on the threshold constraint is as a part of the intrinsic reward signal output by the intrinsic curiosity module, which is used to balance overexploration and insufficient exploration, so that the agent can use the intrinsic reward more effectively to learn better strategies. The simulation experiment results show that the proposed method can improve the reward value of robots and the success rate of the pursuit task significantly. The intuitive change is obviously reflected in the real-time distance between the pursuer and the escapee, the pursuer using the improved algorithm for training can get closer to the escapee more quickly, and the average following distance also decreases.

Більше джерел

Дисертації з теми "Sparsely rewarded environments":

1

Gallouedec, Quentin. "Toward the generalization of reinforcement learning." Electronic Thesis or Diss., Ecully, Ecole centrale de Lyon, 2024. http://www.theses.fr/2024ECDL0013.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

L’apprentissage par renforcement conventionnel implique l’entraînement d’un agent unimodal sur une tâche unique et bien définie, guidé par un signal de récompense optimisé pour le gradient. Ce cadre ne nous permet pas d’envisager un agent d’apprentissage adapté aux problèmes du monde réel impliquant des flux de diverses modalités, des tâches multiples, souvent mal définies, voire pas définies du tout. C’est pourquoi nous préconisons une transition vers un cadre plus général, visant à créer des algorithmes d’apprentissage par renforcement plus adaptables et intrinsèquement polyvalents. Pour progresser dans cette direction, nous identifions deux domaines d’intérêt principaux. Le premier est l’amélioration de l’exploration, qui permet à l’agent d’apprendre de l’environnement en dépendant le moins possible du signal de récompense. Nous présentons Latent Go-Explore (LGE), une généralisation de l’algorithme Go-Explore qui, malgré ses résultats impressionnants, était limité par une forte contrainte de connaissance du domaine. LGE atténue ces limitations et permet une application plus large dans un cadre plus général. LGE démontre son efficacité et sa polyvalence accrues en surpassant de manière significative les lignes de base dans tous les environnements testés. Le deuxième domaine d’intérêt est celui de la conception d’un agent polyvalent qui peut fonctionner dans une variété d’environnements, impliquant ainsi une structure multimodale et transcendant même le cadre séquentiel conventionnel de l’apprentissage par renforcement. Nous présentons Jack of All Trades (JAT), une architecture multimodale basée Transformers, spécialement conçue pour les tâches de décision séquentielle. En utilisant un seul ensemble de poids, JAT démontre sa robustesse et sa polyvalence, rivalisant avec son unique référence sur plusieurs benchmarks d’apprentissage par renforcement et montrant même des performances prometteuses sur des tâches de vision et textuelles. Nous pensons que ces deux contributions constituent une étape importante vers une approche plus générale de l’apprentissage par renforcement. En outre, nous présentons d’autres avancées méthodologiques et techniques qui sont étroitement liées à notre question de recherche initiale. La première est l’introduction d’un ensemble d’environnements robotiques simulés à récompense éparse, conçus pour fournir à la communauté les outils nécessaires à l’apprentissage dans des conditions de faible supervision. Trois ans après son introduction, cette contribution a été largement adoptée par la communauté et continue de faire l’objet d’une maintenance et d’un support actifs. D’autre part, nous présentons Open RL Benchmark, notre initiative pionnière visant à fournir un ensemble complet et entièrement enregistré d’expériences d’apprentissage par renforcement, allant au-delà des données typiques pour inclure toutes les métriques spécifiques à l’algorithme et au système. Ce benchmark vise à améliorer l’efficacité de la recherche en fournissant des données prêtes à l’emploi et en v vi facilitant la reproductibilité précise des expériences. Grâce à son approche communautaire, il est rapidement devenu une ressource importante, documentant plus de 25 000 exécutions. Ces avancées techniques et méthodologiques, associées aux contributions scientifiques décrites ci-dessus, visent à promouvoir une approche plus générale de l’apprentissage par renforcement et, nous l’espérons, représentent une étape significative vers le développement à terme d’un agent plus opérationnel
Conventional Reinforcement Learning (RL) involves training a unimodal agent on a single, well-defined task, guided by a gradient-optimized reward signal. This framework does not allow us to envisage a learning agent adapted to real-world problems involving diverse modality streams, multiple tasks, often poorly defined, sometimes not defined at all. Hence, we advocate for transitioning towards a more general framework, aiming to create RL algorithms that more inherently versatile.To advance in this direction, we identify two primary areas of focus. The first aspect involves improving exploration, enabling the agent to learn from the environment with reduced dependence on the reward signal. We present Latent Go-Explore (LGE), an extension of the Go-Explore algorithm. While Go-Explore achieved impressive results, it was constrained by domain-specific knowledge. LGE overcomes these limitations, offering wider applicability within a general framework. In various tested environments, LGE consistently outperforms the baselines, showcasing its enhanced effectiveness and versatility. The second focus is to design a general-purpose agent that can operate in a variety of environments, thus involving a multimodal structure and even transcending the conventional sequential framework of RL. We introduce Jack of All Trades (JAT), a multimodal Transformer-based architecture uniquely tailored to sequential decision tasks. Using a single set of weights, JAT demonstrates robustness and versatility, competing its unique baseline on several RL benchmarks and even showing promising performance on vision and textual tasks. We believe that these two contributions are a valuable step towards a more general approach to RL. In addition, we present other methodological and technical advances that are closely related to our core research question. The first is the introduction of a set of sparsely rewarded simulated robotic environments designed to provide the community with the necessary tools for learning under conditions of low supervision. Notably, three years after its introduction, this contribution has been widely adopted by the community and continues to receive active maintenance and support. On the other hand, we present Open RL Benchmark, our pioneering initiative to provide a comprehensive and fully tracked set of RL experiments, going beyond typical data to include all algorithm-specific and system metrics. This benchmark aims to improve research efficiency by providing out-of-the-box RL data and facilitating accurate reproducibility of experiments. With its community-driven approach, it has quickly become an important resource, documenting over 25,000 runs.These technical and methodological advances, along with the scientific contributions described above, are intended to promote a more general approach to Reinforcement Learning and, we hope, represent a meaningful step toward the eventual development of a more operative RL agent

2

Hanski, Jari, and Kaan Baris Biçak. "An Evaluation of the Unity Machine Learning Agents Toolkit in Dense and Sparse Reward Video Game Environments." Thesis, Uppsala universitet, Institutionen för speldesign, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-444982.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

In computer games, one use case for artificial intelligence is used to create interesting problems for the player. To do this new techniques such as reinforcement learning allows game developers to create artificial intelligence agents with human-like or superhuman abilities. The Unity ML-agents toolkit is a plugin that provides game developers with access to reinforcement algorithms without expertise in machine learning. In this paper, we compare reinforcement learning methods and provide empirical training data from two different environments. First, we describe the chosen reinforcement methods and then explain the design of both training environments. We compared the benefits in both dense and sparse rewards environments. The reinforcement learning methods were evaluated by comparing the training speed and cumulative rewards of the agents. The goal was to evaluate how much the combination of extrinsic and intrinsic rewards accelerated the training process in the sparse rewards environment. We hope this study helps game developers utilize reinforcement learning more effectively, saving time during the training process by choosing the most fitting training method for their video game environment. The results show that when training reinforcement agents in sparse rewards environments the agents trained faster with the combination of extrinsic and intrinsic rewards. And when training an agent in a sparse reward environment with only extrinsic rewards the agent failed to learn to complete the task.

Частини книг з теми "Sparsely rewarded environments":

1

Hensel, Maximilian. "Exploration Methods in Sparse Reward Environments." In Reinforcement Learning Algorithms: Analysis and Applications, 35–45. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-41188-6_4.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

2

Moy, Glennn, and Slava Shekh. "Evolution Strategies for Sparse Reward Gridworld Environments." In AI 2022: Advances in Artificial Intelligence, 266–78. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-22695-3_19.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

3

Jeewa, Asad, Anban W. Pillay, and Edgar Jembere. "Learning to Generalise in Sparse Reward Navigation Environments." In Artificial Intelligence Research, 85–100. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-66151-9_6.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

4

Chen, Zhongpeng, and Qiang Guan. "Continuous Exploration via Multiple Perspectives in Sparse Reward Environment." In Pattern Recognition and Computer Vision, 57–68. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-8435-0_5.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

5

Le, Bang-Giang, Thi-Linh Hoang, Hai-Dang Kieu, and Viet-Cuong Ta. "Structural and Compact Latent Representation Learning on Sparse Reward Environments." In Intelligent Information and Database Systems, 40–51. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-5837-5_4.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

6

Kang, Yongxin, Enmin Zhao, Yifan Zang, Kai Li, and Junliang Xing. "Towards a Unified Benchmark for Reinforcement Learning in Sparse Reward Environments." In Communications in Computer and Information Science, 189–201. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-1639-9_16.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

7

Liu, Xi, Long Ma, Zhen Chen, Changgang Zheng, Ren Chen, Yong Liao, and Shufan Yang. "A Novel State Space Exploration Method for the Sparse-Reward Reinforcement Learning Environment." In Artificial Intelligence XL, 216–21. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-47994-6_18.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

8

Xie, Zaipeng, Yufeng Zhang, Chentai Qiao, and Sitong Shen. "IPERS: Individual Prioritized Experience Replay with Subgoals for Sparse Reward Multi-Agent Reinforcement Learning." In Frontiers in Artificial Intelligence and Applications. IOS Press, 2023. http://dx.doi.org/10.3233/faia230586.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Multi-agent reinforcement learning commonly uses a global team reward signal to represent overall collaborative performance. Value decomposition breaks this global reward into estimated individual value functions per agent, enabling efficient training. However, in sparse reward environments, agents struggle to assess if their actions achieve the team goal, slowing convergence. This impedes the algorithm’s convergence rate and overall efficacy. We present IPERS, an Individual Prioritized Experience Replay algorithm with Subgoals for Sparse Reward Multi-Agent Reinforcement Learning. IPERS integrates joint action decomposition and prioritized experience replay, maintaining invariance between global and individual loss gradients. Subgoals serve as intermediate goals that break down complex tasks into simpler steps with dense feedback and provide helpful intrinsic rewards that guide agents. This facilitates learning coordinated policies in challenging collaborative environments with sparse rewards. Experimental evaluations of IPERS in both the SMAC and GRF environments demonstrate rapid adaptation to diverse multi-agent tasks and significant improvements in win rate and convergence performance relative to state-of-the-art algorithms.

9

Shah, Syed Ihtesham Hussain, Antonio Coronato, and Muddasar Naeem. "Inverse Reinforcement Learning Based Approach for Investigating Optimal Dynamic Treatment Regime." In Ambient Intelligence and Smart Environments. IOS Press, 2022. http://dx.doi.org/10.3233/aise220052.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

In recent years, the importance of artificial intelligence (AI) and reinforcement learning (RL) has exponentially increased in healthcare and learning Dynamic Treatment Regimes (DTR). These techniques are used to learn and recover the best of the doctor’s treatment policies. However, methods based on existing RL approaches are encountered with some limitations e.g. behavior cloning (BC) methods suffer from compounding errors and reinforcement learning (RL) techniques use self-defined reward functions that are either too sparse or need clinical guidance. To tackle the limitations that are associated with RL model, a new technique named Inverse reinforcement learning (IRL) was introduced. In IRL reward function is learned through expert demonstrations. In this paper, we are proposing an IRL approach for finding the true reward function for expert demonstrations. Result shows that with rewards through proposed technique provide fast learning capability to existing RL model as compared to self-defined rewards.

10

Abate, Alessandro, Yousif Almulla, James Fox, David Hyland, and Michael Wooldridge. "Learning Task Automata for Reinforcement Learning Using Hidden Markov Models." In Frontiers in Artificial Intelligence and Applications. IOS Press, 2023. http://dx.doi.org/10.3233/faia230247.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Training reinforcement learning (RL) agents using scalar reward signals is often infeasible when an environment has sparse and non-Markovian rewards. Moreover, handcrafting these reward functions before training is prone to misspecification. We learn non-Markovian finite task specifications as finite-state ‘task automata’ from episodes of agent experience within environments with unknown dynamics. First, we learn a product MDP, a model composed of the specification’s automaton and the environment’s MDP (both initially unknown), by treating it as a partially observable MDP and employing a hidden Markov model learning algorithm. Second, we efficiently distil the task automaton (assumed to be a deterministic finite automaton) from the learnt product MDP. Our automaton enables a task to be decomposed into sub-tasks, so an RL agent can later synthesise an optimal policy more efficiently. It is also an interpretable encoding of high-level task features, so a human can verify that the agent’s learnt tasks have no misspecifications. Finally, we also take steps towards ensuring that the automaton is environment-agnostic, making it well-suited for use in transfer learning.

Тези доповідей конференцій з теми "Sparsely rewarded environments":

1

Camacho, Alberto, Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. McIlraith. "LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/840.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

In Reinforcement Learning (RL), an agent is guided by the rewards it receives from the reward function. Unfortunately, it may take many interactions with the environment to learn from sparse rewards, and it can be challenging to specify reward functions that reflect complex reward-worthy behavior. We propose using reward machines (RMs), which are automata-based representations that expose reward function structure, as a normal form representation for reward functions. We show how specifications of reward in various formal languages, including LTL and other regular languages, can be automatically translated into RMs, easing the burden of complex reward function specification. We then show how the exposed structure of the reward function can be exploited by tailored q-learning algorithms and automated reward shaping techniques in order to improve the sample efficiency of reinforcement learning methods. Experiments show that these RM-tailored techniques significantly outperform state-of-the-art (deep) RL algorithms, solving problems that otherwise cannot reasonably be solved by existing approaches.

2

Bougie, Nicolas, and Ryutaro Ichise. "Towards High-Level Intrinsic Exploration in Reinforcement Learning." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/733.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Deep reinforcement learning (DRL) methods traditionally struggle with tasks where environment rewards are sparse or delayed, which entails that exploration remains one of the key challenges of DRL. Instead of solely relying on extrinsic rewards, many state-of-the-art methods use intrinsic curiosity as exploration signal. While they hold promise of better local exploration, discovering global exploration strategies is beyond the reach of current methods. We propose a novel end-to-end intrinsic reward formulation that introduces high-level exploration in reinforcement learning. Our curiosity signal is driven by a fast reward that deals with local exploration and a slow reward that incentivizes long-time horizon exploration strategies. We formulate curiosity as the error in an agent’s ability to reconstruct the observations given their contexts. Experimental results show that this high-level exploration enables our agents to outperform prior work in several Atari games.

3

Wan, Shanchuan, Yujin Tang, Yingtao Tian, and Tomoyuki Kaneko. "DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards." In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. California: International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/477.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Exploration is a fundamental aspect of reinforcement learning (RL), and its effectiveness is a deciding factor in the performance of RL algorithms, especially when facing sparse extrinsic rewards. Recent studies have shown the effectiveness of encouraging exploration with intrinsic rewards estimated from novelties in observations. However, there is a gap between the novelty of an observation and an exploration, as both the stochasticity in the environment and the agent's behavior may affect the observation. To evaluate exploratory behaviors accurately, we propose DEIR, a novel method in which we theoretically derive an intrinsic reward with a conditional mutual information term that principally scales with the novelty contributed by agent explorations, and then implement the reward with a discriminative forward model. Extensive experiments on both standard and advanced exploration tasks in MiniGrid show that DEIR quickly learns a better policy than the baselines. Our evaluations on ProcGen demonstrate both the generalization capability and the general applicability of our intrinsic reward.

4

Noever, David, and Ryerson Burdick. "Puzzle Solving without Search or Human Knowledge: An Unnatural Language Approach." In 9th International Conference on Artificial Intelligence and Applications (AIAPP 2022). Academy and Industry Research Collaboration Center (AIRCC), 2022. http://dx.doi.org/10.5121/csit.2022.120902.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

The application of Generative Pre-trained Transformer (GPT-2) to learn text-archived game notation provides a model environment for exploring sparse reward gameplay. The transformer architecture proves amenable to training on solved text archives describing mazes, Rubik’s Cube, and Sudoku solvers. The method benefits from fine-tuning the transformer architecture to visualize plausible strategies derived outside any guidance from human heuristics or domain expertise. The large search space (>1019) for the games provides a puzzle environment in which the solution has few intermediate rewards and a final move that solves the challenge.

5

Chatterjee, Palash, Ashutosh Chapagain, Weizhe Chen, and Roni Khardon. "DiSProD: Differentiable Symbolic Propagation of Distributions for Planning." In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. California: International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/591.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

The paper introduces DiSProD, an online planner developed for environments with probabilistic transitions in continuous state and action spaces. DiSProD builds a symbolic graph that captures the distribution of future trajectories, conditioned on a given policy, using independence assumptions and approximate propagation of distributions. The symbolic graph provides a differentiable representation of the policy's value, enabling efficient gradient-based optimization for long-horizon search. The propagation of approximate distributions can be seen as an aggregation of many trajectories, making it well-suited for dealing with sparse rewards and stochastic environments. An extensive experimental evaluation compares DiSProD to state-of-the-art planners in discrete-time planning and real-time control of robotic systems. The proposed method improves over existing planners in handling stochastic environments, sensitivity to search depth, sparsity of rewards, and large action spaces. Additional real-world experiments demonstrate that DiSProD can control ground vehicles and surface vessels to successfully navigate around obstacles.

6

Xu, Pei, Junge Zhang, and Kaiqi Huang. "Exploration via Joint Policy Diversity for Sparse-Reward Multi-Agent Tasks." In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. California: International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/37.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Exploration under sparse rewards is a key challenge for multi-agent reinforcement learning problems. Previous works argue that complex dynamics between agents and the huge exploration space in MARL scenarios amplify the vulnerability of classical count-based exploration methods when combined with agents parameterized by neural networks, resulting in inefficient exploration. In this paper, we show that introducing constrained joint policy diversity into a classical count-based method can significantly improve exploration when agents are parameterized by neural networks. Specifically, we propose a joint policy diversity to measure the difference between current joint policy and previous joint policies, and then use a filtering-based exploration constraint to further refine the joint policy diversity. Under the sparse-reward setting, we show that the proposed method significantly outperforms the state-of-the-art methods in the multiple-particle environment, the Google Research Football, and StarCraft II micromanagement tasks. To the best of our knowledge, on the hard 3s_vs_5z task which needs non-trivial strategies to defeat enemies, our method is the first to learn winning strategies without domain knowledge under the sparse-reward setting.

7

Memarian, Farzan, Wonjoon Goo, Rudolf Lioutikov, Scott Niekum, and Ufuk Topcu. "Self-Supervised Online Reward Shaping in Sparse-Reward Environments." In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021. http://dx.doi.org/10.1109/iros51168.2021.9636020.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

8

Lin, Xingyu, Pengsheng Guo, Carlos Florensa, and David Held. "Adaptive Variance for Changing Sparse-Reward Environments." In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019. http://dx.doi.org/10.1109/icra.2019.8793650.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

9

Seurin, Mathieu, Florian Strub, Philippe Preux, and Olivier Pietquin. "Don’t Do What Doesn’t Matter: Intrinsic Motivation with Action Usefulness." In Thirtieth International Joint Conference on Artificial Intelligence {IJCAI-21}. California: International Joint Conferences on Artificial Intelligence Organization, 2021. http://dx.doi.org/10.24963/ijcai.2021/406.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Sparse rewards are double-edged training signals in reinforcement learning: easy to design but hard to optimize. Intrinsic motivation guidances have thus been developed toward alleviating the resulting exploration problem. They usually incentivize agents to look for new states through novelty signals. Yet, such methods encourage exhaustive exploration of the state space rather than focusing on the environment's salient interaction opportunities. We propose a new exploration method, called Don't Do What Doesn't Matter (DoWhaM), shifting the emphasis from state novelty to state with relevant actions. While most actions consistently change the state when used, e.g. moving the agent, some actions are only effective in specific states, e.g., opening a door, grabbing an object. DoWhaM detects and rewards actions that seldom affect the environment. We evaluate DoWhaM on the procedurally-generated environment MiniGrid against state-of-the-art methods. Experiments consistently show that DoWhaM greatly reduces sample complexity, installing the new state-of-the-art in MiniGrid.

10

Juliani, Arthur, Ahmed Khalifa, Vincent-Pierre Berges, Jonathan Harper, Ervin Teng, Hunter Henry, Adam Crespi, Julian Togelius, and Danny Lange. "Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/373.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

The rapid pace of recent research in AI has been driven in part by the presence of fast and challenging simulation environments. These environments often take the form of games; with tasks ranging from simple board games, to competitive video games. We propose a new benchmark - Obstacle Tower: a high fidelity, 3D, 3rd person, procedurally generated environment. An agent in Obstacle Tower must learn to solve both low-level control and high-level planning problems in tandem while learning from pixels and a sparse reward signal. Unlike other benchmarks such as the Arcade Learning Environment, evaluation of agent performance in Obstacle Tower is based on an agent's ability to perform well on unseen instances of the environment. In this paper we outline the environment and provide a set of baseline results produced by current state-of-the-art Deep RL methods as well as human players. These algorithms fail to produce agents capable of performing near human level.