Academic literature on the topic 'Factored reinforcement learning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Factored reinforcement learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Factored reinforcement learning":

1

Wu, Bo, Yan Peng Feng, and Hong Yan Zheng. "A Model-Based Factored Bayesian Reinforcement Learning Approach." Applied Mechanics and Materials 513-517 (February 2014): 1092–95. http://dx.doi.org/10.4028/www.scientific.net/amm.513-517.1092.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Bayesian reinforcement learning has turned out to be an effective solution to the optimal tradeoff between exploration and exploitation. However, in practical applications, the learning parameters with exponential growth are the main impediment for online planning and learning. To overcome this problem, we bring factored representations, model-based learning, and Bayesian reinforcement learning together in a new approach. Firstly, we exploit a factored representation to describe the states to reduce the size of learning parameters, and adopt Bayesian inference method to learn the unknown structure and parameters simultaneously. Then, we use an online point-based value iteration algorithm to plan and learn. The experimental results show that the proposed approach is an effective way for improving the learning efficiency in large-scale state spaces.
2

Li, Chao, Yupeng Zhang, Jianqi Wang, Yujing Hu, Shaokang Dong, Wenbin Li, Tangjie Lv, Changjie Fan, and Yang Gao. "Optimistic Value Instructors for Cooperative Multi-Agent Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (March 24, 2024): 17453–60. http://dx.doi.org/10.1609/aaai.v38i16.29694.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
In cooperative multi-agent reinforcement learning, decentralized agents hold the promise of overcoming the combinatorial explosion of joint action space and enabling greater scalability. However, they are susceptible to a game-theoretic pathology called relative overgeneralization that shadows the optimal joint action. Although recent value-decomposition algorithms guide decentralized agents by learning a factored global action value function, the representational limitation and the inaccurate sampling of optimal joint actions during the learning process make this problem still. To address this limitation, this paper proposes a novel algorithm called Optimistic Value Instructors (OVI). The main idea behind OVI is to introduce multiple optimistic instructors into the value-decomposition paradigm, which are capable of suggesting potentially optimal joint actions and rectifying the factored global action value function to recover these optimal actions. Specifically, the instructors maintain optimistic value estimations of per-agent local actions and thus eliminate the negative effects caused by other agents' exploratory or sub-optimal non-cooperation, enabling accurate identification and suggestion of optimal joint actions. Based on the instructors' suggestions, the paper further presents two instructive constraints to rectify the factored global action value function to recover these optimal joint actions, thus overcoming the RO problem. Experimental evaluation of OVI on various cooperative multi-agent tasks demonstrates its superior performance against multiple baselines, highlighting its effectiveness.
3

Kveton, Branislav, and Georgios Theocharous. "Structured Kernel-Based Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 27, no. 1 (June 30, 2013): 569–75. http://dx.doi.org/10.1609/aaai.v27i1.8669.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Kernel-based reinforcement learning (KBRL) is a popular approach to learning non-parametric value function approximations. In this paper, we present structured KBRL, a paradigm for kernel-based RL that allows for modeling independencies in the transition and reward models of problems. Real-world problems often exhibit this structure and can be solved more efficiently when it is modeled. We make three contributions. First, we motivate our work, define a structured backup operator, and prove that it is a contraction. Second, we show how to evaluate our operator efficiently. Our analysis reveals that the fixed point of the operator is the optimal value function in a special factored MDP. Finally, we evaluate our method on a synthetic problem and compare it to two KBRL baselines. In most experiments, we learn better policies than the baselines from an order of magnitude less training data.
4

Simão, Thiago D., and Matthijs T. J. Spaan. "Safe Policy Improvement with Baseline Bootstrapping in Factored Environments." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4967–74. http://dx.doi.org/10.1609/aaai.v33i01.33014967.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. Factored reinforcement learning, on the other hand, is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.
5

Truong, Van Binh, and Long Bao Le. "Electric vehicle charging design: The factored action based reinforcement learning approach." Applied Energy 359 (April 2024): 122737. http://dx.doi.org/10.1016/j.apenergy.2024.122737.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

SIMM, Jaak, Masashi SUGIYAMA, and Hirotaka HACHIYA. "Multi-Task Approach to Reinforcement Learning for Factored-State Markov Decision Problems." IEICE Transactions on Information and Systems E95.D, no. 10 (2012): 2426–37. http://dx.doi.org/10.1587/transinf.e95.d.2426.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Wang, Zizhao, Caroline Wang, Xuesu Xiao, Yuke Zhu, and Peter Stone. "Building Minimal and Reusable Causal State Abstractions for Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (March 24, 2024): 15778–86. http://dx.doi.org/10.1609/aaai.v38i14.29507.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Two desiderata of reinforcement learning (RL) algorithms are the ability to learn from relatively little experience and the ability to learn policies that generalize to a range of problem specifications. In factored state spaces, one approach towards achieving both goals is to learn state abstractions, which only keep the necessary variables for learning the tasks at hand. This paper introduces Causal Bisimulation Modeling (CBM), a method that learns the causal relationships in the dynamics and reward functions for each task to derive a minimal, task-specific abstraction. CBM leverages and improves implicit modeling to train a high-fidelity causal dynamics model that can be reused for all tasks in the same environment. Empirical validation on two manipulation environments and four tasks reveals that CBM's learned implicit dynamics models identify the underlying causal relationships and state abstractions more accurately than explicit ones. Furthermore, the derived state abstractions allow a task learner to achieve near-oracle levels of sample efficiency and outperform baselines on all tasks.
8

Mohamad Hafiz Abu Bakar, Abu Ubaidah bin Shamsudin, Ruzairi Abdul Rahim, Zubair Adil Soomro, and Andi Adrianshah. "Comparison Method Q-Learning and SARSA for Simulation of Drone Controller using Reinforcement Learning." Journal of Advanced Research in Applied Sciences and Engineering Technology 30, no. 3 (May 15, 2023): 69–78. http://dx.doi.org/10.37934/araset.30.3.6978.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Nowadays, the advancement of drones is also factored in the development of a world surrounded by technologies. One of the aspects emphasized here is the difficulty of controlling the drone, and the system developed is still under full control by the users as well. Reinforcement Learning is used to enable the system to operate automatically, thus drone will learn the next movement based on the interaction between the agent and the environment. Through this study, Q-Learning and State-Action-Reward-State-Action (SARSA) are used in this study and the comparison of results involving both the performance and effectiveness of the system based on the simulation of both methods can be seen through the analysis. A comparison of both Q-learning and State-Action-Reward-State-Action (SARSA) based systems in autonomous drone application was performed for evaluation in this study. According to this simulation process is shows that Q-Learning is a better performance and effective to train the system to achieve desire compared with SARSA algorithm for drone controller.
9

Kong, Minseok, and Jungmin So. "Empirical Analysis of Automated Stock Trading Using Deep Reinforcement Learning." Applied Sciences 13, no. 1 (January 3, 2023): 633. http://dx.doi.org/10.3390/app13010633.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
There are several automated stock trading programs using reinforcement learning, one of which is an ensemble strategy. The main idea of the ensemble strategy is to train DRL agents and make an ensemble with three different actor–critic algorithms: Advantage Actor–Critic (A2C), Deep Deterministic Policy Gradient (DDPG), and Proximal Policy Optimization (PPO). This novel idea was the concept mainly used in this paper. However, we did not stop there, but we refined the automated stock trading in two areas. First, we made another DRL-based ensemble and employed it as a new trading agent. We named it Remake Ensemble, and it combines not only A2C, DDPG, and PPO but also Actor–Critic using Kronecker-Factored Trust Region (ACKTR), Soft Actor–Critic (SAC), Twin Delayed DDPG (TD3), and Trust Region Policy Optimization (TRPO). Furthermore, we expanded the application domain of automated stock trading. Although the existing stock trading method treats only 30 Dow Jones stocks, ours handles KOSPI stocks, JPX stocks, and Dow Jones stocks. We conducted experiments with our modified automated stock trading system to validate its robustness in terms of cumulative return. Finally, we suggested some methods to gain relatively stable profits following the experiments.
10

Mutti, Mirco, Riccardo De Santi, Emanuele Rossi, Juan Felipe Calderon, Michael Bronstein, and Marcello Restelli. "Provably Efficient Causal Model-Based Reinforcement Learning for Systematic Generalization." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 9251–59. http://dx.doi.org/10.1609/aaai.v37i8.26109.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
In the sequential decision making setting, an agent aims to achieve systematic generalization over a large, possibly infinite, set of environments. Such environments are modeled as discrete Markov decision processes with both states and actions represented through a feature vector. The underlying structure of the environments allows the transition dynamics to be factored into two components: one that is environment-specific and another that is shared. Consider a set of environments that share the laws of motion as an example. In this setting, the agent can take a finite amount of reward-free interactions from a subset of these environments. The agent then must be able to approximately solve any planning task defined over any environment in the original set, relying on the above interactions only. Can we design a provably efficient algorithm that achieves this ambitious goal of systematic generalization? In this paper, we give a partially positive answer to this question. First, we provide a tractable formulation of systematic generalization by employing a causal viewpoint. Then, under specific structural assumptions, we provide a simple learning algorithm that guarantees any desired planning error up to an unavoidable sub-optimality term, while showcasing a polynomial sample complexity.

Dissertations / Theses on the topic "Factored reinforcement learning":

1

Kozlova, Olga. "Hierarchical and factored reinforcement learning." Paris 6, 2010. http://www.theses.fr/2010PA066196.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Les méthodes d'apprentissage par renforcement factorisé et hiérarchique (HFRL) sont basées sur le formalisme des processus de décision markoviens factorisées (FMDP) et les MDP hiérarchiques (HMDP). Dans cette thèse, nous proposons une méthode de HFRL qui utilise les approches d’apprentissage par renforcement indirect et le formalisme des options pour résoudre les problèmes de prise de décision dans les environnements dynamiques sans connaissance a priori de la structure du problème. Dans la première contribution de cette thèse, nous montrons comment modéliser les problèmes où certaines combinaisons de variables n’existent pas et nous démontrons les performances de nos algorithmes sur des problèmes jouet classiques dans la littérature, MAZE6 et BLOCKSWORLD, en comparaison avec l’approche standard. La deuxième contribution de cette thèse est la proposition de TeXDYNA, un algorithme pour la résolution de MDP de grande taille dont la structure est inconnue. TeXDYNA décompose hiérarchiquement le FMDP sur la base de la découverte automatique des sous-tâches directement à partir de la structure du problème qui est elle même apprise en interaction avec l’environnement. Nous évaluons TeXDYNA sur deux benchmarks, à savoir les problèmes TAXI et LIGHTBOX. Finalement, nous estimons le potentiel et les limitations de TeXDYNA sur un problème jouet plus représentatif du domaine de la simulation industrielle.
2

Tournaire, Thomas. "Model-based reinforcement learning for dynamic resource allocation in cloud environments." Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAS004.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
L'émergence de nouvelles technologies nécessite une allocation efficace des ressources pour satisfaire la demande. Cependant, ces nouveaux besoins nécessitent une puissance de calcul élevée impliquant une plus grande consommation d'énergie notamment dans les infrastructures cloud et data centers. Il est donc essentiel de trouver de nouvelles solutions qui peuvent satisfaire ces besoins tout en réduisant la consommation d'énergie des ressources. Dans cette thèse, nous proposons et comparons de nouvelles solutions d'IA (apprentissage par renforcement RL) pour orchestrer les ressources virtuelles dans les environnements de réseaux virtuels de manière à garantir les performances et minimiser les coûts opérationnels. Nous considérons les systèmes de file d'attente comme un modèle pour les infrastructures cloud IaaS et apportons des méthodes d'apprentissage pour allouer efficacement le bon nombre de ressources.Notre objectif est de minimiser une fonction de coût en tenant compte des coûts de performance et opérationnels. Nous utilisons différents types d'algorithmes de RL (du « sans-modèle » au modèle relationnel) pour apprendre la meilleure politique. L'apprentissage par renforcement s'intéresse à la manière dont un agent doit agir dans un environnement pour maximiser une récompense cumulative. Nous développons d'abord un modèle de files d'attente d'un système cloud avec un nœud physique hébergeant plusieurs ressources virtuelles. Dans cette première partie, nous supposons que l'agent connaît le modèle (dynamiques de l'environnement et coût), ce qui lui donne la possibilité d'utiliser des méthodes de programmation dynamique pour le calcul de la politique optimale. Puisque le modèle est connu dans cette partie, nous nous concentrons également sur les propriétés des politiques optimales, qui sont des règles basées sur les seuils et l'hystérésis. Cela nous permet d'intégrer la propriété structurelle des politiques dans les algorithmes MDP. Après avoir fourni un modèle de cloud concret avec des arrivées exponentielles avec des intensités réelles et des données d'énergie pour le fournisseur de cloud, nous comparons dans cette première approche l'efficacité et le temps de calcul des algorithmes MDP par rapport aux heuristiques construites sur les distributions stationnaires de la chaîne de Markov des files d'attente.Dans une deuxième partie, nous considérons que l'agent n'a pas accès au modèle de l'environnement et nous concentrons notre travail sur les techniques de RL. Nous évaluons d'abord des méthodes basées sur un modèle où l'agent peut réutiliser son expérience pour mettre à jour sa fonction de valeur. Nous considérons également des techniques de MDP en ligne où l'agent autonome approxime le modèle pour effectuer une programmation dynamique. Cette partie est évaluée dans un environnement plus large avec deux nœuds physiques en tandem et nous évaluons le temps de convergence et la précision des différentes méthodes, principalement les techniques basées sur un modèle par rapport aux méthodes sans modèle de l'état de l'art.La dernière partie se concentre sur les techniques de RL basées sur des modèles avec une structure relationnelle entre les variables d’état. Comme ces réseaux en tandem ont des propriétés structurelles dues à la forme de l’infrastructure, nous intégrons les approches factorisées et causales aux méthodes de RL pour inclure cette connaissance. Nous fournissons à l'agent une connaissance relationnelle de l'environnement qui lui permet de comprendre comment les variables sont reliées. L'objectif principal est d'accélérer la convergence: d'abord avec une représentation plus compacte avec la factorisation où nous concevons un algorithme en ligne de MDP factorisé que nous comparons avec des algorithmes de RL sans modèle et basés sur un modèle ; ensuite en intégrant le raisonnement causal et contrefactuel qui peut traiter les environnements avec des observations partielles et des facteurs de confusion non observés
The emergence of new technologies (Internet of Things, smart cities, autonomous vehicles, health, industrial automation, ...) requires efficient resource allocation to satisfy the demand. These new offers are compatible with new 5G network infrastructure since it can provide low latency and reliability. However, these new needs require high computational power to fulfill the demand, implying more energy consumption in particular in cloud infrastructures and more particularly in data centers. Therefore, it is critical to find new solutions that can satisfy these needs still reducing the power usage of resources in cloud environments. In this thesis we propose and compare new AI solutions (Reinforcement Learning) to orchestrate virtual resources in virtual network environments such that performances are guaranteed and operational costs are minimised. We consider queuing systems as a model for clouds IaaS infrastructures and bring learning methodologies to efficiently allocate the right number of resources for the users.Our objective is to minimise a cost function considering performance costs and operational costs. We go through different types of reinforcement learning algorithms (from model-free to relational model-based) to learn the best policy. Reinforcement learning is concerned with how a software agent ought to take actions in an environment to maximise some cumulative reward. We first develop queuing model of a cloud system with one physical node hosting several virtual resources. On this first part we assume the agent perfectly knows the model (dynamics of the environment and the cost function), giving him the opportunity to perform dynamic programming methods for optimal policy computation. Since the model is known in this part, we also concentrate on the properties of the optimal policies, which are threshold-based and hysteresis-based rules. This allows us to integrate the structural property of the policies into MDP algorithms. After providing a concrete cloud model with exponential arrivals with real intensities and energy data for cloud provider, we compare in this first approach efficiency and time computation of MDP algorithms against heuristics built on top of the queuing Markov Chain stationary distributions.In a second part we consider that the agent does not have access to the model of the environment and concentrate our work with reinforcement learning techniques, especially model-based reinforcement learning. We first develop model-based reinforcement learning methods where the agent can re-use its experience replay to update its value function. We also consider MDP online techniques where the autonomous agent approximates environment model to perform dynamic programming. This part is evaluated in a larger network environment with two physical nodes in tandem and we assess convergence time and accuracy of different reinforcement learning methods, mainly model-based techniques versus the state-of-the-art model-free methods (e.g. Q-Learning).The last part focuses on model-based reinforcement learning techniques with relational structure between environment variables. As these tandem networks have structural properties due to their infrastructure shape, we investigate factored and causal approaches built-in reinforcement learning methods to integrate this information. We provide the autonomous agent with a relational knowledge of the environment where it can understand how variables are related to each other. The main goal is to accelerate convergence by: first having a more compact representation with factorisation where we devise a factored MDP online algorithm that we evaluate and compare with model-free and model-based reinforcement learning algorithms; second integrating causal and counterfactual reasoning that can tackle environments with partial observations and unobserved confounders
3

Magnan, Jean-Christophe. "Représentations graphiques de fonctions et processus décisionnels Markoviens factorisés." Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066042/document.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
En planification théorique de la décision, le cadre des Processus Décisionnels Markoviens Factorisés (Factored Markov Decision Process, FMDP) a produit des algorithmes efficaces de résolution des problèmes de décisions séquentielles dans l'incertain. L'efficacité de ces algorithmes repose sur des structures de données telles que les Arbres de Décision ou les Diagrammes de Décision Algébriques (ADDs). Ces techniques de planification sont utilisées en Apprentissage par Renforcement par l'architecture SDYNA afin de résoudre des problèmes inconnus de grandes tailles. Toutefois, l'état-de-l'art des algorithmes d'apprentissage, de programmation dynamique et d'apprentissage par renforcement utilisés par SDYNA, requière que le problème soit spécifié uniquement à l'aide de variables binaires et/ou utilise des structures améliorables en termes de compacité. Dans ce manuscrit, nous présentons nos travaux de recherche visant à élaborer et à utiliser une structure de donnée plus efficace et moins contraignante, et à l'intégrer dans une nouvelle instance de l'architecture SDYNA. Dans une première partie, nous présentons l'état-de-l'art de la modélisation de problèmes de décisions séquentielles dans l'incertain à l'aide de FMDP. Nous abordons en détail la modélisation à l'aide d'DT et d'ADDs.Puis nous présentons les ORFGs, nouvelle structure de données que nous proposons dans cette thèse pour résoudre les problèmes inhérents aux ADDs. Nous démontrons ainsi que les ORFGs s'avèrent plus efficaces que les ADDs pour modéliser les problèmes de grandes tailles. Dans une seconde partie, nous nous intéressons à la résolution des problèmes de décision dans l'incertain par Programmation Dynamique. Après avoir introduit les principaux algorithmes de résolution, nous nous attardons sur leurs variantes dans le domaine factorisé. Nous précisons les points de ces variantes factorisées qui sont améliorables. Nous décrivons alors une nouvelle version de ces algorithmes qui améliore ces aspects et utilise les ORFGs précédemment introduits. Dans une dernière partie, nous abordons l'utilisation des FMDPs en Apprentissage par Renforcement. Puis nous présentons un nouvel algorithme d'apprentissage dédié à la nouvelle structure que nous proposons. Grâce à ce nouvel algorithme, une nouvelle instance de l'architecture SDYNA est proposée, se basant sur les ORFGs ~:~l'instance SPIMDDI. Nous testons son efficacité sur quelques problèmes standards de la littérature. Enfin nous présentons quelques travaux de recherche autour de cette nouvelle instance. Nous évoquons d'abord un nouvel algorithme de gestion du compromis exploration-exploitation destiné à simplifier l'algorithme F-RMax. Puis nous détaillons une application de l'instance SPIMDDI à la gestion d'unités dans un jeu vidéo de stratégie en temps réel
In decision theoretic planning, the factored framework (Factored Markovian Decision Process, FMDP) has produced several efficient algorithms in order to resolve large sequential decision making under uncertainty problems. The efficiency of this algorithms relies on data structures such as decision trees or algebraïc decision diagrams (ADDs). These planification technics are exploited in Reinforcement Learning by the architecture SDyna in order to resolve large and unknown problems. However, state-of-the-art learning and planning algorithms used in SDyna require the problem to be specified uniquely using binary variables and/or to use improvable data structure in term of compactness. In this book, we present our research works that seek to elaborate and to use a new data structure more efficient and less restrictive, and to integrate it in a new instance of the SDyna architecture. In a first part, we present the state-of-the-art modeling tools used in the algorithms that tackle large sequential decision making under uncertainty problems. We detail the modeling using decision trees and ADDs. Then we introduce the Ordered and Reduced Graphical Representation of Function, a new data structure that we propose in this thesis to deal with the various problems concerning the ADDs. We demonstrate that ORGRFs improve on ADDs to model large problems. In a second part, we go over the resolution of large sequential decision under uncertainty problems using Dynamic Programming. After the introduction of the main algorithms, we see in details the factored alternative. We indicate the improvable points of these factored versions. We describe our new algorithm that improve on these points and exploit the ORGRFs previously introduced. In a last part, we speak about the use of FMDPs in Reinforcement Learning. Then we introduce a new algorithm to learn the new datastrcture we propose. Thanks to this new algorithm, a new instance of the SDyna architecture is proposed, based on the ORGRFs : the SPIMDDI instance. We test its efficiency on several standard problems from the litterature. Finally, we present some works around this new instance. We detail a new algorithm for efficient exploration-exploitation compromise management, aiming to simplify F-RMax. Then we speak about an application of SPIMDDI to the managements of units in a strategic real time video game
4

Magnan, Jean-Christophe. "Représentations graphiques de fonctions et processus décisionnels Markoviens factorisés." Electronic Thesis or Diss., Paris 6, 2016. http://www.theses.fr/2016PA066042.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
En planification théorique de la décision, le cadre des Processus Décisionnels Markoviens Factorisés (Factored Markov Decision Process, FMDP) a produit des algorithmes efficaces de résolution des problèmes de décisions séquentielles dans l'incertain. L'efficacité de ces algorithmes repose sur des structures de données telles que les Arbres de Décision ou les Diagrammes de Décision Algébriques (ADDs). Ces techniques de planification sont utilisées en Apprentissage par Renforcement par l'architecture SDYNA afin de résoudre des problèmes inconnus de grandes tailles. Toutefois, l'état-de-l'art des algorithmes d'apprentissage, de programmation dynamique et d'apprentissage par renforcement utilisés par SDYNA, requière que le problème soit spécifié uniquement à l'aide de variables binaires et/ou utilise des structures améliorables en termes de compacité. Dans ce manuscrit, nous présentons nos travaux de recherche visant à élaborer et à utiliser une structure de donnée plus efficace et moins contraignante, et à l'intégrer dans une nouvelle instance de l'architecture SDYNA. Dans une première partie, nous présentons l'état-de-l'art de la modélisation de problèmes de décisions séquentielles dans l'incertain à l'aide de FMDP. Nous abordons en détail la modélisation à l'aide d'DT et d'ADDs.Puis nous présentons les ORFGs, nouvelle structure de données que nous proposons dans cette thèse pour résoudre les problèmes inhérents aux ADDs. Nous démontrons ainsi que les ORFGs s'avèrent plus efficaces que les ADDs pour modéliser les problèmes de grandes tailles. Dans une seconde partie, nous nous intéressons à la résolution des problèmes de décision dans l'incertain par Programmation Dynamique. Après avoir introduit les principaux algorithmes de résolution, nous nous attardons sur leurs variantes dans le domaine factorisé. Nous précisons les points de ces variantes factorisées qui sont améliorables. Nous décrivons alors une nouvelle version de ces algorithmes qui améliore ces aspects et utilise les ORFGs précédemment introduits. Dans une dernière partie, nous abordons l'utilisation des FMDPs en Apprentissage par Renforcement. Puis nous présentons un nouvel algorithme d'apprentissage dédié à la nouvelle structure que nous proposons. Grâce à ce nouvel algorithme, une nouvelle instance de l'architecture SDYNA est proposée, se basant sur les ORFGs ~:~l'instance SPIMDDI. Nous testons son efficacité sur quelques problèmes standards de la littérature. Enfin nous présentons quelques travaux de recherche autour de cette nouvelle instance. Nous évoquons d'abord un nouvel algorithme de gestion du compromis exploration-exploitation destiné à simplifier l'algorithme F-RMax. Puis nous détaillons une application de l'instance SPIMDDI à la gestion d'unités dans un jeu vidéo de stratégie en temps réel
In decision theoretic planning, the factored framework (Factored Markovian Decision Process, FMDP) has produced several efficient algorithms in order to resolve large sequential decision making under uncertainty problems. The efficiency of this algorithms relies on data structures such as decision trees or algebraïc decision diagrams (ADDs). These planification technics are exploited in Reinforcement Learning by the architecture SDyna in order to resolve large and unknown problems. However, state-of-the-art learning and planning algorithms used in SDyna require the problem to be specified uniquely using binary variables and/or to use improvable data structure in term of compactness. In this book, we present our research works that seek to elaborate and to use a new data structure more efficient and less restrictive, and to integrate it in a new instance of the SDyna architecture. In a first part, we present the state-of-the-art modeling tools used in the algorithms that tackle large sequential decision making under uncertainty problems. We detail the modeling using decision trees and ADDs. Then we introduce the Ordered and Reduced Graphical Representation of Function, a new data structure that we propose in this thesis to deal with the various problems concerning the ADDs. We demonstrate that ORGRFs improve on ADDs to model large problems. In a second part, we go over the resolution of large sequential decision under uncertainty problems using Dynamic Programming. After the introduction of the main algorithms, we see in details the factored alternative. We indicate the improvable points of these factored versions. We describe our new algorithm that improve on these points and exploit the ORGRFs previously introduced. In a last part, we speak about the use of FMDPs in Reinforcement Learning. Then we introduce a new algorithm to learn the new datastrcture we propose. Thanks to this new algorithm, a new instance of the SDyna architecture is proposed, based on the ORGRFs : the SPIMDDI instance. We test its efficiency on several standard problems from the litterature. Finally, we present some works around this new instance. We detail a new algorithm for efficient exploration-exploitation compromise management, aiming to simplify F-RMax. Then we speak about an application of SPIMDDI to the managements of units in a strategic real time video game
5

Heron, Michael James. "The ACCESS Framework : reinforcement learning for accessibility and cognitive support for older adults." Thesis, University of Dundee, 2011. https://discovery.dundee.ac.uk/en/studentTheses/0952d5ff-7a23-4c29-b050-fd799035652c.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This dissertation focuses on the ACCESS Framework which is an open source software framework designed to address four issues with regards to older and novice users with accessibility needs – that they often do not know what support is available within their systems, that they often do not know how to change those settings they know exist, that they often lack the confidence to make the changes they know how to make, and are often unable to physically enable accessibility support.The software discussed in this dissertation serves as a bridge between what users are expected to know and what they actually know by assuming the responsibility for identifying user accessibility requirements and making those changes on the user?s behalf. User interaction with the framework is limited to either expressing approval or disapproval with regards to corrective action. Individual corrections are deployed as plug-ins within this tool.Four studies were conducted during this research. Three of these studies were aimed at evaluating the ACCESS Framework directly with the remaining study being an exploration of a cognitive support tool deployed using the framework. Two of these studies involved participants attempting to perform specific, well-defined tasks on systems that had been configured to the extremes of what was possible with operating system settings. These tasks were attempted with and without the support of the framework. The final study was a focus group in which issues of the framework were discussed by individuals who had been through the experimental trials.The research provided strong evidence that this is an effective mechanism for accessibility configuration when there is a strong match between identified accessibility needs and available operating system support. The system was seen as understandable, useful and appropriate by individuals who had been through the experimental trials. The research provided strong evidence that this is an effective mechanism for accessibility configuration when there is a strong match between identified accessibility needs and available operating system support. The system was seen as understandable, useful and appropriate by participants, with a majority stating that they would be willing to use a similar system on their own machines.
6

Al-Safi, Abdullah Taha. "Social reinforcement and risk-taking factors to enhance creativity in Saudi Arabian school children." Thesis, Cardiff University, 1988. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.296226.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Abed-Alguni, Bilal Hashem Kalil. "Cooperative reinforcement learning for independent learners." Thesis, 2014. http://hdl.handle.net/1959.13/1052917.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Research Doctorate - Doctor of Philosophy (PhD)
Machine learning in multi-agent domains poses several research challenges. One challenge is how to model cooperation between reinforcement learners. Cooperation between independent reinforcement learners is known to accelerate convergence to optimal solutions. In large state space problems, independent reinforcement learners normally cooperate to accelerate the learning process using decomposition techniques or knowledge sharing strategies. This thesis presents two techniques to multi-agent reinforcement learning and a comparison study. The first technique is a formal decomposition model and an algorithm for distributed systems. The second technique is a cooperative Q-learning algorithm for multi-goal decomposable systems. The comparison study compares the performance of some of the best known cooperative Q-learning algorithms for independent learners. Distributed systems are normally organised into two levels: system and subsystem levels. This thesis presents a formal solution for decomposition of Markov Decision Processes (MDPs) in distributed systems that takes advantage of the organisation of distributed systems and provides support for migration of learners. This is accomplished by two proposals: a Distributed, Hierarchical Learning Model (DHLM) and an Intelligent Distributed Q-Learning algorithm (IDQL) that are based on three specialisations of agents: workers, tutors and consultants. Worker agents are the actual learners and performers of tasks, while tutor agents and consultant agents are coordinators at the subsystem level and the system level, respectively. A main duty of consultant and tutor agents is the assignment of problem space to worker agents. The experimental results in a distributed hunter prey problem suggest that IDQL converges to a solution faster than the single agent Q-learning approach. An important feature of DHLM is that it provides a solution for migration of agents. This feature provides support for the IDQL algorithm where the problem space of each worker agent can change dynamically. Other hierarchical RL models do not cover this issue. Problems that have multiple goal-states can be decomposed into sub-problems by taking advantage of the loosely-coupled bonds among the goal states. In such problems, each goal state and its problem space form a sub-problem. This thesis introduces Q-learning with Aggregation algorithm (QA-learning), an algorithm for problems with multiple goal-states that is based on two roles: learner and tutor. A learner is an agent that learns and uses the knowledge of its neighbours (tutors) to construct its Q-table. A tutor is a learner that is ready to share its Q-table with its neighbours (learners). These roles are based on the concept of learners reusing tutors' sub-solutions. This algorithm provides solutions to problems with multiple goal-states. In this algorithm, each learner incorporates its tutors' knowledge into its own Q-table calculations. A comprehensive solution can then be obtained by combining these partial solutions. The experimental results in an instance of the shortest path problem suggest that the output of QA-learning is comparable to the output of a single Q-learner whose problem space is the whole system. But the QA-learning algorithm converges to a solution faster than a single learner approach. Cooperative Q-learning algorithms for independent learners accelerate the learning process of individual learners. In this type of Q-learning, independent learners share and update their Q-values by following a sharing strategy after some episodes learning independently. This thesis presents a comparison study of the performance of some famous cooperative Q-learning algorithms (BEST-Q, AVE-Q, PSO-Q, and WSS) as well as an algorithm that aggregates their results. These algorithms are compared in two cases: equal experience and different experiences cases. In the first case, the learners have equal learning time, while in the second case, the learners have different learning times. The comparison study also examines the effects of the frequency of Q-value sharing on the learning speed of independent learners. The experimental results in the equal experience case indicate that sharing of Q-values is not beneficial and produces similar results to single agent Q-learning. While, the experimental results in the different experiences case suggest that each of the cooperative Q-learning algorithms performs similarly, but better than single agent Q-learning. In both cases, high-frequency sharing of Q-values accelerates the convergence to optimal solutions compared to low-frequency sharing. Low-frequency Q-value sharing degrades the performance of the cooperative Q-learning algorithms in the equal experience and different experiences cases.
8

Baker, Travis Edward. "Genetics, drugs, and cognitive control: uncovering individual differences in substance dependence." Thesis, 2012. http://hdl.handle.net/1828/4265.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Why is it that only some people who use drugs actually become addicted? In fact, addiction depends on a complicated process involving a confluence of risk factors related to biology, cognition, behaviour, and personality. Notably, all addictive drugs act on a neural system for reinforcement learning called the midbrain dopamine system, which projects to and regulates the brain's system for cognitive control, called frontal cortex and basal ganglia. Further, the development and expression of the dopamine system is determined in part by genetic factors that vary across individuals such that dopamine related genes are partly responsible for addiction-proneness. Taken together, these observations suggest that the cognitive and behavioral impairments associated with substance abuse result from the impact of disrupted dopamine signals on frontal brain areas involved in cognitive control: By acting on the abnormal reinforcement learning system of the genetically vulnerable, addictive drugs hijack the control system to reinforce maladaptive drug-taking behaviors. The goal of this research was to investigate this hypothesis by conducting a series of experiments that assayed the integrity of the dopamine system and its neural targets involved in cognitive control and decision making in young adults using a combination of electrophysiological, behavioral, and genetic assays together with surveys of substance use and personality. First, this research demonstrated that substance dependent individuals produce an abnormal Reward-positivity, an electrophysiological measure of a cortical mechanism for dopamine-dependent reward processing and cognitive control, and behaved abnormally on a decision making task that is diagnostic of dopamine dysfunction. Second, several dopamine-related neural pathways underlying individual differences in substance dependence were identified and modeled, providing a theoretical framework for bridging the gap between genes and behavior in drug addiction. Third, the neural mechanisms that underlie individual differences in decision making function and dysfunction were identified, revealing possible risk factors in the decision making system. In sum, these results illustrate how future interventions might be individually tailored for specific genetic, cognitive and personality profiles.
Graduate

Books on the topic "Factored reinforcement learning":

1

Sallans, Brian. Reinforcement learning for factored Markov decision processes. 2002.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Factored reinforcement learning":

1

Sigaud, Olivier, Martin V. Butz, Olga Kozlova, and Christophe Meyer. "Anticipatory Learning Classifier Systems and Factored Reinforcement Learning." In Anticipatory Behavior in Adaptive Learning Systems, 321–33. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. http://dx.doi.org/10.1007/978-3-642-02565-5_18.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Kozlova, Olga, Olivier Sigaud, and Christophe Meyer. "TeXDYNA: Hierarchical Reinforcement Learning in Factored MDPs." In From Animals to Animats 11, 489–500. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010. http://dx.doi.org/10.1007/978-3-642-15193-4_46.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Kozlova, Olga, Olivier Sigaud, Pierre-Henri Wuillemin, and Christophe Meyer. "Considering Unseen States as Impossible in Factored Reinforcement Learning." In Machine Learning and Knowledge Discovery in Databases, 721–35. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. http://dx.doi.org/10.1007/978-3-642-04180-8_64.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Degris, Thomas, Olivier Sigaud, and Pierre-Henri Wuillemin. "Exploiting Additive Structure in Factored MDPs for Reinforcement Learning." In Lecture Notes in Computer Science, 15–26. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. http://dx.doi.org/10.1007/978-3-540-89722-4_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Coqueret, Guillaume, and Tony Guida. "Reinforcement learning." In Machine Learning for Factor Investing, 257–72. Boca Raton: Chapman and Hall/CRC, 2023. http://dx.doi.org/10.1201/9781003121596-20.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Klar, M., J. Mertes, M. Glatt, B. Ravani, and J. C. Aurich. "A Holistic Framework for Factory Planning Using Reinforcement Learning." In Proceedings of the 3rd Conference on Physical Modeling for Virtual Manufacturing Systems and Processes, 129–48. Cham: Springer International Publishing, 2023. http://dx.doi.org/10.1007/978-3-031-35779-4_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
AbstractThe generation of an optimized factory layout is a central element of the factory planning process. The generated factory layout predefines multiple characteristics of the future factory, such as the operational costs and proper resource allocations. However, manual layout planning is often time and resource-consuming and involves creative processes. In order to reduce the manual planning effort, automated, computer-aided planning approaches can support the factory planner to deal with this complexity by generating valuable solutions in the early phase of factory layout planning. Novel approaches have introduced Reinforcement Learning based planning schemes to generate optimized factory layouts. However, the existing research mainly focuses on the technical feasibility and does not highlight how a Reinforcement Learning based planning approach can be integrated into the factory planning process. Furthermore, it is unclear which information is required for its application. This paper addresses this research gap by presenting a holistic framework for Reinforcement Learning based factory layout planning that can be applied at the initial planning (greenfield planning) stages as well as in the restructuring (brownfield planning) of a factory layout. The framework consists of five steps: the initialization of the layout planning problem, the initialization of the algorithm, the execution of multiple training sets, the evaluation of the training results, and a final manual planning step for a selected layout variant. Each step consists of multiple sub-steps that are interlinked by an information flow. The framework describes the necessary and optional information for each sub-step and further provides guidance for future developments.
7

Wu, Tingyao, and Werner Van Leekwijck. "Factor Selection for Reinforcement Learning in HTTP Adaptive Streaming." In MultiMedia Modeling, 553–67. Cham: Springer International Publishing, 2014. http://dx.doi.org/10.1007/978-3-319-04114-8_47.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Alur, Rajeev, Osbert Bastani, Kishor Jothimurugan, Mateo Perez, Fabio Somenzi, and Ashutosh Trivedi. "Policy Synthesis and Reinforcement Learning for Discounted LTL." In Computer Aided Verification, 415–35. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-37706-8_21.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
AbstractThe difficulty of manually specifying reward functions has led to an interest in using linear temporal logic (LTL) to express objectives for reinforcement learning (RL). However, LTL has the downside that it is sensitive to small perturbations in the transition probabilities, which prevents probably approximately correct (PAC) learning without additional assumptions. Time discounting provides a way of removing this sensitivity, while retaining the high expressivity of the logic. We study the use of discounted LTL for policy synthesis in Markov decision processes with unknown transition probabilities, and show how to reduce discounted LTL to discounted-sum reward via a reward machine when all discount factors are identical.
9

Vitorino, João, Rui Andrade, Isabel Praça, Orlando Sousa, and Eva Maia. "A Comparative Analysis of Machine Learning Techniques for IoT Intrusion Detection." In Foundations and Practice of Security, 191–207. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-08147-7_13.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
AbstractThe digital transformation faces tremendous security challenges. In particular, the growing number of cyber-attacks targeting Internet of Things (IoT) systems restates the need for a reliable detection of malicious network activity. This paper presents a comparative analysis of supervised, unsupervised and reinforcement learning techniques on nine malware captures of the IoT-23 dataset, considering both binary and multi-class classification scenarios. The developed models consisted of Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Isolation Forest (iForest), Local Outlier Factor (LOF) and a Deep Reinforcement Learning (DRL) model based on a Double Deep Q-Network (DDQIN), adapted to the intrusion detection context. The most reliable performance was achieved by LightGBM. Nonetheless, iForest displayed good anomaly detection results and the DRL model demonstrated the possible benefits of employing this methodology to continuously improve the detection. Overall, the obtained results indicate that the analyzed techniques are well suited for IoT intrusion detection.
10

Hammler, Patric, Nicolas Riesterer, Gang Mu, and Torsten Braun. "Multi-Echelon Inventory Optimization Using Deep Reinforcement Learning." In Quantitative Models in Life Science Business, 73–93. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-11814-2_5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
AbstractIn this chapter, we provide an overview of inventory management within the pharmaceutical industry and how to model and optimize it. Inventory management is a highly relevant topic, as it causes high costs such as holding, shortage, and reordering costs. Especially the event of a stock-out can cause damage that goes beyond monetary damage in the form of lost sales. To minimize those costs is the task of an optimized reorder policy. A reorder policy is optimal when it minimizes the accumulated cost in every situation. However, finding an optimal policy is not trivial. First, the problem is highly stochastic as we need to consider variable demands and lead times. Second, the supply chain consists of several warehouses incl. the factory, global distribution warehouses, and local affiliate warehouses, whereby the reorder policy of each warehouse has an impact on the optimal reorder policy of related warehouses. In this context, we discuss the concept of multi-echelon inventory optimization and a methodology that is capable of capturing both, the stochastic behavior of the environment and how it is impacted by the reorder policy: Markov decision processes (MDPs). On this basis, we introduce the concept, its related benefits and weaknesses of a methodology named Reinforcement Learning (RL). RL is capable of finding (near-) optimal (reorder) policies for MDPs. Furthermore, some simulation-based results and current research directions are presented.

Conference papers on the topic "Factored reinforcement learning":

1

Strehl, Alexander L. "Model-Based Reinforcement Learning in Factored-State MDPs." In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning. IEEE, 2007. http://dx.doi.org/10.1109/adprl.2007.368176.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Sahin, Coskun, Erkin Cilden, and Faruk Polat. "Memory efficient factored abstraction for reinforcement learning." In 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF). IEEE, 2015. http://dx.doi.org/10.1109/cybconf.2015.7175900.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Yao, Hengshuai, Csaba Szepesvari, Bernardo Avila Pires, and Xinhua Zhang. "Pseudo-MDPs and factored linear action models." In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, 2014. http://dx.doi.org/10.1109/adprl.2014.7010633.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Degris, Thomas, Olivier Sigaud, and Pierre-Henri Wuillemin. "Learning the structure of Factored Markov Decision Processes in reinforcement learning problems." In the 23rd international conference. New York, New York, USA: ACM Press, 2006. http://dx.doi.org/10.1145/1143844.1143877.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Wu, Bo, and Yanpeng Feng. "Monte-Carlo Bayesian Reinforcement Learning Using a Compact Factored Representation." In 2017 4th International Conference on Information Science and Control Engineering (ICISCE). IEEE, 2017. http://dx.doi.org/10.1109/icisce.2017.104.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Simão, Thiago D. "Safe and Sample-Efficient Reinforcement Learning Algorithms for Factored Environments." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/919.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Reinforcement Learning (RL) deals with problems that can be modeled as a Markov Decision Process (MDP) where the transition function is unknown. In situations where an arbitrary policy pi is already in execution and the experiences with the environment were recorded in a batch D, an RL algorithm can use D to compute a new policy pi'. However, the policy computed by traditional RL algorithms might have worse performance compared to pi. Our goal is to develop safe RL algorithms, where the agent has a high confidence that the performance of pi' is better than the performance of pi given D. To develop sample-efficient and safe RL algorithms we combine ideas from exploration strategies in RL with a safe policy improvement method.
7

Kroon, Mark, and Shimon Whiteson. "Automatic Feature Selection for Model-Based Reinforcement Learning in Factored MDPs." In 2009 International Conference on Machine Learning and Applications (ICMLA). IEEE, 2009. http://dx.doi.org/10.1109/icmla.2009.71.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

France, Kordel K., and John W. Sheppard. "Factored Particle Swarm Optimization for Policy Co-training in Reinforcement Learning." In GECCO '23: Genetic and Evolutionary Computation Conference. New York, NY, USA: ACM, 2023. http://dx.doi.org/10.1145/3583131.3590376.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Simão, Thiago D., and Matthijs T. J. Spaan. "Structure Learning for Safe Policy Improvement." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/479.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
We investigate how Safe Policy Improvement (SPI) algorithms can exploit the structure of factored Markov decision processes when such structure is unknown a priori. To facilitate the application of reinforcement learning in the real world, SPI provides probabilistic guarantees that policy changes in a running process will improve the performance of this process. However, current SPI algorithms have requirements that might be impractical, such as: (i) availability of a large amount of historical data, or (ii) prior knowledge of the underlying structure. To overcome these limitations we enhance a Factored SPI (FSPI) algorithm with different structure learning methods. The resulting algorithms need fewer samples to improve the policy and require weaker prior knowledge assumptions. In well-factorized domains, the proposed algorithms improve performance significantly compared to a flat SPI algorithm, demonstrating a sample complexity closer to an FSPI algorithm that knows the structure. This indicates that the combination of FSPI and structure learning algorithms is a promising solution to real-world problems involving many variables.
10

Panda, Swetasudha, and Yevgeniy Vorobeychik. "Scalable Initial State Interdiction for Factored MDPs." In Twenty-Seventh International Joint Conference on Artificial Intelligence {IJCAI-18}. California: International Joint Conferences on Artificial Intelligence Organization, 2018. http://dx.doi.org/10.24963/ijcai.2018/667.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
We propose a novel Stackelberg game model of MDP interdiction in which the defender modifies the initial state of the planner, who then responds by computing an optimal policy starting with that state. We first develop a novel approach for MDP interdiction in factored state space that allows the defender to modify the initial state. The resulting approach can be computationally expensive for large factored MDPs. To address this, we develop several interdiction algorithms that leverage variations of reinforcement learning using both linear and non-linear function approximation. Finally, we extend the interdiction framework to consider a Bayesian interdiction problem in which the interdictor is uncertain about some of the planner's initial state features. Extensive experiments demonstrate the effectiveness of our approaches.

Reports on the topic "Factored reinforcement learning":

1

Rinaudo, Christina, William Leonard, Jaylen Hopson, Christopher Morey, Robert Hilborn, and Theresa Coumbe. Enabling understanding of artificial intelligence (AI) agent wargaming decisions through visualizations. Engineer Research and Development Center (U.S.), April 2024. http://dx.doi.org/10.21079/11681/48418.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The process to develop options for military planning course of action (COA) development and analysis relies on human subject matter expertise. Analyzing COAs requires examining several factors and understanding complex interactions and dependencies associated with actions, reactions, proposed counteractions, and multiple reasonable outcomes. In Fiscal Year 2021, the Institute for Systems Engineering Research team completed efforts resulting in a wargaming maritime framework capable of training an artificial intelligence (AI) agent with deep reinforcement learning (DRL) techniques within a maritime scenario where the AI agent credibly competes against blue agents in gameplay. However, a limitation of using DRL for agent training relates to the transparency of how the AI agent makes decisions. If leaders were to rely on AI agents for COA development or analysis, they would want to understand those decisions. In or-der to support increased understanding, researchers engaged with stakeholders to determine visualization requirements and developed initial prototypes for stakeholder feedback in order to support increased understanding of AI-generated decisions and recommendations. This report describes the prototype visualizations developed to support the use case of a mission planner and an AI agent trainer. The prototypes include training results charts, heat map visualizations of agent paths, weight matrix visualizations, and ablation testing graphs.

To the bibliography