Дисертації з теми "Partially Observable Markov Decision Processes (POMDPs)"

Щоб переглянути інші типи публікацій з цієї теми, перейдіть за посиланням: Partially Observable Markov Decision Processes (POMDPs).

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся з топ-35 дисертацій для дослідження на тему "Partially Observable Markov Decision Processes (POMDPs)".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.

1

Aberdeen, Douglas Alexander, and doug aberdeen@anu edu au. "Policy-Gradient Algorithms for Partially Observable Markov Decision Processes." The Australian National University. Research School of Information Sciences and Engineering, 2003. http://thesis.anu.edu.au./public/adt-ANU20030410.111006.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). ¶ In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ¶ Monte-Carlo policy-gradient approaches tend to produce gradient estimates with high variance. Two novel methods for reducing variance are introduced. The first uses high-order filters to replace the eligibility trace of the gradient estimator. The second uses a low-variance value-function method to learn a subset of the parameters and a policy-gradient method to learn the remainder. ¶ The algorithms are applied to large domains including a simulated robot navigation scenario, a multi-agent scenario with 21,000 states, and the complex real-world task of large vocabulary continuous speech recognition. To the best of the author's knowledge, no other policy-gradient algorithms have performed well at such tasks. ¶ The high variance of Monte-Carlo methods requires lengthy simulation and hence a super-computer to train agents within a reasonable time. The ANU ``Bunyip'' Linux cluster was built with such tasks in mind. It was used for several of the experimental results presented here. One chapter of this thesis describes an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001.
2

Olafsson, Björgvin. "Partially Observable Markov Decision Processes for Faster Object Recognition." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-198632.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Object recognition in the real world is a big challenge in the field of computer vision. Given the potentially enormous size of the search space it is essential to be able to make intelligent decisions about where in the visual field to obtain information from to reduce the computational resources needed. In this report a POMDP (Partially Observable Markov Decision Process) learning framework, using a policy gradient method and information rewards as a training signal, has been implemented and used to train fixation policies that aim to maximize the information gathered in each fixation. The purpose of such policies is to make object recognition faster by reducing the number of fixations needed. The trained policies are evaluated by simulation and comparing them with several fixed policies. Finally it is shown that it is possible to use the framework to train policies that outperform the fixed policies for certain observation models.
3

Lusena, Christopher. "Finite Memory Policies for Partially Observable Markov Decision Proesses." UKnowledge, 2001. http://uknowledge.uky.edu/gradschool_diss/323.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This dissertation makes contributions to areas of research on planning with POMDPs: complexity theoretic results and heuristic techniques. The most important contributions are probably the complexity of approximating the optimal history-dependent finite-horizon policy for a POMDP, and the idea of heuristic search over the space of FFTs.
4

Skoglund, Caroline. "Risk-aware Autonomous Driving Using POMDPs and Responsibility-Sensitive Safety." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-300909.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Autonomous vehicles promise to play an important role aiming at increased efficiency and safety in road transportation. Although we have seen several examples of autonomous vehicles out on the road over the past years, how to ensure the safety of autonomous vehicle in the uncertain and dynamic environment is still a challenging problem. This thesis studies this problem by developing a risk-aware decision making framework. The system that integrates the dynamics of an autonomous vehicle and the uncertain environment is modelled as a Partially Observable Markov Decision Process (POMDP). A risk measure is proposed based on the Responsibility-Sensitive Safety (RSS) distance, which quantifying the minimum distance to other vehicles for ensuring safety. This risk measure is incorporated into the reward function of POMDP for achieving a risk-aware decision making. The proposed risk-aware POMDP framework is evaluated in two case studies. In a single-lane car following scenario, it is shown that the ego vehicle is able to successfully avoid a collision in an emergency event where a vehicle in front of it makes a full stop. In the merge scenario, the ego vehicle successfully enters the main road from a ramp with a satisfactory distance to other vehicles. As a conclusion, the risk-aware POMDP framework is able to realize a trade-off between safety and usability by keeping a reasonable distance and adapting to other vehicles behaviours.
Autonoma fordon förutspås spela en stor roll i framtiden med målen att förbättra effektivitet och säkerhet för vägtransporter. Men även om vi sett flera exempel av autonoma fordon ute på vägarna de senaste åren är frågan om hur säkerhet ska kunna garanteras ett utmanande problem. Det här examensarbetet har studerat denna fråga genom att utveckla ett ramverk för riskmedvetet beslutsfattande. Det autonoma fordonets dynamik och den oförutsägbara omgivningen modelleras med en partiellt observerbar Markov-beslutsprocess (POMDP från engelskans “Partially Observable Markov Decision Process”). Ett riskmått föreslås baserat på ett säkerhetsavstånd förkortat RSS (från engelskans “Responsibility-Sensitive Safety”) som kvantifierar det minsta avståndet till andra fordon för garanterad säkerhet. Riskmåttet integreras i POMDP-modellens belöningsfunktion för att åstadkomma riskmedvetna beteenden. Den föreslagna riskmedvetna POMDP-modellen utvärderas i två fallstudier. I ett scenario där det egna fordonet följer ett annat fordon på en enfilig väg visar vi att det egna fordonet kan undvika en kollision då det framförvarande fordonet bromsar till stillastående. I ett scenario där det egna fordonet ansluter till en huvudled från en ramp visar vi att detta görs med ett tillfredställande avstånd till andra fordon. Slutsatsen är att den riskmedvetna POMDP-modellen lyckas realisera en avvägning mellan säkerhet och användbarhet genom att hålla ett rimligt säkerhetsavstånd och anpassa sig till andra fordons beteenden.
5

You, Yang. "Probabilistic Decision-Making Models for Multi-Agent Systems and Human-Robot Collaboration." Electronic Thesis or Diss., Université de Lorraine, 2023. http://www.theses.fr/2023LORR0014.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Dans cette thèse, nous nous intéressons à la prise de décision haut niveau (planification de tâches) pour la robotique à l'aide de modèles de prise de décision markoviens et sous deux aspects : la collaboration robot-robot et la collaboration homme-robot. Dans le cadre de la collaboration robot-robot (RRC), nous étudions les problèmes de décision de plusieurs robots devant atteindre un objectif commun de manière collaborative, et nous utilisons le cadre des processus de décision markoviens partiellement observables et décentralisés (Dec-POMDP) pour modéliser de tels problèmes. Nous proposons deux nouveaux algorithmes pour résoudre les Dec-POMDP. Le premier algorithme (Inf-JESP) trouve des équilibres de Nash en construisant itérativement pour chaque agent la politique meilleure réponse aux autres agents jusqu'à ce qu'aucune amélioration ne soit possible. Pour traiter les Dec-POMDP à horizon infini, nous représentons la politique de chaque agent à l'aide d'un contrôleur à états finis. Le deuxième algorithme (MC-JESP) étend Inf-JESP avec des modèles génératifs, ce qui nous permet de passer à l'échelle pour des grands problèmes. Nous démontrons expérimentalement que nos méthodes sont compétitives par rapport aux solveurs Dec-POMDP existants. Dans le cadre de la collaboration homme-robot (HRC), nous ne pouvons contrôler que le comportement du robot, lequel doit faire face à des objectifs humains et à des comportements induits incertains. Nous cherchons ainsi à dériver des politiques de robot qui sont robustes aux incertitudes sur les comportements humains. Pour cela, nous discutons des modèles mentaux qui peuvent être utilisés pour modéliser le comportement humain dans une telle tâche collaborative. Nous décrivons ensuite une approche générale pour dériver, automatiquement et sans connaissance préalable, un modèle de comportements humains basé sur l'hypothèse que l'humain contrôle aussi le robot. À partir de là, nous proposons deux algorithmes pour calculer des politiques robustes pour le robot en se basant sur la résolution d'un POMDP dont l'état contient l'état interne de l'humain. Le premier algorithme fonctionne hors ligne et fournit une politique complète qui peut être utilisée par le robot pendant son exécution. Le deuxième algorithme est une méthode en ligne, c'est-à-dire qu'il décide de l'action du robot à chaque pas de temps au cours de l'exécution. Par rapport à l'approche hors ligne, la méthode en ligne ne nécessite qu'un modèle génératif et peut donc s'adapter à de grands problèmes. Des expériences avec des humains synthétiques et réels sont menées dans un environnement simulé pour évaluer ces algorithmes. Nous observons que nos méthodes peuvent fournir des décisions robustes pour des robots collaboratifs malgré les incertitudes sur les objectifs et les comportements humains. Dans cette thèse, notre recherche sur la collaboration robot-robot fournit une base pour construire des politiques meilleure réponse dans un cadre partiellement observable et multi-agent, base qui sert d'étape intermédiaire importante pour aborder les problèmes de collaboration homme-robot. De plus, pour chaque contribution, nous fournissons des algorithmes plus flexibles utilisant des modèles génératifs dont nous pensons qu'ils faciliteront la mise en œuvre de nos contributions à des applications du monde réel
In this thesis, using Markov decision models, we investigate high-level decision-making (task-level planning) for robotics in two aspects: robot-robot collaboration and human-robot collaboration.In robot-robot collaboration (RRC), we study the decision problems of multiple robots involved to achieve a shared goal collaboratively, and we use the decentralized partially observable Markov decision process (Dec-POMDP) framework to model such RRC problems. Then, we propose two novel algorithms for solving Dec-POMDPs. The first algorithm (Inf-JESP) finds Nash equilibrium solutions by iteratively building the best-response policy for each agent until no improvement can be made. To handle infinite-horizon Dec-POMDPs, we represent each agent's policy using a finite-state controller. The second algorithm (MC-JESP) extends Inf-JESP with generative models, which enables us to scale up to large problems. Through experiments, we demonstrate our methods are competitive with existing Dec-POMDP solvers.In human-robot collaboration (HRC), we can only control the robot, and the robot faces uncertain human objectives and induced behaviors. Therefore, we attempt to address the challenge of deriving robot policies in HRC, which are robust to the uncertainties about human behaviors. In this direction, we discuss possible mental models that can be used to model humans in an HRC task. We propose a general approach to derive, automatically and without prior knowledge, a model of human behaviors based on the assumption that the human could also control the robot. From here, we then design two algorithms for computing robust robot policies relying on solving a robot POMDP, whose state contains the human's internal state. The first algorithm operates offline and gives a complete robot policy that can be used during the robot's execution. The second algorithm is an online method, i.e., it plans the robot's action at each time step during execution. Compared with the offline approach, the online method only requires a generative model and thus can scale up to large problems. Experiments with synthetic and real humans are conducted in a simulated environment to evaluate these algorithms. We observe that our methods can provide robust robot decisions despite the uncertainties over human objectives and behaviors.In this thesis, our research for RRC provides a foundation for building best-response policies in a partially observable and multi-agent setting, which serves as an important intermediate step for addressing HRC problems. Moreover, we provide more flexible algorithms using generative models in each contribution, and we believe this will facilitate applying our contributions to real-world applications
6

Cheng, Hsien-Te. "Algorithms for partially observable Markov decision processes." Thesis, University of British Columbia, 1988. http://hdl.handle.net/2429/29073.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
The thesis develops methods to solve discrete-time finite-state partially observable Markov decision processes. For the infinite horizon problem, only discounted reward case is considered. Several new algorithms for the finite horizon and the infinite horizon problems are developed. For the finite horizon problem, two new algorithms are developed. The first algorithm is called the relaxed region algorithm. For each support in the value function, this algorithm determines a region not smaller than its support region and modifies it implicitly in later steps until the exact support region is found. The second algorithm, called linear support algorithm, systematically approximates the value function until all supports in the value function are found. The most important feature of this algorithm is that it can be modified to find an approximate value function. The number of regions determined explicitly by both algorithms is the same as the number of supports in the value function, which is much less than the number of regions generated by the one-pass algorithm. Since the vertices of each region have to be found, these two algorithms are more efficient than the one-pass algorithm. The limited numerical examples also show that both methods are more efficient than the existing algorithms. For the infinite horizon problem, it is first shown that the approximation version of linear support algorithm can be used to substitute the policy improvement step in a standard successive approximation method to obtain an e-optimal value function. Next, an iterative discretization procedure is developed which uses a small number of states to find new supports and improve the value function between two policy improvement steps. Since only a finite number of states are chosen in this process, some techniques developed for finite MDP can be applied here. Finally, we prove that the policy improvement step in iterative discretization procedure can be replaced by the approximation version of linear support algorithm. The last part of the thesis deals with problems with continuous signals. We first show that if the signal processes are uniformly distributed, then the problem can be reformulated as a problem with finite number of signals. Then the result is extended to where the signal processes are step functions. Since step functions can be easily used to approximate most of the probability distributions, this method can be used to approximate most of the problems with continuous signals. Finally, we present some conditions which guarantee that the linear support can be computed for any given state, then the methods developed for finite signal cases can be easily modified and applied to problems for which the conditions hold.
Business, Sauder School of
Graduate
7

Jaulmes, Robin. "Active learning in partially observable Markov decision processes." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=98733.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
People are efficient when they make decisions under uncertainty, even when their decisions have long-term ramifications, or when their knowledge and their perception of the environment are uncertain. We are able to experiment with the environment and learn, improving our behavior as experience is gathered. Most of the problems we face in real life are of that kind, and most of the problems that an automated agent would face in robotics too.
Our goal is to build Artificial Intelligence algorithms able to reproduce the reasoning of humans for these complex problems. We use the Reinforcement Learning framework, which allows to learn optimal behaviors in dynamic environments. More precisely, we adapt Partially-Observable Markov Decision Processes (POMDPs) to environments that are partially known.
We take inspiration from the field of Active Learning: we assume the existence of an oracle, who can, during a short learning phase, provide the agent with additional information about its environment. The agent actively learns everything that is useful in the environment, with a minimum use of the oracle.
After reviewing existing methods for solving learning problems in partially observable environments, we expose a theoretical active learning setup. We propose an algorithm, MEDUSA, and show theoretical and empirical proofs of performance for it.
8

Aberdeen, Douglas Alexander. "Policy-gradient algorithms for partially observable Markov decision processes /." View thesis entry in Australian Digital Theses Program, 2003. http://thesis.anu.edu.au/public/adt-ANU20030410.111006/index.html.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
9

Zawaideh, Zaid. "Eliciting preferences sequentially using partially observable Markov decision processes." Thesis, McGill University, 2008. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=18794.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Decision Support systems have been gaining in importance recently. Yet one of the bottlenecks of designing such systems lies in understanding how the user values different decision outcomes, or more simply what the user preferences are. Preference elicitation promises to remove the guess work of designing decision making agents by providing more formal methods for measuring the `goodness' of outcomes. This thesis aims to address some of the challenges of preference elicitation such as the high dimensionality of the underlying problem. The problem is formulated as a partially observable Markov decision process (POMDP) using a factored representation to take advantage of the structure inherent to preference elicitation problems. Moreover, simple preference knowledge on problem attributes are used to acquire more accurate preferences without increasing the burden on the user. Sparse terminal actions are defined to allow a flexible trade-off between speed and accuracy of the elicited preference function. Empirical simulations are used to validate the proposed methodology. The result is a framework that is flexible enough to be applied to a wide range of domains that addresses some of the challenges facing preference elicitation methods
Les systèmes d'aide à la décision ont gagné en importance récemment. Pourtant, un des problèmes importants liés au design de tels systèmes demeure: comprendre comment l'usager évalue les différents résultats, ou plus simplement, déterminer quelles sont ses préférences. L'extraction des préférences vise à éliminer certains aspects arbitraires du design d'agents de décision en offrant des méthodes plus formelles pour mesurer la qualité des résultats. Cette thèse tente de résoudre certains problèmes ayant trait à l'extraction des préférences, tel que celui de la haute dimensionnalité du problème sous-jacent. Le problème est formulé en tant que processus de décision markovien partiellement observable (POMDP), et utilise une représentation factorisée afin de profiter de la structure inhérente aux problèmes d'extraction des préférences. De plus, des connaissances simples quant aux caractéristiques de ces problèmes sont exploitées afin d'obtenir des préférences plus précises, sans pour autant augmenter la tâche de l'usager. Les actions terminales "sparse" sont définies de manière à permettre un compromis flexible entre vitesse et précision. Le résultat est un système assez flexible pour être appliqué à un grand nombre de domaines qui ont à faire face aux problèmes liés aux méthodes d'extraction des préférences.
10

Williams, Jason Douglas. "Partially observable Markov decision processes for spoken dialogue management." Thesis, University of Cambridge, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.612754.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
11

Lusena, Christopher. "Finite memory policies for partially observable Markov decision processes." Lexington, Ky. : [University of Kentucky Libraries], 2001. http://lib.uky.edu/ETD/ukycosc2001d00021/lusena01.pdf.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Thesis (Ph. D.)--University of Kentucky, 2001.
Title from document title page. Document formatted into pages; contains viii, 89 p. : ill. Includes abstract. Includes bibliographical references (p. 81-86).
12

Yu, Huizhen Ph D. Massachusetts Institute of Technology. "Approximate solution methods for partially observable Markov and semi-Markov decision processes." Thesis, Massachusetts Institute of Technology, 2006. http://hdl.handle.net/1721.1/35299.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references (p. 165-169).
We consider approximation methods for discrete-time infinite-horizon partially observable Markov and semi-Markov decision processes (POMDP and POSMDP). One of the main contributions of this thesis is a lower cost approximation method for finite-space POMDPs with the average cost criterion, and its extensions to semi-Markov partially observable problems and constrained POMDP problems, as well as to problems with the undiscounted total cost criterion. Our method is an extension of several lower cost approximation schemes, proposed individually by various authors, for discounted POMDP problems. We introduce a unified framework for viewing all of these schemes together with some new ones. In particular, we establish that due to the special structure of hidden states in a POMDP, there is a class of approximating processes, which are either POMDPs or belief MDPs, that provide lower bounds to the optimal cost function of the original POMDP problem. Theoretically, POMDPs with the long-run average cost criterion are still not fully understood.
(cont.) The major difficulties relate to the structure of the optimal solutions, such as conditions for a constant optimal cost function, the existence of solutions to the optimality equations, and the existence of optimal policies that are stationary and deterministic. Thus, our lower bound result is useful not only in providing a computational method, but also in characterizing the optimal solution. We show that regardless of these theoretical difficulties, lower bounds of the optimal liminf average cost function can be computed efficiently by solving modified problems using multichain MDP algorithms, and the approximating cost functions can be also used to obtain suboptimal stationary control policies. We prove the asymptotic convergence of the lower bounds under certain assumptions. For semi-Markov problems and total cost problems, we show that the same method can be applied for computing lower bounds of the optimal cost function. For constrained average cost POMDPs, we show that lower bounds of the constrained optimal cost function can be computed by solving finite-dimensional LPs. We also consider reinforcement learning methods for POMDPs and MDPs. We propose an actor-critic type policy gradient algorithm that uses a structured policy known as a finite-state controller.
(cont.) We thus provide an alternative to the earlier actor-only algorithm GPOMDP. Our work also clarifies the relationship between the reinforcement learning methods for POMDPs and those for MDPs. For average cost MDPs, we provide a convergence and convergence rate analysis for a least squares temporal difference (TD) algorithm, called LSPE, and previously proposed for discounted problems. We use this algorithm in the critic portion of the policy gradient algorithm for POMDPs with finite-state controllers. Finally, we investigate the properties of the limsup and liminf average cost functions of various types of policies. We show various convexity and concavity properties of these costfunctions, and we give a new necessary condition for the optimal liminf average cost to be constant. Based on this condition, we prove the near-optimality of the class of finite-state controllers under the assumption of a constant optimal liminf average cost. This result provides a theoretical guarantee for the finite-state controller approach.
by Huizhen Yu.
Ph.D.
13

Tobin, Ludovic. "A Stochastic Point-Based Algorithm for Partially Observable Markov Decision Processes." Thesis, Université Laval, 2008. http://www.theses.ulaval.ca/2008/25194/25194.pdf.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
La prise de décision dans un environnement partiellement observable est un sujet d'actualité en intelligence artificielle. Une façon d'aborder ce type de problème est d'utiliser un modèle mathématique. Notamment, les POMDPs (Partially Observable Markov Decision Process) ont fait l'objet de plusieurs recherches au cours des dernières années. Par contre, résoudre un POMDP est un problème très complexe et pour cette raison, le modèle n'a pas été utilisé abondamment. Notre objectif était de continuer les progrès ayant été réalisé lors des dernières années, avec l'espoir que nos travaux de recherches seront un pas de plus vers l'application des POMDPs dans des applications d'envergures. Dans un premier temps, nous avons développé un nouvel algorithme hors-ligne qui, sur des problèmes tests, est plus performant que les meilleurs algorithmes existants. La principale innovation vient du fait qu'il s'agit d'un algorithme stochastique alors que les algorithmes traditionnels sont déterministes. Dans un deuxième temps, nous pouvons également appliquer cet algorithme dans des environnements en-lignes. Lorsque ceux-ci revêtent une certaine particularité, notre algorithme est beaucoup plus performant que la compétition. Finalement, nous avons appliqué une version simplifiée de notre algorithme dans le cadre du projet Combat Identification du RDDC-Valcartier.
Decision making under uncertainty is a popular topic in the field of artificial intelligence. One popular way to attack such problems is by using a sound mathematical model. Notably, Partially Observable Markov Processes (POMDPs) have been the subject of extended researches over the last ten years or so. However, solving a POMDP is a very time-consuming task and for this reason, the model has not been used extensively. Our objective was to continue the tremendous progress that has been made over the last couple of years, with the hope that our work will be a step toward applying POMDPs in large-scale problems. To do so, we combined different ideas in order to produce a new algorithm called SSVI (Stochastic Search Value Iteration). Three major accomplishments were achieved throughout this research work. Firstly, we developed a new offline POMDP algorithm which, on benchmark problems, proved to be more efficient than state of the arts algorithm. The originality of our method comes from the fact that it is a stochastic algorithm, in comparison with the usual determinist algorithms. Secondly, the algorithm we developed can also be applied in a particular type of online environments, in which this algorithm outperforms by a significant margin the competition. Finally, we also applied a basic version of our algorithm in a complex military simulation in the context of the Combat Identification project from DRDC-Valcartier.
14

Olsen, Alan. "Pond-Hindsight: Applying Hindsight Optimization to Partially-Observable Markov Decision Processes." DigitalCommons@USU, 2011. https://digitalcommons.usu.edu/etd/1035.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Partially-observable Markov decision processes (POMDPs) are especially good at modeling real-world problems because they allow for sensor and effector uncertainty. Unfortunately, such uncertainty makes solving a POMDP computationally challenging. Traditional approaches, which are based on value iteration, can be slow because they find optimal actions for every possible situation. With the help of the Fast Forward (FF) planner, FF- Replan and FF-Hindsight have shown success in quickly solving fully-observable Markov decision processes (MDPs) by solving classical planning translations of the problem. This thesis extends the concept of problem determination to POMDPs by sampling action observations (similar to how FF-Replan samples action outcomes) and guiding the construction of policy trajectories with a conformant (as opposed to classical) planning heuristic. The resultant planner is called POND-Hindsight.
15

Hudson, Joshua. "A Partially Observable Markov Decision Process for Breast Cancer Screening." Thesis, Linköpings universitet, Statistik och maskininlärning, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-154437.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In the US, breast cancer is one of the most common forms of cancer and the most lethal. There are many decisions that must be made by the doctor and/or the patient when dealing with a potential breast cancer. Many of these decisions are made under uncertainty, whether it is the uncertainty related to the progression of the patient's health, or that related to the accuracy of the doctor's tests. Each possible action under consideration can have positive effects, such as a surgery successfully removing a tumour, and negative effects: a post-surgery infection for example. The human mind simply cannot take into account all the variables involved and possible outcomes when making these decisions. In this report, a detailed Partially Observable Markov Decision Process (POMDP) for breast cancer screening decisions is presented. It includes 151 states, covering 144 different cancer states, and 2 competing screening methods. The necessary parameters were first set up using relevant medical literature and a patient history simulator. Then the POMDP was solved optimally for an infinite horizon, using the Perseus algorithm. The resulting policy provided several recommendations for breast cancer screening. The results indicated that clinical breast examinations are important for screening younger women. Regarding the decision to operate on a woman with breast cancer, the policy showed that invasive cancers with either a tumour size above 1.5 cm or which are in metastasis, should be surgically removed as soon as possible. However, the policy also recommended that patients who are certain to be healthy should have a breast biopsy. The cause of this error was explored further and the conclusion was reached that a finite horizon may be more appropriate for this application.
16

Castro, Rivadeneira Pablo Samuel. "On planning, prediction and knowledge transfer in fully and partially observable Markov decision processes." Thesis, McGill University, 2011. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=104525.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This dissertation addresses the problem of sequential decision making under uncertainty in large systems. The formalisms used to study this problem are fully and partially observable Markov Decision Processes (MDPs and POMDPs, respectively). The first contribution of this dissertation is a theoretical analysis of the behavior of POMDPs when only subsets of the observation set are used. One of these subsets is used to update the agent's state estimate, while the other subset contains observations the agent is interested in predicting and/or optimizing. The behaviors are formalized as three types of equivalence relations. The first groups states based on their values under optimal or general policies; the second groups states according to their ability to predict observations sequences; the third type isbased on bisimulation, which is a well known equivalence relation borrowed from concurrency theory.Bisimulation relations can be generalized to bisimulation metrics. This dissertation introduces bisimulation metrics for an MDP with temporally extended actions (formalized as options) and proposes a new bisimulation metric that provides atighter bound on the difference in optimal values. A new proof is provided for the convergence of an approximation method for computing bisimulation metrics that is based on statistical sampling, using only a finite number of samples. The newproof allows one to determine the minimum number of samples needed in order to achieve the desired quality of approximation with high probability.Although bisimulation metrics have been previously used for state space compression, this dissertation proposes using them to transfer policies from one MDP to another. In contrast to existing transfer work, the mapping between the twosystems is determined automatically by means of the bisimulation metrics. Theoretical results are provided that bound the loss in optimality incurred by the transferred policy. A number of algorithms are introduced which are evaluatedempirically in the context of planning and learning.
Cette thèse traite le problème de prises de décisions séquentielles en grand domaines. Les formalismes utilisés pour étudier ce problème sont processus de décision Markoviens entièrement ou partiellement observables (MDP et POMDPs, respectivement).La première contribution de cette thèse est une analyse théorique du comportement des POMDPs lorsque seulement sous-ensembles de l'ensemble d'observations sont utilisés. L'un de ces sous-ensembles est utilisé pour mettre à jour la confiance de l'agent sur son état actuel, tandis que l'autre est utilisé pour mesurer la performance de l'agent. Les comportements sont formalisés avec trois types de relations d'equivalence. La première relation place les états dans le même groupe en fonction de leurs valeurs en vertu des politiques optimales ou générales; la second relation place les etats dans le même groupe en fonction de leur capacité a predire sequences d'observations; la troisième relation est basé sur la bisimulation, qui est une relation d'equivalence bien connu emprunté à la théorie de la concurrence.Les relations de bisimulation peuvent être généralisés à métriques de bisimulation. Cette thèse présente métriques de bisimulation pour une MDP avec des actions prolongées (formalisées comme des options) et propose une nouvelle métrique de bisimulation qui fournit un resserrement des limites sur la différence de valeurs optimales. Une nouvelle preuve est fournie pour la convergence d'une méthode d'approximation pour le calcul le du métrique de bisimulation qui est basé sur un échantillonnage statistique. La nouvelle preuve permet de déterminer le nombre minimal d'échantillons nécessaires pour atteindre la qualité souhaitée de rapprochement avec une forte probabilité.Bien que mêtriques de bisimulation ont été précédemment utilisés pour la compression de l'espace d'état, cette thèse propose de les utiliser pour transférer des politiques d'un MDP à l'autre. Contrairement aux travaux de transfert existants,le mappage entre les deux systèmes est déterminé automatiquement par les métriques de bisimulation. Résultats théoriques sont présentés que limite la perte de l'optimalité encourus par la police transferée. Un certain nombre d'algorithmes sont introduites, qui sont évalués de façon empirique dans le contexte de la planification et de l'apprentissage.
17

Horgan, Casey Vi. "Dealing with uncertainty : a comparison of robust optimization and partially observable Markov decision processes." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/112410.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Thesis: S.M., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 131-132).
Uncertainty is often present in real-life problems. Deciding how to deal with this uncertainty can be difficult. The proper formulation of a problem can be the larger part of the work required to solve it. This thesis is intended to be used by a decision maker to determine how best to formulate a problem. Robust optimization and partially observable Markov decision processes (POMDPs) are two methods of dealing with uncertainty in real life problems. Robust optimization is used primarily in operations research, while engineers will be more familiar with POMDPs. For a decision maker who is unfamiliar with one or both of these methods, this thesis will provide insight into a different way of problem solving in the presence of uncertainty. The formulation of each method is explained in detail, and the theory of common solution methods is presented. In addition, several examples are given for each method. While a decision maker may try to solve an entire problem using one method, sometimes there are natural partitions to a problem that encourage using multiple solution methods. In this thesis, one such problem is presented, a military planing problem consisting of two parts. The first part is best solved with POMDPs and the second with robust optimization. The reasoning behind this partition is explained and the formulation of each part is presented. Finally, a discussion of the problem types suitable for each method, including multiple applications, is provided.
by Casey Vi Horgan.
S.M.
18

Crook, Paul A. "Learning in a state of confusion : employing active perception and reinforcement learning in partially observable worlds." Thesis, University of Edinburgh, 2007. http://hdl.handle.net/1842/1471.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
In applying reinforcement learning to agents acting in the real world we are often faced with tasks that are non-Markovian in nature. Much work has been done using state estimation algorithms to try to uncover Markovian models of tasks in order to allow the learning of optimal solutions using reinforcement learning. Unfortunately these algorithms which attempt to simultaneously learn a Markov model of the world and how to act have proved very brittle. Our focus differs. In considering embodied, embedded and situated agents we have a preference for simple learning algorithms which reliably learn satisficing policies. The learning algorithms we consider do not try to uncover the underlying Markovian states, instead they aim to learn successful deterministic reactive policies such that agents actions are based directly upon the observations provided by their sensors. Existing results have shown that such reactive policies can be arbitrarily worse than a policy that has access to the underlying Markov process and in some cases no satisficing reactive policy can exist. Our first contribution is to show that providing agents with alternative actions and viewpoints on the task through the addition of active perception can provide a practical solution in such circumstances. We demonstrate empirically that: (i) adding arbitrary active perception actions to agents which can only learn deterministic reactive policies can allow the learning of satisficing policies where none were originally possible; (ii) active perception actions allow the learning of better satisficing policies than those that existed previously and (iii) our approach converges more reliably to satisficing solutions than existing state estimation algorithms such as U-Tree and the Lion Algorithm. Our other contributions focus on issues which affect the reliability with which deterministic reactive satisficing policies can be learnt in non-Markovian environments. We show that that greedy action selection may be a necessary condition for the existence of stable deterministic reactive policies on partially observable Markov decision processes (POMDPs). We also set out the concept of Consistent Exploration. This is the idea of estimating state-action values by acting as though the policy has been changed to incorporate the action being explored. We demonstrate that this concept can be used to develop better algorithms for learning reactive policies to POMDPs by presenting a new reinforcement learning algorithm; the Consistent Exploration Q(l) algorithm (CEQ(l)). We demonstrate on a significant number of problems that CEQ(l) is more reliable at learning satisficing solutions than the algorithm currently regarded as the best for learning deterministic reactive policies, that of SARSA(l).
19

Omidshafiei, Shayegan. "Decentralized control of multi-robot systems using partially observable Markov Decision Processes and belief space macro-actions." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/101447.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Thesis: S.M., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2015.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 129-139).
Planning, control, perception, and learning for multi-robot systems present signicant challenges. Transition dynamics of the robots may be stochastic, making it difficult to select the best action each robot should take at a given time. The observation model, a function of the robots' sensors, may be noisy or partial, meaning that deterministic knowledge of the team's state is often impossible to attain. Robots designed for real-world applications require careful consideration of such sources of uncertainty. This thesis contributes a framework for multi-robot planning in continuous spaces with partial observability. Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) are general models for multi-robot coordination problems. However, representing and solving Dec-POMDPs is often intractable for large problems. This thesis extends the Dec-POMDP framework to the Decentralized Partially Observable Semi-Markov Decision Process (Dec-POSMDP), taking advantage of high- level representations that are natural for multi-robot problems. Dec-POSMDPs allow asynchronous decision-making, which is crucial in multi-robot domains. This thesis also presents algorithms for solving Dec-POSMDPs, which are more scalable than previous methods due to use of closed-loop macro-actions in planning. The proposed framework's performance is evaluated in a constrained multi-robot package delivery domain, showing its ability to provide high-quality solutions for large problems. Due to the probabilistic nature of state transitions and observations, robots operate in belief space, the space of probability distributions over all of their possible states. This thesis also contributes a hardware platform called Measurable Augmented Reality for Prototyping Cyber-Physical Systems (MAR-CPS). MAR-CPS allows real-time visualization of the belief space in laboratory settings.
by Shayegan Omidshafiei.
S.M.
20

Folsom-Kovarik, Jeremiah. "Leveraging Help Requests in POMDP Intelligent Tutors." Doctoral diss., University of Central Florida, 2012. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/5210.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Intelligent tutoring systems (ITSs) are computer programs that model individual learners and adapt instruction to help each learner differently. One way ITSs differ from human tutors is that few ITSs give learners a way to ask questions. When learners can ask for help, their questions have the potential to improve learning directly and also act as a new source of model data to help the ITS personalize instruction. Inquiry modeling gives ITSs the ability to answer learner questions and refine their learner models with an inexpensive new input channel. In order to support inquiry modeling, an advanced planning formalism is applied to ITS learner modeling. Partially observable Markov decision processes (POMDPs) differ from more widely used ITS architectures because they can plan complex action sequences in uncertain situations with machine learning. Tractability issues have previously precluded POMDP use in ITS models. This dissertation introduces two improvements, priority queues and observation chains, to make POMDPs scale well and encompass the large problem sizes that real-world ITSs must confront. A new ITS was created to support trainees practicing a military task in a virtual environment. The development of the Inquiry Modeling POMDP Adaptive Trainer (IMP) began with multiple formative studies on human and simulated learners that explored inquiry modeling and POMDPs in intelligent tutoring. The studies suggest the new POMDP representations will be effective in ITS domains having certain common characteristics. Finally, a summative study evaluated IMP's ability to train volunteers in specific practice scenarios. IMP users achieved post-training scores averaging up to 4.5 times higher than users who practiced without support and up to twice as high as trainees who used an ablated version of IMP with no inquiry modeling. IMP's implementation and evaluation helped explore questions about how inquiry modeling and POMDP ITSs work, while empirically demonstrating their efficacy.
Ph.D.
Doctorate
Computer Science
Engineering and Computer Science
Computer Science
21

Pradhan, Neil. "Deep Reinforcement Learning for Autonomous Highway Driving Scenario." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-289444.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
We present an autonomous driving agent on a simulated highway driving scenario with vehicles such as cars and trucks moving with stochastically variable velocity profiles. The focus of the simulated environment is to test tactical decision making in highway driving scenarios. When an agent (vehicle) maintains an optimal range of velocity it is beneficial both in terms of energy efficiency and greener environment. In order to maintain an optimal range of velocity, in this thesis work I proposed two novel reward structures: (a) gaussian reward structure and (b) exponential rise and fall reward structure. I trained respectively two deep reinforcement learning agents to study their differences and evaluate their performance based on a set of parameters that are most relevant in highway driving scenarios. The algorithm implemented in this thesis work is double-dueling deep-Q-network with prioritized experience replay buffer. Experiments were performed by adding noise to the inputs, simulating Partially Observable Markov Decision Process in order to obtain reliability comparison between different reward structures. Velocity occupancy grid was found to be better than binary occupancy grid as input for the algorithm. Furthermore, methodology for generating fuel efficient policies has been discussed and demonstrated with an example.
Vi presenterar ett autonomt körföretag på ett simulerat motorvägsscenario med fordon som bilar och lastbilar som rör sig med stokastiskt variabla hastighetsprofiler. Fokus för den simulerade miljön är att testa taktiskt beslutsfattande i motorvägsscenarier. När en agent (fordon) upprätthåller ett optimalt hastighetsområde är det fördelaktigt både när det gäller energieffektivitet och grönare miljö. För att upprätthålla ett optimalt hastighetsområde föreslog jag i detta avhandlingsarbete två nya belöningsstrukturer: (a) gaussisk belöningsstruktur och (b) exponentiell uppgång och nedgång belöningsstruktur. Jag utbildade respektive två djupförstärkande inlärningsagenter för att studera deras skillnader och utvärdera deras prestanda baserat på en uppsättning parametrar som är mest relevanta i motorvägsscenarier. Algoritmen som implementeras i detta avhandlingsarbete är dubbel-duell djupt Q- nätverk med prioriterad återuppspelningsbuffert. Experiment utfördes genom att lägga till brus i ingångarna, simulera delvis observerbar Markov-beslutsprocess för att erhålla tillförlitlighetsjämförelse mellan olika belöningsstrukturer. Hastighetsbeläggningsgaller visade sig vara bättre än binärt beläggningsgaller som inmatning för algoritmen. Dessutom har metodik för att generera bränsleeffektiv politik diskuterats och demonstrerats med ett exempel.
22

Murugesan, Sugumar. "Opportunistic Scheduling Using Channel Memory in Markov-modeled Wireless Networks." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1282065836.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
23

Ibrahim, Rita. "Utilisation des communications Device-to-Device pour améliorer l'efficacité des réseaux cellulaires." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLC002/document.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Cette thèse étudie les communications directes entre les mobiles, appelées communications D2D, en tant que technique prometteuse pour améliorer les futurs réseaux cellulaires. Cette technologie permet une communication directe entre deux terminaux mobiles sans passer par la station de base. La modélisation, l'évaluation et l'optimisation des différents aspects des communications D2D constituent les objectifs fondamentaux de cette thèse et sont réalisés principalement à l'aide des outils mathématiques suivants: la théorie des files d'attente, l'optimisation de Lyapunov et les processus de décision markovien partiellement observable POMDP. Les résultats de cette étude sont présentés en trois parties. Dans la première partie, nous étudions un schéma de sélection entre mode cellulaire et mode D2D. Nous dérivons les régions de stabilité des scénarios suivants: réseaux cellulaires purs et réseaux cellulaires où les communications D2D sont activées. Une comparaison entre ces deux scénarios conduit à l'élaboration d'un algorithme de sélection entre le mode cellulaire et le mode D2D qui permet d'améliorer la capacité du réseau. Dans la deuxième partie, nous développons un algorithme d'allocation de ressources des communications D2D. Les utilisateurs D2D sont en mesure d'estimer leur propre qualité de canal, cependant la station de base a besoin de recevoir des messages de signalisation pour acquérir cette information. Sur la base de cette connaissance disponibles au niveau des utilisateurs D2D, une approche d'allocation des ressources est proposée afin d'améliorer l'efficacité énergétique des communications D2D. La version distribuée de cet algorithme s'avère plus performante que celle centralisée. Dans le schéma distribué des collisions peuvent se produire durant la transmission de l'état des canaux D2D ; ainsi un algorithme de réduction des collisions est élaboré. En outre, la mise en œuvre des algorithmes centralisé et distribué dans un réseau cellulaire, type LTE, est décrite en détails. Dans la troisième partie, nous étudions une politique de sélection des relais D2D mobiles. La mobilité des relais représente un des principaux défis que rencontre toute stratégie de sélection de relais. Le problème est modélisé par un processus contraint de décision markovien partiellement observable qui prend en compte le dynamisme des relais et vise à trouver la politique de sélection de relais qui optimise la performance du réseau cellulaire sous des contraintes de coût
This thesis considers Device-to-Device (D2D) communications as a promising technique for enhancing future cellular networks. Modeling, evaluating and optimizing D2D features are the fundamental goals of this thesis and are mainly achieved using the following mathematical tools: queuing theory, Lyapunov optimization and Partially Observed Markov Decision Process (POMDP). The findings of this study are presented in three parts. In the first part, we investigate a D2D mode selection scheme. We derive the queuing stability regions of both scenarios: pure cellular networks and D2D-enabled cellular networks. Comparing both scenarios leads us to elaborate a D2D vs cellular mode selection design that improves the capacity of the network. In the second part, we develop a D2D resource allocation algorithm. We observe that D2D users are able to estimate their local Channel State Information (CSI), however the base station needs some signaling exchange to acquire this information. Based on the D2D users' knowledge of their local CSI, we provide an energy efficient resource allocation framework that shows how distributed scheduling outperforms centralized one. In the distributed approach, collisions may occur between the different CSI reporting; thus, we propose a collision reduction algorithm. Moreover, we give a detailed description on how both centralized and distributed algorithms can be implemented in practice. In the third part, we propose a mobile relay selection policy in a D2D relay-aided network. Relays' mobility appears as a crucial challenge for defining the strategy of selecting the optimal D2D relays. The problem is formulated as a constrained POMDP which captures the dynamism of the relays and aims to find the optimal relay selection policy that maximizes the performance of the network under cost constraints
24

Gonçalves, Luciano Vargas. "Uma arquitetura de Agentes BDI para auto-regulação de Trocas Sociais em Sistemas Multiagentes Abertos." Universidade Catolica de Pelotas, 2009. http://tede.ucpel.edu.br:8080/jspui/handle/tede/105.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Made available in DSpace on 2016-03-22T17:26:22Z (GMT). No. of bitstreams: 1 dm2_Luciano_vargas.pdf: 637463 bytes, checksum: b08b63e8c6a347cd2c86fc24fdfd8986 (MD5) Previous issue date: 2009-03-31
The study and development of systems to control interactions in multiagent systems is an open problem in Artificial Intelligence. The system of social exchange values of Piaget is a social approach that allows for the foundations of the modeling of interactions between agents, where the interactions are seen as service exchanges between pairs of agents, with the evaluation of the realized or received services, thats is, the investments and profits in the exchange, and credits and debits to be charged or received, respectively, in future exchanges. This evaluation may be performed in different ways by the agents, considering that they may have different exchange personality traits. In an exchange process along the time, the different ways in the evaluation of profits and losses may cause disequilibrium in the exchange balances, where some agents may accumulate profits and others accumulate losses. To solve the exchange equilibrium problem, we use the Partially Observable Markov Decision Processes (POMDP) to help the agent decision of actions that can lead to the equilibrium of the social exchanges. Then, each agent has its own internal process to evaluate its current balance of the results of the exchange process between the other agents, observing its internal state, and with the observation of its partner s exchange behavior, it is able to deliberate on the best action it should perform in order to get the equilibrium of the exchanges. Considering an open multiagent system, it is necessary a mechanism to recognize the different personality traits, to build the POMDPs to manage the exchanges between the pairs of agents. This recognizing task is done by Hidden Markov Models (HMM), which, from models of known personality traits, can approximate the personality traits of the new partners, just by analyzing observations done on the agent behaviors in exchanges. The aim of this work is to develop an hybrid agent architecture for the self-regulation of social exchanges between personalitybased agents in a open multiagent system, based in the BDI (Beliefs, Desires, Intentions) architecture, where the agent plans are obtained from optimal policies of POMDPs, which model personality traits that are recognized by HMMs. To evaluate the proposed approach some simulations were done considering (known or new) different personality traits
O estudo e desenvolvimento de sistemas para o controle de interações em sistemas multiagentes é um tema em aberto dentro da Inteligência Artificial. O sistema de valores de trocas sociais de Piaget é uma abordagem social que possibilita fundamentar a modelagem de interações de agentes, onde as interações são vistas como trocas de serviços entre pares de agentes, com a valorização dos serviços realizados e recebidos, ou seja, investimentos e ganhos na troca realizada, e, também os créditos e débitos a serem cobrados ou recebidos, respectivamente, em trocas futuras. Esta avaliação pode ser realizada de maneira diferenciada pelos agentes envolvidos, considerando que estes apresentam traços de personalidade distintos. No decorrer de processo de trocas sociais a forma diferenciada de avaliar os ganhos e perdas nas interações pode causar desequilíbrio nos balanços de trocas dos agentes, onde alguns agentes acumulam ganhos e outros acumulam perdas. Para resolver a questão do equilíbrio das trocas, encontrou-se nos Processos de Decisão de Markov Parcialmente Observáveis (POMDP) uma metodologia capaz de auxiliar a tomada de decisões de cursos de ações na busca do equilíbrio interno dos agentes. Assim, cada agente conta com um mecanismo próprio para avaliar o seu estado interno, e, de posse das observações sobre o comportamento de troca dos parceiros, torna-se apto para deliberar sobre as melhores ações a seguir na busca do equilíbrio interno para o par de agentes. Com objetivo de operar em sistema multiagentes aberto, torna-se necessário um mecanismo para reconhecer os diferentes traços de personalidade, viabilizando o uso de POMDPs nestes ambientes. Esta tarefa de reconhecimento é desempenhada pelos Modelos de Estados Ocultos de Markov (HMM), que, a partir de modelos de traços de personalidade conhecidos, podem inferir os traços aproximados de novos parceiros de interações, através das observações sobre seus comportamentos nas trocas. O objetivo deste trabalho é desenvolver uma arquitetura de agentes híbrida para a auto-regulação de trocas sociais entre agentes baseados em traços de personalidade em sistemas multiagentes abertos. A arquitetura proposta é baseada na arquitetura BDI (Beliefs, Desires, Intentions), onde os planos dos agentes são obtidos através de políticas ótimas de POMDPs, que modelam traços de personalidade reconhecidos através de HMMs. Para avaliar a proposta, foram realizadas simulações envolvendo traços de personalidade conhecidos e novos traços
25

Sachan, Mohit. "Learning in Partially Observable Markov Decision Processes." 2013. http://hdl.handle.net/1805/3451.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Indiana University-Purdue University Indianapolis (IUPUI)
Learning in Partially Observable Markov Decision process (POMDP) is motivated by the essential need to address a number of realistic problems. A number of methods exist for learning in POMDPs, but learning with limited amount of information about the model of POMDP remains a highly anticipated feature. Learning with minimal information is desirable in complex systems as methods requiring complete information among decision makers are impractical in complex systems due to increase of problem dimensionality. In this thesis we address the problem of decentralized control of POMDPs with unknown transition probabilities and reward. We suggest learning in POMDP using a tree based approach. States of the POMDP are guessed using this tree. Each node in the tree has an automaton in it and acts as a decentralized decision maker for the POMDP. The start state of POMDP is known as the landmark state. Each automaton in the tree uses a simple learning scheme to update its action choice and requires minimal information. The principal result derived is that, without proper knowledge of transition probabilities and rewards, the automata tree of decision makers will converge to a set of actions that maximizes the long term expected reward per unit time obtained by the system. The analysis is based on learning in sequential stochastic games and properties of ergodic Markov chains. Simulation results are presented to compare the long term rewards of the system under different decision control algorithms.
26

Koltunova, Veronika. "Active Sensing for Partially Observable Markov Decision Processes." Thesis, 2013. http://hdl.handle.net/10012/7222.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Context information on a smart phone can be used to tailor applications for specific situations (e.g. provide tailored routing advice based on location, gas prices and traffic). However, typical context-aware smart phone applications use very limited context information such as user identity, location and time. In the future, smart phones will need to decide from a wide range of sensors to gather information from in order to best accommodate user needs and preferences in a given context. In this thesis, we present a model for active sensor selection within decision-making processes, in which observational features are selected based on longer-term impact on the decisions made by the smart phone. This thesis formulates the problem as a partially observable Markov decision process (POMDP), and proposes a non-myopic solution to the problem using a state of the art approximate planning algorithm Symbolic Perseus. We have tested our method on a 3 small example domains, comparing different policy types, discount factors and cost settings. The experimental results proved that the proposed approach delivers a better policy in the situation of costly sensors, while at the same time provides the advantage of faster policy computation with less memory usage.
27

Aberdeen, Douglas. "Policy-Gradient Algorithms for Partially Observable Markov Decision Processes." Phd thesis, 2003. http://hdl.handle.net/1885/48180.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember.
28

Kinathil, Shamin. "Closed-form Solutions to Sequential Decision Making within Markets." Phd thesis, 2018. http://hdl.handle.net/1885/186490.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Sequential decision making is a pervasive and inescapable requirement of every day life. Deciding upon which sequence of actions to take is complicated by incomplete information about the environment, the effects of each decision upon the future state of the environment, ill-defined objectives and our own cognitive limitations. These challenges are exacerbated in financial markets which are in a constant state of flux, with prices adjusting to new information, winning traders replacing losing traders and the introduction of new technologies. Decision theoretic planning provides powerful and flexible frameworks for many sequential decision making problems within financial markets. Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs) are two such frameworks, which can model financial decision making problems with either or both discrete and continuous variables. Traditional methods to solve MDPs and POMDPs use Monte-Carlo based methods or discretise the continuous state space. This is often to the detriment of the quality of the solution, which in many financial domains, such as those involving asset prices, are required to be exact and closed-form. Closed-form solutions, by virtue of being declarative and symbolic, allow the decision maker to undertake further analysis and optimisation. In this thesis we present novel techniques to calculate exact and closed-form solutions to new classes of MDPs and POMDPs by leveraging Symbolic Dynamic Programming (SDP). Our specific contributions include: (i) closed-form solutions to a subclass of Continuous Stochastic Games, which we use to encapsulate and solve a zero-sum game between a Binary Option trader and an adversarial market; (ii) closed-form solutions to Sequential Market Making with Inventory, which extends seminal works from market microstructure theory and algorithmic market-making; and (iii) Analytic Decision Analysis for Parameterised Hybrid MDPs, which we use to examine the sensitivity of a model for optimal trans- action execution to its parameters. Our experimental evaluations confirm the efficacy of our novel solutions for a range of new classes of MDPs and POMDPs.
29

Daswani, Mayank. "Generic Reinforcement Learning Beyond Small MDPs." Phd thesis, 2015. http://hdl.handle.net/1885/110545.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Feature reinforcement learning (FRL) is a framework within which an agent can automatically reduce a complex environment to a Markov Decision Process (MDP) by finding a map which aggregates similar histories into the states of an MDP. The primary motivation behind this thesis is to build FRL agents that work in practice, both for larger environments and larger classes of environments. We focus on empirical work targeted at practitioners in the field of general reinforcement learning, with theoretical results wherever necessary. The current state-of-the-art in FRL uses suffix trees which have issues with large observation spaces and long-term dependencies. We start by addressing the issue of long-term dependency using a class of maps known as looping suffix trees, which have previously been used to represent deterministic POMDPs. We show the best existing results on the TMaze domain and good results on larger domains that require long-term memory. We introduce a new value-based cost function that can be evaluated model-free. The value- based cost allows for smaller representations, and its model-free nature allows for its extension to the function approximation setting, which has computational and representational advantages for large state spaces. We evaluate the performance of this new cost in both the tabular and function approximation settings on a variety of domains, and show performance better than the state-of-the-art algorithm MC-AIXI-CTW on the domain POCMAN. When the environment is very large, an FRL agent needs to explore systematically in order to find a good representation. However, it needs a good representation in order to perform this systematic exploration. We decouple both by considering a different setting, one where the agent has access to the value of any state-action pair from an oracle in a training phase. The agent must learn an approximate representation of the optimal value function. We formulate a regression-based solution based on online learning methods to build an such an agent. We test this agent on the Arcade Learning Environment using a simple class of linear function approximators. While we made progress on the issue of scalability, two major issues with the FRL framework remain: the need for a stochastic search method to minimise the objective function and the need to store an uncompressed history, both of which can be very computationally demanding.
30

Poupart, Pascal. "Exploiting structure to efficiently solve large scale partially observable Markov decision processes." 2005. http://link.library.utoronto.ca/eir/EIRdetail.cfm?Resources__ID=232732&T=F.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
31

Leung, Siu-Ki. "Exploring partially observable Markov decision processes by exploting structure and heuristic information." Thesis, 1996. http://hdl.handle.net/2429/5772.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
This thesis is about chance and choice, or decisions under uncertainty. The desire for creating an intelligent agent performing rewarding tasks in a realistic world urges for working models to do sequential decision making and planning. In responding to this grand wish, decision-theoretic planning (DTP) has evolved from decision theory and control theory, and has been applied to planning in artificial intelligence. Recent interest has been directed toward Markov Decision Processes (MDPs) introduced from operations research. While fruitful results have been tapped from research in fully observable MDPs, partially observable MDPs (POMDPs) are still too difficult to solve as observation uncertainties are incorporated. Abstraction and approximation techniques become the focus. This research attempts to enhance POMDPs by applying A l techniques. In particular, we transform the linear POMDP constructs into a structured representation based on binary decision trees and Bayesian Networks to achieve compactness. A handful of tree-oriented operations is then developed to perform structural belief updates and value computation. Along ii with the structured representation, we explore the belief space with a heuristic online search approach, in which best-first search strategy with heuristic pruning is employed. Experimenting with a structured testbed domain reveals great potentials of exploiting structure and heuristics to empower POMDPs for more practical applications.
32

Poupart, Pascal. "Approximate value-directed belief state monitoring for partially observable Markov decision processes." Thesis, 2000. http://hdl.handle.net/2429/11462.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Partially observable Markov decision processes (POMDPs) provide a principled approach to planning under uncertainty. Unfortunately, several sources of intractability currently limit the application of POMDPs to simple problems. The following thesis is concerned with one source of intractability in particular, namely the belief state monitoring task. As an agent executes a plan, it must track the state of the world by updating its beliefs with respect to the current state. Then, based on its current beliefs, the agent can look up the next action to execute in its plan. In many situations, an agent may be required to decide in real-time which action to execute next. Thus, efficient algorithms to update the current belief state would be desirable. Unfortunately, exact belief state monitoring turns out to be very time consuming for many domains. This thesis introduces a value-directed framework to analyze and design approximation methods that speed up the monitoring task. The goal of approximate belief state monitoring is to trade monitoring accuracy for efficiency. Thus, this framework outlines a principled approach to quantify the impact of approximating belief states on the original plan. Since at any point in time, an action is executed based on the current belief state, it may be possible that a less desirable action ends up being executed as a result of the approximations used to infer the current belief state. The framework developped covers a wide range of approximation methods including projection schemes and density trees. First, several bounds on the loss in decision quality due to approximate belief state monitoring are derived. Then, given a class of approximation methods, a few search algorithms are proposed to seek a relatively good approximation scheme within the given class. These algorithms essentially try to minimize the bounds derived. Next, a vector space analysis is performed to gain some insights regarding which properties of approximation methods are likely to ensure a minimal impact on decision quality. Finally, faster algorithms (than the previous ones) are designed to search for approximation methods that exhibit such properties.
33

Amato, Christopher. "Increasing scalability in algorithms for centralized and decentralized partially observable Markov decision processes: Efficient decision-making and coordination in uncertain environments." 2010. https://scholarworks.umass.edu/dissertations/AAI3427492.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
As agents are built for ever more complex environments, methods that consider the uncertainty in the system have strong advantages. This uncertainty is common in domains such as robot navigation, medical diagnosis and treatment, inventory management, sensor networks and e-commerce. When a single decision maker is present, the partially observable Markov decision process (POMDP) model is a popular and powerful choice. When choices are made in a decentralized manner by a set of decision makers, the problem can be modeled as a decentralized partially observable Markov decision process (DEC-POMDP). While POMDPs and DEC-POMDPs offer rich frameworks for sequential decision making under uncertainty, the computational complexity of each model presents an important research challenge. As a way to address this high complexity, this thesis develops several solution methods based on utilizing domain structure, memory-bounded representations and sampling. These approaches address some of the major bottlenecks for decision-making in real-world uncertain systems. The methods include a more efficient optimal algorithm for DEC-POMDPs as well as scalable approximate algorithms for POMDPs and DEC-POMDPs. Key contributions include optimizing compact representations as well as automatic structure extraction and exploitation. These approaches increase the scalability of algorithms, while also increasing their solution quality.
34

Goswami, Anindya. "Semi-Markov Processes In Dynamic Games And Finance." Thesis, 2008. https://etd.iisc.ac.in/handle/2005/727.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Two different sets of problems are addressed in this thesis. The first one is on partially observed semi-Markov Games (POSMG) and the second one is on semi-Markov modulated financial market model. In this thesis we study a partially observable semi-Markov game in the infinite time horizon. The study of a partially observable game (POG) involves three major steps: (i) construct an equivalent completely observable game (COG), (ii) establish the equivalence between POG and COG by showing that if COG admits an equilibrium, POG does so, (iii) study the equilibrium of COG and find the corresponding equilibrium of original partially observable problem. In case of infinite time horizon game problem there are two different payoff criteria. These are discounted payoff criterion and average payoff criterion. At first a partially observable semi-Markov decision process on general state space with discounted cost criterion is studied. An optimal policy is shown to exist by considering a Shapley’s equation for the corresponding completely observable model. Next the discounted payoff problem is studied for two-person zero-sum case. A saddle point equilibrium is shown to exist for this case. Then the variable sum game is investigated. For this case the Nash equilibrium strategy is obtained in Markov class under suitable assumption. Next the POSMG problem on countable state space is addressed for average payoff criterion. It is well known that under this criterion the game problem do not have a solution in general. To ensure a solution one needs some kind of ergodicity of the transition kernel. We find an appropriate ergodicity of partially observed model which in turn induces a geometric ergodicity to the equivalent model. Using this we establish a solution of the corresponding average payoff optimality equation (APOE). Thus the value and a saddle point equilibrium is obtained for the original partially observable model. A value iteration scheme is also developed to find out the average value of the game. Next we study the financial market model whose key parameters are modulated by semi-Markov processes. Two different problems are addressed under this market assumption. In the first one we show that this market is incomplete. In such an incomplete market we find the locally risk minimizing prices of exotic options in the Follmer Schweizer framework. In this model the stock prices are no more Markov. Generally stock price process is modeled as Markov process because otherwise one may not get a pde representation of price of a contingent claim. To overcome this difficulty we find an appropriate Markov process which includes the stock price as a component and then find its infinitesimal generator. Using Feynman-Kac formula we obtain a system of non-local partial differential equations satisfied by the option price functions in the mildsense. .Next this system is shown to have a classical solution for given initial or boundary conditions. Then this solution is used to have a F¨ollmer Schweizer decomposition of option price. Thus we obtain the locally risk minimizing prices of different options. Furthermore we obtain an integral equation satisfied by the unique solution of this system. This enable us to compute the price of a contingent claim and find the risk minimizing hedging strategy numerically. Further we develop an efficient and stable numerical method to compute the prices. Beside this work on derivative pricing, the portfolio optimization problem in semi-Markov modulated market is also studied in the thesis. We find the optimal portfolio selections by optimizing expected utility of terminal wealth. We also obtain the optimal portfolio selections under risk sensitive criterion for both finite and infinite time horizon.
35

Goswami, Anindya. "Semi-Markov Processes In Dynamic Games And Finance." Thesis, 2008. http://hdl.handle.net/2005/727.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
Анотація:
Two different sets of problems are addressed in this thesis. The first one is on partially observed semi-Markov Games (POSMG) and the second one is on semi-Markov modulated financial market model. In this thesis we study a partially observable semi-Markov game in the infinite time horizon. The study of a partially observable game (POG) involves three major steps: (i) construct an equivalent completely observable game (COG), (ii) establish the equivalence between POG and COG by showing that if COG admits an equilibrium, POG does so, (iii) study the equilibrium of COG and find the corresponding equilibrium of original partially observable problem. In case of infinite time horizon game problem there are two different payoff criteria. These are discounted payoff criterion and average payoff criterion. At first a partially observable semi-Markov decision process on general state space with discounted cost criterion is studied. An optimal policy is shown to exist by considering a Shapley’s equation for the corresponding completely observable model. Next the discounted payoff problem is studied for two-person zero-sum case. A saddle point equilibrium is shown to exist for this case. Then the variable sum game is investigated. For this case the Nash equilibrium strategy is obtained in Markov class under suitable assumption. Next the POSMG problem on countable state space is addressed for average payoff criterion. It is well known that under this criterion the game problem do not have a solution in general. To ensure a solution one needs some kind of ergodicity of the transition kernel. We find an appropriate ergodicity of partially observed model which in turn induces a geometric ergodicity to the equivalent model. Using this we establish a solution of the corresponding average payoff optimality equation (APOE). Thus the value and a saddle point equilibrium is obtained for the original partially observable model. A value iteration scheme is also developed to find out the average value of the game. Next we study the financial market model whose key parameters are modulated by semi-Markov processes. Two different problems are addressed under this market assumption. In the first one we show that this market is incomplete. In such an incomplete market we find the locally risk minimizing prices of exotic options in the Follmer Schweizer framework. In this model the stock prices are no more Markov. Generally stock price process is modeled as Markov process because otherwise one may not get a pde representation of price of a contingent claim. To overcome this difficulty we find an appropriate Markov process which includes the stock price as a component and then find its infinitesimal generator. Using Feynman-Kac formula we obtain a system of non-local partial differential equations satisfied by the option price functions in the mildsense. .Next this system is shown to have a classical solution for given initial or boundary conditions. Then this solution is used to have a F¨ollmer Schweizer decomposition of option price. Thus we obtain the locally risk minimizing prices of different options. Furthermore we obtain an integral equation satisfied by the unique solution of this system. This enable us to compute the price of a contingent claim and find the risk minimizing hedging strategy numerically. Further we develop an efficient and stable numerical method to compute the prices. Beside this work on derivative pricing, the portfolio optimization problem in semi-Markov modulated market is also studied in the thesis. We find the optimal portfolio selections by optimizing expected utility of terminal wealth. We also obtain the optimal portfolio selections under risk sensitive criterion for both finite and infinite time horizon.

До бібліографії