To see the other types of publications on this topic, follow the link: Reinforcement Learning.

Dissertations / Theses on the topic 'Reinforcement Learning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Reinforcement Learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Izquierdo, Ayala Pablo. "Learning comparison: Reinforcement Learning vs Inverse Reinforcement Learning : How well does inverse reinforcement learning perform in simple markov decision processes in comparison to reinforcement learning?" Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-259371.

Full text
Abstract:
This research project elaborates a qualitative comparison between two different learning approaches, Reinforcement Learning (RL) and Inverse Reinforcement Learning (IRL) over the Gridworld Markov Decision Process. The interest focus will be set on the second learning paradigm, IRL, as it is considered to be relatively new and little work has been developed in this field of study. As observed, RL outperforms IRL, obtaining a correct solution in all the different scenarios studied. However, the behaviour of the IRL algorithms can be improved and this will be shown and analyzed as part of the scope.
Denna studie är en kvalitativ jämförelse mellan två olika inlärningsangreppssätt, “Reinforcement Learning” (RL) och “Inverse Reinforcement Learning” (IRL), om använder "Gridworld", en "Markov Decision-Process". Fokus ligger på den senare algoritmen, IRL, eftersom den anses relativt ny och få studier har i nuläget gjorts kring den. I studien är RL mer fördelaktig än IRL, som skapar en korrekt lösning i alla olika scenarier som presenteras i studien. Beteendet hos IRL-algoritmen kan dock förbättras vilket också visas och analyseras i denna studie.
APA, Harvard, Vancouver, ISO, and other styles
2

Seymour, B. J. "Aversive reinforcement learning." Thesis, University College London (University of London), 2010. http://discovery.ucl.ac.uk/800107/.

Full text
Abstract:
We hypothesise that human aversive learning can be described algorithmically by Reinforcement Learning models. Our first experiment uses a second-order conditioning design to study sequential outcome prediction. We show that aversive prediction errors are expressed robustly in the ventral striatum, supporting the validity of temporal difference algorithms (as in reward learning), and suggesting a putative critical area for appetitive-aversive interactions. With this in mind, the second experiment explores the nature of pain relief, which as expounded in theories of motivational opponency, is rewarding. In a Pavlovian conditioning task with phasic relief of tonic noxious thermal stimulation, we show that both appetitive and aversive prediction errors are co-expressed in anatomically dissociable regions (in a mirror opponent pattern) and that striatal activity appears to reflect integrated appetitive-aversive processing. Next we designed a Pavlovian task in which cues predicted either financial gains, losses, or both, thereby forcing integration of both motivational streams. This showed anatomical dissociation of aversive and appetitive predictions along a posterior-anterior gradient within the striatum, respectively. Lastly, we studied aversive instrumental control (avoidance). We designed a simultaneous pain avoidance and financial reward learning task, in which subjects had to learn independently learn about each, and trade off aversive and appetitive predictions. We show that predictions for both converge on the medial head of caudate nucleus, suggesting that this is a critical site for appetitive-aversive integration in instrumental decision making. We also study also tested whether serotonin (5HT) modulates either phasic or tonic opponency using acute tryptophan depletion. Both behavioural and imaging data confirm the latter, in which it appears to mediate an average reward term, providing an aspiration level against which the benefits of exploration are judged. In summary, our data provide a basic computational and neuroanatomical framework for human aversive learning. We demonstrate the algorithmic and implementational validity of reinforcement learning models for both aversive prediction and control, illustrate the nature and neuroanatomy of appetitive-aversive integration, and discover the critical (and somewhat unexpected) central role for the striatum.
APA, Harvard, Vancouver, ISO, and other styles
3

Akrour, Riad. "Robust Preference Learning-based Reinforcement Learning." Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112236/document.

Full text
Abstract:
Les contributions de la thèse sont centrées sur la prise de décisions séquentielles et plus spécialement sur l'Apprentissage par Renforcement (AR). Prenant sa source de l'apprentissage statistique au même titre que l'apprentissage supervisé et non-supervisé, l'AR a gagné en popularité ces deux dernières décennies en raisons de percées aussi bien applicatives que théoriques. L'AR suppose que l'agent (apprenant) ainsi que son environnement suivent un processus de décision stochastique Markovien sur un espace d'états et d'actions. Le processus est dit de décision parce que l'agent est appelé à choisir à chaque pas de temps du processus l'action à prendre. Il est dit stochastique parce que le choix d'une action donnée en un état donné n'implique pas le passage systématique à un état particulier mais définit plutôt une distribution sur l'espace d'états. Il est dit Markovien parce que cette distribution ne dépend que de l'état et de l'action courante. En conséquence d'un choix d'action, l'agent reçoit une récompense. Le but de l'AR est alors de résoudre le problème d'optimisation retournant le comportement qui assure à l'agent une récompense maximale tout au long de son interaction avec l'environnement. D'un point de vue pratique, un large éventail de problèmes peuvent être transformés en un problème d'AR, du Backgammon (cf. TD-Gammon, l'une des premières grandes réussites de l'AR et de l'apprentissage statistique en général, donnant lieu à un joueur expert de classe internationale) à des problèmes de décision dans le monde industriel ou médical. Seulement, le problème d'optimisation résolu par l'AR dépend de la définition préalable d'une fonction de récompense adéquate nécessitant une expertise certaine du domaine d'intérêt mais aussi du fonctionnement interne des algorithmes d'AR. En ce sens, la première contribution de la thèse a été de proposer un nouveau cadre d'apprentissage, allégeant les prérequis exigés à l'utilisateur. Ainsi, ce dernier n'a plus besoin de connaître la solution exacte du problème mais seulement de pouvoir désigner entre deux comportements, celui qui s'approche le plus de la solution. L'apprentissage se déroule en interaction entre l'utilisateur et l'agent. Cette interaction s'articule autour des trois points suivants : i) L'agent exhibe un nouveau comportement ii) l'expert le compare au meilleur comportement jusqu'à présent iii) l'agent utilise ce retour pour mettre à jour son modèle des préférences puis choisit le prochain comportement à démontrer. Afin de réduire le nombre d'interactions nécessaires entre l'utilisateur et l'agent pour que ce dernier trouve le comportement optimal, la seconde contribution de la thèse a été de définir un critère théoriquement justifié faisant le compromis entre les désirs parfois contradictoires de prendre en compte les préférences de l'utilisateur tout en exhibant des comportements suffisamment différents de ceux déjà proposés. La dernière contribution de la thèse est d'assurer la robustesse de l'algorithme face aux éventuelles erreurs d'appréciation de l'utilisateur. Ce qui arrive souvent en pratique, spécialement au début de l'interaction, quand tous les comportements proposés par l'agent sont loin de la solution attendue
The thesis contributions resolves around sequential decision taking and more precisely Reinforcement Learning (RL). Taking its root in Machine Learning in the same way as supervised and unsupervised learning, RL quickly grow in popularity within the last two decades due to a handful of achievements on both the theoretical and applicative front. RL supposes that the learning agent and its environment follow a stochastic Markovian decision process over a state and action space. The process is said of decision as the agent is asked to choose at each time step an action to take. It is said stochastic as the effect of selecting a given action in a given state does not systematically yield the same state but rather defines a distribution over the state space. It is said to be Markovian as this distribution only depends on the current state-action pair. Consequently to the choice of an action, the agent receives a reward. The RL goal is then to solve the underlying optimization problem of finding the behaviour that maximizes the sum of rewards all along the interaction of the agent with its environment. From an applicative point of view, a large spectrum of problems can be cast onto an RL one, from Backgammon (TD-Gammon, was one of Machine Learning first success giving rise to a world class player of advanced level) to decision problems in the industrial and medical world. However, the optimization problem solved by RL depends on the prevous definition of a reward function that requires a certain level of domain expertise and also knowledge of the internal quirks of RL algorithms. As such, the first contribution of the thesis was to propose a learning framework that lightens the requirements made to the user. The latter does not need anymore to know the exact solution of the problem but to only be able to choose between two behaviours exhibited by the agent, the one that matches more closely the solution. Learning is interactive between the agent and the user and resolves around the three main following points: i) The agent demonstrates a behaviour ii) The user compares it w.r.t. to the current best one iii) The agent uses this feedback to update its preference model of the user and uses it to find the next behaviour to demonstrate. To reduce the number of required interactions before finding the optimal behaviour, the second contribution of the thesis was to define a theoretically sound criterion making the trade-off between the sometimes contradicting desires of complying with the user's preferences and demonstrating sufficiently different behaviours. The last contribution was to ensure the robustness of the algorithm w.r.t. the feedback errors that the user might make. Which happens more often than not in practice, especially at the initial phase of the interaction, when all the behaviours are far from the expected solution
APA, Harvard, Vancouver, ISO, and other styles
4

Tabell, Johnsson Marco, and Ala Jafar. "Efficiency Comparison Between Curriculum Reinforcement Learning & Reinforcement Learning Using ML-Agents." Thesis, Blekinge Tekniska Högskola, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-20218.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Yang, Zhaoyuan Yang. "Adversarial Reinforcement Learning for Control System Design: A Deep Reinforcement Learning Approach." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu152411491981452.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Cortesi, Daniele. "Reinforcement Learning in Rogue." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amslaurea.unibo.it/16138/.

Full text
Abstract:
In this work we use Reinforcement Learning to play the famous Rogue, a dungeon-crawler videogame father of the rogue-like genre. By employing different algorithms we substantially improve on the results obtained in previous work, addressing and solving the problems that were arisen. We then devise and perform new experiments to test the limits of our own solution and encounter additional and unexpected issues in the process. In one of the investigated scenario we clearly see that our approach is not yet enough to even perform better than a random agent and propose ideas for future works.
APA, Harvard, Vancouver, ISO, and other styles
7

Girgin, Sertan. "Abstraction In Reinforcement Learning." Phd thesis, METU, 2007. http://etd.lib.metu.edu.tr/upload/12608257/index.pdf.

Full text
Abstract:
Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment. Generally, the problem to be solved contains subtasks that repeat at different regions of the state space. Without any guidance an agent has to learn the solutions of all subtask instances independently, which degrades the learning performance. In this thesis, we propose two approaches to build connections between different regions of the search space leading to better utilization of gained experience and accelerate learning is proposed. In the first approach, we first extend existing work of McGovern and propose the formalization of stochastic conditionally terminating sequences with higher representational power. Then, we describe how to efficiently discover and employ useful abstractions during learning based on such sequences. The method constructs a tree structure to keep track of frequently used action sequences together with visited states. This tree is then used to select actions to be executed at each step. In the second approach, we propose a novel method to identify states with similar sub-policies, and show how they can be integrated into reinforcement learning framework to improve the learning performance. The method uses an efficient data structure to find common action sequences started from observed states and defines a similarity function between states based on the number of such sequences. Using this similarity function, updates on the action-value function of a state are reflected to all similar states. This, consequently, allows experience acquired during learning be applied to a broader context. Effectiveness of both approaches is demonstrated empirically by conducting extensive experiments on various domains.
APA, Harvard, Vancouver, ISO, and other styles
8

Suay, Halit Bener. "Reinforcement Learning from Demonstration." Digital WPI, 2016. https://digitalcommons.wpi.edu/etd-dissertations/173.

Full text
Abstract:
Off-the-shelf Reinforcement Learning (RL) algorithms suffer from slow learning performance, partly because they are expected to learn a task from scratch merely through an agent's own experience. In this thesis, we show that learning from scratch is a limiting factor for the learning performance, and that when prior knowledge is available RL agents can learn a task faster. We evaluate relevant previous work and our own algorithms in various experiments. Our first contribution is the first implementation and evaluation of an existing interactive RL algorithm in a real-world domain with a humanoid robot. Interactive RL was evaluated in a simulated domain which motivated us for evaluating its practicality on a robot. Our evaluation shows that guidance reduces learning time, and that its positive effects increase with state space size. A natural follow up question after our first evaluation was, how do some other previous works compare to interactive RL. Our second contribution is an analysis of a user study, where na"ive human teachers demonstrated a real-world object catching with a humanoid robot. We present the first comparison of several previous works in a common real-world domain with a user study. One conclusion of the user study was the high potential of RL despite poor usability due to slow learning rate. As an effort to improve the learning efficiency of RL learners, our third contribution is a novel human-agent knowledge transfer algorithm. Using demonstrations from three teachers with varying expertise in a simulated domain, we show that regardless of the skill level, human demonstrations can improve the asymptotic performance of an RL agent. As an alternative approach for encoding human knowledge in RL, we investigated the use of reward shaping. Our final contributions are Static Inverse Reinforcement Learning Shaping and Dynamic Inverse Reinforcement Learning Shaping algorithms that use human demonstrations for recovering a shaping reward function. Our experiments in simulated domains show that our approach outperforms the state-of-the-art in cumulative reward, learning rate and asymptotic performance. Overall we show that human demonstrators with varying skills can help RL agents to learn tasks more efficiently.
APA, Harvard, Vancouver, ISO, and other styles
9

Gao, Yang. "Argumentation accelerated reinforcement learning." Thesis, Imperial College London, 2014. http://hdl.handle.net/10044/1/26603.

Full text
Abstract:
Reinforcement Learning (RL) is a popular statistical Artificial Intelligence (AI) technique for building autonomous agents, but it suffers from the curse of dimensionality: the computational requirement for obtaining the optimal policies grows exponentially with the size of the state space. Integrating heuristics into RL has proven to be an effective approach to combat this curse, but deriving high-quality heuristics from people's (typically conflicting) domain knowledge is challenging, yet it received little research attention. Argumentation theory is a logic-based AI technique well-known for its conflict resolution capability and intuitive appeal. In this thesis, we investigate the integration of argumentation frameworks into RL algorithms, so as to improve the convergence speed of RL algorithms. In particular, we propose a variant of Value-based Argumentation Framework (VAF) to represent domain knowledge and to derive heuristics from this knowledge. We prove that the heuristics derived from this framework can effectively instruct individual learning agents as well as multiple cooperative learning agents. In addition,we propose the Argumentation Accelerated RL (AARL) framework to integrate these heuristics into different RL algorithms via Potential Based Reward Shaping (PBRS) techniques: we use classical PBRS techniques for flat RL (e.g. SARSA(λ)) based AARL, and propose a novel PBRS technique for MAXQ-0, a hierarchical RL (HRL) algorithm, so as to implement HRL based AARL. We empirically test two AARL implementations - SARSA(λ)-based AARL and MAXQ-based AARL - in multiple application domains, including single-agent and multi-agent learning problems. Empirical results indicate that AARL can improve the convergence speed of RL, and can also be easily used by people that have little background in Argumentation and RL.
APA, Harvard, Vancouver, ISO, and other styles
10

Alexander, John W. "Transfer in reinforcement learning." Thesis, University of Aberdeen, 2015. http://digitool.abdn.ac.uk:80/webclient/DeliveryManager?pid=227908.

Full text
Abstract:
The problem of developing skill repertoires autonomously in robotics and artificial intelligence is becoming ever more pressing. Currently, the issues of how to apply prior knowledge to new situations and which knowledge to apply have not been sufficiently studied. We present a transfer setting where a reinforcement learning agent faces multiple problem solving tasks drawn from an unknown generative process, where each task has similar dynamics. The task dynamics are changed by varying in the transition function between states. The tasks are presented sequentially with the latest task presented considered as the target for transfer. We describe two approaches to solving this problem. Firstly we present an algorithm for transfer of the function encoding the stateaction value, defined as value function transfer. This algorithm uses the value function of a source policy to initialise the policy of a target task. We varied the type of basis the algorithm used to approximate the value function. Empirical results in several well known domains showed that the learners benefited from the transfer in the majority of cases. Results also showed that the Radial basis performed better in general than the Fourier. However contrary to expectation the Fourier basis benefited most from the transfer. Secondly, we present an algorithm for learning an informative prior which encodes beliefs about the underlying dynamics shared across all tasks. We call this agent the Informative Prior agent (IP). The prior is learnt though experience and captures the commonalities in the transition dynamics of the domain and allows for a quantification of the agent's uncertainty about these. By using a sparse distribution of the uncertainty in the dynamics as a prior, the IP agent can successfully learn a model of 1) the set of feasible transitions rather than the set of possible transitions, and 2) the likelihood of each of the feasible transitions. Analysis focusing on the accuracy of the learned model showed that IP had a very good accuracy bound, which is expressible in terms of only the permissible error and the diffusion, a factor that describes the concentration of the prior mass around the truth, and which decreases as the number of tasks experienced grows. The empirical evaluation of IP showed that an agent which uses the informative prior outperforms several existing Bayesian reinforcement learning algorithms on tasks with shared structure in a domain where multiple related tasks were presented only once to the learners. IP is a step towards the autonomous acquisition of behaviours in artificial intelligence. IP also provides a contribution towards the analysis of exploration and exploitation in the transfer paradigm.
APA, Harvard, Vancouver, ISO, and other styles
11

Leslie, David S. "Reinforcement learning in games." Thesis, University of Bristol, 2004. http://hdl.handle.net/1983/420b3f4b-a8b3-4a65-be23-6d21f6785364.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Schneider, Markus. "Reinforcement Learning für Laufroboter." [S.l. : s.n.], 2007. http://nbn-resolving.de/urn:nbn:de:bsz:747-opus-344.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Wülfing, Jan [Verfasser], and Martin [Akademischer Betreuer] Riedmiller. "Stable deep reinforcement learning." Freiburg : Universität, 2019. http://d-nb.info/1204826188/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Zhang, Jingwei [Verfasser], and Wolfram [Akademischer Betreuer] Burgard. "Learning navigation policies with deep reinforcement learning." Freiburg : Universität, 2021. http://d-nb.info/1235325571/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Rottmann, Axel [Verfasser], and Wolfram [Akademischer Betreuer] Burgard. "Approaches to online reinforcement learning for miniature airships = Online Reinforcement Learning Verfahren für Miniaturluftschiffe." Freiburg : Universität, 2012. http://d-nb.info/1123473560/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Hengst, Bernhard Computer Science &amp Engineering Faculty of Engineering UNSW. "Discovering hierarchy in reinforcement learning." Awarded by:University of New South Wales. Computer Science and Engineering, 2003. http://handle.unsw.edu.au/1959.4/20497.

Full text
Abstract:
This thesis addresses the open problem of automatically discovering hierarchical structure in reinforcement learning. Current algorithms for reinforcement learning fail to scale as problems become more complex. Many complex environments empirically exhibit hierarchy and can be modeled as interrelated subsystems, each in turn with hierarchic structure. Subsystems are often repetitive in time and space, meaning that they reoccur as components of different tasks or occur multiple times in different circumstances in the environment. A learning agent may sometimes scale to larger problems if it successfully exploits this repetition. Evidence suggests that a bottom up approach that repetitively finds building-blocks at one level of abstraction and uses them as background knowledge at the next level of abstraction, makes learning in many complex environments tractable. An algorithm, called HEXQ, is described that automatically decomposes and solves a multi-dimensional Markov decision problem (MDP) by constructing a multi-level hierarchy of interlinked subtasks without being given the model beforehand. The effectiveness and efficiency of the HEXQ decomposition depends largely on the choice of representation in terms of the variables, their temporal relationship and whether the problem exhibits a type of constrained stochasticity. The algorithm is first developed for stochastic shortest path problems and then extended to infinite horizon problems. The operation of the algorithm is demonstrated using a number of examples including a taxi domain, various navigation tasks, the Towers of Hanoi and a larger sporting problem. The main contributions of the thesis are the automation of (1)decomposition, (2) sub-goal identification, and (3) discovery of hierarchical structure for MDPs with states described by a number of variables or features. It points the way to further scaling opportunities that encompass approximations, partial observability, selective perception, relational representations and planning. The longer term research aim is to train rather than program intelligent agents
APA, Harvard, Vancouver, ISO, and other styles
17

Blixt, Rikard, and Anders Ye. "Reinforcement learning AI to Hive." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-134908.

Full text
Abstract:
This report is about the game Hive, which is a very unique board game. Firstly we cover what Hive is, and then later details on our implementations of it, which issues we ran into during the implementation and how we solved those issues. Also we attempted to make an AI and by using reinforcement learning teaching it to become good at playing Hive. More precisely we used two AI that has no knowledge of Hive other than game rules. This however turned out to be impossible within reasonable timeframe, our estimations is that it would have to run on an upper-end home computer for at least 140 years to become decent at playing the game.
Denna rapport handlar om det unika brädspelet Hive. Rapporten kommer först berätta om vad Hive är och sedan gå in på detalj hur vi implementerar spelet, vad för problem vi stötte på och hur dessa problem löstes. Även så försökte vi göra en AI som lärde sig med hjälp av förstärkningslärning för att bli bra på spelet. Mer exakt så använde vi två AI som inte kunde något alls om Hive förutom spelreglerna. Detta visades vara omöjligt att genomföra inom rimlig tid, vår uppskattning är att det skulle ha tagit en bra stationär hemdator minst 140 år att lära en AI spel Hive på en godtagbar nivå.
APA, Harvard, Vancouver, ISO, and other styles
18

Borgstrand, Richard, and Patrik Servin. "Reinforcement Learning AI till Fightingspel." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-3113.

Full text
Abstract:
Utförandet av projektet har varit att implementera två stycken fightingspels Artificiell Intelligens (kommer att förkortas AI). En oadaptiv och mer deterministisk AI och en adaptiv dynamisk AI som använder reinforcement learning. Detta har utförts med att skripta beteendet av AI:n i en gratis 2D fightingsspels motor som heter ”MUGEN”. AI:n använder sig utav skriptade sekvenser som utförs med MUGEN’s egna trigger och state system. Detta system kollar om de skriptade specifierade kraven är uppfyllda för AI:n att ska ”trigga”, utföra den bestämda handlingen. Den mer statiska AI:n har blivit uppbyggd med egen skapade sekvenser och regler som utförs delvis situationsmässigt och delvis slumpmässigt. För att försöka uppnå en reinforcement learning AI så har sekvenserna tilldelats en variabel som procentuellt ökar chansen för utförandet av handlingen när handlingen har givit något positivt och det motsatta minskar när handlingen har orsakat något negativt.
APA, Harvard, Vancouver, ISO, and other styles
19

Arnekvist, Isac. "Reinforcement learning for robotic manipulation." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-216386.

Full text
Abstract:
Reinforcement learning was recently successfully used for real-world robotic manipulation tasks, without the need for human demonstration, usinga normalized advantage function-algorithm (NAF). Limitations on the shape of the advantage function however poses doubts to what kind of policies can be learned using this method. For similar tasks, convolutional neural networks have been used for pose estimation from images taken with fixed position cameras. For some applications however, this might not be a valid assumption. It was also shown that the quality of policies for robotic tasks severely deteriorates from small camera offsets. This thesis investigates the use of NAF for a pushing task with clear multimodal properties. The results are compared with using a deterministic policy with minimal constraints on the Q-function surface. Methods for pose estimation using convolutional neural networks are further investigated, especially with regards to randomly placed cameras with unknown offsets. By defining the coordinate frame of objects with respect to some visible feature, it is hypothesized that relative pose estimation can be accomplished even when the camera is not fixed and the offset is unknown. NAF is successfully implemented to solve a simple reaching task on a real robotic system where data collection is distributed over several robots, and learning is done on a separate server. Using NAF to learn a pushing task fails to converge to a good policy, both on the real robots and in simulation. Deep deterministic policy gradient (DDPG) is instead used in simulation and successfully learns to solve the task. The learned policy is then applied on the real robots and accomplishes to solve the task in the real setting as well. Pose estimation from fixed position camera images is learned and the policy is still able to solve the task using these estimates. By defining a coordinate frame from an object visible to the camera, in this case the robot arm, a neural network learns to regress the pushable objects pose in this frame without the assumption of a fixed camera. However, the precision of the predictions were too inaccurate to be used for solving the pushing task. Further modifications to this approach could however show to be a feasible solution to randomly placed cameras with unknown poses.
Reinforcement learning har nyligen använts framgångsrikt för att lära icke-simulerade robotar uppgifter med hjälp av en normalized advantage function-algoritm (NAF), detta utan att använda mänskliga demonstrationer. Restriktioner på funktionsytorna som använts kan dock visa sig vara problematiska för generalisering till andra uppgifter. För poseestimering har i liknande sammanhang convolutional neural networks använts med bilder från kamera med konstant position. I vissa applikationer kan dock inte kameran garanteras hålla en konstant position och studier har visat att kvaliteten på policys kraftigt förvärras när kameran förflyttas.   Denna uppsats undersöker användandet av NAF för att lära in en ”pushing”-uppgift med tydliga multimodala egenskaper. Resultaten jämförs med användandet av en deterministisk policy med minimala restriktioner på Q-funktionsytan. Vidare undersöks användandet av convolutional neural networks för pose-estimering, särskilt med hänsyn till slumpmässigt placerade kameror med okänd placering. Genom att definiera koordinatramen för objekt i förhållande till ett synligt referensobjekt så tros relativ pose-estimering kunna utföras även när kameran är rörlig och förflyttningen är okänd. NAF appliceras i denna uppsats framgångsrikt på enklare problem där datainsamling är distribuerad över flera robotar och inlärning sker på en central server. Vid applicering på ”pushing”- uppgiften misslyckas dock NAF, både vid träning på riktiga robotar och i simulering. Deep deterministic policy gradient (DDPG) appliceras istället på problemet och lär sig framgångsrikt att lösa problemet i simulering. Den inlärda policyn appliceras sedan framgångsrikt på riktiga robotar. Pose-estimering genom att använda en fast kamera implementeras också framgångsrikt. Genom att definiera ett koordinatsystem från ett föremål i bilden med känd position, i detta fall robotarmen, kan andra föremåls positioner beskrivas i denna koordinatram med hjälp av neurala nätverk. Dock så visar sig precisionen vara för låg för att appliceras på robotar. Resultaten visar ändå att denna metod, med ytterligare utökningar och modifikationer, skulle kunna lösa problemet.
APA, Harvard, Vancouver, ISO, and other styles
20

Cleland, Benjamin George. "Reinforcement Learning for Racecar Control." The University of Waikato, 2006. http://hdl.handle.net/10289/2507.

Full text
Abstract:
This thesis investigates the use of reinforcement learning to learn to drive a racecar in the simulated environment of the Robot Automobile Racing Simulator. Real-life race driving is known to be difficult for humans, and expert human drivers use complex sequences of actions. There are a large number of variables, some of which change stochastically and all of which may affect the outcome. This makes driving a promising domain for testing and developing Machine Learning techniques that have the potential to be robust enough to work in the real world. Therefore the principles of the algorithms from this work may be applicable to a range of problems. The investigation starts by finding a suitable data structure to represent the information learnt. This is tested using supervised learning. Reinforcement learning is added and roughly tuned, and the supervised learning is then removed. A simple tabular representation is found satisfactory, and this avoids difficulties with more complex methods and allows the investigation to concentrate on the essentials of learning. Various reward sources are tested and a combination of three are found to produce the best performance. Exploration of the problem space is investigated. Results show exploration is essential but controlling how much is done is also important. It turns out the learning episodes need to be very long and because of this the task needs to be treated as continuous by using discounting to limit the size of the variables stored. Eligibility traces are used with success to make the learning more efficient. The tabular representation is made more compact by hashing and more accurate by using smaller buckets. This slows the learning but produces better driving. The improvement given by a rough form of generalisation indicates the replacement of the tabular method by a function approximator is warranted. These results show reinforcement learning can work within the Robot Automobile Racing Simulator, and lay the foundations for building a more efficient and competitive agent.
APA, Harvard, Vancouver, ISO, and other styles
21

Kim, Min Sub Computer Science &amp Engineering Faculty of Engineering UNSW. "Reinforcement learning by incremental patching." Awarded by:University of New South Wales, 2007. http://handle.unsw.edu.au/1959.4/39716.

Full text
Abstract:
This thesis investigates how an autonomous reinforcement learning agent can improve on an approximate solution by augmenting it with a small patch, which overrides the approximate solution at certain states of the problem. In reinforcement learning, many approximate solutions are smaller and easier to produce than ???flat??? solutions that maintain distinct parameters for each fully enumerated state, but the best solution within the constraints of the approximation may fall well short of global optimality. This thesis proposes that the remaining gap to global optimality can be efficiently minimised by learning a small patch over the approximate solution. In order to improve the agent???s behaviour, algorithms are presented for learning the overriding patch. The patch is grown around particular regions of the problem where the approximate solution is found to be deficient. Two heuristic strategies are proposed for concentrating resources to those areas where inaccuracies in the approximate solution are most costly, drawing a compromise between solution quality and storage requirements. Patching also handles problems with continuous state variables, by two alternative methods: Kuhn triangulation over a fixed discretisation and nearest neighbour interpolation with a variable discretisation. As well as improving the agent???s behaviour, patching is also applied to the agent???s model of the environment. Inaccuracies in the agent???s model of the world are detected by statistical testing, using a selective sampling strategy to limit storage requirements for collecting data. The patching algorithms are demonstrated in several problem domains, illustrating the effectiveness of patching under a wide range of conditions. A scenario drawn from a real-time strategy game demonstrates the ability of patching to handle large complex tasks. These contributions combine to form a general framework for patching over approximate solutions in reinforcement learning. Complex problems cannot be solved by brute force alone, and some form of approximation is necessary to handle large problems. However, this does not mean that the limitations of approximate solutions must be accepted without question. Patching demonstrates one way in which an agent can leverage approximation techniques without losing the ability to handle fine yet important details.
APA, Harvard, Vancouver, ISO, and other styles
22

Patrascu, Relu-Eugen. "Adaptive exploration in reinforcement learning." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp01/MQ35921.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Li, Jingxian. "Reinforcement learning using sensorimotor traces." Thesis, University of British Columbia, 2013. http://hdl.handle.net/2429/45590.

Full text
Abstract:
The skilled motions of humans and animals are the result of learning good solutions to difficult sensorimotor control problems. This thesis explores new models for using reinforcement learning to acquire motion skills, with potential applications to computer animation and robotics. Reinforcement learning offers a principled methodology for tackling control problems. However, it is difficult to apply in high-dimensional settings, such as the ones that we wish to explore, where the body can have many degrees of freedom, the environment can have significant complexity, and there can be further redundancies that exist in the sensory representations that are available to perceive the state of the body and the environment. In this context, challenges to overcome include: a state space that cannot be fully explored; the need to model how the state of the body and the perceived state of the environment evolve together over time; and solutions that can work with only a small number of sensorimotor experiences. Our contribution is a reinforcement learning method that implicitly represents the current state of the body and the environment using sensorimotor traces. A distance metric is defined between the ongoing sensorimotor trace and previously experienced sensorimotor traces and this is used to model the current state as a weighted mixture of past experiences. Sensorimotor traces play multiple roles in our method: they provide an embodied representation of the state (and therefore also the value function and the optimal actions), and they provide an embodied model of the system dynamics. In our implementation, we focus specifically on learning steering behaviors for a vehicle driving along straight roads, winding roads, and through intersections. The vehicle is equipped with a set of distance sensors. We apply value-iteration using off-policy experiences in order to produce control policies capable of steering the vehicle in a wide range of circumstances. An experimental analysis is provided of the effect of various design choices. In the future we expect that similar ideas can be applied to other high-dimensional systems, such as bipedal systems that are capable of walking over variable terrain, also driven by control policies based on sensorimotor traces.
APA, Harvard, Vancouver, ISO, and other styles
24

Rummery, Gavin Adrian. "Problem solving with reinforcement learning." Thesis, University of Cambridge, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.363828.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

McCabe, Jonathan Aiden. "Reinforcement learning in virtual reality." Thesis, University of Cambridge, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.608852.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Budhraja, Karan Kumar. "Neuroevolution Based Inverse Reinforcement Learning." Thesis, University of Maryland, Baltimore County, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10140581.

Full text
Abstract:

Motivated by such learning in nature, the problem of Learning from Demonstration is targeted at learning to perform tasks based on observed examples. One of the approaches to Learning from Demonstration is Inverse Reinforcement Learning, in which actions are observed to infer rewards. This work combines a feature based state evaluation approach to Inverse Reinforcement Learning with neuroevolution, a paradigm for modifying neural networks based on their performance on a given task. Neural networks are used to learn from a demonstrated expert policy and are evolved to generate a policy similar to the demonstration. The algorithm is discussed and evaluated against competitive feature-based Inverse Reinforcement Learning approaches. At the cost of execution time, neural networks allow for non-linear combinations of features in state evaluations. These valuations may correspond to state value or state reward. This results in better correspondence to observed examples as opposed to using linear combinations. This work also extends existing work on Bayesian Non-Parametric Feature construction for Inverse Reinforcement Learning by using non-linear combinations of intermediate data to improve performance. The algorithm is observed to be specifically suitable for a linearly solvable non-deterministic Markov Decision Processes in which multiple rewards are sparsely scattered in state space. Performance of the algorithm is shown to be limited by parameters used, implying adjustable capability. A conclusive performance hierarchy between evaluated algorithms is constructed.

APA, Harvard, Vancouver, ISO, and other styles
27

Piano, Francesco. "Deep Reinforcement Learning con PyTorch." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2022. http://amslaurea.unibo.it/25340/.

Full text
Abstract:
Il Reinforcement Learning è un campo di ricerca del Machine Learning in cui la risoluzione di problemi da parte di un agente avviene scegliendo l’azione più idonea da eseguire attraverso un processo di apprendimento iterativo, in un ambiente dinamico che lo incentiva tramite ricompense. Il Deep Learning, anch’esso approccio del Machine Learning, sfruttando una rete neurale artificiale è in grado di applicare metodi di apprendimento per rappresentazione allo scopo di ottenere una struttura dei dati più idonea ad essere elaborata. Solo recentemente il Deep Reinforcement Learning, creato dalla combinazione di questi due paradigmi di apprendimento, ha permesso di risolvere problemi considerati prima intrattabili riscuotendo un notevole successo e rinnovando l’interesse dei ricercatori riguardo l’applicazione degli algoritmi di Reinforcement Learning. Con questa tesi si è voluto approfondire lo studio del Reinforcement Learning applicato a problemi semplici, per poi esaminare come esso possa superare i propri limiti caratteristici attraverso l’utilizzo delle reti neurali artificiali, in modo da essere applicato in un contesto di Deep Learning attraverso l'utilizzo del framework PyTorch, una libreria attualmente molto usata per il calcolo scientifico e il Machine Learning.
APA, Harvard, Vancouver, ISO, and other styles
28

Kozlova, Olga. "Hierarchical and factored reinforcement learning." Paris 6, 2010. http://www.theses.fr/2010PA066196.

Full text
Abstract:
Les méthodes d'apprentissage par renforcement factorisé et hiérarchique (HFRL) sont basées sur le formalisme des processus de décision markoviens factorisées (FMDP) et les MDP hiérarchiques (HMDP). Dans cette thèse, nous proposons une méthode de HFRL qui utilise les approches d’apprentissage par renforcement indirect et le formalisme des options pour résoudre les problèmes de prise de décision dans les environnements dynamiques sans connaissance a priori de la structure du problème. Dans la première contribution de cette thèse, nous montrons comment modéliser les problèmes où certaines combinaisons de variables n’existent pas et nous démontrons les performances de nos algorithmes sur des problèmes jouet classiques dans la littérature, MAZE6 et BLOCKSWORLD, en comparaison avec l’approche standard. La deuxième contribution de cette thèse est la proposition de TeXDYNA, un algorithme pour la résolution de MDP de grande taille dont la structure est inconnue. TeXDYNA décompose hiérarchiquement le FMDP sur la base de la découverte automatique des sous-tâches directement à partir de la structure du problème qui est elle même apprise en interaction avec l’environnement. Nous évaluons TeXDYNA sur deux benchmarks, à savoir les problèmes TAXI et LIGHTBOX. Finalement, nous estimons le potentiel et les limitations de TeXDYNA sur un problème jouet plus représentatif du domaine de la simulation industrielle.
APA, Harvard, Vancouver, ISO, and other styles
29

Blows, Curtly. "Reinforcement learning for telescope optimisation." Master's thesis, Faculty of Science, 2019. http://hdl.handle.net/11427/31352.

Full text
Abstract:
Reinforcement learning is a relatively new and unexplored branch of machine learning with a wide variety of applications. This study investigates reinforcement learning and provides an overview of its application to a variety of different problems. We then explore the possible use of reinforcement learning for telescope target selection and scheduling in astronomy with the hope of effectively mimicking the choices made by professional astronomers. This is relevant as next-generation astronomy surveys will require near realtime decision making in response to high-speed transient discoveries. We experiment with and apply some of the leading approaches in reinforcement learning to simplified models of the target selection problem. We find that the methods used in this study show promise but do not generalise well. Hence while there are indications that reinforcement learning algorithms could work, more sophisticated algorithms and simulations are needed.
APA, Harvard, Vancouver, ISO, and other styles
30

Stigenberg, Jakob. "Scheduling using Deep Reinforcement Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-284506.

Full text
Abstract:
As radio networks have continued to evolve in recent decades, so have theircomplexity and the difficulty in efficiently utilizing the available resources. Ina cellular network, the scheduler controls the allocation of time, frequencyand spatial resources to users in both uplink and downlink directions. Thescheduler is therefore a key component in terms of efficient usage of networkresources. Although the scope and characteristics of available resources forschedulers are well defined in network standards, e.g. Long-Term Evolutionor New Radio, its real implementation is not. Most previous work focus onconstructing heuristics, based on metrics such as Quality of Service (QoS)classes, channel quality and delay, from which packets are then sorted andscheduled. In this work, a new approach to time domain scheduling using reinforcementlearning is presented. The proposed algorithm leverages modelfreereinforcement learning in order to treat the frequency domain scheduleras a black box. The proposed algorithm uses end-to-end learning and considersall packets, including control packets such as scheduling requests and CSIreports. Using a Deep Q-Network, the algorithm was evaluated in a settingwith multiple delay sensitive VoiP users and one best effort user. Compared toa priority based scheduler, the agent was able to improve total cell throughputby 20:5%, 23:5%, and 16:2% in the 10th, 50th, and 90th percentiles, respectively,while simultaneously reducing the VoiP packet delay by 29:6%, thusimproving QoS.
I takt med radionätverks fortsatta utveckling under de senaste decenniernahar även komplexiteten och svårigheten i att effektivt utnyttja de tillgängligaresurserna ökat. I varje trådlöst nätverk finns en schemaläggare som styrtrafikflödet genom nätverket. Schemaläggaren är därmed en nyckelkomponentnär det kommer till att effektivt utnyttja de tillgängliga nätverksresurserna. Ien given nätverkspecifikation, t.ex. Long-Term Evoluation eller New Radio,är det givet vilka möjligheter till allokering som schemaläggaren kan använda.Hur schemaläggaren utnyttjar dessa möjligheter, det vill säga implementationenav schemaläggaren, är helt upp till varje enskild tillverkare. I tidigarearbete har fokus främst legat på att manuellt definera sorteringsvikter baseratpå, bland annat, Quality of Service (QoS) -klass, kanalkvalitet och fördröjning.Nätverkspaket skickas sedan givet viktordningen. I detta examensarbetepresenteras en ny metod för schemaläggning baserat på förstärkande inlärning.Metoden hanterar resursallokeraren som en svart låda och lär sig denbästa sorteringen direkt från indata (end-to-end) och hanterar även kontrollpaket.Ramverket utvärderades med ett Deep Q-Network i ett scenario medflera fördröjningskänsliga röstanvändare tillsammans med en (oändligt) storfilnedladdning. Algoritmen lärde sig att minska mängden försenade röstpaket,alltså öka QoS, med 29.6% samtidigt som den ökade total överföringshastighetmed 20.5, 23.5 och 16.2% i den 10:e, 50:e samt 90:e kvantilen.
APA, Harvard, Vancouver, ISO, and other styles
31

Jesu, Alberto. "Reinforcement learning over encrypted data." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/23257/.

Full text
Abstract:
Reinforcement learning is a particular paradigm of machine learning that, recently, has proved times and times again to be a very effective and powerful approach. On the other hand, cryptography usually takes the opposite direction. While machine learning aims at analyzing data, cryptography aims at maintaining its privacy by hiding such data. However, the two techniques can be jointly used to create privacy preserving models, able to make inferences on the data without leaking sensitive information. Despite the numerous amount of studies performed on machine learning and cryptography, reinforcement learning in particular has never been applied to such cases before. Being able to successfully make use of reinforcement learning in an encrypted scenario would allow us to create an agent that efficiently controls a system without providing it with full knowledge of the environment it is operating in, leading the way to many possible use cases. Therefore, we have decided to apply the reinforcement learning paradigm to encrypted data. In this project we have applied one of the most well-known reinforcement learning algorithms, called Deep Q-Learning, to simple simulated environments and studied how the encryption affects the training performance of the agent, in order to see if it is still able to learn how to behave even when the input data is no longer readable by humans. The results of this work highlight that the agent is still able to learn with no issues whatsoever in small state spaces with non-secure encryptions, like AES in ECB mode. For fixed environments, it is also able to reach a suboptimal solution even in the presence of secure modes, like AES in CBC mode, showing a significant improvement with respect to a random agent; however, its ability to generalize in stochastic environments or big state spaces suffers greatly.
APA, Harvard, Vancouver, ISO, and other styles
32

Suggs, Sterling. "Reinforcement Learning with Auxiliary Memory." BYU ScholarsArchive, 2021. https://scholarsarchive.byu.edu/etd/9028.

Full text
Abstract:
Deep reinforcement learning algorithms typically require vast amounts of data to train to a useful level of performance. Each time new data is encountered, the network must inefficiently update all of its parameters. Auxiliary memory units can help deep neural networks train more efficiently by separating computation from storage, and providing a means to rapidly store and retrieve precise information. We present four deep reinforcement learning models augmented with external memory, and benchmark their performance on ten tasks from the Arcade Learning Environment. Our discussion and insights will be helpful for future RL researchers developing their own memory agents.
APA, Harvard, Vancouver, ISO, and other styles
33

Liu, Chong. "Reinforcement learning with time perception." Thesis, University of Manchester, 2012. https://www.research.manchester.ac.uk/portal/en/theses/reinforcement-learning-with-time-perception(a03580bd-2dd6-4172-a061-90e8ac3022b8).html.

Full text
Abstract:
Classical value estimation reinforcement learning algorithms do not perform very well in dynamic environments. On the other hand, the reinforcement learning of animals is quite flexible: they can adapt to dynamic environments very quickly and deal with noisy inputs very effectively. One feature that may contribute to animals' good performance in dynamic environments is that they learn and perceive the time to reward. In this research, we attempt to learn and perceive the time to reward and explore situations where the learned time information can be used to improve the performance of the learning agent in dynamic environments. The type of dynamic environments that we are interested in is that type of switching environment which stays the same for a long time, then changes abruptly, and then holds for a long time before another change. The type of dynamics that we mainly focus on is the time to reward, though we also extend the ideas to learning and perceiving other criteria of optimality, e.g. the discounted return, so that they can still work even when the amount of reward may also change. Specifically, both the mean and variance of the time to reward are learned and then used to detect changes in the environment and to decide whether the agent should give up a suboptimal action. When a change in the environment is detected, the learning agent responds specifically to the change in order to recover quickly from it. When it is found that the current action is still worse than the optimal one, the agent gives up this time's exploration of the action and then remakes its decision in order to avoid longer than necessary exploration. The results of our experiments using two real-world problems show that they have effectively sped up learning, reduced the time taken to recover from environmental changes, and improved the performance of the agent after the learning converges in most of the test cases compared with classical value estimation reinforcement learning algorithms. In addition, we have successfully used spiking neurons to implement various phenomena of classical conditioning, the simplest form of animal reinforcement learning in dynamic environments, and also pointed out a possible implementation of instrumental conditioning and general reinforcement learning using similar models.
APA, Harvard, Vancouver, ISO, and other styles
34

Tluk, von Toschanowitz Katharina. "Relevance determination in reinforcement learning." Tönning Lübeck Marburg Der Andere Verl, 2009. http://d-nb.info/993341128/04.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Bonneau, Maxime. "Reinforcement Learning for 5G Handover." Thesis, Linköpings universitet, Statistik och maskininlärning, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-140816.

Full text
Abstract:
The development of the 5G network is in progress, and one part of the process that needs to be optimised is the handover. This operation, consisting of changing the base station (BS) providing data to a user equipment (UE), needs to be efficient enough to be a seamless operation. From the BS point of view, this operation should be as economical as possible, while satisfying the UE needs.  In this thesis, the problem of 5G handover has been addressed, and the chosen tool to solve this problem is reinforcement learning. A review of the different methods proposed by reinforcement learning led to the restricted field of model-free, off-policy methods, more specifically the Q-Learning algorithm. On its basic form, and used with simulated data, this method allows to get information on which kind of reward and which kinds of action-space and state-space produce good results. However, despite working on some restricted datasets, this algorithm does not scale well due to lengthy computation times. It means that the agent trained can not use a lot of data for its learning process, and both state-space and action-space can not be extended a lot, restricting the use of the basic Q-Learning algorithm to discrete variables. Since the strength of the signal (RSRP), which is of high interest to match the UE needs, is a continuous variable, a continuous form of the Q-learning needs to be used. A function approximation method is then investigated, namely artificial neural networks. In addition to the lengthy computational time, the results obtained are not convincing yet. Thus, despite some interesting results obtained from the basic form of the Q-Learning algorithm, the extension to the continuous case has not been successful. Moreover, the computation times make the use of reinforcement learning applicable in our domain only for really powerful computers.
APA, Harvard, Vancouver, ISO, and other styles
36

Ovidiu, Chelcea Vlad, and Björn Ståhl. "Deep Reinforcement Learning for Snake." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-239362.

Full text
Abstract:
The world has recently seen a large increase in both research and development and layman use of machine learning. Machine learning has a broad application domain, e.g, in marketing, production and finance. Although these applications have a predetermined set of rules or goals, this project deals with another aspect of machine learning which is general intelligence. During the course of the project a non-human player (known as agent) will learn how to play the game SNAKE without any outside influence or knowledge of the environment dynamics. After having the agent train for 66 hours and almost two million games an average of 16 points per game out of 35 possible were reached. This is realized by the use of reinforcement learning and deep convolutional neural networks (CNN).
APA, Harvard, Vancouver, ISO, and other styles
37

Edlund, Joar, and Jack Jönsson. "Reinforcement Learning for Video Games." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-239363.

Full text
Abstract:
We present an implementation of a specific type of deep reinforcement learning algorithm known as deep Qlearning. With a Convolutional Neural Network (CNN) combined with our Q-learning algorithm, we trained an agent to play the game of Snake. The input to the CNN is the raw pixel values from the Snake environment and the output is a value function which estimates future rewards for different actions. We implemented the Q-learning algorithm on a grid based and a pixel based representation of the Snake environment and found that the algorithm can perform at human level on smaller grid based representation whilst the performance on the pixel based representation was fairly limited.
APA, Harvard, Vancouver, ISO, and other styles
38

Magnusson, Björn, and Måns Forslund. "SAFE AND EFFICIENT REINFORCEMENT LEARNING." Thesis, Örebro universitet, Institutionen för naturvetenskap och teknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-76588.

Full text
Abstract:
Pre-programming a robot may be efficient to some extent, but since a human has code the robot it will only be as efficient as the programming. The problem can solved by using machine learning, which lets the robot learn the most efficient way by itself. This thesis is continuation of a previous work that covered the development of the framework ​Safe-To-Explore-State-Spaces​ (STESS) for safe robot manipulation. This thesis evaluates the efficiency of the ​Q-Learning with normalized advantage function ​ (NAF), a deep reinforcement learning algorithm, when integrated with the safety framework STESS. It does this by performing a 2D task where the robot moves the tooltip on a plane from point A to point B in a set workspace. To test the viability different scenarios was presented to the robot. No obstacles, sphere obstacles and cylinder obstacles. The reinforcement learning algorithm only knew the starting position and the STESS pre-defined the workspace constraining the areas which the robot could not enter. By satisfying these constraints the robot could explore and learn the most efficient way to complete its task. The results show that in simulation the NAF-algorithm learns fast and efficient, while avoiding the obstacles without collision.
Förprogrammering av en robot kan vara effektiv i viss utsträckning, men eftersom en människa har programmerat roboten kommer den bara att vara lika effektiv som programmet är skrivet. Problemet kan lösas genom att använda maskininlärning. Detta gör att roboten kan lära sig det effektivaste sättet på sitt sätt. Denna avhandling är fortsättning på ett tidigare arbete som täckte utvecklingen av ramverket Safe-To-Explore-State-Spaces (STESS) för säker robot manipulation. Denna avhandling utvärderar effektiviteten hos ​Q-Learning with normalized advantage function (NAF)​, en deep reinforcement learning algoritm, när den integreras med ramverket STESS. Det gör detta genom att utföra en 2D-uppgift där roboten flyttar sitt verktyg på ett plan från punkt A till punkt B i en förbestämd arbetsyta. För att testa effektiviteten presenterades olika scenarier för roboten. Inga hinder, hinder med sfärisk form och hinder med cylindrisk form. Deep reinforcement learning algoritmen visste bara startpositionen och STESS-fördefinierade arbetsytan och begränsade de områden som roboten inte fick beträda. Genom att uppfylla dessa hinder kunde roboten utforska och lära sig det mest effektiva sättet att utföra sin uppgift. Resultaten visar att NAF-algoritmen i simulering lär sig snabbt och effektivt, samtidigt som man undviker hindren utan kollision.
APA, Harvard, Vancouver, ISO, and other styles
39

Liu, Bai S. M. Massachusetts Institute of Technology. "Reinforcement learning in network control." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/122414.

Full text
Abstract:
Thesis: S.M., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2019
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 59-91).
With the rapid growth of information technology, network systems have become increasingly complex. In particular, designing network control policies requires knowledge of underlying network dynamics, which are often unknown, and need to be learned. Existing reinforcement learning methods such as Q-Learning, Actor-Critic, etc. are heuristic and do not offer performance guarantees. In contrast, model-based learning methods offer performance guarantees, but can only be applied with bounded state spaces. In the thesis, we propose to use model-based reinforcement learning. By applying Lyapunov analysis, our algorithm can be applied to queueing networks with unbounded state spaces. We prove that under our algorithm, the average queue backlog can get arbitrarily close to the optimal result. We also implement simulations to illustrate the effectiveness of our algorithm.
by Bai Liu.
S.M.
S.M. Massachusetts Institute of Technology, Department of Aeronautics and Astronautics
APA, Harvard, Vancouver, ISO, and other styles
40

Garcelon, Evrard. "Constrained Exploration in Reinforcement Learning." Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAG007.

Full text
Abstract:
Une application majeure de l'apprentissage machine automatisée est la personnalisation des différents contenus recommandé à différents utilisateurs. Généralement, les algorithmes étant à la base de ces systèmes sont dit supervisé. C'est-à-dire que les données utilisées lors de la phase d'apprentissage sont supposées provenir de la même distribution. Cependant, ces données sont générées par des interactions entre un utilisateur et ces mêmes algorithmes. Ainsi, les recommandations pour un utilisateur à un instant t peuvent modifier l'ensemble des recommandations pertinentes à un instant ultérieur. Il est donc nécessaire de prendre en compte ces interactions afin de produire un service de la meilleure qualité possible. Ce type d'interaction est réminiscente du problème d'apprentissage en ligne. Parmi les algorithmes dit en ligne, les algorithmes de bandits et d'apprentissage par Renforcement (AR) semblent être les mieux positionnés afin de remplacer les méthodes d'apprentissage supervisé pour des applications nécessitant un certain degré de personnalisation. Le déploiement en production d'algorithmes d'apprentissage par Renforcement présente un certain nombre de difficultés tel que garantir un certain niveau de performance lors des phases d'exploration ou encore comment garantir la confidentialité des données collectées par ces algorithmes. Dans cette thèse nous considérons différentes contraintes freinant l’utilisation d’algorithmes d’apprentissage par renforcement, en fournissant des résultats à la fois empirique et théorique sur la vitesse d’apprentissage en présence de différentes contraintes
A major application of machine learning is to provide personnalized content to different users. In general, the algorithms powering those recommandation are supervised learning algorithm. That is to say the data used to train those algorithms are assumed to be sampled from the same distribution. However, the data are generated through interactions between the users and the recommendation algorithms. Thus, recommendations for a user a time t can have an impact on the set of pertinent recommandation at a later time. Therefore, it is necessary to take those interactions into account. This setting is reminiscent of the online learning setting. Among online learning algorithms, Reinforcement Learning algorithms (RL) looks the most promising to replace supervised learning algorithms for applications requiring a certain degree of personnalization. The deployement in production of RL algorithms presents some challenges such as being able to guarantee a certain level of performance during exploration phases or how to guarantee privacy of the data collected by RL algorithms. In this thesis, we consider different constraints limiting the use of RL algorithms and provides both empirical and theoretical results on the impact of those constraints on the learning process
APA, Harvard, Vancouver, ISO, and other styles
41

Wei, Ermo. "Learning to Play Cooperative Games via Reinforcement Learning." Thesis, George Mason University, 2019. http://pqdtopen.proquest.com/#viewpdf?dispub=13420351.

Full text
Abstract:

Being able to accomplish tasks with multiple learners through learning has long been a goal of the multiagent systems and machine learning communities. One of the main approaches people have taken is reinforcement learning, but due to certain conditions and restrictions, applying reinforcement learning in a multiagent setting has not achieved the same level of success when compared to its single agent counterparts.

This thesis aims to make coordination better for agents in cooperative games by improving on reinforcement learning algorithms in several ways. I begin by examining certain pathologies that can lead to the failure of reinforcement learning in cooperative games, and in particular the pathology of relative overgeneralization. In relative overgeneralization, agents do not learn to optimally collaborate because during the learning process each agent instead converges to behaviors which are robust in conjunction with the other agent's exploratory (and thus random), rather than optimal, choices. One solution to this is so-called lenient learning, where agents are forgiving of the poor choices of their teammates early in the learning cycle. In the first part of the thesis, I develop a lenient learning method to deal with relative overgeneralization in independent learner settings with small stochastic games and discrete actions.

I then examine certain issues in a more complex multiagent domain involving parameterized action Markov decision processes, motivated by the RoboCup 2D simulation league. I propose two methods, one batch method and one actor-critic method, based on state of the art reinforcement learning algorithms, and show experimentally that the proposed algorithms can train the agents in a significantly more sample-efficient way than more common methods.

I then broaden the parameterized-action scenario to consider both repeated and stochastic games with continuous actions. I show how relative overgeneralization prevents the multiagent actor-critic model from learning optimal behaviors and demonstrate how to use Soft Q-Learning to solve this problem in repeated games.

Finally, I extend imitation learning to the multiagent setting to solve related issues in stochastic games, and prove that given the demonstration from an expert, multiagent Imitation Learning is exactly the multiagent actor-critic model in Maximum Entropy Reinforcement Learning framework. I further show that when demonstration samples meet certain conditions the relative overgeneralization problem can be avoided during the learning process.

APA, Harvard, Vancouver, ISO, and other styles
42

Stachenfeld, Kimberly. "Learning Neural Representations that Support Efficient Reinforcement Learning." Thesis, Princeton University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10824319.

Full text
Abstract:

RL has been transformative for neuroscience by providing a normative anchor for interpreting neural and behavioral data. End-to-end RL methods have scored impressive victories with minimal compromises in autonomy, hand-engineering, and generality. The cost of this minimalism in practice is that model-free RL methods are slow to learn and generalize poorly. Humans and animals exhibit substantially improved flexibility and generalize learned information rapidly to new environment by learning invariants of the environment and features of the environment that support fast learning rapid transfer in new environments. An important question for both neuroscience and machine learning is what kind of ``representational objectives'' encourage humans and other animals to encode structure about the world. This can be formalized as ``representation feature learning,'' in which the animal or agent learns to form representations with information potentially relevant to the downstream RL process. We will overview different representational objectives that have received attention in neuroscience and in machine learning. The focus of this overview will be to first highlight conditions under which these seemingly unrelated objectives are actually mathematically equivalent. We will use this to motivate a breakdown of properties of different learned representations that are meaningfully different and can be used to inform contrasting hypotheses for neuroscience. We then use this perspective to motivate our model of the hippocampus. A cognitive map has long been the dominant metaphor for hippocampal function, embracing the idea that place cells encode a geometric representation of space. However, evidence for predictive coding, reward sensitivity, and policy dependence in place cells suggests that the representation is not purely spatial. We approach the problem of understanding hippocampal representations from a reinforcement learning perspective, focusing on what kind of spatial representation is most useful for maximizing future reward. We show that the answer takes the form of a predictive representation. This representation captures many aspects of place cell responses that fall outside the traditional view of a cognitive map. We go on to argue that entorhinal grid cells encode a low-dimensional basis set for the predictive representation, useful for suppressing noise in predictions and extracting multiscale structure for hierarchical planning.

APA, Harvard, Vancouver, ISO, and other styles
43

Effraimidis, Dimitros. "Computation approaches for continuous reinforcement learning problems." Thesis, University of Westminster, 2016. https://westminsterresearch.westminster.ac.uk/item/q0y82/computation-approaches-for-continuous-reinforcement-learning-problems.

Full text
Abstract:
Optimisation theory is at the heart of any control process, where we seek to control the behaviour of a system through a set of actions. Linear control problems have been extensively studied, and optimal control laws have been identified. But the world around us is highly non-linear and unpredictable. For these dynamic systems, which don’t possess the nice mathematical properties of the linear counterpart, the classic control theory breaks and other methods have to be employed. But nature thrives by optimising non-linear and over-complicated systems. Evolutionary Computing (EC) methods exploit nature’s way by imitating the evolution process and avoid to solve the control problem analytically. Reinforcement Learning (RL) from the other side regards the optimal control problem as a sequential one. In every discrete time step an action is applied. The transition of the system to a new state is accompanied by a sole numerical value, the “reward” that designate the quality of the control action. Even though the amount of feedback information is limited into a sole real number, the introduction of the Temporal Difference method made possible to have accurate predictions of the value-functions. This paved the way to optimise complex structures, like the Neural Networks, which are used to approximate the value functions. In this thesis we investigate the solution of continuous Reinforcement Learning control problems by EC methodologies. The accumulated reward of such problems throughout an episode suffices as information to formulate the required measure, fitness, in order to optimise a population of candidate solutions. Especially, we explore the limits of applicability of a specific branch of EC, that of Genetic Programming (GP). The evolving population in the GP case is comprised from individuals, which are immediately translated to mathematical functions, which can serve as a control law. The major contribution of this thesis is the proposed unification of these disparate Artificial Intelligence paradigms. The provided information from the systems are exploited by a step by step basis from the RL part of the proposed scheme and by an episodic basis from GP. This makes possible to augment the function set of the GP scheme with adaptable Neural Networks. In the quest to achieve stable behaviour of the RL part of the system a modification of the Actor-Critic algorithm has been implemented. Finally we successfully apply the GP method in multi-action control problems extending the spectrum of the problems that this method has been proved to solve. Also we investigated the capability of GP in relation to problems from the food industry. These type of problems exhibit also non-linearity and there is no definite model describing its behaviour.
APA, Harvard, Vancouver, ISO, and other styles
44

Le, Piane Fabio. "Training cognitivo adattativo mediante Reinforcement Learning." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amslaurea.unibo.it/17289/.

Full text
Abstract:
La sclerosi multipla (SM) è una malattia autoimmune che colpisce il sistema nervoso centrale causando varie alterazioni organiche e funzionali. In particolare, una rilevante percentuale di pazienti sviluppa deficit in differenti domini cognitivi. Per limitare la progressione di tali deficit, team specialistici hanno ideato dei protocolli per la riabilitazione cognitiva. Per effettuare le sedute di riabilitazione, i pazienti devono recarsi in cliniche specializzate, necessitando dell'assistenza di personale qualificato e svolgendo gli esercizi tramite scrittura su carta. In seguito, si è iniziato un percorso verso la digitalizzazione di questo genere di esperienze. Un team multidisciplinare composto da ricercatori del DISI - Università di Bologna e da specialisti di vari centri italiani ha progettato un software, MS-Rehab, il cui scopo è fornire alle strutture sanitarie un sistema completo e di facile utilizzo specifico per la riabilitazione della SM. Tale software permette lo svolgimento di numerosi esercizi nei tre domini cognitivi: attenzione, memoria e funzioni esecutive. Questo lavoro di tesi si è concentrato sull'integrazione di metodi di Reinforcement Learning (RL) all'interno di MS-Rehab, allo scopo di realizzare un meccanismo per l'automatizzazione adattiva della difficoltà degli esercizi. Tale soluzione è inedita nell'ambito della riabilitazione cognitiva. Allo scopo di verificare se tale soluzione permettesse un’esperienza riabilitativa pari o superiore a quella fornita attualmente, è stato realizzato un esperimento basato sulla somministrazione ad individui selezionati di un test preliminare, atto a valutare il loro livello nelle funzioni cognitive di attenzione e memoria, seguito poi da un periodo di allenamento su MS-rehab, e infine da una nuova istanza del test iniziale. I risultati ottenuti sono incoraggianti: le prestazioni del test neuro-psicologico hanno evidenziato punteggi sensibilmente più alti per il gruppo che ha utilizzato la versione con RL.
APA, Harvard, Vancouver, ISO, and other styles
45

Mariani, Tommaso. "Deep reinforcement learning for industrial applications." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/20548/.

Full text
Abstract:
In recent years there has been a growing attention from the world of research and companies in the field of Machine Learning. This interest, thanks mainly to the increasing availability of large amounts of data, and the respective strengthening of the hardware sector useful for their analysis, has led to the birth of Deep Learning. The growing computing capacity and the use of mathematical optimization techniques, already studied in depth but with few applications due to a low computational power, have then allowed the development of a new approach called Reinforcement Learning. This thesis work is part of an industrial process of selection of fruit for sale, thanks to the identification and classification of any defects present on it. The final objective is to measure the similarity between them, being able to identify and link them together, even if coming from optical acquisitions obtained at different time steps. We therefore studied a class of algorithms characteristic of Reinforcement Learning, the policy gradient methods, in order to train a feedforward neural network to compare possible malformations of the same fruit. Finally, an applicability study was made, based on real data, in which the model was compared on different fruit rolling dynamics and with different versions of the network.
APA, Harvard, Vancouver, ISO, and other styles
46

Rossi, Martina. "Opponent Modelling using Inverse Reinforcement Learning." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/22263/.

Full text
Abstract:
Un’area di ricerca particolarmente attiva ultimamente nel campo dell'intelligenza artificiale (IA) riguarda lo studio di agenti autonomi, notevolmente diffusi anche nella vita quotidiana. L'obiettivo principale è sviluppare agenti che interagiscano in modo efficiente con altri agenti o esseri umani. Di conseguenza, queste relazioni potrebbero essere notevolmente semplificate grazie alla capacità di dedurre autonomamente le preferenze di altre entità e di adattare di conseguenza la strategia dell'agente. Pertanto, lo scopo di questa tesi è implementare un agente, in grado di apprendere, che interagisce con un'altra entità nello stesso ambiente e utilizza questa esperienza per estrapolare le preferenze dell'avversario. Queste informazioni possono essere impiegate per cooperare o sfruttare l'interlocutore, a seconda dell'obiettivo dell'agente. Pertanto, i temi centrali sono il Reinforcement Learning, gli ambienti multi-agente e il Value alignment. L'agente presentato apprende tramite Deep Q-Learning e riceve una ricompensa che viene calcolata combinando i feedback dell’ambiente e il reward dell'avversario. Questi valori sono ottenuti eseguendo l'algoritmo Maximum Entropy Inverse Reinforcement Learning sulle interazioni precedenti. Il comportamento dell’agente proposto viene testato in due diversi ambienti: il gioco Centipede e il gioco Apple Picking. I risultati ottenuti sono promettenti poiché dimostrano che l'agente può dedurre correttamente le preferenze dell'avversario e utilizzare questa conoscenza per adattare la sua strategia. Tuttavia, il comportamento finale non sempre corrisponde alle aspettative; sono quindi analizzati i limiti dell'approccio attuale e i gli sviluppi futuri per migliorare l'agente.
APA, Harvard, Vancouver, ISO, and other styles
47

Borga, Magnus. "Reinforcement Learning Using Local Adaptive Models." Licentiate thesis, Linköping University, Linköping University, Computer Vision, 1995. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-53352.

Full text
Abstract:

In this thesis, the theory of reinforcement learning is described and its relation to learning in biological systems is discussed. Some basic issues in reinforcement learning, the credit assignment problem and perceptual aliasing, are considered. The methods of temporal difference are described. Three important design issues are discussed: information representation and system architecture, rules for improving the behaviour and rules for the reward mechanisms. The use of local adaptive models in reinforcement learning is suggested and exemplified by some experiments. This idea is behind all the work presented in this thesis. A method for learning to predict the reward called the prediction matrix memory is presented. This structure is similar to the correlation matrix memory but differs in that it is not only able to generate responses to given stimuli but also to predict the rewards in reinforcement learning. The prediction matrix memory uses the channel representation, which is also described. A dynamic binary tree structure that uses the prediction matrix memories as local adaptive models is presented. The theory of canonical correlation is described and its relation to the generalized eigenproblem is discussed. It is argued that the directions of canonical correlations can be used as linear models in the input and output spaces respectively in order to represent input and output signals that are maximally correlated. It is also argued that this is a better representation in a response generating system than, for example, principal component analysis since the energy of the signals has nothing to do with their importance for the response generation. An iterative method for finding the canonical correlations is presented. Finally, the possibility of using the canonical correlation for response generation in a reinforcement learning system is indicated.

APA, Harvard, Vancouver, ISO, and other styles
48

Mastour, Eshgh Somayeh Sadat. "Distributed Reinforcement Learning for Overlay Networks." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-92131.

Full text
Abstract:
In this thesis, we study Collaborative Reinforcement Learning (CRL) in the context of Information Retrieval in unstructured distributed systems. Collaborative reinforcement learning is an extension to reinforcement learning to support multiple agents that both share value functions and cooperate to solve tasks. Specifically, we propose and develop an algorithm for searching in peer to peer systems by using collaborative reinforcement learning. We present a search technique that achieve higher performance than currently available techniques, but is straightforward and practical enough to be easily incorporated into existing systems. Theapproach is profitable because reinforcement learning methods search for good behaviors gradually during the lifetime of the learning peer. However, we must overcome the challenges due to the fundamental partial observability inherent in distributed systems which have highly dynamic nature and changes in their configuration are common practice. Also, we undertake a performance study of the effects that some environment parameters, such as the number of peers, network traffic bandwidth, and partial behavioral knowledge from previous experience, have on the speed and reliability of learning. In the process, we show how CRL can be used to establish and maintain autonomic properties of decentralized distributed systems. This thesis is an empirical study of collaborative reinforcement learning. However, our results contribute to the broader understanding of learning strategies and design of different search policies in distributed systems. Our experimental results confirm the performance improvement of CRL in heterogeneous overlay networks over standard techniques such as random walking.
APA, Harvard, Vancouver, ISO, and other styles
49

Humphrys, Mark. "Action selection methods using reinforcement learning." Thesis, University of Cambridge, 1996. https://www.repository.cam.ac.uk/handle/1810/252269.

Full text
Abstract:
The Action Selection problem is the problem of run-time choice between conflicting and heterogeneous goals, a central problem in the simulation of whole creatures (as opposed to the solution of isolated uninterrupted tasks). This thesis argues that Reinforcement Learning has been overlooked in the solution of the Action Selection problem. Considering a decentralised model of mind, with internal tension and competition between selfish behaviors, this thesis introduces an algorithm called "W-learning", whereby different parts of the mind modify their behavior based on whether or not they are succeeding in getting the body to execute their actions. This thesis sets W-learning in context among the different ways of exploiting Reinforcement Learning numbers for the purposes of Action Selection. It is a 'Minimize the Worst Unhappiness' strategy. The different methods are tested and their strengths and weaknesses analysed in an artificial world.
APA, Harvard, Vancouver, ISO, and other styles
50

Namvar, Gharehshiran Omid. "Reinforcement learning in non-stationary games." Thesis, University of British Columbia, 2015. http://hdl.handle.net/2429/51993.

Full text
Abstract:
The unifying theme of this thesis is the design and analysis of adaptive procedures that are aimed at learning the optimal decision in the presence of uncertainty. The first part is devoted to strategic decision making involving multiple individuals with conflicting interests. This is the subject of non-cooperative game theory. The proliferation of social networks has led to new ways of sharing information. Individuals subscribe to social groups, in which their experiences are shared. This new information patterns facilitate the resolution of uncertainties. We present an adaptive learning algorithm that exploits these new patterns. Despite its deceptive simplicity, if followed by all individuals, the emergent global behavior resembles that obtained from fully rational considerations, namely, correlated equilibrium. Further, it responds to the random unpredictable changes in the environment by properly tracking the evolving correlated equilibria set. Numerical evaluations verify these new information patterns can lead to improved adaptability of individuals and, hence, faster convergence to correlated equilibrium. Motivated by the self-configuration feature of the game-theoretic design and the prevalence of wireless-enabled electronics, the proposed adaptive learning procedure is then employed to devise an energy-aware activation mechanism for wireless-enabled sensors which are assigned a parameter estimation task. The proposed game-theoretic model trades-off sensors' contribution to the estimation task and the associated energy costs. The second part considers the problem of a single decision maker who seeks the optimal choice in the presence of uncertainty. This problem is mathematically formulated as a discrete stochastic optimization. In many real-life systems, due to the unexplained randomness and complexity involved, there typically exists no explicit relation between the performance measure of interest and the decision variables. In such cases, computer simulations are used as models of real systems to evaluate output responses. We present two simulation-based adaptive search schemes and show that, by following these schemes, the global optimum can be properly tracked as it undergoes random unpredictable jumps over time. Further, most of the simulation effort is exhausted on the global optimizer. Numerical evaluations verify faster convergence and improved efficiency as compared with existing random search, simulated annealing, and upper confidence bound methods.
Applied Science, Faculty of
Electrical and Computer Engineering, Department of
Graduate
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography