Log in

Relevant bibliographies by topics / Policy gradients / Dissertations / Theses

Dissertations / Theses on the topic 'Policy gradients'

To see the other types of publications on this topic, follow the link: Policy gradients.

Author: Grafiati

Published: 30 May 2022

Last updated: 31 May 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 30 dissertations / theses for your research on the topic 'Policy gradients.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Crowley, Mark. "Equilibrium policy gradients for spatiotemporal planning." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/38971.

Full text

Abstract:

In spatiotemporal planning, agents choose actions at multiple locations in space over some planning horizon to maximize their utility and satisfy various constraints. In forestry planning, for example, the problem is to choose actions for thousands of locations in the forest each year. The actions at each location could include harvesting trees, treating trees against disease and pests, or doing nothing. A utility model could place value on sale of forest products, ecosystem sustainability or employment levels, and could incorporate legal and logistical constraints such as avoiding large contiguous areas of clearcutting and managing road access. Planning requires a model of the dynamics. Existing simulators developed by forestry researchers can provide detailed models of the dynamics of a forest over time, but these simulators are often not designed for use in automated planning. This thesis presents spatiotemoral planning in terms of factored Markov decision processes. A policy gradient planning algorithm optimizes a stochastic spatial policy using existing simulators for dynamics. When a planning problem includes spatial interaction between locations, deciding on an action to carry out at one location requires considering the actions performed at other locations. This spatial interdependence is common in forestry and other environmental planning problems and makes policy representation and planning challenging. We define a spatial policy in terms of local policies defined as distributions over actions at one location conditioned upon actions at other locations. A policy gradient planning algorithm using this spatial policy is presented which uses Markov Chain Monte Carlo simulation to sample the landscape policy, estimate its gradient and use this gradient to guide policy improvement. Evaluation is carried out on a forestry planning problem with 1880 locations using a variety of value models and constraints. The distribution over joint actions at all locations can be seen as the equilibrium of a cyclic causal model. This equilibrium semantics is compared to Structural Equation Models. We also define an algorithm for approximating the equilibrium distribution for cyclic causal networks which exploits graphical structure and analyse when the algorithm is exact.

APA, Harvard, Vancouver, ISO, and other styles

2

Sehnke, Frank [Verfasser], Patrick van der [Akademischer Betreuer] Smagt, and Jürgen [Akademischer Betreuer] Schmidhuber. "Parameter Exploring Policy Gradients and their Implications / Frank Sehnke. Gutachter: Jürgen Schmidhuber. Betreuer: Patrick van der Smagt." München : Universitätsbibliothek der TU München, 2012. http://d-nb.info/1030099820/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Tolman, Deborah A. "Environmental Gradients, Community Boundaries, and Disturbance the Darlingtonia Fens of Southwestern Oregon." PDXScholar, 2004. https://pdxscholar.library.pdx.edu/open_access_etds/3013.

Full text

Abstract:

The Darlingtonia fens, found on serpentine soils in southern Oregon, are distinct communities that frequently undergo dramatic changes in size and shape in response to a wide array of environmental factors. Since few systems demonstrate a balance among high water tables, shallow soils, the presence of heavy metals, and limited nutrients, conservative efforts have been made to preserve them. This dissertation investigates the role of fire on nutrient cycling and succession in three separate fens, each a different time since fire. I specifically analyze the spatial distributions of soil properties, the physical and ecological characteristics of ecotones between Jeffrey pine savanna and Darlingtonia fens, and the vegetation structure of fire-disturbed systems. Soil, water, and vegetation sampling were conducted along an array of transects, oriented perpendicular to community boundaries and main environmental gradients, at each of the three fens. Abrupt changes in vegetation, across communities, were consistently identified at each of the three sites, although statistical analysis did not always identify distinct mid-canopy communities. Below-ground variables were likewise distinguished at the fen and savanna boundary for two of the three sites. At the third site, discontinuities did not align with the fen boundaries, but followed fluctuations in soil NH4. My results suggest that below-ground discontinuities may be more important than fire at preserving these uniquely-adapted systems, while vegetation undergoes postfire succession from fen to mid-canopy to savanna after approximately 100 years since fire. Although restoration of ecosystem structure and processes was not the primary focus of this study, my data suggest that time since fire may drive ecosystem processes in a trajectory away from the normal succession cycle. Moreover, time since fire may decrease overall vigor of Darlingtonia populations.

APA, Harvard, Vancouver, ISO, and other styles

4

Masoudi, Mohammad Amin. "Robust Deep Reinforcement Learning for Portfolio Management." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/42743.

Full text

Abstract:

In Finance, the use of Automated Trading Systems (ATS) on markets is growing every year and the trades generated by an algorithm now account for most of orders that arrive at stock exchanges (Kissell, 2020). Historically, these systems were based on advanced statistical methods and signal processing designed to extract trading signals from financial data. The recent success of Machine Learning has attracted the interest of the financial community. Reinforcement Learning is a subcategory of machine learning and has been broadly applied by investors and researchers in building trading systems (Kissell, 2020). In this thesis, we address the issue that deep reinforcement learning may be susceptible to sampling errors and over-fitting and propose a robust deep reinforcement learning method that integrates techniques from reinforcement learning and robust optimization. We back-test and compare the performance of the developed algorithm, Robust DDPG, with UBAH (Uniform Buy and Hold) benchmark and other RL algorithms and show that the robust algorithm of this research can reduce the downside risk of an investment strategy significantly and can ensure a safer path for the investor’s portfolio value.

APA, Harvard, Vancouver, ISO, and other styles

5

Jacobzon, Gustaf, and Martin Larsson. "Generalizing Deep Deterministic Policy Gradient." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-239365.

Full text

Abstract:

We extend Deep Deterministic Policy Gradient, a state of the art algorithm for continuous control, in order to achieve a high generalization capability. To achieve better generalization capabilities for the agent we introduce drop-out to the algorithm one of the most successful regularization techniques for generalization in machine learning. We use the recently published exploration technique, parameter space noise, to achieve higher stability and less likelihood of converging to a poor local minimum. We also replace the nonlinearity Rectified Linear Unit (ReLU) with Exponential Linear Unit (ELU) for greater stability and faster learning for the agent. Our results show that an agent trained with drop-out has generalization capabilities that far exceeds one that was trained with L2-regularization, when evaluated in the racing simulator TORCS. Further we found ELU to produce a more stable and faster learning process than ReLU when evaluated in the physics simulator MuJoCo.

APA, Harvard, Vancouver, ISO, and other styles

6

Ковальов, Костянтин Миколайович. "Комп'ютерна система управління промисловим роботом." Bachelor's thesis, КПІ ім. Ігоря Сікорського, 2019. https://ela.kpi.ua/handle/123456789/28610.

Full text

Abstract:

Кваліфікаційна робота включає пояснювальну записку (56 с., 2 додатка). Об’єкт дослідження – алгоритми навчання з підкріпленням для задачі керування промисловою роботичною рукою. Задача непервного керування промисловою роботичною рукою для нетривіальних задач є занадто складною або навіть невирішуваною для класичних методів робототехніки. Методи навчання з підкріпленням можуть бути використані в цьому випадку. Вони є досить простими у реалізації, дозволяють узагальнюватися на небачені випадки, та вчитися на даних великої розмірності. Ми реалізуємо метод градієнту глибокої детермінованої стратегії, який підходить для складних задач непервного управління. В ході дослідження:  проведено аналіз існуючих класичних методів для задачі управління промисловим роботом  проведено аналіз існуючих алгоритмів навчання з підкріпленням та їх використання в області робототехніки  реалізовано алгоритм градієнту глибокої детермінованої стратегії  проведено тестування реалізованого алгоритму у спрощеному середовищі  запропоновано архітектуру нейронної мережі для вирішення поставленої задачі  проведено тестування алгоритму на навчальній виборці  проведено тестування алгоритму на здатність до узагальнення на тестовій виборці Показано здатність алгоритму градієнту глибокої детермінованої стратегії з використанням нейронних мереж для представлення стратегії вирішувати поставлену задачі з зображенням в якості входу та узагальнюватися на небачені до цього об’єкти.
Qualifying work includes an explanatory note (56 p., 2 appendix). The object of the study are reinforcement learning algorithms for the task of an industrial robotic arm control. Continuous control of an industrial robotic arm for non-trivial tasks is too complicated or even unsolvable for classical methods of robotics. Reinforcement learning methods can be used in this case. They are quite simple to implement, allow for generalization to unseen cases, and learn from high-dimensional data. We implement deep deterministic policy gradient algorithm that is suitable for complex continuous contol tasks. During the study: • An analysis of existing classical methods for the problem of industrial robot control was conducted • An analysis of existing algorithms of training with reinforcement learning and their use in the field of robotics has been conducted • Deep deterministic policy gradient algorithm is implemented • Implemented algorithm is tested on a simplified environment • The architecture of the neural network is proposed for solving the problem • Algorithm was tested on the training set of objects • Algorithm was tested for its generalization ability on the test set It was shown that deep deterministic policy gradient algorithm with neural network as policy approximator is able to solve the problem with the image as an input and to generalize to objects not seen before.

APA, Harvard, Vancouver, ISO, and other styles

7

Greensmith, Evan, and evan greensmith@gmail com. "Policy Gradient Methods: Variance Reduction and Stochastic Convergence." The Australian National University. Research School of Information Sciences and Engineering, 2005. http://thesis.anu.edu.au./public/adt-ANU20060106.193712.

Full text

Abstract:

In a reinforcement learning task an agent must learn a policy for performing actions so as to perform well in a given environment. Policy gradient methods consider a parameterized class of policies, and using a policy from the class, and a trajectory through the environment taken by the agent using this policy, estimate the performance of the policy with respect to the parameters. Policy gradient methods avoid some of the problems of value function methods, such as policy degradation, where inaccuracy in the value function leads to the choice of a poor policy. However, the estimates produced by policy gradient methods can have high variance.¶ In Part I of this thesis we study the estimation variance of policy gradient algorithms, in particular, when augmenting the estimate with a baseline, a common method for reducing estimation variance, and when using actor-critic methods. A baseline adjusts the reward signal supplied by the environment, and can be used to reduce the variance of a policy gradient estimate without adding any bias. We find the baseline that minimizes the variance. We also consider the class of constant baselines, and find the constant baseline that minimizes the variance. We compare this to the common technique of adjusting the rewards by an estimate of the performance measure. Actor-critic methods usually attempt to learn a value function accurate enough to be used in a gradient estimate without adding much bias. In this thesis we propose that in learning the value function we should also consider the variance. We show how considering the variance of the gradient estimate when learning a value function can be beneficial, and we introduce a new optimization criterion for selecting a value function.¶ In Part II of this thesis we consider online versions of policy gradient algorithms, where we update our policy for selecting actions at each step in time, and study the convergence of the these online algorithms. For such online gradient-based algorithms, convergence results aim to show that the gradient of the performance measure approaches zero. Such a result has been shown for an algorithm which is based on observing trajectories between visits to a special state of the environment. However, the algorithm is not suitable in a partially observable setting, where we are unable to access the full state of the environment, and its variance depends on the time between visits to the special state, which may be large even when only few samples are needed to estimate the gradient. To date, convergence results for algorithms that do not rely on a special state are weaker. We show that, for a certain algorithm that does not rely on a special state, the gradient of the performance measure approaches zero. We show that this continues to hold when using certain baseline algorithms suggested by the results of Part I.

APA, Harvard, Vancouver, ISO, and other styles

8

Greensmith, Evan. "Policy gradient methods : variance reduction and stochastic convergence /." View thesis entry in Australian Digital Theses Program, 2005. http://thesis.anu.edu.au/public/adt-ANU20060106.193712/index.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Aberdeen, Douglas Alexander, and doug aberdeen@anu edu au. "Policy-Gradient Algorithms for Partially Observable Markov Decision Processes." The Australian National University. Research School of Information Sciences and Engineering, 2003. http://thesis.anu.edu.au./public/adt-ANU20030410.111006.

Full text

Abstract:

Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). ¶ In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ¶ Monte-Carlo policy-gradient approaches tend to produce gradient estimates with high variance. Two novel methods for reducing variance are introduced. The first uses high-order filters to replace the eligibility trace of the gradient estimator. The second uses a low-variance value-function method to learn a subset of the parameters and a policy-gradient method to learn the remainder. ¶ The algorithms are applied to large domains including a simulated robot navigation scenario, a multi-agent scenario with 21,000 states, and the complex real-world task of large vocabulary continuous speech recognition. To the best of the author's knowledge, no other policy-gradient algorithms have performed well at such tasks. ¶ The high variance of Monte-Carlo methods requires lengthy simulation and hence a super-computer to train agents within a reasonable time. The ANU ``Bunyip'' Linux cluster was built with such tasks in mind. It was used for several of the experimental results presented here. One chapter of this thesis describes an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001.

APA, Harvard, Vancouver, ISO, and other styles

10

Aberdeen, Douglas Alexander. "Policy-gradient algorithms for partially observable Markov decision processes /." View thesis entry in Australian Digital Theses Program, 2003. http://thesis.anu.edu.au/public/adt-ANU20030410.111006/index.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Lidström, Christian, and Hannes Leskelä. "Learning for RoboCup Soccer : Policy Gradient Reinforcement Learning inmulti-agent systems." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-157469.

Full text

Abstract:

Robo Cup Soccer is a long-running yearly world wide robotics competition,in which teams of autonomous robot agents play soccer against each other.This report focuses on the 2D simulator variant, where no actual robots are needed and the agents instead communicate with a server which keeps trackof the game state. RoboCup Soccer 2D simulation has become a major topic of research for articial intelligence, cooperative behaviour in multi-agent systems, and the learning thereof. Some form of machine learning is mandatory if you want to compete at the highest level, as the problem is too complex for manualconguration of a teams decision making.This report nds that PGRL is a common method for machine learning in Robo Cup teams, it is utilized in some of the best teams in Robo Cup. The report also nds that PGRL is an effective form of machine learning interms of learning speed, but there are many factors which affects this. Most often a compromise have to made between speed of learning and precision.
Robo Cup Soccer är en årlig världsomspännande robotiktävling, i vilken lag av autonoma robotagenter spelar fotboll mot varandra. Denna rapport fokuserar på 2D-simulatorn, vilken är en variant där inga riktiga robotar behövs, utan där spelarklienterna istället kommunicerar med en server vilken håller reda på speltillståndet. RoboCup Soccer 2D simulation har blivit ett stort ämne för forskning inom articiell intelligens, samarbete och beteende i multi-agent-system, och lärandet därav. Någon form av maskininlärning är ett krav om man villkunna tävla på den högsta nivån, då problemet är för komplext för att beslutsfattandet ska kunna programmeras manuellt.Denna rapport finner att PGRL är en vanlig metod för maskininlärning i Robo Cup-lag, den används inom några av de bästa lagen i Robo Cup. Rapporten nner också att PGRL är en effektiv form av maskininlärningn är det gäller inlärningshastighet, men att det finns många faktorer som kan påverka detta. Oftast måste en avvägning ske mellan inlärningshastighet och precision.

APA, Harvard, Vancouver, ISO, and other styles

12

GAVELLI, VIKTOR, and ALEXANDER GOMEZ. "Multi-agent system with Policy Gradient Reinforcement Learning for RoboCup Soccer Simulator." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-157418.

Full text

Abstract:

The RoboCup Soccer Simulator is a multi-agent soccer simulator used in competitions to simulate soccer playing robots. These competitionsare mainly held to promote robotics and AI research by providing a cheap and accessible way to program robot-like agents. In this report alearning multi-agent soccer team is implemented, described and tested.Policy Gradient Reinforcement Learning (PGRL) is used to train and alter the strategical decision making of the agents. The results show that PGRL improves the performance of the learningteam. But when the gap in performance between the learning team and the opponent is big the results were inconclusive.
RoboCup Soccer Simulator är en multiagent fotbollssimulator som används i tävlingar för att simulera robotar som spelar fotboll. Dessa tävlingar hålls huvudsakligen för att marknadsföra forskning inom robotik och articiell intelligens genom att tillhandahålla ett billigt och lättillgängligt sätt att programmera robotlika agenter. I denna rapportbeskrivs och testas en implementation av ett multiagentfotbollslag. PolicyGradiend Reinforcement Learning (PGRL) används för att träna ochförändra lagets beteende. Resultaten visar att PGRL förbättrar lagets prestanda, men närlagets prestanda skiljer sig avsevärt från motståndarens blir resultatetofullständigt.3

APA, Harvard, Vancouver, ISO, and other styles

13

Poulin, Nolan. "Proactive Planning through Active Policy Inference in Stochastic Environments." Digital WPI, 2018. https://digitalcommons.wpi.edu/etd-theses/1267.

Full text

Abstract:

In multi-agent Markov Decision Processes, a controllable agent must perform optimal planning in a dynamic and uncertain environment that includes another unknown and uncontrollable agent. Given a task specification for the controllable agent, its ability to complete the task can be impeded by an inaccurate model of the intent and behaviors of other agents. In this work, we introduce an active policy inference algorithm that allows a controllable agent to infer a policy of the environmental agent through interaction. Active policy inference is data-efficient and is particularly useful when data are time-consuming or costly to obtain. The controllable agent synthesizes an exploration-exploitation policy that incorporates the knowledge learned about the environment's behavior. Whenever possible, the agent also tries to elicit behavior from the other agent to improve the accuracy of the environmental model. This is done by mapping the uncertainty in the environmental model to a bonus reward, which helps elicit the most informative exploration, and allows the controllable agent to return to its main task as fast as possible. Experiments demonstrate the improved sample efficiency of active learning and the convergence of the policy for the controllable agents.

APA, Harvard, Vancouver, ISO, and other styles

14

Pianazzi, Enrico. "A deep reinforcement learning approach based on policy gradient for mobile robot navigation." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2022.

Find full text

Abstract:

Reinforcement learning is a model-free technique to solve decision-making problems by learning the best behavior to solve a specific task in a given environment. This thesis work focuses on state-of-the-art reinforcement learning methods and their application to mobile robotics navigation and control. Our work is inspired by the recent developments in deep reinforcement learning and from the ever-growing need for complex control and navigation capabilities from autonomous mobile robots. We propose a reinforcement learning controller based on an actor-critic approach to navigate a mobile robot in an initially unknown environment. The task is to navigate the robot from a random initial point on the map to a fixed goal point, while trying to stay within the environment limits and to avoid obstacles on the path. The agent has no initial knowledge of the environment's characteristic, including the goal and obstacles positions. The adopted algorithm is the so-called Deep Deterministic Policy Gradient (DDPG), which is able to deal with continuous states and inputs thanks to the use of neural networks in the actor-critic architecture and of the policy gradient to update the neural network representing the control policy. The learned controller directly outputs velocity commands to the robot, basing its decisions on the robot's position, without the need of additional sensory data. The robot is simulated as a unicycle kinematic model, and we present an implementation of the learning algorithm and robot simulation developed in Python that is able to solve the goal-reaching task while avoiding obstacles with a success rate above 95%.

APA, Harvard, Vancouver, ISO, and other styles

15

Fleming, Brian James. "The social gradient in health : trends in C20th ideas, Australian Health Policy 1970-1998, and a health equity policy evaluation of Australian aged care planning /." Title page, abstract and table of contents only, 2003. http://web4.library.adelaide.edu.au/theses/09PH/09phf5971.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Björnberg, Adam, and Haris Poljo. "Impact of observation noise and reward sparseness on Deep Deterministic Policy Gradient when applied to inverted pendulum stabilization." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-259758.

Full text

Abstract:

Deep Reinforcement Learning (RL) algorithms have been shown to solve complex problems. Deep Deterministic Policy Gradient (DDPG) is a state-of-the-art deep RL algorithm able to handle environments with continuous action spaces. This thesis evaluates how the DDPG algorithm performs in terms of success rate and results depending on observation noise and reward sparseness using a simple environment. A threshold for how much gaussian noise can be added to observations before algorithm performance starts to decrease was found between a standard deviation of 0.025 and 0.05. It was also con-cluded that reward sparseness leads to result inconsistency and irreproducibility, showing the importance of a well-designed reward function. Further testing is required to thoroughly evaluate the performance impact when noisy observations and sparse rewards are combined.
Djupa Reinforcement Learning (RL) algoritmer har visat sig kunna lösa komplexa problem. Deep Deterministic Policy Gradient (DDPG) är en modern djup RL algoritm som kan hantera miljöer med kontinuerliga åtgärdsutrymmen. Denna studie utvärderar hur DDPG-algoritmen presterar med avseende på lösningsgrad och resultat beroende på observationsbrus och belöningsgles-het i en enkel miljö. Ett tröskelvärde för hur mycket gaussiskt brus som kan läggas på observationer innan algoritmens prestanda börjar minska hittades mellan en standardavvikelse på 0,025 och 0,05. Det drogs även slutsatsen att belöningsgleshet leder till inkonsekventa resultat och oreproducerbarhet, vilket visar vikten av en väl utformad belöningsfunktion. Ytterligare tester krävs för att grundligt utvärdera effekten av att kombinera brusiga observationer och glesa belöningssignaler.

APA, Harvard, Vancouver, ISO, and other styles

17

Tagesson, Dennis. "A Comparison Between Deep Q-learning and Deep Deterministic Policy Gradient for an Autonomous Drone in a Simulated Environment." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-55134.

Full text

Abstract:

This thesis investigates how the performance between Deep Q-Network (DQN) with a continuous and discrete state- and action space, respectively, and Deep Deterministic Policy Gradient (DDPG) with a continuous state- and action space compare when trained in an environment with a continuous state- and action space. The environment was a simulation where the task for the algorithms was to control a drone from the start position to the location of the goal. The purpose of this investigation is to gain insight into how important it is to consider the action space of the environment when choosing a reinforcement learning algorithm. The action space of the environment is discretized for Deep Q-Network by restricting the number of possible actions to six. A simulation experiment was conducted where the algorithms were trained in the environment. The experiments were divided into six tests, where each test had the algorithms trained for 5000, 10000, or 35000 number of steps and with two different goal locations. The experiment was followed by an exploratory analysis of the data that was collected. Four different metrics were used to determine the performance. My analysis showed that DQN needed less experience to learn a successful policy than DDPG. Also, DQN outperformed DDPG in all tests but one. These results show that when choosing a reinforcement learning algorithm for a task, an algorithm with the same type of state- and action space as the environment is not necessarily the most effective one.

APA, Harvard, Vancouver, ISO, and other styles

18

Kaisaravalli, Bhojraj Gokul, and Yeswanth Surya Achyut Markonda. "Policy-based Reinforcement learning control for window opening and closing in an office building." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34420.

Full text

Abstract:

The level of indoor comfort can highly be influenced by window opening and closing behavior of the occupant in an office building. It will not only affect the comfort level but also affects the energy consumption, if not properly managed. This occupant behavior is not easy to predict and control in conventional way. Nowadays, to call a system smart it must learn user behavior, as it gives valuable information to the controlling system. To make an efficient way of controlling a window, we propose RL (Reinforcement Learning) in our thesis which should be able to learn user behavior and maintain optimal indoor climate. This model free nature of RL gives the flexibility in developing an intelligent control system in a simpler way, compared to that of the conventional techniques. Data in our thesis is taken from an office building in Beijing. There has been implementation of Value-based Reinforcement learning before for controlling the window, but here in this thesis we are applying policy-based RL (REINFORCE algorithm) and also compare our results with value-based (Q-learning) and there by getting a better idea, which suits better for the task that we have in our hand and also to explore how they behave. Based on our work it is found that policy based RL provides a great trade-off in maintaining optimal indoor temperature and learning occupant’s behavior, which is important for a system to be called smart.

APA, Harvard, Vancouver, ISO, and other styles

19

Olafsson, Björgvin. "Partially Observable Markov Decision Processes for Faster Object Recognition." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-198632.

Full text

Abstract:

Object recognition in the real world is a big challenge in the field of computer vision. Given the potentially enormous size of the search space it is essential to be able to make intelligent decisions about where in the visual field to obtain information from to reduce the computational resources needed. In this report a POMDP (Partially Observable Markov Decision Process) learning framework, using a policy gradient method and information rewards as a training signal, has been implemented and used to train fixation policies that aim to maximize the information gathered in each fixation. The purpose of such policies is to make object recognition faster by reducing the number of fixations needed. The trained policies are evaluated by simulation and comparing them with several fixed policies. Finally it is shown that it is possible to use the framework to train policies that outperform the fixed policies for certain observation models.

APA, Harvard, Vancouver, ISO, and other styles

20

Cox, Carissa. "Spatial Patterns in Development Regulation: Tree Preservation Ordinances of the DFW Metropolitan Area." Thesis, University of North Texas, 2011. https://digital.library.unt.edu/ark:/67531/metadc84194/.

Full text

Abstract:

Land use regulations are typically established as a response to development activity. For effective growth management and habitat preservation, the opposite should occur. This study considers tree preservation ordinances of the Dallas-Fort Worth metropolitan area as a means of evaluating development regulation in a metropolitan context. It documents the impact urban cores have on regulations and policies throughout their region, demonstrating that the same urban-rural gradient used to describe physical components of our metropolitan areas also holds true in terms of policy formation. Although sophistication of land use regulation generally dissipates as one moves away from an urban core, native habitat is more pristine at the outer edges. To more effectively protect native habitat, regional preservation measures are recommended.

APA, Harvard, Vancouver, ISO, and other styles

21

McDowell, Journey. "Comparison of Modern Controls and Reinforcement Learning for Robust Control of Autonomously Backing Up Tractor-Trailers to Loading Docks." DigitalCommons@CalPoly, 2019. https://digitalcommons.calpoly.edu/theses/2100.

Full text

Abstract:

Two controller performances are assessed for generalization in the path following task of autonomously backing up a tractor-trailer. Starting from random locations and orientations, paths are generated to loading docks with arbitrary pose using Dubins Curves. The combination vehicles can be varied in wheelbase, hitch length, weight distributions, and tire cornering stiffness. The closed form calculation of the gains for the Linear Quadratic Regulator (LQR) rely heavily on having an accurate model of the plant. However, real-world applications cannot expect to have an updated model for each new trailer. Finding alternative robust controllers when the trailer model is changed was the motivation of this research. Reinforcement learning, with neural networks as their function approximators, can allow for generalized control from its learned experience that is characterized by a scalar reward value. The Linear Quadratic Regulator and the Deep Deterministic Policy Gradient (DDPG) are compared for robust control when the trailer is changed. This investigation quantifies the capabilities and limitations of both controllers in simulation using a kinematic model. The controllers are evaluated for generalization by altering the kinematic model trailer wheelbase, hitch length, and velocity from the nominal case. In order to close the gap from simulation and reality, the control methods are also assessed with sensor noise and various controller frequencies. The root mean squared and maximum errors from the path are used as metrics, including the number of times the controllers cause the vehicle to jackknife or reach the goal. Considering the runs where the LQR did not cause the trailer to jackknife, the LQR tended to have slightly better precision. DDPG, however, controlled the trailer successfully on the paths where the LQR jackknifed. Reinforcement learning was found to sacrifice a short term reward, such as precision, to maximize the future expected reward like reaching the loading dock. The reinforcement learning agent learned a policy that imposed nonlinear constraints such that it never jackknifed, even when it wasn't the trailer it trained on.

APA, Harvard, Vancouver, ISO, and other styles

22

Michaud, Brianna. "A Habitat Analysis of Estuarine Fishes and Invertebrates, with Observations on the Effects of Habitat-Factor Resolution." Scholar Commons, 2016. http://scholarcommons.usf.edu/etd/6543.

Full text

Abstract:

Between 1988 and 2014, otter trawls, seine nets, and plankton nets were deployed along the salinity gradients of 18 estuaries by the University of South Florida and the Florida Fish and Wildlife Research Institute (FWRI, a research branch of the Florida Fish and Wildlife Conservation Commission). The purpose of these surveys was to document the responses of aquatic estuarine biota to variation in the quantity and quality of freshwater inflows that were being managed by the Southwest Florida Water Management District (SWFWMD). In the present analyses, four community types collected by these gears were compared with a diversity of habitat factors to identify the factors with the greatest influence on beta diversity, and also to identify the factors that were most influential to important prey species and economically important species. The four community types were (1) plankton-net invertebrates, (2) plankton-net ichthyoplankton, (3) seine nekton, and (4) trawl nekton. The habitat factors were (1) vertical profiles of salinity, dissolved oxygen, pH, and water temperature taken at the time of the biological collections, (2) various characterizations of local habitat associated with seine and trawl deployments, (3) chlorophyll a, color, and turbidity data obtained from the STORET database (US Environmental Protection Agency), and (4) data that characterize the effects of freshwater inflow on different estuarine zones, including factors for freshwater inflow, freshwater turnover time, and temporal instability in freshwater inflow (flashiness). Only 13 of the 18 estuaries had data that were comprehensive enough to allow habitat-factor analysis. An existing study had performed distance-based redundancy analysis (dbRDA) and principle component analysis (PCA) for these data within 78 estuarine survey zones that were composited together (i.e., regardless of estuary of origin). Based on that study’s findings, the communities of primarily spring-fed and primarily surface-fed estuaries were analyzed separately in the present study. Analysis was also performed with the habitat factors grouped into three categories (water management, restoration, and water quality) based on their ability to be directly modified by different management sectors. For an analysis of beta diversity interactions with habitat factors, dbRDA (called distance-based linear modeling (DistLM) in the PRIMER software) was performed using PRIMER 7 software (Quest Research Limited, Auckland, NZ). The dbRDA indicated pH, salinity, and distance to the Gulf of Mexico (distance-to-GOM) usually explained the most variation in the biotic data. These results were compared with partial dbRDA using the Akaike Information Criterion (AIC) as the model selection criterion with distance-to-GOM held as a covariate to reduce the effect of differences in the connectivity of marine-derived organisms to the different estuaries; distance-to-GOM explained between 8.46% and 32.4% of the variation in beta diversity. Even with the variation from distance-to-GOM removed, salinity was still selected as most influential factor, explaining up to an additional 23.7% of the variation in beta diversity. Factors associated with the water-management sector were most influential (primarily salinity), followed by factors associated with the restoration sector (primarily factors that describe shoreline type and bottom type). For the analysis of individual species, canonical analysis of principal coordinates (CAP) was performed to test for significant difference in community structure between groups of sites that represented high and low levels of each factor. For those communities that were significantly different, an indicator value (IndVal) was calculated for each species for high and low levels of each factor. Among species with significant IndVal for high or low levels of at least one factor, emphasis was given to important prey species (polychaetes, copepods, mysids, shrimps, bay anchovy juveniles, and gammaridean amphipods) and to species of economic importance, including adults, larvae and juveniles of commercial and recreational fishes, pink shrimp, and blue crab. Shrimps, copepods and mysids were all associated with estuarine zones that had low percentages of wooded or lawn-type shoreline, a factor that may serve as a proxy for flood conditions, as lawns or trees were usually only sampled with seines at high water elevations and in the freshwater reaches of the estuaries. Many copepod and shrimp species were strongly associated with high flushing times, which suggests that if flushing times were too short in an estuarine zone, then these species or their prey would be flushed out. Multiple regression analysis was performed on each of the selected indicator species, using AIC as a selection criterion and distance-to-GOM as a covariate. As might be expected, the apparent influences of different habitat factors varied from species to species, but there were some general patterns. For prey species in both spring-fed and surface-fed estuaries, pH and flushing time explained a significant amount of variation. In surface-fed estuaries, the presence of oysters on the bottom also had a positive effect for many prey species. For economically important species, depth was important in both spring-fed and surface-fed estuaries. This suggested the importance of maintaining large, shallow areas, particularly in surface-fed estuaries. Another important factor in spring-fed estuaries was the percent coverage of the bottom with sand; however, a mixture of positive and negative coefficients on this factor suggested the importance of substrate variety. In surface-fed estuaries, flashiness also often explained substantial variation for many economically important species, usually with positive coefficients, possibly due to the importance of alternation between nutrient-loading and high-primary-productivity periods. When comparing the three management sectors, the restoration sector was the most explanatory. Several factors were averaged over entire estuaries due to data scarcity or due to the nature of the factors themselves. Specifically, the STORET data for chlorophyll, color, and turbidity was inconsistently distributed with in the survey areas and was not collected at the same time as the biological samples. Moreover, certain water-management factors such as freshwater-inflow rate and flashiness are inherently less dimensional than other factors, and could only be represented by a single observation (i.e., no spatial variation) at any point in time. Due to concern that reduced spatiotemporal concurrence/dimensionality was masking the influence of habitat factors, the community analysis was repeated after representing each estuary with a single value for each habitat factor. We found that far fewer factors were selected in this analysis; salinity was only factor selected from the water-management factors. Overall, the factor that explained the most variation most often was the presence of emergent vegetation on the shoreline. This factor is a good proxy for urban development (more developed areas have lower levels of emergent vegetation on the shoreline). Unlike the previous analysis, the restoration sector overwhelmingly had the highest R2 values compared with other management sectors. In general, these results indicate the seeming importance of salinity in the previous analysis was likely because it had a higher resolution compared with many other factors, and that the lack of resolution homogeneity did influence the results. Of the habitat factors determined to be most influential with the analysis of communities and individual species (salinity, pH, emergent vegetation and lawn-and-trees shoreline types, oyster and sand bottom types, depth, flashiness, and flushing time) most were part of an estuarine gradient with high values at one end of the estuary with a gradual shift to low values at the other end. Since many of the analyzed species also showed a gradient distribution across the estuary, the abundance and community patterns could be explained by any of the habitat factors with that same gradient pattern. Therefore, there is a certain limitation to determining which factors are most influential in estuaries using this type of regression-based analysis. Three selected factors that do not have a strong estuarine gradient pattern are the sand bottom type, depth, and flashiness. In particular, flashiness has a single value for each estuary so it is incapable of following the estuarine gradient. This suggests that flashiness has an important process-based role that merits further investigation of its effect on estuarine species.

APA, Harvard, Vancouver, ISO, and other styles

23

Olsson, Anton, and Felix Rosberg. "Domain Transfer for End-to-end Reinforcement Learning." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-43042.

Full text

Abstract:

In this master thesis project a LiDAR-based, depth image-based and semantic segmentation image-based reinforcement learning agent is investigated and compared forlearning in simulation and performing in real-time. The project utilize the Deep Deterministic Policy Gradient architecture for learning continuous actions and was designed to control a RC car. One of the first project to deploy an agent in a real scenario after training in a similar simulation. The project demonstrated that with a proper reward function and by tuning driving parameters such as restricting steering, maximum velocity, minimum velocity and performing input data scaling a LiDAR-based agent could drive indefinitely on a simple but completely unseen track in real-time.

APA, Harvard, Vancouver, ISO, and other styles

24

Aklil, Nassim. "Apprentissage actif sous contrainte de budget en robotique et en neurosciences computationnelles. Localisation robotique et modélisation comportementale en environnement non stationnaire." Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066225/document.

Full text

Abstract:

La prise de décision est un domaine très étudié en sciences, que ce soit en neurosciences pour comprendre les processus sous tendant la prise de décision chez les animaux, qu’en robotique pour modéliser des processus de prise de décision efficaces et rapides dans des tâches en environnement réel. En neurosciences, ce problème est résolu online avec des modèles de prises de décision séquentiels basés sur l’apprentissage par renforcement. En robotique, l’objectif premier est l’efficacité, dans le but d’être déployés en environnement réel. Cependant en robotique ce que l’on peut appeler le budget et qui concerne les limitations inhérentes au matériel, comme les temps de calculs, les actions limitées disponibles au robot ou la durée de vie de la batterie du robot, ne sont souvent pas prises en compte à l’heure actuelle. Nous nous proposons dans ce travail de thèse d’introduire la notion de budget comme contrainte explicite dans les processus d’apprentissage robotique appliqués à une tâche de localisation en mettant en place un modèle basé sur des travaux développés en apprentissage statistique qui traitent les données sous contrainte de budget, en limitant l’apport en données ou en posant une contrainte de temps plus explicite. Dans le but d’envisager un fonctionnement online de ce type d’algorithmes d’apprentissage budgétisé, nous discutons aussi certaines inspirations possibles qui pourraient être prises du côté des neurosciences computationnelles. Dans ce cadre, l’alternance entre recherche d’information pour la localisation et la décision de se déplacer pour un robot peuvent être indirectement liés à la notion de compromis exploration-exploitation. Nous présentons notre contribution à la modélisation de ce compromis chez l’animal dans une tâche non stationnaire impliquant différents niveaux d’incertitude, et faisons le lien avec les méthodes de bandits manchot
Decision-making is a highly researched field in science, be it in neuroscience to understand the processes underlying animal decision-making, or in robotics to model efficient and rapid decision-making processes in real environments. In neuroscience, this problem is resolved online with sequential decision-making models based on reinforcement learning. In robotics, the primary objective is efficiency, in order to be deployed in real environments. However, in robotics what can be called the budget and which concerns the limitations inherent to the hardware, such as computation times, limited actions available to the robot or the lifetime of the robot battery, are often not taken into account at the present time. We propose in this thesis to introduce the notion of budget as an explicit constraint in the robotic learning processes applied to a localization task by implementing a model based on work developed in statistical learning that processes data under explicit constraints, limiting the input of data or imposing a more explicit time constraint. In order to discuss an online functioning of this type of budgeted learning algorithms, we also discuss some possible inspirations that could be taken on the side of computational neuroscience. In this context, the alternation between information retrieval for location and the decision to move for a robot may be indirectly linked to the notion of exploration-exploitation compromise. We present our contribution to the modeling of this compromise in animals in a non-stationary task involving different levels of uncertainty, and we make the link with the methods of multi-armed bandits

APA, Harvard, Vancouver, ISO, and other styles

25

Su, Xiaoshan. "Three Essays on the Design, Pricing, and Hedging of Insurance Contracts." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE2065.

Full text

Abstract:

Cette thèse utilise des outils théoriques de la finance, de la théorie de la décision et de l'apprentissage automatique, pour améliorer la conception, la tarification et la couverture des contrats d'assurance. Le chapitre 3 de cette thèse développe des formules de tarification sous forme fermée pour une classe de contrats d'assurance vie participatifs, sur la base de la factorisation matricielle de Wiener-Hopf, et prend en compte plusieurs types de risque, tels que les risques de crédit, de marché et économiques. La méthode de tarification se révèle précise et efficace. Les stratégies de couverture dynamique et semi-statique sont introduites pour aider les compagnies d'assurance à réduire leur risque lié à l'émission de contrats participatifs. Le chapitre 4 traite de la conception optimale de contrats lorsque l'assuré possède une aversion au risque du troisième degré. Les résultats exhibent une forme de contrat optimale pour les agents averses au risque comme pour ceux appréciant le risque dans différents contextes. Le chapitre 5 développe un modèle stochastique amplificateur degradient fréquence/sévérité qui améliore les modèles de fréquence et de sévérité importants et populaires que sont les modèles GLM et GAM. Ce nouveau modèle hérite pleinement des avantages de l'algorithme de renforcement du gradient, dépassant ainsi les formes linéaires ou additives restrictives des modèles GLM et GAM, avec apprentissage de la structure du modèle à partir des données. En outre, ce modèle peut également rendre compte de la dépendance non linéaire existant entre fréquence et sévérité des sinistres
This thesis makes use of some theoretical tools in finance, decision theory, machine learning, to improve the design, pricing and hedging of insurance contracts. Chapter 3 develops closed-form pricing formulas for participating life insurance contracts, based on matrix Wiener-Hopf factorization, where multiple risk sources, such as credit, market, and economic risks, are considered. The pricing method proves to be accurate and efficient. The dynamic and semi-static hedging strategies are introduced to assist insurance company to reduce risk exposure arising from the issue of participating contracts. Chapter 4 discusses the optimal contract design when the insured is third degree risk averse. The results showthat dual limited stop-loss, change-loss, dual change-loss, and stop-loss can be optimal contracts favord by both of risk averters and risk lovers in different settings. Chapter 5 develops a stochastic gradient boosting frequency-severity model, which improves the important and popular GLM and GAM frequency-severity models. This model fully inherits advantages ofgradient boosting algorithm, overcoming the restrictive linear or additive forms of the GLM and GAM frequency-severity models, through learning the model structure from data. Further, our model can also capture the flexible nonlinear dependence between claim frequency and severity

APA, Harvard, Vancouver, ISO, and other styles

26

Cai, Bo-Yin, and 蔡博胤. "A Behavior Fusion Approach Based on Policy Gradient." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/u6ctx3.

Full text

Abstract:

碩士
國立中山大學
電機工程學系研究所
107
In this study, we propose a behavioral fusion algorithm based on policy gradient. We use Actor-Critic algorithm to train sub-tasks. After the training is completed, the behavior fusion algorithm proposed in this paper is used for the learning of complex tasks. We can know the state value function of each sub-task in each state by reading the trained sub-task neural network, then calculate the return of each sub-task, and then pass the normalized return to the behavior fusion algorithm as a policy gradient. When reinforced learning is learning a complex task, there is often a problem that the reward function is difficult to be designed. If we use the sparse reward, although the best solution can be achieved theoretically, it will take a long training time. If we use the dense reward, although the speed of training is accelerated, it is also easy to get the agent into the local minimum. If the complex task is disassembled into several sub-tasks for training, the reward functions of the sub-tasks are easier to design. After the training is completed, these sub-tasks can be merged to achieve the complex tasks. In this study, we use the wafer probe simulator designed by our laboratory and pong in Atari game as the test environment. The wafer inspection simulator is used to simulate how the probe moves when the fab detects the chip. The goal is to have each wafer on the wafer checked once and not repeatedly check the same chip. The pong environment is about letting agents learn to defeat the computer on their own.

APA, Harvard, Vancouver, ISO, and other styles

27

Chen, Yi-Ching, and 陳怡靜. "Solving Rubik's Cube by Policy Gradient Based Reinforcement Learning." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/t842yt.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系所
107
Reinforcement Learning provides a mechanism for training an agent to interact with its environment. Policy gradient makes the right actions more probable. We propose using a linear policy gradient method in a deep neural network-based reinforcement learning. The proposed method employs an intensifying reward function to increase the probabilities of right actions to solve the Rubik's Cube problems. Experiments show that our proposed neural network learned to solve some Rubik's Cube states. For more difficult initial states, the network still cannot always give the correct suggestion.

APA, Harvard, Vancouver, ISO, and other styles

28

Kiah-YangChong and 張家揚. "Design and Implementation of Fuzzy Policy Gradient Gait Learning Method for Humanoid Robot." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/90100127378597192142.

Full text

Abstract:

碩士
國立成功大學
電機工程學系碩博士班
98
The design and implementation of Fuzzy Policy Gradient Learning (FPGL) method for small-sized humanoid robot is proposed in this thesis. This thesis not only introduces the mechanism structure of the humanoid robot and the hardware system adapted on the robot, which is named as aiRobots-V, but also improves and parameterizes the gait pattern of the robot. The movement of arms is added to the gait pattern to reduce the tilt of trunk while walking. FPGL method is an integrated machine learning method based on Policy Gradient Reinforcement Learning (PGRL) method and fuzzy logic concept in order to improve the efficiency and speed of gait learning computation. The humanoid robot is trained with FPGL method which is using the walking distance in constant walking cycles as the reward to learn faster and stable gait automatically. The tilt degree of trunk is chosen as the reward to learn the movement of arms in the walking cycle. The result of the experiment shows that FPGL method could train the gait pattern from 9.26 mm/s walking speed to 162.27 mm/s in about an hour. The training data of experiments also shows that this method could improve the efficiency of basic PGRL method up to 13%. The effect of arm movement to reduce the tilt degree of trunk is also proved by the experimental results. This robot is also applied to participate in the throw-in technical challenge of RoboCup 2010.

APA, Harvard, Vancouver, ISO, and other styles

29

"Adaptive Curvature for Stochastic Optimization." Master's thesis, 2019. http://hdl.handle.net/2286/R.I.53675.

Full text

Abstract:

abstract: This thesis presents a family of adaptive curvature methods for gradient-based stochastic optimization. In particular, a general algorithmic framework is introduced along with a practical implementation that yields an efficient, adaptive curvature gradient descent algorithm. To this end, a theoretical and practical link between curvature matrix estimation and shrinkage methods for covariance matrices is established. The use of shrinkage improves estimation accuracy of the curvature matrix when data samples are scarce. This thesis also introduce several insights that result in data- and computation-efficient update equations. Empirical results suggest that the proposed method compares favorably with existing second-order techniques based on the Fisher or Gauss-Newton and with adaptive stochastic gradient descent methods on both supervised and reinforcement learning tasks.
Dissertation/Thesis
Masters Thesis Computer Science 2019

APA, Harvard, Vancouver, ISO, and other styles

30

Pereira, Bruno Alexandre Barbosa. "Deep reinforcement learning for robotic manipulation tasks." Master's thesis, 2021. http://hdl.handle.net/10773/33654.

Full text

Abstract:

The recent advances in Artificial Intelligence (AI) present new opportunities for robotics on many fronts. Deep Reinforcement Learning (DRL) is a sub-field of AI which results from the combination of Deep Learning (DL) and Reinforcement Learning (RL). It categorizes machine learning algorithms which learn directly from experience and offers a comprehensive framework for studying the interplay among learning, representation and decision-making. It has already been successfully used to solve tasks in many domains. Most notably, DRL agents learned to play Atari 2600 video games directly from pixels and achieved human comparable performance in 49 of those games. Additionally, recent efforts using DRL in conjunction with other techniques produced agents capable of playing the board game of Go at a professional level, which has long been viewed as an intractable problem due to its enormous search space. In the context of robotics, DRL is often applied to planning, navigation, optimal control and others. Here, the powerful function approximation and representation learning properties of Deep Neural Networks enable RL to scale up to problems with highdimensional state and action spaces. Additionally, inherent properties of DRL make transfer learning useful when moving from simulation to the real world. This dissertation aims to investigate the applicability and effectiveness of DRL to learn successful policies on the domain of robot manipulator tasks. Initially, a set of three classic RL problems were solved using RL and DRL algorithms in order to explore their practical implementation and arrive at class of algorithms appropriate for these robotic tasks. Afterwards, a task in simulation is defined such that an agent is set to control a 6 DoF manipulator to reach a target with its end effector. This is used to evaluate the effects on performance of different state representations, hyperparameters and state-of-the-art DRL algorithms, resulting in agents with high success rates. The emphasis is then placed on the speed and time restrictions of the end effector's positioning. To this end, different reward systems were tested for an agent learning a modified version of the previous reaching task with faster joint speeds. In this setting, a number of improvements were verified in relation to the original reward system. Finally, an application of the best reaching agent obtained from the previous experiments is demonstrated on a simplified ball catching scenario.
Os avanços recentes na Inteligência Artificial (IA) demonstram um conjunto de novas oportunidades para a robótica. A Aprendizagem Profunda por Reforço (DRL) é uma subárea da IA que resulta da combinação de Aprendizagem Profunda (DL) com Aprendizagem por Reforço (RL). Esta subárea define algoritmos de aprendizagem automática que aprendem diretamente por experiência e oferece uma abordagem compreensiva para o estudo da interação entre aprendizagem, representação e a decisão. Estes algoritmos já têm sido utilizados com sucesso em diferentes domínios. Nomeadamente, destaca-se a aplicação de agentes de DRL que aprenderam a jogar vídeo jogos da consola Atari 2600 diretamente a partir de pixels e atingiram um desempenho comparável a humanos em 49 desses jogos. Mais recentemente, a DRL em conjunto com outras técnicas originou agentes capazes de jogar o jogo de tabuleiro Go a um nível profissional, algo que até ao momento era visto como um problema demasiado complexo para ser resolvido devido ao seu enorme espaço de procura. No âmbito da robótica, a DRL tem vindo a ser utilizada em problemas de planeamento, navegação, controlo ótimo e outros. Nestas aplicações, as excelentes capacidades de aproximação de funções e aprendizagem de representação das Redes Neuronais Profundas permitem à RL escalar a problemas com espaços de estado e ação multidimensionais. Adicionalmente, propriedades inerentes à DRL fazem a transferência de aprendizagem útil ao passar da simulação para o mundo real. Esta dissertação visa investigar a aplicabilidade e eficácia de técnicas de DRL para aprender políticas de sucesso no domínio das tarefas de manipulação robótica. Inicialmente, um conjunto de três problemas clássicos de RL foram resolvidos utilizando algoritmos de RL e DRL de forma a explorar a sua implementação prática e chegar a uma classe de algoritmos apropriados para estas tarefas de robótica. Posteriormente, foi definida uma tarefa em simulação onde um agente tem como objetivo controlar um manipulador com 6 graus de liberdade de forma a atingir um alvo com o seu terminal. Esta é utilizada para avaliar o efeito no desempenho de diferentes representações do estado, hiperparâmetros e algoritmos do estado da arte de DRL, o que resultou em agentes com taxas de sucesso elevadas. O foco é depois colocado na velocidade e restrições de tempo do posicionamento do terminal. Para este fim, diferentes sistemas de recompensa foram testados para que um agente possa aprender uma versão modificada da tarefa anterior para velocidades de juntas superiores. Neste cenário, foram verificadas várias melhorias em relação ao sistema de recompensa original. Finalmente, uma aplicação do melhor agente obtido nas experiências anteriores é demonstrada num cenário implicado de captura de bola.
Mestrado em Engenharia de Computadores e Telemática

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!