Dissertations / Theses: 'Policy gradient'

1

Jacobzon, Gustaf, and Martin Larsson. "Generalizing Deep Deterministic Policy Gradient." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-239365.

Full text

Abstract:

We extend Deep Deterministic Policy Gradient, a state of the art algorithm for continuous control, in order to achieve a high generalization capability. To achieve better generalization capabilities for the agent we introduce drop-out to the algorithm one of the most successful regularization techniques for generalization in machine learning. We use the recently published exploration technique, parameter space noise, to achieve higher stability and less likelihood of converging to a poor local minimum. We also replace the nonlinearity Rectified Linear Unit (ReLU) with Exponential Linear Unit (ELU) for greater stability and faster learning for the agent. Our results show that an agent trained with drop-out has generalization capabilities that far exceeds one that was trained with L2-regularization, when evaluated in the racing simulator TORCS. Further we found ELU to produce a more stable and faster learning process than ReLU when evaluated in the physics simulator MuJoCo.

APA, Harvard, Vancouver, ISO, and other styles

2

Greensmith, Evan, and evan greensmith@gmail com. "Policy Gradient Methods: Variance Reduction and Stochastic Convergence." The Australian National University. Research School of Information Sciences and Engineering, 2005. http://thesis.anu.edu.au./public/adt-ANU20060106.193712.

Full text

Abstract:

In a reinforcement learning task an agent must learn a policy for performing actions so as to perform well in a given environment. Policy gradient methods consider a parameterized class of policies, and using a policy from the class, and a trajectory through the environment taken by the agent using this policy, estimate the performance of the policy with respect to the parameters. Policy gradient methods avoid some of the problems of value function methods, such as policy degradation, where inaccuracy in the value function leads to the choice of a poor policy. However, the estimates produced by policy gradient methods can have high variance.¶ In Part I of this thesis we study the estimation variance of policy gradient algorithms, in particular, when augmenting the estimate with a baseline, a common method for reducing estimation variance, and when using actor-critic methods. A baseline adjusts the reward signal supplied by the environment, and can be used to reduce the variance of a policy gradient estimate without adding any bias. We find the baseline that minimizes the variance. We also consider the class of constant baselines, and find the constant baseline that minimizes the variance. We compare this to the common technique of adjusting the rewards by an estimate of the performance measure. Actor-critic methods usually attempt to learn a value function accurate enough to be used in a gradient estimate without adding much bias. In this thesis we propose that in learning the value function we should also consider the variance. We show how considering the variance of the gradient estimate when learning a value function can be beneficial, and we introduce a new optimization criterion for selecting a value function.¶ In Part II of this thesis we consider online versions of policy gradient algorithms, where we update our policy for selecting actions at each step in time, and study the convergence of the these online algorithms. For such online gradient-based algorithms, convergence results aim to show that the gradient of the performance measure approaches zero. Such a result has been shown for an algorithm which is based on observing trajectories between visits to a special state of the environment. However, the algorithm is not suitable in a partially observable setting, where we are unable to access the full state of the environment, and its variance depends on the time between visits to the special state, which may be large even when only few samples are needed to estimate the gradient. To date, convergence results for algorithms that do not rely on a special state are weaker. We show that, for a certain algorithm that does not rely on a special state, the gradient of the performance measure approaches zero. We show that this continues to hold when using certain baseline algorithms suggested by the results of Part I.

APA, Harvard, Vancouver, ISO, and other styles

3

Greensmith, Evan. "Policy gradient methods : variance reduction and stochastic convergence /." View thesis entry in Australian Digital Theses Program, 2005. http://thesis.anu.edu.au/public/adt-ANU20060106.193712/index.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Aberdeen, Douglas Alexander, and doug aberdeen@anu edu au. "Policy-Gradient Algorithms for Partially Observable Markov Decision Processes." The Australian National University. Research School of Information Sciences and Engineering, 2003. http://thesis.anu.edu.au./public/adt-ANU20030410.111006.

Full text

Abstract:

Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). ¶ In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ¶ Monte-Carlo policy-gradient approaches tend to produce gradient estimates with high variance. Two novel methods for reducing variance are introduced. The first uses high-order filters to replace the eligibility trace of the gradient estimator. The second uses a low-variance value-function method to learn a subset of the parameters and a policy-gradient method to learn the remainder. ¶ The algorithms are applied to large domains including a simulated robot navigation scenario, a multi-agent scenario with 21,000 states, and the complex real-world task of large vocabulary continuous speech recognition. To the best of the author's knowledge, no other policy-gradient algorithms have performed well at such tasks. ¶ The high variance of Monte-Carlo methods requires lengthy simulation and hence a super-computer to train agents within a reasonable time. The ANU ``Bunyip'' Linux cluster was built with such tasks in mind. It was used for several of the experimental results presented here. One chapter of this thesis describes an application written for the Bunyip cluster that won the international Gordon-Bell prize for price/performance in 2001.

APA, Harvard, Vancouver, ISO, and other styles

5

Aberdeen, Douglas Alexander. "Policy-gradient algorithms for partially observable Markov decision processes /." View thesis entry in Australian Digital Theses Program, 2003. http://thesis.anu.edu.au/public/adt-ANU20030410.111006/index.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Lidström, Christian, and Hannes Leskelä. "Learning for RoboCup Soccer : Policy Gradient Reinforcement Learning inmulti-agent systems." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-157469.

Full text

Abstract:

Robo Cup Soccer is a long-running yearly world wide robotics competition,in which teams of autonomous robot agents play soccer against each other.This report focuses on the 2D simulator variant, where no actual robots are needed and the agents instead communicate with a server which keeps trackof the game state. RoboCup Soccer 2D simulation has become a major topic of research for articial intelligence, cooperative behaviour in multi-agent systems, and the learning thereof. Some form of machine learning is mandatory if you want to compete at the highest level, as the problem is too complex for manualconguration of a teams decision making.This report nds that PGRL is a common method for machine learning in Robo Cup teams, it is utilized in some of the best teams in Robo Cup. The report also nds that PGRL is an effective form of machine learning interms of learning speed, but there are many factors which affects this. Most often a compromise have to made between speed of learning and precision.
Robo Cup Soccer är en årlig världsomspännande robotiktävling, i vilken lag av autonoma robotagenter spelar fotboll mot varandra. Denna rapport fokuserar på 2D-simulatorn, vilken är en variant där inga riktiga robotar behövs, utan där spelarklienterna istället kommunicerar med en server vilken håller reda på speltillståndet. RoboCup Soccer 2D simulation har blivit ett stort ämne för forskning inom articiell intelligens, samarbete och beteende i multi-agent-system, och lärandet därav. Någon form av maskininlärning är ett krav om man villkunna tävla på den högsta nivån, då problemet är för komplext för att beslutsfattandet ska kunna programmeras manuellt.Denna rapport finner att PGRL är en vanlig metod för maskininlärning i Robo Cup-lag, den används inom några av de bästa lagen i Robo Cup. Rapporten nner också att PGRL är en effektiv form av maskininlärningn är det gäller inlärningshastighet, men att det finns många faktorer som kan påverka detta. Oftast måste en avvägning ske mellan inlärningshastighet och precision.

APA, Harvard, Vancouver, ISO, and other styles

7

GAVELLI, VIKTOR, and ALEXANDER GOMEZ. "Multi-agent system with Policy Gradient Reinforcement Learning for RoboCup Soccer Simulator." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-157418.

Full text

Abstract:

The RoboCup Soccer Simulator is a multi-agent soccer simulator used in competitions to simulate soccer playing robots. These competitionsare mainly held to promote robotics and AI research by providing a cheap and accessible way to program robot-like agents. In this report alearning multi-agent soccer team is implemented, described and tested.Policy Gradient Reinforcement Learning (PGRL) is used to train and alter the strategical decision making of the agents. The results show that PGRL improves the performance of the learningteam. But when the gap in performance between the learning team and the opponent is big the results were inconclusive.
RoboCup Soccer Simulator är en multiagent fotbollssimulator som används i tävlingar för att simulera robotar som spelar fotboll. Dessa tävlingar hålls huvudsakligen för att marknadsföra forskning inom robotik och articiell intelligens genom att tillhandahålla ett billigt och lättillgängligt sätt att programmera robotlika agenter. I denna rapportbeskrivs och testas en implementation av ett multiagentfotbollslag. PolicyGradiend Reinforcement Learning (PGRL) används för att träna ochförändra lagets beteende. Resultaten visar att PGRL förbättrar lagets prestanda, men närlagets prestanda skiljer sig avsevärt från motståndarens blir resultatetofullständigt.3

APA, Harvard, Vancouver, ISO, and other styles

8

Pianazzi, Enrico. "A deep reinforcement learning approach based on policy gradient for mobile robot navigation." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2022.

Find full text

Abstract:

Reinforcement learning is a model-free technique to solve decision-making problems by learning the best behavior to solve a specific task in a given environment. This thesis work focuses on state-of-the-art reinforcement learning methods and their application to mobile robotics navigation and control. Our work is inspired by the recent developments in deep reinforcement learning and from the ever-growing need for complex control and navigation capabilities from autonomous mobile robots. We propose a reinforcement learning controller based on an actor-critic approach to navigate a mobile robot in an initially unknown environment. The task is to navigate the robot from a random initial point on the map to a fixed goal point, while trying to stay within the environment limits and to avoid obstacles on the path. The agent has no initial knowledge of the environment's characteristic, including the goal and obstacles positions. The adopted algorithm is the so-called Deep Deterministic Policy Gradient (DDPG), which is able to deal with continuous states and inputs thanks to the use of neural networks in the actor-critic architecture and of the policy gradient to update the neural network representing the control policy. The learned controller directly outputs velocity commands to the robot, basing its decisions on the robot's position, without the need of additional sensory data. The robot is simulated as a unicycle kinematic model, and we present an implementation of the learning algorithm and robot simulation developed in Python that is able to solve the goal-reaching task while avoiding obstacles with a success rate above 95%.

APA, Harvard, Vancouver, ISO, and other styles

9

Poulin, Nolan. "Proactive Planning through Active Policy Inference in Stochastic Environments." Digital WPI, 2018. https://digitalcommons.wpi.edu/etd-theses/1267.

Full text

Abstract:

In multi-agent Markov Decision Processes, a controllable agent must perform optimal planning in a dynamic and uncertain environment that includes another unknown and uncontrollable agent. Given a task specification for the controllable agent, its ability to complete the task can be impeded by an inaccurate model of the intent and behaviors of other agents. In this work, we introduce an active policy inference algorithm that allows a controllable agent to infer a policy of the environmental agent through interaction. Active policy inference is data-efficient and is particularly useful when data are time-consuming or costly to obtain. The controllable agent synthesizes an exploration-exploitation policy that incorporates the knowledge learned about the environment's behavior. Whenever possible, the agent also tries to elicit behavior from the other agent to improve the accuracy of the environmental model. This is done by mapping the uncertainty in the environmental model to a bonus reward, which helps elicit the most informative exploration, and allows the controllable agent to return to its main task as fast as possible. Experiments demonstrate the improved sample efficiency of active learning and the convergence of the policy for the controllable agents.

APA, Harvard, Vancouver, ISO, and other styles

10

Fleming, Brian James. "The social gradient in health : trends in C20th ideas, Australian Health Policy 1970-1998, and a health equity policy evaluation of Australian aged care planning /." Title page, abstract and table of contents only, 2003. http://web4.library.adelaide.edu.au/theses/09PH/09phf5971.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Björnberg, Adam, and Haris Poljo. "Impact of observation noise and reward sparseness on Deep Deterministic Policy Gradient when applied to inverted pendulum stabilization." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-259758.

Full text

Abstract:

Deep Reinforcement Learning (RL) algorithms have been shown to solve complex problems. Deep Deterministic Policy Gradient (DDPG) is a state-of-the-art deep RL algorithm able to handle environments with continuous action spaces. This thesis evaluates how the DDPG algorithm performs in terms of success rate and results depending on observation noise and reward sparseness using a simple environment. A threshold for how much gaussian noise can be added to observations before algorithm performance starts to decrease was found between a standard deviation of 0.025 and 0.05. It was also con-cluded that reward sparseness leads to result inconsistency and irreproducibility, showing the importance of a well-designed reward function. Further testing is required to thoroughly evaluate the performance impact when noisy observations and sparse rewards are combined.
Djupa Reinforcement Learning (RL) algoritmer har visat sig kunna lösa komplexa problem. Deep Deterministic Policy Gradient (DDPG) är en modern djup RL algoritm som kan hantera miljöer med kontinuerliga åtgärdsutrymmen. Denna studie utvärderar hur DDPG-algoritmen presterar med avseende på lösningsgrad och resultat beroende på observationsbrus och belöningsgles-het i en enkel miljö. Ett tröskelvärde för hur mycket gaussiskt brus som kan läggas på observationer innan algoritmens prestanda börjar minska hittades mellan en standardavvikelse på 0,025 och 0,05. Det drogs även slutsatsen att belöningsgleshet leder till inkonsekventa resultat och oreproducerbarhet, vilket visar vikten av en väl utformad belöningsfunktion. Ytterligare tester krävs för att grundligt utvärdera effekten av att kombinera brusiga observationer och glesa belöningssignaler.

APA, Harvard, Vancouver, ISO, and other styles

12

Tagesson, Dennis. "A Comparison Between Deep Q-learning and Deep Deterministic Policy Gradient for an Autonomous Drone in a Simulated Environment." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-55134.

Full text

Abstract:

This thesis investigates how the performance between Deep Q-Network (DQN) with a continuous and discrete state- and action space, respectively, and Deep Deterministic Policy Gradient (DDPG) with a continuous state- and action space compare when trained in an environment with a continuous state- and action space. The environment was a simulation where the task for the algorithms was to control a drone from the start position to the location of the goal. The purpose of this investigation is to gain insight into how important it is to consider the action space of the environment when choosing a reinforcement learning algorithm. The action space of the environment is discretized for Deep Q-Network by restricting the number of possible actions to six. A simulation experiment was conducted where the algorithms were trained in the environment. The experiments were divided into six tests, where each test had the algorithms trained for 5000, 10000, or 35000 number of steps and with two different goal locations. The experiment was followed by an exploratory analysis of the data that was collected. Four different metrics were used to determine the performance. My analysis showed that DQN needed less experience to learn a successful policy than DDPG. Also, DQN outperformed DDPG in all tests but one. These results show that when choosing a reinforcement learning algorithm for a task, an algorithm with the same type of state- and action space as the environment is not necessarily the most effective one.

APA, Harvard, Vancouver, ISO, and other styles

13

Olafsson, Björgvin. "Partially Observable Markov Decision Processes for Faster Object Recognition." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-198632.

Full text

Abstract:

Object recognition in the real world is a big challenge in the field of computer vision. Given the potentially enormous size of the search space it is essential to be able to make intelligent decisions about where in the visual field to obtain information from to reduce the computational resources needed. In this report a POMDP (Partially Observable Markov Decision Process) learning framework, using a policy gradient method and information rewards as a training signal, has been implemented and used to train fixation policies that aim to maximize the information gathered in each fixation. The purpose of such policies is to make object recognition faster by reducing the number of fixations needed. The trained policies are evaluated by simulation and comparing them with several fixed policies. Finally it is shown that it is possible to use the framework to train policies that outperform the fixed policies for certain observation models.

APA, Harvard, Vancouver, ISO, and other styles

14

Kaisaravalli, Bhojraj Gokul, and Yeswanth Surya Achyut Markonda. "Policy-based Reinforcement learning control for window opening and closing in an office building." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34420.

Full text

Abstract:

The level of indoor comfort can highly be influenced by window opening and closing behavior of the occupant in an office building. It will not only affect the comfort level but also affects the energy consumption, if not properly managed. This occupant behavior is not easy to predict and control in conventional way. Nowadays, to call a system smart it must learn user behavior, as it gives valuable information to the controlling system. To make an efficient way of controlling a window, we propose RL (Reinforcement Learning) in our thesis which should be able to learn user behavior and maintain optimal indoor climate. This model free nature of RL gives the flexibility in developing an intelligent control system in a simpler way, compared to that of the conventional techniques. Data in our thesis is taken from an office building in Beijing. There has been implementation of Value-based Reinforcement learning before for controlling the window, but here in this thesis we are applying policy-based RL (REINFORCE algorithm) and also compare our results with value-based (Q-learning) and there by getting a better idea, which suits better for the task that we have in our hand and also to explore how they behave. Based on our work it is found that policy based RL provides a great trade-off in maintaining optimal indoor temperature and learning occupant’s behavior, which is important for a system to be called smart.

APA, Harvard, Vancouver, ISO, and other styles

15

Cox, Carissa. "Spatial Patterns in Development Regulation: Tree Preservation Ordinances of the DFW Metropolitan Area." Thesis, University of North Texas, 2011. https://digital.library.unt.edu/ark:/67531/metadc84194/.

Full text

Abstract:

Land use regulations are typically established as a response to development activity. For effective growth management and habitat preservation, the opposite should occur. This study considers tree preservation ordinances of the Dallas-Fort Worth metropolitan area as a means of evaluating development regulation in a metropolitan context. It documents the impact urban cores have on regulations and policies throughout their region, demonstrating that the same urban-rural gradient used to describe physical components of our metropolitan areas also holds true in terms of policy formation. Although sophistication of land use regulation generally dissipates as one moves away from an urban core, native habitat is more pristine at the outer edges. To more effectively protect native habitat, regional preservation measures are recommended.

APA, Harvard, Vancouver, ISO, and other styles

16

McDowell, Journey. "Comparison of Modern Controls and Reinforcement Learning for Robust Control of Autonomously Backing Up Tractor-Trailers to Loading Docks." DigitalCommons@CalPoly, 2019. https://digitalcommons.calpoly.edu/theses/2100.

Full text

Abstract:

Two controller performances are assessed for generalization in the path following task of autonomously backing up a tractor-trailer. Starting from random locations and orientations, paths are generated to loading docks with arbitrary pose using Dubins Curves. The combination vehicles can be varied in wheelbase, hitch length, weight distributions, and tire cornering stiffness. The closed form calculation of the gains for the Linear Quadratic Regulator (LQR) rely heavily on having an accurate model of the plant. However, real-world applications cannot expect to have an updated model for each new trailer. Finding alternative robust controllers when the trailer model is changed was the motivation of this research. Reinforcement learning, with neural networks as their function approximators, can allow for generalized control from its learned experience that is characterized by a scalar reward value. The Linear Quadratic Regulator and the Deep Deterministic Policy Gradient (DDPG) are compared for robust control when the trailer is changed. This investigation quantifies the capabilities and limitations of both controllers in simulation using a kinematic model. The controllers are evaluated for generalization by altering the kinematic model trailer wheelbase, hitch length, and velocity from the nominal case. In order to close the gap from simulation and reality, the control methods are also assessed with sensor noise and various controller frequencies. The root mean squared and maximum errors from the path are used as metrics, including the number of times the controllers cause the vehicle to jackknife or reach the goal. Considering the runs where the LQR did not cause the trailer to jackknife, the LQR tended to have slightly better precision. DDPG, however, controlled the trailer successfully on the paths where the LQR jackknifed. Reinforcement learning was found to sacrifice a short term reward, such as precision, to maximize the future expected reward like reaching the loading dock. The reinforcement learning agent learned a policy that imposed nonlinear constraints such that it never jackknifed, even when it wasn't the trailer it trained on.

APA, Harvard, Vancouver, ISO, and other styles

17

Michaud, Brianna. "A Habitat Analysis of Estuarine Fishes and Invertebrates, with Observations on the Effects of Habitat-Factor Resolution." Scholar Commons, 2016. http://scholarcommons.usf.edu/etd/6543.

Full text

Abstract:

Between 1988 and 2014, otter trawls, seine nets, and plankton nets were deployed along the salinity gradients of 18 estuaries by the University of South Florida and the Florida Fish and Wildlife Research Institute (FWRI, a research branch of the Florida Fish and Wildlife Conservation Commission). The purpose of these surveys was to document the responses of aquatic estuarine biota to variation in the quantity and quality of freshwater inflows that were being managed by the Southwest Florida Water Management District (SWFWMD). In the present analyses, four community types collected by these gears were compared with a diversity of habitat factors to identify the factors with the greatest influence on beta diversity, and also to identify the factors that were most influential to important prey species and economically important species. The four community types were (1) plankton-net invertebrates, (2) plankton-net ichthyoplankton, (3) seine nekton, and (4) trawl nekton. The habitat factors were (1) vertical profiles of salinity, dissolved oxygen, pH, and water temperature taken at the time of the biological collections, (2) various characterizations of local habitat associated with seine and trawl deployments, (3) chlorophyll a, color, and turbidity data obtained from the STORET database (US Environmental Protection Agency), and (4) data that characterize the effects of freshwater inflow on different estuarine zones, including factors for freshwater inflow, freshwater turnover time, and temporal instability in freshwater inflow (flashiness). Only 13 of the 18 estuaries had data that were comprehensive enough to allow habitat-factor analysis. An existing study had performed distance-based redundancy analysis (dbRDA) and principle component analysis (PCA) for these data within 78 estuarine survey zones that were composited together (i.e., regardless of estuary of origin). Based on that study’s findings, the communities of primarily spring-fed and primarily surface-fed estuaries were analyzed separately in the present study. Analysis was also performed with the habitat factors grouped into three categories (water management, restoration, and water quality) based on their ability to be directly modified by different management sectors. For an analysis of beta diversity interactions with habitat factors, dbRDA (called distance-based linear modeling (DistLM) in the PRIMER software) was performed using PRIMER 7 software (Quest Research Limited, Auckland, NZ). The dbRDA indicated pH, salinity, and distance to the Gulf of Mexico (distance-to-GOM) usually explained the most variation in the biotic data. These results were compared with partial dbRDA using the Akaike Information Criterion (AIC) as the model selection criterion with distance-to-GOM held as a covariate to reduce the effect of differences in the connectivity of marine-derived organisms to the different estuaries; distance-to-GOM explained between 8.46% and 32.4% of the variation in beta diversity. Even with the variation from distance-to-GOM removed, salinity was still selected as most influential factor, explaining up to an additional 23.7% of the variation in beta diversity. Factors associated with the water-management sector were most influential (primarily salinity), followed by factors associated with the restoration sector (primarily factors that describe shoreline type and bottom type). For the analysis of individual species, canonical analysis of principal coordinates (CAP) was performed to test for significant difference in community structure between groups of sites that represented high and low levels of each factor. For those communities that were significantly different, an indicator value (IndVal) was calculated for each species for high and low levels of each factor. Among species with significant IndVal for high or low levels of at least one factor, emphasis was given to important prey species (polychaetes, copepods, mysids, shrimps, bay anchovy juveniles, and gammaridean amphipods) and to species of economic importance, including adults, larvae and juveniles of commercial and recreational fishes, pink shrimp, and blue crab. Shrimps, copepods and mysids were all associated with estuarine zones that had low percentages of wooded or lawn-type shoreline, a factor that may serve as a proxy for flood conditions, as lawns or trees were usually only sampled with seines at high water elevations and in the freshwater reaches of the estuaries. Many copepod and shrimp species were strongly associated with high flushing times, which suggests that if flushing times were too short in an estuarine zone, then these species or their prey would be flushed out. Multiple regression analysis was performed on each of the selected indicator species, using AIC as a selection criterion and distance-to-GOM as a covariate. As might be expected, the apparent influences of different habitat factors varied from species to species, but there were some general patterns. For prey species in both spring-fed and surface-fed estuaries, pH and flushing time explained a significant amount of variation. In surface-fed estuaries, the presence of oysters on the bottom also had a positive effect for many prey species. For economically important species, depth was important in both spring-fed and surface-fed estuaries. This suggested the importance of maintaining large, shallow areas, particularly in surface-fed estuaries. Another important factor in spring-fed estuaries was the percent coverage of the bottom with sand; however, a mixture of positive and negative coefficients on this factor suggested the importance of substrate variety. In surface-fed estuaries, flashiness also often explained substantial variation for many economically important species, usually with positive coefficients, possibly due to the importance of alternation between nutrient-loading and high-primary-productivity periods. When comparing the three management sectors, the restoration sector was the most explanatory. Several factors were averaged over entire estuaries due to data scarcity or due to the nature of the factors themselves. Specifically, the STORET data for chlorophyll, color, and turbidity was inconsistently distributed with in the survey areas and was not collected at the same time as the biological samples. Moreover, certain water-management factors such as freshwater-inflow rate and flashiness are inherently less dimensional than other factors, and could only be represented by a single observation (i.e., no spatial variation) at any point in time. Due to concern that reduced spatiotemporal concurrence/dimensionality was masking the influence of habitat factors, the community analysis was repeated after representing each estuary with a single value for each habitat factor. We found that far fewer factors were selected in this analysis; salinity was only factor selected from the water-management factors. Overall, the factor that explained the most variation most often was the presence of emergent vegetation on the shoreline. This factor is a good proxy for urban development (more developed areas have lower levels of emergent vegetation on the shoreline). Unlike the previous analysis, the restoration sector overwhelmingly had the highest R2 values compared with other management sectors. In general, these results indicate the seeming importance of salinity in the previous analysis was likely because it had a higher resolution compared with many other factors, and that the lack of resolution homogeneity did influence the results. Of the habitat factors determined to be most influential with the analysis of communities and individual species (salinity, pH, emergent vegetation and lawn-and-trees shoreline types, oyster and sand bottom types, depth, flashiness, and flushing time) most were part of an estuarine gradient with high values at one end of the estuary with a gradual shift to low values at the other end. Since many of the analyzed species also showed a gradient distribution across the estuary, the abundance and community patterns could be explained by any of the habitat factors with that same gradient pattern. Therefore, there is a certain limitation to determining which factors are most influential in estuaries using this type of regression-based analysis. Three selected factors that do not have a strong estuarine gradient pattern are the sand bottom type, depth, and flashiness. In particular, flashiness has a single value for each estuary so it is incapable of following the estuarine gradient. This suggests that flashiness has an important process-based role that merits further investigation of its effect on estuarine species.

APA, Harvard, Vancouver, ISO, and other styles

18

Olsson, Anton, and Felix Rosberg. "Domain Transfer for End-to-end Reinforcement Learning." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-43042.

Full text

Abstract:

In this master thesis project a LiDAR-based, depth image-based and semantic segmentation image-based reinforcement learning agent is investigated and compared forlearning in simulation and performing in real-time. The project utilize the Deep Deterministic Policy Gradient architecture for learning continuous actions and was designed to control a RC car. One of the first project to deploy an agent in a real scenario after training in a similar simulation. The project demonstrated that with a proper reward function and by tuning driving parameters such as restricting steering, maximum velocity, minimum velocity and performing input data scaling a LiDAR-based agent could drive indefinitely on a simple but completely unseen track in real-time.

APA, Harvard, Vancouver, ISO, and other styles

19

Crowley, Mark. "Equilibrium policy gradients for spatiotemporal planning." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/38971.

Full text

Abstract:

In spatiotemporal planning, agents choose actions at multiple locations in space over some planning horizon to maximize their utility and satisfy various constraints. In forestry planning, for example, the problem is to choose actions for thousands of locations in the forest each year. The actions at each location could include harvesting trees, treating trees against disease and pests, or doing nothing. A utility model could place value on sale of forest products, ecosystem sustainability or employment levels, and could incorporate legal and logistical constraints such as avoiding large contiguous areas of clearcutting and managing road access. Planning requires a model of the dynamics. Existing simulators developed by forestry researchers can provide detailed models of the dynamics of a forest over time, but these simulators are often not designed for use in automated planning. This thesis presents spatiotemoral planning in terms of factored Markov decision processes. A policy gradient planning algorithm optimizes a stochastic spatial policy using existing simulators for dynamics. When a planning problem includes spatial interaction between locations, deciding on an action to carry out at one location requires considering the actions performed at other locations. This spatial interdependence is common in forestry and other environmental planning problems and makes policy representation and planning challenging. We define a spatial policy in terms of local policies defined as distributions over actions at one location conditioned upon actions at other locations. A policy gradient planning algorithm using this spatial policy is presented which uses Markov Chain Monte Carlo simulation to sample the landscape policy, estimate its gradient and use this gradient to guide policy improvement. Evaluation is carried out on a forestry planning problem with 1880 locations using a variety of value models and constraints. The distribution over joint actions at all locations can be seen as the equilibrium of a cyclic causal model. This equilibrium semantics is compared to Structural Equation Models. We also define an algorithm for approximating the equilibrium distribution for cyclic causal networks which exploits graphical structure and analyse when the algorithm is exact.

APA, Harvard, Vancouver, ISO, and other styles

20

Aklil, Nassim. "Apprentissage actif sous contrainte de budget en robotique et en neurosciences computationnelles. Localisation robotique et modélisation comportementale en environnement non stationnaire." Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066225/document.

Full text

Abstract:

La prise de décision est un domaine très étudié en sciences, que ce soit en neurosciences pour comprendre les processus sous tendant la prise de décision chez les animaux, qu’en robotique pour modéliser des processus de prise de décision efficaces et rapides dans des tâches en environnement réel. En neurosciences, ce problème est résolu online avec des modèles de prises de décision séquentiels basés sur l’apprentissage par renforcement. En robotique, l’objectif premier est l’efficacité, dans le but d’être déployés en environnement réel. Cependant en robotique ce que l’on peut appeler le budget et qui concerne les limitations inhérentes au matériel, comme les temps de calculs, les actions limitées disponibles au robot ou la durée de vie de la batterie du robot, ne sont souvent pas prises en compte à l’heure actuelle. Nous nous proposons dans ce travail de thèse d’introduire la notion de budget comme contrainte explicite dans les processus d’apprentissage robotique appliqués à une tâche de localisation en mettant en place un modèle basé sur des travaux développés en apprentissage statistique qui traitent les données sous contrainte de budget, en limitant l’apport en données ou en posant une contrainte de temps plus explicite. Dans le but d’envisager un fonctionnement online de ce type d’algorithmes d’apprentissage budgétisé, nous discutons aussi certaines inspirations possibles qui pourraient être prises du côté des neurosciences computationnelles. Dans ce cadre, l’alternance entre recherche d’information pour la localisation et la décision de se déplacer pour un robot peuvent être indirectement liés à la notion de compromis exploration-exploitation. Nous présentons notre contribution à la modélisation de ce compromis chez l’animal dans une tâche non stationnaire impliquant différents niveaux d’incertitude, et faisons le lien avec les méthodes de bandits manchot
Decision-making is a highly researched field in science, be it in neuroscience to understand the processes underlying animal decision-making, or in robotics to model efficient and rapid decision-making processes in real environments. In neuroscience, this problem is resolved online with sequential decision-making models based on reinforcement learning. In robotics, the primary objective is efficiency, in order to be deployed in real environments. However, in robotics what can be called the budget and which concerns the limitations inherent to the hardware, such as computation times, limited actions available to the robot or the lifetime of the robot battery, are often not taken into account at the present time. We propose in this thesis to introduce the notion of budget as an explicit constraint in the robotic learning processes applied to a localization task by implementing a model based on work developed in statistical learning that processes data under explicit constraints, limiting the input of data or imposing a more explicit time constraint. In order to discuss an online functioning of this type of budgeted learning algorithms, we also discuss some possible inspirations that could be taken on the side of computational neuroscience. In this context, the alternation between information retrieval for location and the decision to move for a robot may be indirectly linked to the notion of exploration-exploitation compromise. We present our contribution to the modeling of this compromise in animals in a non-stationary task involving different levels of uncertainty, and we make the link with the methods of multi-armed bandits

APA, Harvard, Vancouver, ISO, and other styles

21

Nilsson, Anna-Maria, and Malin Björk. "Interpretation and Grading in the Current Grading System." Thesis, Malmö högskola, Lärarutbildningen (LUT), 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-29798.

Full text

Abstract:

Detta examensarbete behandlar det mål- och kriterierelaterade betygssystemet som används i svenska skolor idag. Vi har valt att undersöka hur lärare uppfattar det nuvarande betygssystemet och vilka utmaningar de möter när de betygssätter i det engelska ämnet. Intresset för ämnet fördjupades under vår slutpraktik, och efter att ha haft diskussioner tog vi beslutet att djupare granska det nuvarande betygssystemet. Ett skäl för detta var att vi ville känna oss tryggare när vi väl lämnar lärarhögskolan. Eftersom det nuvarande betygssystemet är ett redskap för lärare att arbeta med debatteras detta dagligen i skolan. Även om lärarna ger intryck av att fortfarande ha problem med det nuvarande betygssystemet, visar resultaten att majoriteten av de intervjuade har förstått systemet. Det verkar som om svårigheterna snarare är hur man ska tolka de olika styrdokumenten eftersom målen och kriterierna ofta är väldigt öppna. En allmän åsikt hos lärarna är att de inte har några svårigheter med betygssättandet i sig i det nuvarande betygssystemet, men de efterfrågar dock fler betygssteg för att bättre kunna förklara var eleverna befinner sig på betygsskalan. Vi kom slutligen fram till att lärarna anser att systemet gynnar elever såväl som lärare. Dessutom är det även nödvändigt att ständigt diskutera det nuvarande betygssystemet så att man bättre förstår vad det innebär.
This dissertation deals with the goal- and criterion-referenced grading system in use in Swedish schools today. We have chosen to investigate how teachers perceive the current grading system and what challenges they are faced with when grading in the English subject. The interest for this topic was deepened during our final in-school-practices after which we discussed the issue with each other and came to the conclusion that the grading system would be useful to delve into in order to feel more secure when leaving the teacher training college. The current grading system is debated in schools on a daily basis since it is a tool for teachers to work with. Although the teachers give the impression of still having difficulties with the current grading system, the results show that the majority of the interviewees have grasped the system. It rather seems as if the difficulty is how to interpret different policy documents due to the fact that the goals and the criteria are of a general nature at times. A general opinion among most of the teachers is that they do not have difficulties with the grading itself in the current grading system. But they do, however, request further grading steps in order to be better able to explain where the students are on the grading scale. Moreover, we concluded that the teachers believe the system to benefit the students as well as themselves. Also, that it is necessary to continuously have discussions concerning the current grading system so as to better understand what it entails.

APA, Harvard, Vancouver, ISO, and other styles

22

Henry, Dawn Therese. "Standards-based Grading: The Effect of Common Grading Criteria on Academic Growth." Bowling Green State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1522846892709392.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Sehnke, Frank [Verfasser], Patrick van der [Akademischer Betreuer] Smagt, and Jürgen [Akademischer Betreuer] Schmidhuber. "Parameter Exploring Policy Gradients and their Implications / Frank Sehnke. Gutachter: Jürgen Schmidhuber. Betreuer: Patrick van der Smagt." München : Universitätsbibliothek der TU München, 2012. http://d-nb.info/1030099820/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Dennis, Janelle. "No-Zero Policy in Middle School: A Comparison of High School Student Achievement." ScholarWorks, 2018. https://scholarworks.waldenu.edu/dissertations/5694.

Full text

Abstract:

Local middle schools have begun implementing a no-zero policy, which compels teachers to assign grades no lower than 50% even if a student did not turn in assignments for grading. In the study setting, high school teachers are struggling to motivate students who have attended a middle school with a no-zero policy in place. High school students who have attended a middle school with a no-zero policy show signs of learned helplessness. The purpose of this study was to examine the differences in core course grades between high school students who attended a middle school with a no-zero policy (NZPMS) and high school students who attended a middle school without this policy that would compel the assignment of F grades if earned by the student (FPMS). The theoretical framework is Seligman's theory of learned helplessness. The sample included 1,396 students in a high school who attended either of the two middle schools. Comparisons between mean high school mathematics, science, and English grades were compared using a one-tailed t-test. Effect sizes were measured using Cohen's d. The findings indicated statistically significant small to medium differences in students' core course grades. Students who had attended the NZPMS earned lower high school core course grades in mathematics, science, and English than students who had attended FPMS. Professional development activities were created to train teachers and administrators at the NZPMS about the negative effects of awarding students with passing grades without expanding any or only minimal effort. Positive social change could occur for students' academic careers and professional lives if the no-zero policy is rescinded.

APA, Harvard, Vancouver, ISO, and other styles

25

Tolman, Deborah A. "Environmental Gradients, Community Boundaries, and Disturbance the Darlingtonia Fens of Southwestern Oregon." PDXScholar, 2004. https://pdxscholar.library.pdx.edu/open_access_etds/3013.

Full text

Abstract:

The Darlingtonia fens, found on serpentine soils in southern Oregon, are distinct communities that frequently undergo dramatic changes in size and shape in response to a wide array of environmental factors. Since few systems demonstrate a balance among high water tables, shallow soils, the presence of heavy metals, and limited nutrients, conservative efforts have been made to preserve them. This dissertation investigates the role of fire on nutrient cycling and succession in three separate fens, each a different time since fire. I specifically analyze the spatial distributions of soil properties, the physical and ecological characteristics of ecotones between Jeffrey pine savanna and Darlingtonia fens, and the vegetation structure of fire-disturbed systems. Soil, water, and vegetation sampling were conducted along an array of transects, oriented perpendicular to community boundaries and main environmental gradients, at each of the three fens. Abrupt changes in vegetation, across communities, were consistently identified at each of the three sites, although statistical analysis did not always identify distinct mid-canopy communities. Below-ground variables were likewise distinguished at the fen and savanna boundary for two of the three sites. At the third site, discontinuities did not align with the fen boundaries, but followed fluctuations in soil NH4. My results suggest that below-ground discontinuities may be more important than fire at preserving these uniquely-adapted systems, while vegetation undergoes postfire succession from fen to mid-canopy to savanna after approximately 100 years since fire. Although restoration of ecosystem structure and processes was not the primary focus of this study, my data suggest that time since fire may drive ecosystem processes in a trajectory away from the normal succession cycle. Moreover, time since fire may decrease overall vigor of Darlingtonia populations.

APA, Harvard, Vancouver, ISO, and other styles

26

De, Larkin Christian Martin II. "A Study of Teacher-Buy-In and Grading Policy Reform in a Los Angeles Archdiocesan Catholic High School." Thesis, Loyola Marymount University, 2013. http://pqdtopen.proquest.com/#viewpdf?dispub=3597221.

Full text

Abstract:

This study examined the construct of teacher buy-in (TBI) during a grading policy reform effort in a high school. The purpose of this study was to identify and describe teachers' perceived value to the grading reform. Additionally, the researcher studied teacher behavior by identifying the teachers' actual practice of the policy. The study finally compared the identified reported values of the participants with their actual grading practices to determine the convergence of values and practice.

The research provided empirical evidence for a new way to study TBI and its relationship to a reform implementation. This study addressed a school-site policy reform effort and described TBI contributing to, and perhaps challenging, current practices in school reform and teacher grading policies. This study described the extent to which teacher bought into the grading policies and provided a framework for studying TBI and grading policies in the context of Standards-Based Reform in the future. The findings and discussion highlight how grading policies are a critical element of the student evaluation process in the increasing movement towards national learning standards and testing.

APA, Harvard, Vancouver, ISO, and other styles

27

De, Larkin Christian Martín II. "A Study of Teacher-Buy-In and Grading Policy Reform in a Los Angeles Archdiocesan Catholic High School." Digital Commons at Loyola Marymount University and Loyola Law School, 2013. https://digitalcommons.lmu.edu/etd/220.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Wolford, Walter Paul. "Policy and Practice Concerning Essay-Grading Criteria in Developmental English and College-Level English Programs in Tennessee Community Colleges." Digital Commons @ East Tennessee State University, 2000. https://dc.etsu.edu/etd/4.

Full text

Abstract:

The criteria used to grade college essays have been the subject of research for over three decades. Using quantatative data, this study investigated the differences in essay-grading criteria and essay-grading policy among full-time faculty members who teach English composition in Tennessee's community colleges. This study revealed beliefs about the importance of essay-grading criteria and beliefs about written and unwritten essay-grading policies among those who teach developmental English, college-level English, and those who teach both levels of English. This study hypothesized that there were no differences among the English composition teacher's beliefs about the importance of the twenty essay-grading criteria nor in their beliefs regarding written and unwritten grading policies. Chi-square analysis of the non-parametric data collected during this study indicated statistically significant differences among the English teachers regarding only one of the essay-grading criteria and no statistically significant differences regarding the essay-grading policies.

APA, Harvard, Vancouver, ISO, and other styles

29

Souter, Dawn Hopkins. "The Nature of Feedback Provided to Elementary Students in Classrooms where Grading and Reporting are Standards-Based." Digital Archive @ GSU, 2009. http://digitalarchive.gsu.edu/eps_diss/62.

Full text

Abstract:

THE NATURE OF FEEDBACK PROVIDED TO ELEMENTARY STUDENTS BY TEACHERS IN SCHOOLS WHERE GRADING AND REPORTING ARE STANDARDS-BASED Feedback is one of the most powerful influences on learning and achievement. Hattie (2002) found that the giving of quality feedback to students is one of the top five strategies teachers can use to improve student achievement. Research has confirmed that the right kind of feedback is essential for effective teaching and learning (McMillan, 2007). The University of Queensland (Australia) notes that feedback is the entity that brings assessment into the learning process (1998). The evidence also shows, however, that how feedback is given and the types of feedback given can provide disparate results with both achievement and student motivation. One mitigating factor to the giving and receiving of feedback in classrooms is a climate of evaluation, competition, rewards, punishments, winners and losers. In fact, research shows that while the giving of descriptive feedback enhances learning and motivation, the giving of norm-referenced grades has a negative impact on students (Bandura, 1993; Black & Wiliam, 1998; Butler & Nisan, 1986; Butler, 1987). This qualitative study used interviews, teacher observations, and document analysis to seek out the nature of feedback provided to students in a standards-based school district, where grading is standards-based rather than norm-referenced. The literature review suggests particular properties and circumstances that make feedback effective, and the researcher has used this research to analyze the oral and written feedback that teachers provide students. The analysis describes the use of feedback and feedback loops in these classrooms and the findings add to the current knowledge-base about the giving and receiving of feedback in standards-based schools and suggests areas for teacher improvement and development.

APA, Harvard, Vancouver, ISO, and other styles

30

Rice, William Robertson. "Subjectivity in grading: The role individual subjectivity plays in assigning grades." Miami University / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=miami1623317108089967.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Haley, James. "To Curve or Not to Curve? The Effect of College Science Grading Policies on Implicit Theories of Intelligence, Perceived Classroom Goal Structures, and Self-efficacy." Thesis, Boston College, 2015. http://hdl.handle.net/2345/bc-ir:104165.

Full text

Abstract:

Thesis advisor: George M. Barnett
There is currently a shortage of students graduating with STEM (science, technology, engineering, or mathematics) degrees, particularly women and students of color. Approximately half of students who begin a STEM major eventually switch out. Many switchers cite the competitiveness, grading curves, and weed-out culture of introductory STEM classes as reasons for the switch. Variables known to influence resilience include a student's implicit theory of intelligence and achievement goal orientation. Incremental theory (belief that intelligence is malleable) and mastery goals (pursuit of increased competence) are more adaptive in challenging classroom contexts. This dissertation investigates the role that college science grading policies and messages about the importance of effort play in shaping both implicit theories and achievement goal orientation. College students (N = 425) were randomly assigned to read one of three grading scenarios: (1) a "mastery" scenario, which used criterion-referenced grading, permitted tests to be retaken, and included a strong effort message; (2) a "norm" scenario, which used norm-referenced grading (grading on the curve); or (3) an "effort" scenario, which combined a strong effort message with the norm-referenced policies. The dependent variables included implicit theories of intelligence, perceived classroom goal structure, and self-efficacy. A different sample of students (N = 15) were randomly assigned a scenario to read, asked to verbalize their thoughts, and responded to questions in a semi-structured interview. Results showed that students reading the mastery scenario were more likely to endorse an incremental theory of intelligence, perceived greater mastery goal structure, and had higher self-efficacy. The effort message had no effect on self-efficacy, implicit theory, and most of the goal structure measures. The interviews revealed that it was the retake policy in the mastery scenario and the competitive atmosphere in the norm-referenced scenarios that were likely driving the results. Competitive grading policies appear to be incompatible with mastery goals, cooperative learning, and a belief in the efficacy of effort. Implications for college STEM instruction are discussed
Thesis (PhD) — Boston College, 2015
Submitted to: Boston College. Lynch School of Education
Discipline: Teacher Education, Special Education, Curriculum and Instruction

APA, Harvard, Vancouver, ISO, and other styles

32

Masoudi, Mohammad Amin. "Robust Deep Reinforcement Learning for Portfolio Management." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/42743.

Full text

Abstract:

In Finance, the use of Automated Trading Systems (ATS) on markets is growing every year and the trades generated by an algorithm now account for most of orders that arrive at stock exchanges (Kissell, 2020). Historically, these systems were based on advanced statistical methods and signal processing designed to extract trading signals from financial data. The recent success of Machine Learning has attracted the interest of the financial community. Reinforcement Learning is a subcategory of machine learning and has been broadly applied by investors and researchers in building trading systems (Kissell, 2020). In this thesis, we address the issue that deep reinforcement learning may be susceptible to sampling errors and over-fitting and propose a robust deep reinforcement learning method that integrates techniques from reinforcement learning and robust optimization. We back-test and compare the performance of the developed algorithm, Robust DDPG, with UBAH (Uniform Buy and Hold) benchmark and other RL algorithms and show that the robust algorithm of this research can reduce the downside risk of an investment strategy significantly and can ensure a safer path for the investor’s portfolio value.

APA, Harvard, Vancouver, ISO, and other styles

33

Ковальов, Костянтин Миколайович. "Комп'ютерна система управління промисловим роботом." Bachelor's thesis, КПІ ім. Ігоря Сікорського, 2019. https://ela.kpi.ua/handle/123456789/28610.

Full text

Abstract:

Кваліфікаційна робота включає пояснювальну записку (56 с., 2 додатка). Об’єкт дослідження – алгоритми навчання з підкріпленням для задачі керування промисловою роботичною рукою. Задача непервного керування промисловою роботичною рукою для нетривіальних задач є занадто складною або навіть невирішуваною для класичних методів робототехніки. Методи навчання з підкріпленням можуть бути використані в цьому випадку. Вони є досить простими у реалізації, дозволяють узагальнюватися на небачені випадки, та вчитися на даних великої розмірності. Ми реалізуємо метод градієнту глибокої детермінованої стратегії, який підходить для складних задач непервного управління. В ході дослідження:  проведено аналіз існуючих класичних методів для задачі управління промисловим роботом  проведено аналіз існуючих алгоритмів навчання з підкріпленням та їх використання в області робототехніки  реалізовано алгоритм градієнту глибокої детермінованої стратегії  проведено тестування реалізованого алгоритму у спрощеному середовищі  запропоновано архітектуру нейронної мережі для вирішення поставленої задачі  проведено тестування алгоритму на навчальній виборці  проведено тестування алгоритму на здатність до узагальнення на тестовій виборці Показано здатність алгоритму градієнту глибокої детермінованої стратегії з використанням нейронних мереж для представлення стратегії вирішувати поставлену задачі з зображенням в якості входу та узагальнюватися на небачені до цього об’єкти.
Qualifying work includes an explanatory note (56 p., 2 appendix). The object of the study are reinforcement learning algorithms for the task of an industrial robotic arm control. Continuous control of an industrial robotic arm for non-trivial tasks is too complicated or even unsolvable for classical methods of robotics. Reinforcement learning methods can be used in this case. They are quite simple to implement, allow for generalization to unseen cases, and learn from high-dimensional data. We implement deep deterministic policy gradient algorithm that is suitable for complex continuous contol tasks. During the study: • An analysis of existing classical methods for the problem of industrial robot control was conducted • An analysis of existing algorithms of training with reinforcement learning and their use in the field of robotics has been conducted • Deep deterministic policy gradient algorithm is implemented • Implemented algorithm is tested on a simplified environment • The architecture of the neural network is proposed for solving the problem • Algorithm was tested on the training set of objects • Algorithm was tested for its generalization ability on the test set It was shown that deep deterministic policy gradient algorithm with neural network as policy approximator is able to solve the problem with the image as an input and to generalize to objects not seen before.

APA, Harvard, Vancouver, ISO, and other styles

34

Helmér, Henrik. "Närbyråkraters individuella handlingsutrymme : Lärares handlingslogiker vid myndighetsutövning i form av bedömning och betygsättning." Thesis, Linköpings universitet, Statsvetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166307.

Full text

Abstract:

The point of departure for this study is Michael Lipsky’s description and problematization of street-level bureaucrats’ discretion. Street-level bureaucrats, such as teachers, have a possibility to influence the implementation of policy at the point of delivery to citizens.This can create a problem within the democratic policy process as policy does not materialize in the way that politicians intended. I used a qualitative research design and interviewed ten teachers in upper secondary schools about their exercise of authority, in order to investigate a factor that may lead to policy-making: logics of action. I claim that logics of action are suitable tools for analyzing and discussing the policy-making that street-level bureaucrats perform in the democratic policy process. The main purpose of this study is to contribute to such a discussion. A second purpose is to elucidate logics of action as a type of factors that guides teachers’ exercise of authority, but which has not been noticed to any great extent in previous research. I investigated which logics of action are mainly present in teachers’ exercise of authority concerning assessment and grading: a logic of consequences or a logic of appropriateness; a manufacturing logic or a service logic; and an instrumental logic or alternative logics. The relationship between logics of consequences and appropriateness is complex. It is difficult to say that one logic is the dominant force behind teachers’ exercise of authority. This is because of the constantly changing circumstances in the school environment. As for the manufacturing and service logics, the latter is dominant in assessment and grading. This does not influence decision-making as such, but enriches policy with a certain value production. Lastly, teachers claim that they instrumentally follow the guidelines in their exercise of authority. But at the same time alternative logics, such as gaming and cheating with the rules, are very much present in assessment and grading. Alternative logics distort teachers’ decision-making in several ways. These results show that logics of action are indeed tools that can help us to better understand what influences street-level bureaucrats’ exercise of authority, and how this contributes to policy-making. I conclude by suggesting how the use of logics of action as analytical tools can enhance our knowledge of street-level bureaucrats’ discretion in future research.

APA, Harvard, Vancouver, ISO, and other styles

35

Su, Xiaoshan. "Three Essays on the Design, Pricing, and Hedging of Insurance Contracts." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE2065.

Full text

Abstract:

Cette thèse utilise des outils théoriques de la finance, de la théorie de la décision et de l'apprentissage automatique, pour améliorer la conception, la tarification et la couverture des contrats d'assurance. Le chapitre 3 de cette thèse développe des formules de tarification sous forme fermée pour une classe de contrats d'assurance vie participatifs, sur la base de la factorisation matricielle de Wiener-Hopf, et prend en compte plusieurs types de risque, tels que les risques de crédit, de marché et économiques. La méthode de tarification se révèle précise et efficace. Les stratégies de couverture dynamique et semi-statique sont introduites pour aider les compagnies d'assurance à réduire leur risque lié à l'émission de contrats participatifs. Le chapitre 4 traite de la conception optimale de contrats lorsque l'assuré possède une aversion au risque du troisième degré. Les résultats exhibent une forme de contrat optimale pour les agents averses au risque comme pour ceux appréciant le risque dans différents contextes. Le chapitre 5 développe un modèle stochastique amplificateur degradient fréquence/sévérité qui améliore les modèles de fréquence et de sévérité importants et populaires que sont les modèles GLM et GAM. Ce nouveau modèle hérite pleinement des avantages de l'algorithme de renforcement du gradient, dépassant ainsi les formes linéaires ou additives restrictives des modèles GLM et GAM, avec apprentissage de la structure du modèle à partir des données. En outre, ce modèle peut également rendre compte de la dépendance non linéaire existant entre fréquence et sévérité des sinistres
This thesis makes use of some theoretical tools in finance, decision theory, machine learning, to improve the design, pricing and hedging of insurance contracts. Chapter 3 develops closed-form pricing formulas for participating life insurance contracts, based on matrix Wiener-Hopf factorization, where multiple risk sources, such as credit, market, and economic risks, are considered. The pricing method proves to be accurate and efficient. The dynamic and semi-static hedging strategies are introduced to assist insurance company to reduce risk exposure arising from the issue of participating contracts. Chapter 4 discusses the optimal contract design when the insured is third degree risk averse. The results showthat dual limited stop-loss, change-loss, dual change-loss, and stop-loss can be optimal contracts favord by both of risk averters and risk lovers in different settings. Chapter 5 develops a stochastic gradient boosting frequency-severity model, which improves the important and popular GLM and GAM frequency-severity models. This model fully inherits advantages ofgradient boosting algorithm, overcoming the restrictive linear or additive forms of the GLM and GAM frequency-severity models, through learning the model structure from data. Further, our model can also capture the flexible nonlinear dependence between claim frequency and severity

APA, Harvard, Vancouver, ISO, and other styles

36

Senate, University of Arizona Faculty. "Faculty Senate Minutes March 6, 2017." University of Arizona Faculty Senate (Tucson, AZ), 2017. http://hdl.handle.net/10150/623059.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Cai, Bo-Yin, and 蔡博胤. "A Behavior Fusion Approach Based on Policy Gradient." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/u6ctx3.

Full text

Abstract:

碩士
國立中山大學
電機工程學系研究所
107
In this study, we propose a behavioral fusion algorithm based on policy gradient. We use Actor-Critic algorithm to train sub-tasks. After the training is completed, the behavior fusion algorithm proposed in this paper is used for the learning of complex tasks. We can know the state value function of each sub-task in each state by reading the trained sub-task neural network, then calculate the return of each sub-task, and then pass the normalized return to the behavior fusion algorithm as a policy gradient. When reinforced learning is learning a complex task, there is often a problem that the reward function is difficult to be designed. If we use the sparse reward, although the best solution can be achieved theoretically, it will take a long training time. If we use the dense reward, although the speed of training is accelerated, it is also easy to get the agent into the local minimum. If the complex task is disassembled into several sub-tasks for training, the reward functions of the sub-tasks are easier to design. After the training is completed, these sub-tasks can be merged to achieve the complex tasks. In this study, we use the wafer probe simulator designed by our laboratory and pong in Atari game as the test environment. The wafer inspection simulator is used to simulate how the probe moves when the fab detects the chip. The goal is to have each wafer on the wafer checked once and not repeatedly check the same chip. The pong environment is about letting agents learn to defeat the computer on their own.

APA, Harvard, Vancouver, ISO, and other styles

38

Greensmith, Evan. "Policy Gradient Methods: Variance Reduction and Stochastic Convergence." Phd thesis, 2005. http://hdl.handle.net/1885/47105.

Full text

Abstract:

In a reinforcement learning task an agent must learn a policy for performing actions so as to perform well in a given environment. Policy gradient methods consider a parameterized class of policies, and using a policy from the class, and a trajectory through the environment taken by the agent using this policy, estimate the performance of the policy with respect to the parameters. Policy gradient methods avoid some of the problems of value function methods, such as policy degradation, where inaccuracy in the value function leads to the choice of a poor policy. However, the estimates produced by policy gradient methods can have high variance. ¶ ...

APA, Harvard, Vancouver, ISO, and other styles

39

Chen, Yi-Ching, and 陳怡靜. "Solving Rubik's Cube by Policy Gradient Based Reinforcement Learning." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/t842yt.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系所
107
Reinforcement Learning provides a mechanism for training an agent to interact with its environment. Policy gradient makes the right actions more probable. We propose using a linear policy gradient method in a deep neural network-based reinforcement learning. The proposed method employs an intensifying reward function to increase the probabilities of right actions to solve the Rubik's Cube problems. Experiments show that our proposed neural network learned to solve some Rubik's Cube states. For more difficult initial states, the network still cannot always give the correct suggestion.

APA, Harvard, Vancouver, ISO, and other styles

40

Aberdeen, Douglas. "Policy-Gradient Algorithms for Partially Observable Markov Decision Processes." Phd thesis, 2003. http://hdl.handle.net/1885/48180.

Full text

Abstract:

Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember.

APA, Harvard, Vancouver, ISO, and other styles

41

Kiah-YangChong and 張家揚. "Design and Implementation of Fuzzy Policy Gradient Gait Learning Method for Humanoid Robot." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/90100127378597192142.

Full text

Abstract:

碩士
國立成功大學
電機工程學系碩博士班
98
The design and implementation of Fuzzy Policy Gradient Learning (FPGL) method for small-sized humanoid robot is proposed in this thesis. This thesis not only introduces the mechanism structure of the humanoid robot and the hardware system adapted on the robot, which is named as aiRobots-V, but also improves and parameterizes the gait pattern of the robot. The movement of arms is added to the gait pattern to reduce the tilt of trunk while walking. FPGL method is an integrated machine learning method based on Policy Gradient Reinforcement Learning (PGRL) method and fuzzy logic concept in order to improve the efficiency and speed of gait learning computation. The humanoid robot is trained with FPGL method which is using the walking distance in constant walking cycles as the reward to learn faster and stable gait automatically. The tilt degree of trunk is chosen as the reward to learn the movement of arms in the walking cycle. The result of the experiment shows that FPGL method could train the gait pattern from 9.26 mm/s walking speed to 162.27 mm/s in about an hour. The training data of experiments also shows that this method could improve the efficiency of basic PGRL method up to 13%. The effect of arm movement to reduce the tilt degree of trunk is also proved by the experimental results. This robot is also applied to participate in the throw-in technical challenge of RoboCup 2010.

APA, Harvard, Vancouver, ISO, and other styles

42

"Adaptive Curvature for Stochastic Optimization." Master's thesis, 2019. http://hdl.handle.net/2286/R.I.53675.

Full text

Abstract:

abstract: This thesis presents a family of adaptive curvature methods for gradient-based stochastic optimization. In particular, a general algorithmic framework is introduced along with a practical implementation that yields an efficient, adaptive curvature gradient descent algorithm. To this end, a theoretical and practical link between curvature matrix estimation and shrinkage methods for covariance matrices is established. The use of shrinkage improves estimation accuracy of the curvature matrix when data samples are scarce. This thesis also introduce several insights that result in data- and computation-efficient update equations. Empirical results suggest that the proposed method compares favorably with existing second-order techniques based on the Fisher or Gauss-Newton and with adaptive stochastic gradient descent methods on both supervised and reinforcement learning tasks.
Dissertation/Thesis
Masters Thesis Computer Science 2019

APA, Harvard, Vancouver, ISO, and other styles

43

Fleming, Brian James. "The social gradient in health : trends in C20th ideas, Australian Health Policy 1970-1998, and a health equity policy evaluation of Australian aged care planning / Brian James Fleming." Thesis, 2003. http://hdl.handle.net/2440/22062.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Pereira, Bruno Alexandre Barbosa. "Deep reinforcement learning for robotic manipulation tasks." Master's thesis, 2021. http://hdl.handle.net/10773/33654.

Full text

Abstract:

The recent advances in Artificial Intelligence (AI) present new opportunities for robotics on many fronts. Deep Reinforcement Learning (DRL) is a sub-field of AI which results from the combination of Deep Learning (DL) and Reinforcement Learning (RL). It categorizes machine learning algorithms which learn directly from experience and offers a comprehensive framework for studying the interplay among learning, representation and decision-making. It has already been successfully used to solve tasks in many domains. Most notably, DRL agents learned to play Atari 2600 video games directly from pixels and achieved human comparable performance in 49 of those games. Additionally, recent efforts using DRL in conjunction with other techniques produced agents capable of playing the board game of Go at a professional level, which has long been viewed as an intractable problem due to its enormous search space. In the context of robotics, DRL is often applied to planning, navigation, optimal control and others. Here, the powerful function approximation and representation learning properties of Deep Neural Networks enable RL to scale up to problems with highdimensional state and action spaces. Additionally, inherent properties of DRL make transfer learning useful when moving from simulation to the real world. This dissertation aims to investigate the applicability and effectiveness of DRL to learn successful policies on the domain of robot manipulator tasks. Initially, a set of three classic RL problems were solved using RL and DRL algorithms in order to explore their practical implementation and arrive at class of algorithms appropriate for these robotic tasks. Afterwards, a task in simulation is defined such that an agent is set to control a 6 DoF manipulator to reach a target with its end effector. This is used to evaluate the effects on performance of different state representations, hyperparameters and state-of-the-art DRL algorithms, resulting in agents with high success rates. The emphasis is then placed on the speed and time restrictions of the end effector's positioning. To this end, different reward systems were tested for an agent learning a modified version of the previous reaching task with faster joint speeds. In this setting, a number of improvements were verified in relation to the original reward system. Finally, an application of the best reaching agent obtained from the previous experiments is demonstrated on a simplified ball catching scenario.
Os avanços recentes na Inteligência Artificial (IA) demonstram um conjunto de novas oportunidades para a robótica. A Aprendizagem Profunda por Reforço (DRL) é uma subárea da IA que resulta da combinação de Aprendizagem Profunda (DL) com Aprendizagem por Reforço (RL). Esta subárea define algoritmos de aprendizagem automática que aprendem diretamente por experiência e oferece uma abordagem compreensiva para o estudo da interação entre aprendizagem, representação e a decisão. Estes algoritmos já têm sido utilizados com sucesso em diferentes domínios. Nomeadamente, destaca-se a aplicação de agentes de DRL que aprenderam a jogar vídeo jogos da consola Atari 2600 diretamente a partir de pixels e atingiram um desempenho comparável a humanos em 49 desses jogos. Mais recentemente, a DRL em conjunto com outras técnicas originou agentes capazes de jogar o jogo de tabuleiro Go a um nível profissional, algo que até ao momento era visto como um problema demasiado complexo para ser resolvido devido ao seu enorme espaço de procura. No âmbito da robótica, a DRL tem vindo a ser utilizada em problemas de planeamento, navegação, controlo ótimo e outros. Nestas aplicações, as excelentes capacidades de aproximação de funções e aprendizagem de representação das Redes Neuronais Profundas permitem à RL escalar a problemas com espaços de estado e ação multidimensionais. Adicionalmente, propriedades inerentes à DRL fazem a transferência de aprendizagem útil ao passar da simulação para o mundo real. Esta dissertação visa investigar a aplicabilidade e eficácia de técnicas de DRL para aprender políticas de sucesso no domínio das tarefas de manipulação robótica. Inicialmente, um conjunto de três problemas clássicos de RL foram resolvidos utilizando algoritmos de RL e DRL de forma a explorar a sua implementação prática e chegar a uma classe de algoritmos apropriados para estas tarefas de robótica. Posteriormente, foi definida uma tarefa em simulação onde um agente tem como objetivo controlar um manipulador com 6 graus de liberdade de forma a atingir um alvo com o seu terminal. Esta é utilizada para avaliar o efeito no desempenho de diferentes representações do estado, hiperparâmetros e algoritmos do estado da arte de DRL, o que resultou em agentes com taxas de sucesso elevadas. O foco é depois colocado na velocidade e restrições de tempo do posicionamento do terminal. Para este fim, diferentes sistemas de recompensa foram testados para que um agente possa aprender uma versão modificada da tarefa anterior para velocidades de juntas superiores. Neste cenário, foram verificadas várias melhorias em relação ao sistema de recompensa original. Finalmente, uma aplicação do melhor agente obtido nas experiências anteriores é demonstrada num cenário implicado de captura de bola.
Mestrado em Engenharia de Computadores e Telemática

APA, Harvard, Vancouver, ISO, and other styles

45

Shang, Li-Dan, and 商瓈丹. "Does Instructors’ Grading Policy Affect Student Evaluation of Teaching? Evidence from National Tsing Hua University." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/84201754667124562349.

Full text

Abstract:

碩士
國立清華大學
經濟學系
104
This paper takes advantage of a uniquely compiled data set from National Tsing Hua University to empirically evaluate whether the results of student evaluation of teaching (SET) are affected by the reputation of instructors with controlling other factors. Our results show that there exists a significant impact between reputation/justice and student evaluation of teaching (SET). In particular, our empirical results suggest that if instructors are willing to gain better SET outcome, what they should do is to establish their reputation by giving students higher grades and treating students fairly. We also find that courses, instructors and students related characteristics are important in regard to SET.

APA, Harvard, Vancouver, ISO, and other styles

46

"Policy and Practice Concerning Essay-Grading Criteria in Developmental English and College-Level English Programs in Tennessee Community Colleges." East Tennessee State University, 2000. http://etd-submit.etsu.edu/etd/theses/available/etd-0330100-210610/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

"Grade Inflation in English 102: Thirty Years of Data." Doctoral diss., 2015. http://hdl.handle.net/2286/R.I.29962.

Full text

Abstract:

abstract: Abstract: This study investigates grades from 1980 to 2010 in English 102 at Arizona State University Tempe Campus to see if grade inflation has taken place. It concludes it has and then goes on to study the causes. The data was collected from existing data held in the archives of the Registrar's Office, collated into proper order and saved in proper numerical format for analysis. After analysis, the data was reviewed to establish whether or not as consumer demands rise, measured by student responses to evaluation questions, grade point averages rise as well, and whether demands for adequate performance in classrooms have declined. This study statistically analyzes students' final grades in ENG102 for thirty years and concludes that grade compression at the top of the grading scale exists. This study discusses the implications of that compression at length.
Dissertation/Thesis
Doctoral Dissertation English 2015

APA, Harvard, Vancouver, ISO, and other styles

48

Mickwitz, Larissa. "En reformerad lärare : Konstruktionen av en professionell och betygssättande lärare i skolpolitik och skolpraktik." Doctoral thesis, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-115348.

Full text

Abstract:

This doctoral thesis investigates the interrelatedness between school policy and practice. In the thesis, the construction of “the teacher” is analysed in school policy documents and teacher interviews. I am particularly interested in the relation between school policy and school practice in light of the two latest curriculum reforms 1994 and 2011 and the teacher accreditation registration reform of 2011. The analysis focuses on two topics: grading and the professional teacher. In fact, an analytic link is made between the emphasis on grading and the discursive construction of the teacher in Swedish education policy. The theoretical framework is positioned within institutional theory within which I combine curriculum theory and the sociological new institutionalism with discourse theory. The analyses of policy documents reveals three types of different discursive constructions of “the teacher”. In the period of deregulation and decentralization, a professional teacher is constructed and the need for an autonomous teacher for school quality is expressed. By the 1990s -2000s an unprofessional grading teacher is constructed. In the period signifying the teacher accreditation and registration reform, a quality assured teacher is constructed. It is a teacher who is formally authorized and in need of continuing evaluation. In the focus groups interviews teachers constructs two types of professionalism. One is in line with the professionalism articulated in the policy texts and is about control and formal regulation and the other is about autonomy. Furthermore, the teachers relate to grading and teachers' ability to act in accordance with their overall teaching assignment. Grading were often constructed opposed to teaching. Demands for documentation, quality reports or the requirement of teacher accreditation is described as institutional practices defined from above. These practices make it difficult for teachers to complete their teaching assignments. The study indicates that teachers' ability to operate in an increasingly regulatory schooling culture has, through the types of requirements for transparency in teachers’ work, resulted in the decline of autonomy in their professional practice.

APA, Harvard, Vancouver, ISO, and other styles

49

Magalane, T. Phoshoko. "Exploring the adaptability of indigenous African marriage song to piano for classroom and the university level education." Diss., 2017. http://hdl.handle.net/11602/957.

Full text

Abstract:

MAAS
Centre for African Studies
This study explored the adaptability of indigenous African marriage songs to piano. Music education has always been biased towards Western music content to the exclusion of local musical traditions. A vast amount of musical repertoire within indigenous African societies exists. Formal music education, however, seems oblivious of this resource despite some educators decrying the dearth of materials. There is a need for music curriculum which is located within an African context and which includes indigenous African musical practices. Such need is also expressed in the new Curriculum and Assessment Policy Statement (CAPS) document. This study explored the feasibility of building a repertoire of indigenous songs for classroom purposes. A number of songs, were collected, transcribed, analysed then placed in various levels of difficulty. These were then matched with the requisite proficiency levels congruent to other graded piano regimes commonly used in the school system. The assumption is that the adaptation and arrangement of indigenous marriage songs will help to bring indigenous African musical practices into modern music education space. Furthermore, it is envisaged that the philosophical understanding and the knowledge attendant to music practices yielding these songs and the context in which they are performed will form the basis for further advancement.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Policy gradient'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles