Academic literature on the topic 'Constrained Reinforcement Learning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Constrained Reinforcement Learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Constrained Reinforcement Learning"

1

Pankayaraj, Pathmanathan, and Pradeep Varakantham. "Constrained Reinforcement Learning in Hard Exploration Problems." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 12 (June 26, 2023): 15055–63. http://dx.doi.org/10.1609/aaai.v37i12.26757.

Full text
Abstract:
One approach to guaranteeing safety in Reinforcement Learning is through cost constraints that are dependent on the policy. Recent works in constrained RL have developed methods that ensure constraints are enforced even at learning time while maximizing the overall value of the policy. Unfortunately, as demonstrated in our experimental results, such approaches do not perform well on complex multi-level tasks, with longer episode lengths or sparse rewards. To that end, we propose a scalable hierarchical approach for constrained RL problems that employs backward cost value functions in the context of task hierarchy and a novel intrinsic reward function in lower levels of the hierarchy to enable cost constraint enforcement. One of our key contributions is in proving that backward value functions are theoretically viable even when there are multiple levels of decision making. We also show that our new approach, referred to as Hierarchically Limited consTraint Enforcement (HiLiTE) significantly improves on state of the art Constrained RL approaches for many benchmark problems from literature. We further demonstrate that this performance (on value and constraint enforcement) clearly outperforms existing best approaches for constrained RL and hierarchical RL.
APA, Harvard, Vancouver, ISO, and other styles
2

HasanzadeZonuzy, Aria, Archana Bura, Dileep Kalathil, and Srinivas Shakkottai. "Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 9 (May 18, 2021): 7667–74. http://dx.doi.org/10.1609/aaai.v35i9.16937.

Full text
Abstract:
Many physical systems have underlying safety considerations that require that the policy employed ensures the satisfaction of a set of constraints. The analytical formulation usually takes the form of a Constrained Markov Decision Process (CMDP). We focus on the case where the CMDP is unknown, and RL algorithms obtain samples to discover the model and compute an optimal constrained policy. Our goal is to characterize the relationship between safety constraints and the number of samples needed to ensure a desired level of accuracy---both objective maximization and constraint satisfaction---in a PAC sense. We explore two classes of RL algorithms, namely, (i) a generative model based approach, wherein samples are taken initially to estimate a model, and (ii) an online approach, wherein the model is updated as samples are obtained. Our main finding is that compared to the best known bounds of the unconstrained regime, the sample complexity of constrained RL algorithms are increased by a factor that is logarithmic in the number of constraints, which suggests that the approach may be easily utilized in real systems.
APA, Harvard, Vancouver, ISO, and other styles
3

Dai, Juntao, Jiaming Ji, Long Yang, Qian Zheng, and Gang Pan. "Augmented Proximal Policy Optimization for Safe Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 6 (June 26, 2023): 7288–95. http://dx.doi.org/10.1609/aaai.v37i6.25888.

Full text
Abstract:
Safe reinforcement learning considers practical scenarios that maximize the return while satisfying safety constraints. Current algorithms, which suffer from training oscillations or approximation errors, still struggle to update the policy efficiently with precise constraint satisfaction. In this article, we propose Augmented Proximal Policy Optimization (APPO), which augments the Lagrangian function of the primal constrained problem via attaching a quadratic deviation term. The constructed multiplier-penalty function dampens cost oscillation for stable convergence while being equivalent to the primal constrained problem to precisely control safety costs. APPO alternately updates the policy and the Lagrangian multiplier via solving the constructed augmented primal-dual problem, which can be easily implemented by any first-order optimizer. We apply our APPO methods in diverse safety-constrained tasks, setting a new state of the art compared with a comprehensive list of safe RL baselines. Extensive experiments verify the merits of our method in easy implementation, stable convergence, and precise cost control.
APA, Harvard, Vancouver, ISO, and other styles
4

Bhatia, Abhinav, Pradeep Varakantham, and Akshat Kumar. "Resource Constrained Deep Reinforcement Learning." Proceedings of the International Conference on Automated Planning and Scheduling 29 (May 25, 2021): 610–20. http://dx.doi.org/10.1609/icaps.v29i1.3528.

Full text
Abstract:
In urban environments, resources have to be constantly matched to the “right” locations where customer demand is present. For instance, ambulances have to be matched to base stations regularly so as to reduce response time for emergency incidents in ERS (Emergency Response Systems); vehicles (cars, bikes among others) have to be matched to docking stations to reduce lost demand in shared mobility systems. Such problems are challenging owing to the demand uncertainty, combinatorial action spaces and constraints on allocation of resources (e.g., total resources, minimum and maximum number of resources at locations and regions).Existing systems typically employ myopic and greedy optimization approaches to optimize resource allocation. Such approaches typically are unable to handle surges or variances in demand patterns well. Recent work has demonstrated the ability of Deep RL methods in adapting well to highly uncertain environments. However, existing Deep RL methods are unable to handle combinatorial action spaces and constraints on allocation of resources. To that end, we have developed three approaches on top of the well known actor-critic approach, DDPG (Deep Deterministic Policy Gradient) that are able to handle constraints on resource allocation. We also demonstrate that they are able to outperform leading approaches on simulators validated on semi-real and real data sets.
APA, Harvard, Vancouver, ISO, and other styles
5

Yang, Qisong, Thiago D. Simão, Simon H. Tindemans, and Matthijs T. J. Spaan. "WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10639–46. http://dx.doi.org/10.1609/aaai.v35i12.17272.

Full text
Abstract:
Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.
APA, Harvard, Vancouver, ISO, and other styles
6

Zhou, Zixian, Mengda Huang, Feiyang Pan, Jia He, Xiang Ao, Dandan Tu, and Qing He. "Gradient-Adaptive Pareto Optimization for Constrained Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 9 (June 26, 2023): 11443–51. http://dx.doi.org/10.1609/aaai.v37i9.26353.

Full text
Abstract:
Constrained Reinforcement Learning (CRL) burgeons broad interest in recent years, which pursues maximizing long-term returns while constraining costs. Although CRL can be cast as a multi-objective optimization problem, it is still facing the key challenge that gradient-based Pareto optimization methods tend to stick to known Pareto-optimal solutions even when they yield poor returns (e.g., the safest self-driving car that never moves) or violate the constraints (e.g., the record-breaking racer that crashes the car). In this paper, we propose Gradient-adaptive Constrained Policy Optimization (GCPO for short), a novel Pareto optimization method for CRL with two adaptive gradient recalibration techniques. First, to find Pareto-optimal solutions with balanced performance over all targets, we propose gradient rebalancing which forces the agent to improve more on under-optimized objectives at every policy iteration. Second, to guarantee that the cost constraints are satisfied, we propose gradient perturbation that can temporarily sacrifice the returns for costs. Experiments on the SafetyGym benchmarks show that our method consistently outperforms previous CRL methods in reward while satisfying the constraints.
APA, Harvard, Vancouver, ISO, and other styles
7

He, Tairan, Weiye Zhao, and Changliu Liu. "AutoCost: Evolving Intrinsic Cost for Zero-Violation Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 12 (June 26, 2023): 14847–55. http://dx.doi.org/10.1609/aaai.v37i12.26734.

Full text
Abstract:
Safety is a critical hurdle that limits the application of deep reinforcement learning to real-world control tasks. To this end, constrained reinforcement learning leverages cost functions to improve safety in constrained Markov decision process. However, constrained methods fail to achieve zero violation even when the cost limit is zero. This paper analyzes the reason for such failure, which suggests that a proper cost function plays an important role in constrained RL. Inspired by the analysis, we propose AutoCost, a simple yet effective framework that automatically searches for cost functions that help constrained RL to achieve zero-violation performance. We validate the proposed method and the searched cost function on the safety benchmark Safety Gym. We compare the performance of augmented agents that use our cost function to provide additive intrinsic costs to a Lagrangian-based policy learner and a constrained-optimization policy learner with baseline agents that use the same policy learners but with only extrinsic costs. Results show that the converged policies with intrinsic costs in all environments achieve zero constraint violation and comparable performance with baselines.
APA, Harvard, Vancouver, ISO, and other styles
8

Yang, Zhaoxing, Haiming Jin, Rong Ding, Haoyi You, Guiyun Fan, Xinbing Wang, and Chenghu Zhou. "DeCOM: Decomposed Policy for Constrained Cooperative Multi-Agent Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 9 (June 26, 2023): 10861–70. http://dx.doi.org/10.1609/aaai.v37i9.26288.

Full text
Abstract:
In recent years, multi-agent reinforcement learning (MARL) has presented impressive performance in various applications. However, physical limitations, budget restrictions, and many other factors usually impose constraints on a multi-agent system (MAS), which cannot be handled by traditional MARL frameworks. Specifically, this paper focuses on constrained MASes where agents work cooperatively to maximize the expected team-average return under various constraints on expected team-average costs, and develops a constrained cooperative MARL framework, named DeCOM, for such MASes. In particular, DeCOM decomposes the policy of each agent into two modules, which empowers information sharing among agents to achieve better cooperation. In addition, with such modularization, the training algorithm of DeCOM separates the original constrained optimization into an unconstrained optimization on reward and a constraints satisfaction problem on costs. DeCOM then iteratively solves these problems in a computationally efficient manner, which makes DeCOM highly scalable. We also provide theoretical guarantees on the convergence of DeCOM's policy update algorithm. Finally, we conduct extensive experiments to show the effectiveness of DeCOM with various types of costs in both moderate-scale and large-scale (with 500 agents) environments that originate from real-world applications.
APA, Harvard, Vancouver, ISO, and other styles
9

Martins, Miguel S. E., Joaquim L. Viegas, Tiago Coito, Bernardo Marreiros Firme, João M. C. Sousa, João Figueiredo, and Susana M. Vieira. "Reinforcement Learning for Dual-Resource Constrained Scheduling." IFAC-PapersOnLine 53, no. 2 (2020): 10810–15. http://dx.doi.org/10.1016/j.ifacol.2020.12.2866.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Guenter, Florent, Micha Hersch, Sylvain Calinon, and Aude Billard. "Reinforcement learning for imitating constrained reaching movements." Advanced Robotics 21, no. 13 (January 1, 2007): 1521–44. http://dx.doi.org/10.1163/156855307782148550.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Constrained Reinforcement Learning"

1

Chung, Jen Jen. "Learning to soar: exploration strategies in reinforcement learning for resource-constrained missions." Thesis, The University of Sydney, 2014. http://hdl.handle.net/2123/11733.

Full text
Abstract:
An unpowered aerial glider learning to soar in a wind field presents a new manifestation of the exploration-exploitation trade-off. This thesis proposes a directed, adaptive and nonmyopic exploration strategy in a temporal difference reinforcement learning framework for tackling the resource-constrained exploration-exploitation task of this autonomous soaring problem. The complete learning algorithm is developed in a SARSA() framework, which uses a Gaussian process with a squared exponential covariance function to approximate the value function. The three key contributions of this thesis form the proposed exploration-exploitation strategy. Firstly, a new information measure is derived from the change in the variance volume surrounding the Gaussian process estimate. This measure of information gain is used to define the exploration reward of an observation. Secondly, a nonmyopic information value is presented that captures both the immediate exploration reward due to taking an action as well as future exploration opportunities that result. Finally, this information value is combined with the state-action value of SARSA() through a dynamic weighting factor to produce an exploration-exploitation management scheme for resource-constrained learning systems. The proposed learning strategy encourages either exploratory or exploitative behaviour depending on the requirements of the learning task and the available resources. The performance of the learning algorithms presented in this thesis is compared against other SARSA() methods. Results show that actively directing exploration to regions of the state-action space with high uncertainty improves the rate of learning, while dynamic management of the exploration-exploitation behaviour according to the available resources produces prudent learning behaviour in resource-constrained systems.
APA, Harvard, Vancouver, ISO, and other styles
2

Araújo, Anderson Viçoso de. "ERG-ARCH : a reinforcement learning architecture for propositionally constrained multi-agent state spaces." Instituto Tecnológico de Aeronáutica, 2014. http://www.bd.bibl.ita.br/tde_busca/arquivo.php?codArquivo=3096.

Full text
Abstract:
The main goal of this work is to present an approach that ?nds an appropriate set of sequential actions for a group of cooperative agents interacting over a constrained environment. This search is considered a complex task for autonomous agents and is not possible to use default reinforcement learning algorithms to learn the adequate policy. In this thesis, a technique that deals with propositionally constrained state spaces and makes use of a Reinforcement Learning algorithm based on Markov Decision Process is proposed. A new model is also presented which formally de?nes this restricted search space. By so doing, this work aims at reducing the overall exploratory need, thus improving the performance of the learning algorithm. To constrain the state space the concept of extended reachability goals is employed. Through them it is possible to de?ne an objective to be preserved during the iteration with the environment and another that de?nes a goal state. In this cooperative environment, the information about the propositions is shared among the agents during its interaction. An architecture to solve problems in such environments is also presented. Experiments to validate the proposed algorithm were performed on different test cases and showed interesting results. A performance evaluation against standard Reinforcement Learning techniques showed that by extending autonomous learning with propositional constraints updated along the learning process can produce faster convergence to adequate policies. The best results achieved present an important reduction over execution time (34,32%) and number of iterations (67.94%). This occurs due to the early state space reduction caused by shared information on state space constraints.
APA, Harvard, Vancouver, ISO, and other styles
3

Pavesi, Alessandro. "Design and implementation of a Reinforcement Learning framework for iOS devices." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2022. http://amslaurea.unibo.it/25811/.

Full text
Abstract:
Reinforcement Learning is an increasingly popular area of Artificial Intelligence. The applications of this learning paradigm are many, but its application in mobile computing is in its infancy. This study aims to provide an overview of current Reinforcement Learning applications on mobile devices, as well as to introduce a new framework for iOS devices: Swift-RL Lib. This new Swift package allows developers to easily support and integrate two of the most common RL algorithms, Q-Learning and Deep Q-Network, in a fully customizable environment. All processes are performed on the device, without any need for remote computation. The framework was tested in different settings and evaluated through several use cases. Through an in-depth performance analysis, we show that the platform provides effective and efficient support for Reinforcement Learning for mobile applications.
APA, Harvard, Vancouver, ISO, and other styles
4

Watanabe, Takashi. "Regret analysis of constrained irreducible MDPs with reset action." Kyoto University, 2020. http://hdl.handle.net/2433/253371.

Full text
Abstract:
Kyoto University (京都大学)
0048
新制・課程博士
博士(人間・環境学)
甲第22535号
人博第938号
新制||人||223(附属図書館)
2019||人博||938(吉田南総合図書館)
京都大学大学院人間・環境学研究科共生人間学専攻
(主査)准教授 櫻川 貴司, 教授 立木 秀樹, 教授 日置 尋久
学位規則第4条第1項該当
APA, Harvard, Vancouver, ISO, and other styles
5

Allmendinger, Richard. "Tuning evolutionary search for closed-loop optimization." Thesis, University of Manchester, 2012. https://www.research.manchester.ac.uk/portal/en/theses/tuning-evolutionary-search-for-closedloop-optimization(d54e63e2-7927-42aa-b974-c41e717298cb).html.

Full text
Abstract:
Closed-loop optimization deals with problems in which candidate solutions are evaluated by conducting experiments, e.g. physical or biochemical experiments. Although this form of optimization is becoming more popular across the sciences, it may be subject to rather unexplored resourcing issues, as any experiment may require resources in order to be conducted. In this thesis we are concerned with understanding how evolutionary search is affected by three particular resourcing issues -- ephemeral resource constraints (ERCs), changes of variables, and lethal environments -- and the development of search strategies to combat these issues. The thesis makes three broad contributions. First, we motivate and formally define the resourcing issues considered. Here, concrete examples in a range of applications are given. Secondly, we theoretically and empirically investigate the effect of the resourcing issues considered on evolutionary search. This investigation reveals that resourcing issues affect optimization in general, and that clear patterns emerge relating specific properties of the different resourcing issues to performance effects. Thirdly, we develop and analyze various search strategies augmented on an evolutionary algorithm (EA) for coping with resourcing issues. To cope specifically with ERCs, we develop several static constraint-handling strategies, and investigate the application of reinforcement learning techniques to learn when to switch between these static strategies during an optimization process. We also develop several online resource-purchasing strategies to cope with ERCs that leave the arrangement of resources to the hands of the optimizer. For problems subject to changes of variables relating to the resources, we find that knowing which variables are changed provides an optimizer with valuable information, which we exploit using a novel dynamic strategy. Finally, for lethal environments, where visiting parts of the search space can cause the permanent loss of resources, we observe that a standard EA's population may be reduced in size rapidly, complicating the search for innovative solutions. To cope with such scenarios, we consider some non-standard EA setups that are able to innovate genetically whilst simultaneously mitigating risks to the evolving population.
APA, Harvard, Vancouver, ISO, and other styles
6

Irani, Arya John. "Utilizing negative policy information to accelerate reinforcement learning." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/53481.

Full text
Abstract:
A pilot study by Subramanian et al. on Markov decision problem task decomposition by humans revealed that participants break down tasks into both short-term subgoals with a defined end-condition (such as "go to food") and long-term considerations and invariants with no end-condition (such as "avoid predators"). In the context of Markov decision problems, behaviors having clear start and end conditions are well-modeled by an abstraction known as options, but no abstraction exists in the literature for continuous constraints imposed on the agent's behavior. We propose two representations to fill this gap: the state constraint (a set or predicate identifying states that the agent should avoid) and the state-action constraint (identifying state-action pairs that should not be taken). State-action constraints can be directly utilized by an agent, which must choose an action in each state, while state constraints require an approximation of the MDP’s state transition function to be used; however, it is important to support both representations, as certain constraints may be more easily expressed in terms of one as compared to the other, and users may conceive of rules in either form. Using domains inspired by classic video games, this dissertation demonstrates the thesis that explicitly modeling this negative policy information improves reinforcement learning performance by decreasing the amount of training needed to achieve a given level of performance. In particular, we will show that even the use of negative policy information captured from individuals with no background in artificial intelligence yields improved performance. We also demonstrate that the use of options and constraints together form a powerful combination: an option and constraint can be taken together to construct a constrained option, which terminates in any situation where the original option would violate a constraint. In this way, a naive option defined to perform well in a best-case scenario may still accelerate learning in domains where the best-case scenario is not guaranteed.
APA, Harvard, Vancouver, ISO, and other styles
7

Acevedo, Valle Juan Manuel. "Sensorimotor exploration: constraint awareness and social reinforcement in early vocal development." Doctoral thesis, Universitat Politècnica de Catalunya, 2018. http://hdl.handle.net/10803/667500.

Full text
Abstract:
This research is motivated by the benefits that knowledge regarding early development in infants may provide to different fields of science. In particular, early sensorimotor exploration behaviors are studied in the framework of developmental robotics. The main objective is about understanding the role of motor constraint awareness and imitative behaviors during sensorimotor exploration. Particular emphasis is placed on prelinguistic vocal development because during this stage infants start to master the motor systems that will later allow them to pronounce their first words. Previous works have demonstrated that goal-directed intrinsically motivated sensorimotor exploration is an essential element for sensorimotor control learning. Moreover, evidence coming from biological sciences strongly suggests that knowledge acquisition is shaped by the environment in which an agent is embedded and the embodiment of the agent itself, including developmental processes that shape what can be learned and when. In this dissertation, we firstly provide a collection of theoretical evidence that supports the relevance of our study. Starting from concepts of cognitive and developmental sciences, we arrived al the conclusion that spoken language, i.e., early \/ocal development, must be studied asan embodied and situated phenomena. Considering a synthetic approach allow us to use robots and realistic simulators as artifacts to study natural cognitive phenomena. In this work, we adopta toy example to test our cognitive architectures and a speech synthesizer that mimics the mechanisms by which humans produce speech. Next, we introduce a mechanism to endow embodied agents with motor constraint awareness. lntrinsic motivation has been studied as an importan! element to explain the emergence of structured developmental stages during early vocal development. However, previous studies failed to acknowledge the constraints imposed by the embodiment and situatedness, al sensory, motor, cognitive and social levels. We assume that during the onset of sensorimotor exploratory behaviors, motor constraints are unknown to the developmental agent. Thus, the agent must discover and learn during exploration what !hose motor constraints are. The agent is endowed with a somesthetic system based on tactile information. This system generales a sensor signal indicating if a motor configuration was reached or not. This information is later used to create a somesthetic model to predict constraint violations. Finally, we propase to include social reinforcement during exploration. Sorne works studying early vocal development have shown that environmental speech shapes the sensory space explored during babbling. More generally, imitative behaviors have been demonstrated to be crucial for early development in children as they constraint the search space.during sensorimotor exploration. Therefore, based on early interactions of infants and caregivers we proposed an imitative mechanism to reinforce intrinsically motivated sensorimotor exploration with relevan! sensory units. Thus, we modified the constraints aware sensorimotor exploration architecture to include a social instructor, expert in sensor units relevant to communication, which interacts with the developmental agent. lnteraction occurs when the learner production is ·enough' similar to one relevan! to communication. In that case, the instructor perceives this similitude and reformulates with the relevan! sensor unit. When the learner perceives an utterance by the instructor, it attempts to imitate it. In general, our results suggest that somesthetic senses and social reinforcement contribute to achieving better results during intrinsically motivated exploration. Achieving lest redundant exploration, decreasing exploration and evaluation errors, as well as showing a clearer picture of developmental transitions.
La motivación principal de este trabajo es la magnitud que las contribuciones al conocimiento en relación al desarrollo infantil pueden aportar a diferentes campos de la ciencia. Particularmente, este trabajo se enfoca en el estudio de los comportamientos de autoexploración sensorimotora en un marco robótico e inspirado en el campo de la psicología del desarrollo. Nuestro objetivo principal es entender el papel que juegan las restricciones motoras y los reflejos imitativos durante la exploración espontánea observada en infantes. Así mismo, este trabajo hace especial énfasis en el desarrollo vocal-auditivo en infantes, que les provee con las herramientas que les permitirán producir sus primeras palabras. Trabajos anteriores han demostrado que los comportamientos de autoexploración sensorimotora en niños, la cual ocurre en gran medida por motivaciones intrínsecas, es un elemento importante para aprender a controlar su cuerpo con tal de alcanzar estados sensoriales específicos. Además, evidencia obtenida de estudios biológicos sugiere tajantemente que la adquisición de conocimiento es regulada por el ambiente en el cual un agente cognitivo se desenvuelve y por el cuerpo del agente per se. Incluso, los procesos de desarrollo que ocurren a nivel físico, cognitivo y social también regulan que es aprendido y cuando esto es aprendido. La primera parte de este trabajo provee al lector con la evidencia teórica y práctica que demuestran la relevancia de esta investigación. Recorriendo conceptos que van desde las ciencias cognitivas y del desarrollo, llegamos a la conclusión de que el lenguaje, y por tanto el habla, deben ser estudiados como fenómenos cognitivos que requieren un cuerpo físico y además un ambiente propicio para su existencia. En la actualidad los sistemas robóticos, reales y simulados, pueden ser considerados como elementos para el estudio de los fenómenos cognitivos naturales. En este trabajo consideramos un ejemplo simple para probar las arquitecturas cognitivas que proponemos, y posteriormente utilizamos dichas arquitecturas con un sintetizador de voz similar al mecanismo humano de producción del habla. Como primera contribución de este trabajo proponemos introducir un mecanismo para construir robots capaces de considerar sus propias restricciones motoras durante la etapa de autoexploración sensorimotora. Ciertos mecanismos de motivación intrínseca para exploración sensorimotora han sido estudiados como posibles conductores de las trayectorias de desarrollo observadas durante el desarrollo temprano del habla. Sin embargo, en previos estudios no se consideró o que este desarrollo está a delimitado por restricciones debido al ambiente, al cuerpo físico, y a las capacidades sensoriales, motoras y cognitivas. En nuestra arquitectura, asumimos que un agente artificial no cuenta con conocimiento de sus limitantes motoras, y por tanto debe descubrirlas durante la etapa de autoexploración. Para tal efecto, el agente es proveído de un sistema somatosensorial que le indica cuando una configuración motora viola las restricciones impuestas por el propio cuerpo. Finalmente, como segunda parte de nuestra contribución proponemos incluir un mecanismo para reforzar el aprendizaje durante la autoexploración. Estudios anteriores demostraron que el ambiente lingüístico en que se desarrolla un infante, o un agente artificial, condiciona sus producciones vocales durante la autoexploración o balbuceo. En este trabajo nos enfocamos en el estudio de episodios de imitación que ocurren durante el desarrollo temprano de un agente. Basados en estudios sobre la interacción entre madres e hijos durante la etapa pre lingüística, proponemos un mecanismo para reforzar el aprendizaje durante la autoexploración con unidades sensoriales relevantes. Entonces, a partir de la arquitectura con autoconocimiento de restricciones motores, construimos una arquitectura que incluye un instructor experto en control sensorimotor. Las interacciones entre el aprendiz y el experto ocurren cuando el aprendiz produce una unidad sensorial relevante para la comunicación durante la autoexploración. En este caso, el experto percibe esta similitud y responde reformulando la producción del aprendiz como la unidad relevante. Cuando el aprendiz percibe una acción del experto, inmediatamente intenta imitarlo. Los resultados presentados en este trabajo sugieren que, los sistemas somatosensoriales, y el reforzamiento social contribuyen a lograr mejores resultados durante la etapa de autoexploración sensorimotora motivada intrínsecamente. En este sentido, se logra una exploración menos redundante, los errores de exploración y evaluación disminuyen, y por último se obtiene una imagen más nítida de las transiciones entre etapas del desarrollo.
La motivació principal d'aquest treball és la magnitud que les contribucions al coneixement en relació al desenvolupament infantil poden aportar a diferents camps de la ciència. Particularment, aquest treball s'enfoca en l'estudi dels comportaments d’autoexploració sensorimotora en un marc robòtic i inspirat en el camp de la psicologia del desenvolupament. El nostre objectiu principal és entendre el paper que juguen les restriccions motores i els reflexos imitatius durant l’exploració espontània observada en infants. Així mateix, aquest treball fa especial èmfasi en el desenvolupament vocal-auditiu en infants, que els proveeix amb les eines que els permetran produir les seves primeres paraules. Treballs anteriors han demostrat que els comportaments d'autoexploració sensorimotora en nens, la qual ocorre en gran mesura per motivacions intrínseques, és un element important per aprendre a controlar el seu cos per tal d'assolir estats sensorials específics. A més, evidencies obtingudes d'estudis biològics suggereixen que l’adquisició de coneixement és regulada per l'ambient en el qual un agent cognitiu es desenvolupa i pel cos de l'agent per se. Fins i tot, els processos de desenvolupament que ocorren a nivell físic, cognitiu i social també regulen què és après i quan això ès après. La primera part d'aquest treball proveeix el lector amb les evidencies teòrica i pràctica que demostren la rellevància d'aquesta investigació. Recorrent conceptes que van des de les ciències cognitives i del desenvolupament, vam arribar a la conclusió que el llenguatge, i per tant la parla, han de ser estudiats com a fenòmens cognitius que requereixen un cos físic i a més un ambient propici per a la seva existència. En l'actualitat els sistemes robòtics, reals i simulats, poden ser considerats com a elements per a l'estudi dels fenòmens cognitius naturals. En aquest treball considerem un exemple simple per provar les arquitectures cognitives que proposem, i posteriorment utilitzem aquestes arquitectures amb un sintetitzador de veu similar al mecanisme humà de producció de la parla. Com a primera contribució d'aquest treball proposem introduir un mecanisme per construir robots capaços de considerar les seves pròpies restriccions motores durant l'etapa d'autoexploració sensorimotora. Certs mecanismes de motivació intrínseca per exploració sensorimotora han estat estudiats com a possibles conductors de les trajectòries de desenvolupament observades durant el desenvolupament primerenc de la parla. No obstant això, en previs estudis no es va considerar que aquest desenvolupament és delimitat per restriccions a causa de l'ambient, el cos físic, i les capacitats sensorials, motores i cognitives. A la nostra arquitectura, assumim que un agent artificial no compta amb coneixement dels seus limitants motors, i per tant ha de descobrir-los durant l'etapa d'autoexploració. Per a tal efecte, l'agent és proveït d'un sistema somatosensorial que li indica quan una configuració motora viola les restriccions imposades pel propi cos. Finalment, com a segona part de la nostra contribució proposem incloure un mecanisme per reforçar l'aprenentatge durant l'autoexploració. Estudis anteriors han demostrat que l'ambient lingüísticstic en què es desenvolupa un infant, o un agent artificial, condiciona les seves produccions vocals durant l'autoexploració o balboteig. En aquest treball ens enfoquem en l'estudi d'episodis d’imitació que ocorren durant el desenvolupament primerenc d'un agent. Basats en estudis sobre la interacció entre mares i fills durant l'etapa prelingüística, proposem un mecanisme per reforçar l'aprenentatge durant l'autoexploració amb unitats sensorials rellevants. Aleshores, a partir de l'arquitectura amb autoconeixement de restriccions motors, vam construir una arquitectura que inclou un instructor expert en control sensorimotor. Les interaccions entre l'aprenent i l'expert, ocorren quan una producció sensorial de l'aprenent durant l'autoexploració és similar a una unitat sensorial rellevant per a la comunicació. En aquest cas, l'expert percep aquesta similitud i respon reformulant la producció de l'aprenent com la unitat rellevant. Quan l'aprenent percep una acció de l'expert, immediatament intenta imitar-lo. Els resultats presentats en aquest treball suggereixen que els sistemes somatosensorials i el reforçament social contribueixen a aconseguir millors resultats durant l'etapa d'autoexploració sensorimotora motivada intrínsecament. En aquest sentit, s'aconsegueix una exploració menys redundant, els errors d’exploració i avaluació disminueixen, i finalment s’obté una imatge més nítida de les transicions entre etapes del desenvolupament
APA, Harvard, Vancouver, ISO, and other styles
8

Racey, Deborah Elaine. "EFFECTS OF RESPONSE FREQUENCY CONSTRAINTS ON LEARNING IN A NON-STATIONARY MULTI-ARMED BANDIT TASK." OpenSIUC, 2009. https://opensiuc.lib.siu.edu/dissertations/86.

Full text
Abstract:
An n-armed bandit task was used to investigate the trade-off between exploratory (choosing lesser-known options) and exploitive (choosing options with the greatest probability of reinforcement) human choice in a trial-and-error learning problem. In Experiment 1 a different probability of reinforcement was assigned to each of 8 response options using random-ratios (RRs), and participants chose by clicking buttons in a circular display on a computer screen using a computer mouse. Relative frequency thresholds (ranging from .10 to 1.0) were randomly assigned to each participant and acted as task constraints limiting the proportion of total responses that could be attributed to any response option. Preference for the richer keys was shown, and those with greater constraints explored more and earned less reinforcement. Those with the highest constraints showed no preference, distributing their responses among the options with equal probability. In Experiment 2 the payoff probabilities changed partway through, for some the leanest options increased to richest, and for others the richest became leanest. When the RRs changed, the decrease participants with moderate and low constraints showed immediate increases in exploration and change in preference to the new richest keys, while increase participants showed no increase in exploration, and more gradual changes in preference. For Experiment 3 the constraint was held constant at .85, and the two richest options were decreased midway through the task by varying amounts (0 to .60). Decreases were detected early for participants in all but the smallest decrease conditions, and exploration increased.
APA, Harvard, Vancouver, ISO, and other styles
9

Cline, Tammy Lynn. "Effects of Training Accurate Component Strokes Using Response Constraint and Self-evaluation on Whole Letter Writing." Thesis, University of North Texas, 2006. https://digital.library.unt.edu/ark:/67531/metadc5472/.

Full text
Abstract:
This study analyzed the effects of a training package containing response constraint, self-evaluation, reinforcement, and a fading procedure on written letter components and whole letter writing in four elementary school participants. The effect on accuracy of written components was evaluated using a multiple-baseline-across components and a continuous probe design of components, as well as pre-test, baseline, and post-test measures. The results of this study show that response constraint and self-evaluation quickly improved students' performance in writing components. Fading of the intervention was achieved quickly and performance maintained. Results also show that improvement in component writing improved whole letter and full name writing and letter reversals in the presence of a model were corrected.
APA, Harvard, Vancouver, ISO, and other styles
10

Hester, Todd. "Texplore : temporal difference reinforcement learning for robots and time-constrained domains." Thesis, 2012. http://hdl.handle.net/2152/ETD-UT-2012-12-6763.

Full text
Abstract:
Robots have the potential to solve many problems in society, because of their ability to work in dangerous places doing necessary jobs that no one wants or is able to do. One barrier to their widespread deployment is that they are mainly limited to tasks where it is possible to hand-program behaviors for every situation that may be encountered. For robots to meet their potential, they need methods that enable them to learn and adapt to novel situations that they were not programmed for. Reinforcement learning (RL) is a paradigm for learning sequential decision making processes and could solve the problems of learning and adaptation on robots. This dissertation identifies four key challenges that must be addressed for an RL algorithm to be practical for robotic control tasks. These RL for Robotics Challenges are: 1) it must learn in very few samples; 2) it must learn in domains with continuous state features; 3) it must handle sensor and/or actuator delays; and 4) it should continually select actions in real time. This dissertation focuses on addressing all four of these challenges. In particular, this dissertation is focused on time-constrained domains where the first challenge is critically important. In these domains, the agent's lifetime is not long enough for it to explore the domain thoroughly, and it must learn in very few samples. Although existing RL algorithms successfully address one or more of the RL for Robotics Challenges, no prior algorithm addresses all four of them. To fill this gap, this dissertation introduces TEXPLORE, the first algorithm to address all four challenges. TEXPLORE is a model-based RL method that learns a random forest model of the domain which generalizes dynamics to unseen states. Each tree in the random forest model represents a hypothesis of the domain's true dynamics, and the agent uses these hypotheses to explores states that are promising for the final policy, while ignoring states that do not appear promising. With sample-based planning and a novel parallel architecture, TEXPLORE can select actions continually in real time whenever necessary. We empirically evaluate each component of TEXPLORE in comparison with other state-of-the-art approaches. In addition, we present modifications of TEXPLORE's exploration mechanism for different types of domains. The key result of this dissertation is a demonstration of TEXPLORE learning to control the velocity of an autonomous vehicle on-line, in real time, while running on-board the robot. After controlling the vehicle for only two minutes, TEXPLORE is able to learn to move the pedals of the vehicle to drive at the desired velocities. The work presented in this dissertation represents an important step towards applying RL to robotics and enabling robots to perform more tasks in society. By enabling robots to learn in few actions while acting on-line in real time on robots with continuous state and actuator delays, TEXPLORE significantly broadens the applicability of RL to robots.
text
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Constrained Reinforcement Learning"

1

Hester, Todd. TEXPLORE: Temporal Difference Reinforcement Learning for Robots and Time-Constrained Domains. Heidelberg: Springer International Publishing, 2013. http://dx.doi.org/10.1007/978-3-319-01168-4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Hester, Todd. TEXPLORE: Temporal Difference Reinforcement Learning for Robots and Time-Constrained Domains. Heidelberg: Springer International Publishing, 2013.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
3

Hester, Todd. TEXPLORE: Temporal Difference Reinforcement Learning for Robots and Time-Constrained Domains. Springer, 2013.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
4

Hester, Todd. TEXPLORE: Temporal Difference Reinforcement Learning for Robots and Time-Constrained Domains. Springer, 2013.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
5

Hester, Todd. TEXPLORE: Temporal Difference Reinforcement Learning for Robots and Time-Constrained Domains. Springer, 2016.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Constrained Reinforcement Learning"

1

Junges, Sebastian, Nils Jansen, Christian Dehnert, Ufuk Topcu, and Joost-Pieter Katoen. "Safety-Constrained Reinforcement Learning for MDPs." In Tools and Algorithms for the Construction and Analysis of Systems, 130–46. Berlin, Heidelberg: Springer Berlin Heidelberg, 2016. http://dx.doi.org/10.1007/978-3-662-49674-9_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Wang, Huiwei, Huaqing Li, and Bo Zhou. "Reinforcement Learning for Constrained Games with Incomplete Information." In Distributed Optimization, Game and Learning Algorithms, 161–89. Singapore: Springer Singapore, 2021. http://dx.doi.org/10.1007/978-981-33-4528-7_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Ling, Jiajing, Arambam James Singh, Nguyen Duc Thien, and Akshat Kumar. "Constrained Multiagent Reinforcement Learning for Large Agent Population." In Machine Learning and Knowledge Discovery in Databases, 183–99. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-26412-2_12.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Winkel, David, Niklas Strauß, Matthias Schubert, Yunpu Ma, and Thomas Seidl. "Constrained Portfolio Management Using Action Space Decomposition for Reinforcement Learning." In Advances in Knowledge Discovery and Data Mining, 373–85. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-33377-4_29.

Full text
Abstract:
AbstractFinancial portfolio managers typically face multi-period optimization tasks such as short-selling or investing at least a particular portion of the portfolio in a specific industry sector. A common approach to tackle these problems is to use constrained Markov decision process (CMDP) methods, which may suffer from sample inefficiency, hyperparameter tuning, and lack of guarantees for constraint violations. In this paper, we propose Action Space Decomposition Based Optimization (ADBO) for optimizing a more straightforward surrogate task that allows actions to be mapped back to the original task. We examine our method on two real-world data portfolio construction tasks. The results show that our new approach consistently outperforms state-of-the-art benchmark approaches for general CMDPs.
APA, Harvard, Vancouver, ISO, and other styles
5

Li, Wei, and Waleed Meleis. "Adaptive Adjacency Kanerva Coding for Memory-Constrained Reinforcement Learning." In Machine Learning and Data Mining in Pattern Recognition, 187–201. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-96136-1_16.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Hasanbeig, Mohammadhosein, Daniel Kroening, and Alessandro Abate. "LCRL: Certified Policy Synthesis via Logically-Constrained Reinforcement Learning." In Quantitative Evaluation of Systems, 217–31. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-16336-4_11.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Ferrari, Silvia, Keith Rudd, and Gianluca Di Muro. "A Constrained Backpropagation Approach to Function Approximation and Approximate Dynamic Programming." In Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, 162–81. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2013. http://dx.doi.org/10.1002/9781118453988.ch8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Sharif, Muddsair, Charitha Buddhika Heendeniya, and Gero Lückemeyer. "ARaaS: Context-Aware Optimal Charging Distribution Using Deep Reinforcement Learning." In iCity. Transformative Research for the Livable, Intelligent, and Sustainable City, 199–209. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-030-92096-8_12.

Full text
Abstract:
AbstractElectromobility has profound economic and ecological impacts on human society. Much of the mobility sector’s transformation is catalyzed by digitalization, enabling many stakeholders, such as vehicle users and infrastructure owners, to interact with each other in real time. This article presents a new concept based on deep reinforcement learning to optimize agent interactions and decision-making in a smart mobility ecosystem. The algorithm performs context-aware, constrained optimization that fulfills on-demand requests from each agent. The algorithm can learn from the surrounding environment until the agent interactions reach an optimal equilibrium point in a given context. The methodology implements an automatic template-based approach via a continuous integration and delivery (CI/CD) framework using a GitLab runner and transfers highly computationally intensive tasks over a high-performance computing cluster automatically without manual intervention.
APA, Harvard, Vancouver, ISO, and other styles
9

Jędrzejowicz, Piotr, and Ewa Ratajczak-Ropel. "Reinforcement Learning Strategy for A-Team Solving the Resource-Constrained Project Scheduling Problem." In Computational Collective Intelligence. Technologies and Applications, 457–66. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-40495-5_46.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Dutta, Hrishikesh, Amit Kumar Bhuyan, and Subir Biswas. "Reinforcement Learning for Protocol Synthesis in Resource-Constrained Wireless Sensor and IoT Networks." In Ubiquitous Networking, 183–99. Cham: Springer International Publishing, 2023. http://dx.doi.org/10.1007/978-3-031-29419-8_14.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Constrained Reinforcement Learning"

1

Hu, Chengpeng, Jiyuan Pei, Jialin Liu, and Xin Yao. "Evolving Constrained Reinforcement Learning Policy." In 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 2023. http://dx.doi.org/10.1109/ijcnn54540.2023.10191982.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Zhang, Linrui, Li Shen, Long Yang, Shixiang Chen, Xueqian Wang, Bo Yuan, and Dacheng Tao. "Penalized Proximal Policy Optimization for Safe Reinforcement Learning." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/520.

Full text
Abstract:
Safe reinforcement learning aims to learn the optimal policy while satisfying safety constraints, which is essential in real-world applications. However, current algorithms still struggle for efficient policy updates with hard constraint satisfaction. In this paper, we propose Penalized Proximal Policy Optimization (P3O), which solves the cumbersome constrained policy iteration via a single minimization of an equivalent unconstrained problem. Specifically, P3O utilizes a simple yet effective penalty approach to eliminate cost constraints and removes the trust-region constraint by the clipped surrogate objective. We theoretically prove the exactness of the penalized method with a finite penalty factor and provide a worst-case analysis for approximate error when evaluated on sample trajectories. Moreover, we extend P3O to more challenging multi-constraint and multi-agent scenarios which are less studied in previous work. Extensive experiments show that P3O outperforms state-of-the-art algorithms with respect to both reward improvement and constraint satisfaction on a set of constrained locomotive tasks.
APA, Harvard, Vancouver, ISO, and other styles
3

HasanzadeZonuzy, Aria, Dileep Kalathil, and Srinivas Shakkottai. "Model-Based Reinforcement Learning for Infinite-Horizon Discounted Constrained Markov Decision Processes." In Thirtieth International Joint Conference on Artificial Intelligence {IJCAI-21}. California: International Joint Conferences on Artificial Intelligence Organization, 2021. http://dx.doi.org/10.24963/ijcai.2021/347.

Full text
Abstract:
In many real-world reinforcement learning (RL) problems, in addition to maximizing the objective, the learning agent has to maintain some necessary safety constraints. We formulate the problem of learning a safe policy as an infinite-horizon discounted Constrained Markov Decision Process (CMDP) with an unknown transition probability matrix, where the safety requirements are modeled as constraints on expected cumulative costs. We propose two model-based constrained reinforcement learning (CRL) algorithms for learning a safe policy, namely, (i) GM-CRL algorithm, where the algorithm has access to a generative model, and (ii) UC-CRL algorithm, where the algorithm learns the model using an upper confidence style online exploration method. We characterize the sample complexity of these algorithms, i.e., the the number of samples needed to ensure a desired level of accuracy with high probability, both with respect to objective maximization and constraint satisfaction.
APA, Harvard, Vancouver, ISO, and other styles
4

Skalse, Joar, Lewis Hammond, Charlie Griffin, and Alessandro Abate. "Lexicographic Multi-Objective Reinforcement Learning." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/476.

Full text
Abstract:
In this work we introduce reinforcement learning techniques for solving lexicographic multi-objective problems. These are problems that involve multiple reward signals, and where the goal is to learn a policy that maximises the first reward signal, and subject to this constraint also maximises the second reward signal, and so on. We present a family of both action-value and policy gradient algorithms that can be used to solve such problems, and prove that they converge to policies that are lexicographically optimal. We evaluate the scalability and performance of these algorithms empirically, and demonstrate their applicability in practical settings. As a more specific application, we show how our algorithms can be used to impose safety constraints on the behaviour of an agent, and compare their performance in this context with that of other constrained reinforcement learning algorithms.
APA, Harvard, Vancouver, ISO, and other styles
5

Zhao, Weiye, Tairan He, Rui Chen, Tianhao Wei, and Changliu Liu. "State-wise Safe Reinforcement Learning: A Survey." In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. California: International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/763.

Full text
Abstract:
Despite the tremendous success of Reinforcement Learning (RL) algorithms in simulation environments, applying RL to real-world applications still faces many challenges. A major concern is safety, in another word, constraint satisfaction. State-wise constraints are one of the most common constraints in real-world applications and one of the most challenging constraints in Safe RL. Enforcing state-wise constraints is necessary and essential to many challenging tasks such as autonomous driving, robot manipulation. This paper provides a comprehensive review of existing approaches that address state-wise constraints in RL. Under the framework of State-wise Constrained Markov Decision Process (SCMDP), we will discuss the connections, differences, and trade-offs of existing approaches in terms of (i) safety guarantee and scalability, (ii) safety and reward performance, and (iii) safety after convergence and during training. We also summarize limitations of current methods and discuss potential future directions.
APA, Harvard, Vancouver, ISO, and other styles
6

Sarafian, Elad, Aviv Tamar, and Sarit Kraus. "Constrained Policy Improvement for Efficient Reinforcement Learning." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/396.

Full text
Abstract:
We propose a policy improvement algorithm for Reinforcement Learning (RL) termed Rerouted Behavior Improvement (RBI). RBI is designed to take into account the evaluation errors of the Q-function. Such errors are common in RL when learning the Q-value from finite experience data. Greedy policies or even constrained policy optimization algorithms that ignore these errors may suffer from an improvement penalty (i.e., a policy impairment). To reduce the penalty, the idea of RBI is to attenuate rapid policy changes to actions that were rarely sampled. This approach is shown to avoid catastrophic performance degradation and reduce regret when learning from a batch of transition samples. Through a two-armed bandit example, we show that it also increases data efficiency when the optimal action has a high variance. We evaluate RBI in two tasks in the Atari Learning Environment: (1) learning from observations of multiple behavior policies and (2) iterative RL. Our results demonstrate the advantage of RBI over greedy policies and other constrained policy optimization algorithms both in learning from observations and in RL tasks.
APA, Harvard, Vancouver, ISO, and other styles
7

Lee, Jongmin, Youngsoo Jang, Pascal Poupart, and Kee-Eung Kim. "Constrained Bayesian Reinforcement Learning via Approximate Linear Programming." In Twenty-Sixth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 2017. http://dx.doi.org/10.24963/ijcai.2017/290.

Full text
Abstract:
In this paper, we consider the safe learning scenario where we need to restrict the exploratory behavior of a reinforcement learning agent. Specifically, we treat the problem as a form of Bayesian reinforcement learning in an environment that is modeled as a constrained MDP (CMDP) where the cost function penalizes undesirable situations. We propose a model-based Bayesian reinforcement learning (BRL) algorithm for such an environment, eliciting risk-sensitive exploration in a principled way. Our algorithm efficiently solves the constrained BRL problem by approximate linear programming, and generates a finite state controller in an off-line manner. We provide theoretical guarantees and demonstrate empirically that our approach outperforms the state of the art.
APA, Harvard, Vancouver, ISO, and other styles
8

Chen, Weiqin, Dharmashankar Subramanian, and Santiago Paternain. "Policy Gradients for Probabilistic Constrained Reinforcement Learning." In 2023 57th Annual Conference on Information Sciences and Systems (CISS). IEEE, 2023. http://dx.doi.org/10.1109/ciss56502.2023.10089763.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Abe, Naoki, Melissa Kowalczyk, Mark Domick, Timothy Gardinier, Prem Melville, Cezar Pendus, Chandan K. Reddy, et al. "Optimizing debt collections using constrained reinforcement learning." In the 16th ACM SIGKDD international conference. New York, New York, USA: ACM Press, 2010. http://dx.doi.org/10.1145/1835804.1835817.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Malik, Shehryar, Muhammad Umair Haider, Omer Iqbal, and Murtaza Taj. "Neural Network Pruning Through Constrained Reinforcement Learning." In 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022. http://dx.doi.org/10.1109/icpr56361.2022.9956050.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography