To see the other types of publications on this topic, follow the link: Constrained Reinforcement Learning.

Journal articles on the topic 'Constrained Reinforcement Learning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Constrained Reinforcement Learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Pankayaraj, Pathmanathan, and Pradeep Varakantham. "Constrained Reinforcement Learning in Hard Exploration Problems." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 12 (June 26, 2023): 15055–63. http://dx.doi.org/10.1609/aaai.v37i12.26757.

Full text
Abstract:
One approach to guaranteeing safety in Reinforcement Learning is through cost constraints that are dependent on the policy. Recent works in constrained RL have developed methods that ensure constraints are enforced even at learning time while maximizing the overall value of the policy. Unfortunately, as demonstrated in our experimental results, such approaches do not perform well on complex multi-level tasks, with longer episode lengths or sparse rewards. To that end, we propose a scalable hierarchical approach for constrained RL problems that employs backward cost value functions in the context of task hierarchy and a novel intrinsic reward function in lower levels of the hierarchy to enable cost constraint enforcement. One of our key contributions is in proving that backward value functions are theoretically viable even when there are multiple levels of decision making. We also show that our new approach, referred to as Hierarchically Limited consTraint Enforcement (HiLiTE) significantly improves on state of the art Constrained RL approaches for many benchmark problems from literature. We further demonstrate that this performance (on value and constraint enforcement) clearly outperforms existing best approaches for constrained RL and hierarchical RL.
APA, Harvard, Vancouver, ISO, and other styles
2

HasanzadeZonuzy, Aria, Archana Bura, Dileep Kalathil, and Srinivas Shakkottai. "Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 9 (May 18, 2021): 7667–74. http://dx.doi.org/10.1609/aaai.v35i9.16937.

Full text
Abstract:
Many physical systems have underlying safety considerations that require that the policy employed ensures the satisfaction of a set of constraints. The analytical formulation usually takes the form of a Constrained Markov Decision Process (CMDP). We focus on the case where the CMDP is unknown, and RL algorithms obtain samples to discover the model and compute an optimal constrained policy. Our goal is to characterize the relationship between safety constraints and the number of samples needed to ensure a desired level of accuracy---both objective maximization and constraint satisfaction---in a PAC sense. We explore two classes of RL algorithms, namely, (i) a generative model based approach, wherein samples are taken initially to estimate a model, and (ii) an online approach, wherein the model is updated as samples are obtained. Our main finding is that compared to the best known bounds of the unconstrained regime, the sample complexity of constrained RL algorithms are increased by a factor that is logarithmic in the number of constraints, which suggests that the approach may be easily utilized in real systems.
APA, Harvard, Vancouver, ISO, and other styles
3

Dai, Juntao, Jiaming Ji, Long Yang, Qian Zheng, and Gang Pan. "Augmented Proximal Policy Optimization for Safe Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 6 (June 26, 2023): 7288–95. http://dx.doi.org/10.1609/aaai.v37i6.25888.

Full text
Abstract:
Safe reinforcement learning considers practical scenarios that maximize the return while satisfying safety constraints. Current algorithms, which suffer from training oscillations or approximation errors, still struggle to update the policy efficiently with precise constraint satisfaction. In this article, we propose Augmented Proximal Policy Optimization (APPO), which augments the Lagrangian function of the primal constrained problem via attaching a quadratic deviation term. The constructed multiplier-penalty function dampens cost oscillation for stable convergence while being equivalent to the primal constrained problem to precisely control safety costs. APPO alternately updates the policy and the Lagrangian multiplier via solving the constructed augmented primal-dual problem, which can be easily implemented by any first-order optimizer. We apply our APPO methods in diverse safety-constrained tasks, setting a new state of the art compared with a comprehensive list of safe RL baselines. Extensive experiments verify the merits of our method in easy implementation, stable convergence, and precise cost control.
APA, Harvard, Vancouver, ISO, and other styles
4

Bhatia, Abhinav, Pradeep Varakantham, and Akshat Kumar. "Resource Constrained Deep Reinforcement Learning." Proceedings of the International Conference on Automated Planning and Scheduling 29 (May 25, 2021): 610–20. http://dx.doi.org/10.1609/icaps.v29i1.3528.

Full text
Abstract:
In urban environments, resources have to be constantly matched to the “right” locations where customer demand is present. For instance, ambulances have to be matched to base stations regularly so as to reduce response time for emergency incidents in ERS (Emergency Response Systems); vehicles (cars, bikes among others) have to be matched to docking stations to reduce lost demand in shared mobility systems. Such problems are challenging owing to the demand uncertainty, combinatorial action spaces and constraints on allocation of resources (e.g., total resources, minimum and maximum number of resources at locations and regions).Existing systems typically employ myopic and greedy optimization approaches to optimize resource allocation. Such approaches typically are unable to handle surges or variances in demand patterns well. Recent work has demonstrated the ability of Deep RL methods in adapting well to highly uncertain environments. However, existing Deep RL methods are unable to handle combinatorial action spaces and constraints on allocation of resources. To that end, we have developed three approaches on top of the well known actor-critic approach, DDPG (Deep Deterministic Policy Gradient) that are able to handle constraints on resource allocation. We also demonstrate that they are able to outperform leading approaches on simulators validated on semi-real and real data sets.
APA, Harvard, Vancouver, ISO, and other styles
5

Yang, Qisong, Thiago D. Simão, Simon H. Tindemans, and Matthijs T. J. Spaan. "WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10639–46. http://dx.doi.org/10.1609/aaai.v35i12.17272.

Full text
Abstract:
Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.
APA, Harvard, Vancouver, ISO, and other styles
6

Zhou, Zixian, Mengda Huang, Feiyang Pan, Jia He, Xiang Ao, Dandan Tu, and Qing He. "Gradient-Adaptive Pareto Optimization for Constrained Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 9 (June 26, 2023): 11443–51. http://dx.doi.org/10.1609/aaai.v37i9.26353.

Full text
Abstract:
Constrained Reinforcement Learning (CRL) burgeons broad interest in recent years, which pursues maximizing long-term returns while constraining costs. Although CRL can be cast as a multi-objective optimization problem, it is still facing the key challenge that gradient-based Pareto optimization methods tend to stick to known Pareto-optimal solutions even when they yield poor returns (e.g., the safest self-driving car that never moves) or violate the constraints (e.g., the record-breaking racer that crashes the car). In this paper, we propose Gradient-adaptive Constrained Policy Optimization (GCPO for short), a novel Pareto optimization method for CRL with two adaptive gradient recalibration techniques. First, to find Pareto-optimal solutions with balanced performance over all targets, we propose gradient rebalancing which forces the agent to improve more on under-optimized objectives at every policy iteration. Second, to guarantee that the cost constraints are satisfied, we propose gradient perturbation that can temporarily sacrifice the returns for costs. Experiments on the SafetyGym benchmarks show that our method consistently outperforms previous CRL methods in reward while satisfying the constraints.
APA, Harvard, Vancouver, ISO, and other styles
7

He, Tairan, Weiye Zhao, and Changliu Liu. "AutoCost: Evolving Intrinsic Cost for Zero-Violation Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 12 (June 26, 2023): 14847–55. http://dx.doi.org/10.1609/aaai.v37i12.26734.

Full text
Abstract:
Safety is a critical hurdle that limits the application of deep reinforcement learning to real-world control tasks. To this end, constrained reinforcement learning leverages cost functions to improve safety in constrained Markov decision process. However, constrained methods fail to achieve zero violation even when the cost limit is zero. This paper analyzes the reason for such failure, which suggests that a proper cost function plays an important role in constrained RL. Inspired by the analysis, we propose AutoCost, a simple yet effective framework that automatically searches for cost functions that help constrained RL to achieve zero-violation performance. We validate the proposed method and the searched cost function on the safety benchmark Safety Gym. We compare the performance of augmented agents that use our cost function to provide additive intrinsic costs to a Lagrangian-based policy learner and a constrained-optimization policy learner with baseline agents that use the same policy learners but with only extrinsic costs. Results show that the converged policies with intrinsic costs in all environments achieve zero constraint violation and comparable performance with baselines.
APA, Harvard, Vancouver, ISO, and other styles
8

Yang, Zhaoxing, Haiming Jin, Rong Ding, Haoyi You, Guiyun Fan, Xinbing Wang, and Chenghu Zhou. "DeCOM: Decomposed Policy for Constrained Cooperative Multi-Agent Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 9 (June 26, 2023): 10861–70. http://dx.doi.org/10.1609/aaai.v37i9.26288.

Full text
Abstract:
In recent years, multi-agent reinforcement learning (MARL) has presented impressive performance in various applications. However, physical limitations, budget restrictions, and many other factors usually impose constraints on a multi-agent system (MAS), which cannot be handled by traditional MARL frameworks. Specifically, this paper focuses on constrained MASes where agents work cooperatively to maximize the expected team-average return under various constraints on expected team-average costs, and develops a constrained cooperative MARL framework, named DeCOM, for such MASes. In particular, DeCOM decomposes the policy of each agent into two modules, which empowers information sharing among agents to achieve better cooperation. In addition, with such modularization, the training algorithm of DeCOM separates the original constrained optimization into an unconstrained optimization on reward and a constraints satisfaction problem on costs. DeCOM then iteratively solves these problems in a computationally efficient manner, which makes DeCOM highly scalable. We also provide theoretical guarantees on the convergence of DeCOM's policy update algorithm. Finally, we conduct extensive experiments to show the effectiveness of DeCOM with various types of costs in both moderate-scale and large-scale (with 500 agents) environments that originate from real-world applications.
APA, Harvard, Vancouver, ISO, and other styles
9

Martins, Miguel S. E., Joaquim L. Viegas, Tiago Coito, Bernardo Marreiros Firme, João M. C. Sousa, João Figueiredo, and Susana M. Vieira. "Reinforcement Learning for Dual-Resource Constrained Scheduling." IFAC-PapersOnLine 53, no. 2 (2020): 10810–15. http://dx.doi.org/10.1016/j.ifacol.2020.12.2866.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Guenter, Florent, Micha Hersch, Sylvain Calinon, and Aude Billard. "Reinforcement learning for imitating constrained reaching movements." Advanced Robotics 21, no. 13 (January 1, 2007): 1521–44. http://dx.doi.org/10.1163/156855307782148550.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Chung, Jen Jen, Nicholas R. J. Lawrance, and Salah Sukkarieh. "Learning to soar: Resource-constrained exploration in reinforcement learning." International Journal of Robotics Research 34, no. 2 (December 16, 2014): 158–72. http://dx.doi.org/10.1177/0278364914553683.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Bai, Qinbo, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal. "Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Primal-Dual Approach." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 4 (June 28, 2022): 3682–89. http://dx.doi.org/10.1609/aaai.v36i4.20281.

Full text
Abstract:
Reinforcement learning is widely used in applications where one needs to perform sequential decisions while interacting with the environment. The problem becomes more challenging when the decision requirement includes satisfying some safety constraints. The problem is mathematically formulated as constrained Markov decision process (CMDP). In the literature, various algorithms are available to solve CMDP problems in a model-free manner to achieve epsilon-optimal cumulative reward with epsilon feasible policies. An epsilon-feasible policy implies that it suffers from constraint violation. An important question here is whether we can achieve epsilon-optimal cumulative reward with zero constraint violations or not. To achieve that, we advocate the use of a randomized primal-dual approach to solve the CMDP problems and propose a conservative stochastic primal-dual algorithm (CSPDA) which is shown to exhibit O(1/epsilon^2) sample complexity to achieve epsilon-optimal cumulative reward with zero constraint violations. In the prior works, the best available sample complexity for the epsilon-optimal policy with zero constraint violation is O(1/epsilon^5). Hence, the proposed algorithm provides a significant improvement compared to the state of the art.
APA, Harvard, Vancouver, ISO, and other styles
13

Bai, Qinbo, Amrit Singh Bedi, and Vaneet Aggarwal. "Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 6 (June 26, 2023): 6737–44. http://dx.doi.org/10.1609/aaai.v37i6.25826.

Full text
Abstract:
We consider the problem of constrained Markov decision process (CMDP) in continuous state actions spaces where the goal is to maximize the expected cumulative reward subject to some constraints. We propose a novel Conservative Natural Policy Gradient Primal Dual Algorithm (CNPGPD) to achieve zero constraint violation while achieving state of the art convergence results for the objective value function. For general policy parametrization, we prove convergence of value function to global optimal upto an approximation error due to restricted policy class. We improve the sample complexity of existing constrained NPGPD algorithm. To the best of our knowledge, this is the first work to establish zero constraint violation with Natural policy gradient style algorithms for infinite horizon discounted CMDPs. We demonstrate the merits of proposed algorithm via experimental evaluations.
APA, Harvard, Vancouver, ISO, and other styles
14

Zhao, Hang, Qijin She, Chenyang Zhu, Yin Yang, and Kai Xu. "Online 3D Bin Packing with Constrained Deep Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 1 (May 18, 2021): 741–49. http://dx.doi.org/10.1609/aaai.v35i1.16155.

Full text
Abstract:
We solve a challenging yet practically useful variant of 3D Bin Packing Problem (3D-BPP). In our problem, the agent has limited information about the items to be packed into a single bin, and an item must be packed immediately after its arrival without buffering or readjusting. The item's placement also subjects to the constraints of order dependence and physical stability. We formulate this online 3D-BPP as a constrained Markov decision process (CMDP). To solve the problem, we propose an effective and easy-to-implement constrained deep reinforcement learning (DRL) method under the actor-critic framework. In particular, we introduce a prediction-and-projection scheme: The agent first predicts a feasibility mask for the placement actions as an auxiliary task and then uses the mask to modulate the action probabilities output by the actor during training. Such supervision and projection facilitate the agent to learn feasible policies very efficiently. Our method can be easily extended to handle lookahead items, multi-bin packing, and item re-orienting. We have conducted extensive evaluation showing that the learned policy significantly outperforms the state-of-the-art methods. A preliminary user study even suggests that our method might attain a human-level performance.
APA, Harvard, Vancouver, ISO, and other styles
15

Petsagkourakis, P., I. O. Sandoval, E. Bradford, D. Zhang, and E. A. del Rio-Chanona. "Constrained Reinforcement Learning for Dynamic Optimization under Uncertainty." IFAC-PapersOnLine 53, no. 2 (2020): 11264–70. http://dx.doi.org/10.1016/j.ifacol.2020.12.361.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Pan, Elton, Panagiotis Petsagkourakis, Max Mowbray, Dongda Zhang, and Ehecatl Antonio del Rio-Chanona. "Constrained model-free reinforcement learning for process optimization." Computers & Chemical Engineering 154 (November 2021): 107462. http://dx.doi.org/10.1016/j.compchemeng.2021.107462.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Giuseppi, Alessandro, and Antonio Pietrabissa. "Chance-Constrained Control With Lexicographic Deep Reinforcement Learning." IEEE Control Systems Letters 4, no. 3 (July 2020): 755–60. http://dx.doi.org/10.1109/lcsys.2020.2979635.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Ge, Yangyang, Fei Zhu, Wei Huang, Peiyao Zhao, and Quan Liu. "Multi-agent cooperation Q-learning algorithm based on constrained Markov Game." Computer Science and Information Systems 17, no. 2 (2020): 647–64. http://dx.doi.org/10.2298/csis191220009g.

Full text
Abstract:
Multi-Agent system has broad application in real world, whose security performance, however, is barely considered. Reinforcement learning is one of the most important methods to resolve Multi-Agent problems. At present, certain progress has been made in applying Multi-Agent reinforcement learning to robot system, man-machine match, and automatic, etc. However, in the above area, an agent may fall into unsafe states where the agent may find it difficult to bypass obstacles, to receive information from other agents and so on. Ensuring the safety of Multi-Agent system is of great importance in the above areas where an agent may fall into dangerous states that are irreversible, causing great damage. To solve the safety problem, in this paper we introduce a Multi-Agent Cooperation Q-Learning Algorithm based on Constrained Markov Game. In this method, safety constraints are added to the set of actions, and each agent, when interacting with the environment to search for optimal values, should be restricted by the safety rules, so as to obtain an optimal policy that satisfies the security requirements. Since traditional Multi-Agent reinforcement learning algorithm is no more suitable for the proposed model in this paper, a new solution is introduced for calculating the global optimum state-action function that satisfies the safety constraints. We take advantage of the Lagrange multiplier method to determine the optimal action that can be performed in the current state based on the premise of linearizing constraint functions, under conditions that the state-action function and the constraint function are both differentiable, which not only improves the efficiency and accuracy of the algorithm, but also guarantees to obtain the global optimal solution. The experiments verify the effectiveness of the algorithm.
APA, Harvard, Vancouver, ISO, and other styles
19

Fachantidis, Anestis, Matthew Taylor, and Ioannis Vlahavas. "Learning to Teach Reinforcement Learning Agents." Machine Learning and Knowledge Extraction 1, no. 1 (December 6, 2017): 21–42. http://dx.doi.org/10.3390/make1010002.

Full text
Abstract:
In this article, we study the transfer learning model of action advice under a budget. We focus on reinforcement learning teachers providing action advice to heterogeneous students playing the game of Pac-Man under a limited advice budget. First, we examine several critical factors affecting advice quality in this setting, such as the average performance of the teacher, its variance and the importance of reward discounting in advising. The experiments show that the best performers are not always the best teachers and reveal the non-trivial importance of the coefficient of variation (CV) as a statistic for choosing policies that generate advice. The CV statistic relates variance to the corresponding mean. Second, the article studies policy learning for distributing advice under a budget. Whereas most methods in the relevant literature rely on heuristics for advice distribution, we formulate the problem as a learning one and propose a novel reinforcement learning algorithm capable of learning when to advise or not. The proposed algorithm is able to advise even when it does not have knowledge of the student’s intended action and needs significantly less training time compared to previous learning approaches. Finally, in this article, we argue that learning to advise under a budget is an instance of a more generic learning problem: Constrained Exploitation Reinforcement Learning.
APA, Harvard, Vancouver, ISO, and other styles
20

Xu, Yizhen, Zhengyang Zhao, Peng Cheng, Zhuo Chen, Ming Ding, Branka Vucetic, and Yonghui Li. "Constrained Reinforcement Learning for Resource Allocation in Network Slicing." IEEE Communications Letters 25, no. 5 (May 2021): 1554–58. http://dx.doi.org/10.1109/lcomm.2021.3053612.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Mowbray, M., P. Petsagkourakis, E. A. del Rio-Chanona, and D. Zhang. "Safe chance constrained reinforcement learning for batch process control." Computers & Chemical Engineering 157 (January 2022): 107630. http://dx.doi.org/10.1016/j.compchemeng.2021.107630.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Poznyak, A. S., and K. Najim. "Learning through reinforcement for N-person repeated constrained games." IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32, no. 6 (December 2002): 759–71. http://dx.doi.org/10.1109/tsmcb.2002.1049610.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Tsai, Ya-Yen, Bo Xiao, Edward Johns, and Guang-Zhong Yang. "Constrained-Space Optimization and Reinforcement Learning for Complex Tasks." IEEE Robotics and Automation Letters 5, no. 2 (April 2020): 683–90. http://dx.doi.org/10.1109/lra.2020.2965392.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Gao, Yuanqi, Wei Wang, Jie Shi, and Nanpeng Yu. "Batch-Constrained Reinforcement Learning for Dynamic Distribution Network Reconfiguration." IEEE Transactions on Smart Grid 11, no. 6 (November 2020): 5357–69. http://dx.doi.org/10.1109/tsg.2020.3005270.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Lin, Wei-Song, and Chen-Hong Zheng. "Constrained adaptive optimal control using a reinforcement learning agent." Automatica 48, no. 10 (October 2012): 2614–19. http://dx.doi.org/10.1016/j.automatica.2012.06.064.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Hu, Zhenzhen, and Wenyin Gong. "Constrained evolutionary optimization based on reinforcement learning using the objective function and constraints." Knowledge-Based Systems 237 (February 2022): 107731. http://dx.doi.org/10.1016/j.knosys.2021.107731.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Geibel, P., and F. Wysotzki. "Risk-Sensitive Reinforcement Learning Applied to Control under Constraints." Journal of Artificial Intelligence Research 24 (July 1, 2005): 81–108. http://dx.doi.org/10.1613/jair.1666.

Full text
Abstract:
In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.
APA, Harvard, Vancouver, ISO, and other styles
28

Szwarcfiter, Claudio, Yale T. Herer, and Avraham Shtub. "Balancing Project Schedule, Cost, and Value under Uncertainty: A Reinforcement Learning Approach." Algorithms 16, no. 8 (August 21, 2023): 395. http://dx.doi.org/10.3390/a16080395.

Full text
Abstract:
Industrial projects are plagued by uncertainties, often resulting in both time and cost overruns. This research introduces an innovative approach, employing Reinforcement Learning (RL), to address three distinct project management challenges within a setting of uncertain activity durations. The primary objective is to identify stable baseline schedules. The first challenge encompasses the multimode lean project management problem, wherein the goal is to maximize a project’s value function while adhering to both due date and budget chance constraints. The second challenge involves the chance-constrained critical chain buffer management problem in a multimode context. Here, the aim is to minimize the project delivery date while considering resource constraints and duration-chance constraints. The third challenge revolves around striking a balance between the project value and its net present value (NPV) within a resource-constrained multimode environment. To tackle these three challenges, we devised mathematical programming models, some of which were solved optimally. Additionally, we developed competitive RL-based algorithms and verified their performance against established benchmarks. Our RL algorithms consistently generated schedules that compared favorably with the benchmarks, leading to higher project values and NPVs and shorter schedules while staying within the stakeholders’ risk thresholds. The potential beneficiaries of this research are project managers and decision-makers who can use this approach to generate an efficient frontier of optimal project plans.
APA, Harvard, Vancouver, ISO, and other styles
29

Qin, Chunbin, Yinliang Wu, Jishi Zhang, and Tianzeng Zhu. "Reinforcement Learning-Based Decentralized Safety Control for Constrained Interconnected Nonlinear Safety-Critical Systems." Entropy 25, no. 8 (August 2, 2023): 1158. http://dx.doi.org/10.3390/e25081158.

Full text
Abstract:
This paper addresses the problem of decentralized safety control (DSC) of constrained interconnected nonlinear safety-critical systems under reinforcement learning strategies, where asymmetric input constraints and security constraints are considered. To begin with, improved performance functions associated with the actuator estimates for each auxiliary subsystem are constructed. Then, the decentralized control problem with security constraints and asymmetric input constraints is transformed into an equivalent decentralized control problem with asymmetric input constraints using the barrier function. This approach ensures that safety-critical systems operate and learn optimal DSC policies within their safe global domains. Then, the optimal control strategy is shown to ensure that the entire system is uniformly ultimately bounded (UUB). In addition, all signals in the closed-loop auxiliary subsystem, based on Lyapunov theory, are uniformly ultimately bounded, and the effectiveness of the designed method is verified by practical simulation.
APA, Harvard, Vancouver, ISO, and other styles
30

Ding, Yuhao, and Javad Lavaei. "Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with Non-stationary Objectives and Constraints." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 6 (June 26, 2023): 7396–404. http://dx.doi.org/10.1609/aaai.v37i6.25900.

Full text
Abstract:
We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints, which plays a central role in ensuring the safety of RL in time-varying environments. In this problem, the reward/utility functions and the state transition functions are both allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain known variation budgets. Designing safe RL algorithms in time-varying environments is particularly challenging because of the need to integrate the constraint violation reduction, safe exploration, and adaptation to the non-stationarity. To this end, we identify two alternative conditions on the time-varying constraints under which we can guarantee the safety in the long run. We also propose the Periodically Restarted Optimistic Primal-Dual Proximal Policy Optimization (PROPD-PPO) algorithm that can coordinate with both two conditions. Furthermore, a dynamic regret bound and a constraint violation bound are established for the proposed algorithm in both the linear kernel CMDP function approximation setting and the tabular CMDP setting under two alternative conditions. This paper provides the first provably efficient algorithm for non-stationary CMDPs with safe exploration.
APA, Harvard, Vancouver, ISO, and other styles
31

Fu, Yanbo, Wenjie Zhao, and Liu Liu. "Safe Reinforcement Learning for Transition Control of Ducted-Fan UAVs." Drones 7, no. 5 (May 22, 2023): 332. http://dx.doi.org/10.3390/drones7050332.

Full text
Abstract:
Ducted-fan tail-sitter unmanned aerial vehicles (UAVs) provide versatility and unique benefits, attracting significant attention in various applications. This study focuses on developing a safe reinforcement learning method for back-transition control between level flight mode and hover mode for ducted-fan tail-sitter UAVs. Our method enables transition control with a minimal altitude change and transition time while adhering to the velocity constraint. We employ the Trust Region Policy Optimization, Proximal Policy Optimization with Lagrangian, and Constrained Policy Optimization (CPO) algorithms for controller training, showcasing the superiority of the CPO algorithm and the necessity of the velocity constraint. The transition trajectory achieved using the CPO algorithm closely resembles the optimal trajectory obtained via the well-known GPOPS-II software with the SNOPT solver. Meanwhile, the CPO algorithm also exhibits strong robustness under unknown perturbations of UAV model parameters and wind disturbance.
APA, Harvard, Vancouver, ISO, and other styles
32

Wei, Honghao, Xin Liu, and Lei Ying. "A Provably-Efficient Model-Free Algorithm for Infinite-Horizon Average-Reward Constrained Markov Decision Processes." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 4 (June 28, 2022): 3868–76. http://dx.doi.org/10.1609/aaai.v36i4.20302.

Full text
Abstract:
This paper presents a model-free reinforcement learning (RL) algorithm for infinite-horizon average-reward Constrained Markov Decision Processes (CMDPs). Considering a learning horizon K, which is sufficiently large, the proposed algorithm achieves sublinear regret and zero constraint violation. The bounds depend on the number of states S, the number of actions A, and two constants which are independent of the learning horizon K.
APA, Harvard, Vancouver, ISO, and other styles
33

Qi, Qi, Wenbin Lin, Boyang Guo, Jinshan Chen, Chaoping Deng, Guodong Lin, Xin Sun, and Youjia Chen. "Augmented Lagrangian-Based Reinforcement Learning for Network Slicing in IIoT." Electronics 11, no. 20 (October 19, 2022): 3385. http://dx.doi.org/10.3390/electronics11203385.

Full text
Abstract:
Network slicing enables the multiplexing of independent logical networks on the same physical network infrastructure to provide different network services for different applications. The resource allocation problem involved in network slicing is typically a decision-making problem, falling within the scope of reinforcement learning. The advantage of adapting to dynamic wireless environments makes reinforcement learning a good candidate for problem solving. In this paper, to tackle the constrained mixed integer nonlinear programming problem in network slicing, we propose an augmented Lagrangian-based soft actor–critic (AL-SAC) algorithm. In this algorithm, a hierarchical action selection network is designed to handle the hybrid action space. More importantly, inspired by the augmented Lagrangian method, both neural networks for Lagrange multipliers and a penalty item are introduced to deal with the constraints. Experiment results show that the proposed AL-SAC algorithm can strictly satisfy the constraints, and achieve better performance than other benchmark algorithms.
APA, Harvard, Vancouver, ISO, and other styles
34

Pocius, Rey, Lawrence Neal, and Alan Fern. "Strategic Tasks for Explainable Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 10007–8. http://dx.doi.org/10.1609/aaai.v33i01.330110007.

Full text
Abstract:
Commonly used sequential decision making tasks such as the games in the Arcade Learning Environment (ALE) provide rich observation spaces suitable for deep reinforcement learning. However, they consist mostly of low-level control tasks which are of limited use for the development of explainable artificial intelligence(XAI) due to the fine temporal resolution of the tasks. Many of these domains also lack built-in high level abstractions and symbols. Existing tasks that provide for both strategic decision-making and rich observation spaces are either difficult to simulate or are intractable. We provide a set of new strategic decision-making tasks specialized for the development and evaluation of explainable AI methods, built as constrained mini-games within the StarCraft II Learning Environment.
APA, Harvard, Vancouver, ISO, and other styles
35

Dinu, Alexandru, and Petre Lucian Ogrutan. "Reinforcement Learning Made Affordable for Hardware Verification Engineers." Micromachines 13, no. 11 (November 1, 2022): 1887. http://dx.doi.org/10.3390/mi13111887.

Full text
Abstract:
Constrained random stimulus generation is no longer sufficient to fully simulate the functionality of a digital design. The increasing complexity of today’s hardware devices must be supported by powerful development and simulation environments, powerful computational mechanisms, and appropriate software to exploit them. Reinforcement learning, a powerful technique belonging to the field of artificial intelligence, provides the means to efficiently exploit computational resources to find even the least obvious correlations between configuration parameters, stimuli applied to digital design inputs, and their functional states. This paper, in which a novel software system is used to simplify the analysis of simulation outputs and the generation of input stimuli through reinforcement learning methods, provides important details about the setup of the proposed method to automate the verification process. By understanding how to properly configure a reinforcement algorithm to fit the specifics of a digital design, verification engineers can more quickly adopt this automated and efficient stimulus generation method (compared with classical verification) to bring the digital design to a desired functional state. The results obtained are most promising, with even 52 times fewer steps needed to reach a target state using reinforcement learning than when constrained random stimulus generation was used.
APA, Harvard, Vancouver, ISO, and other styles
36

Brosowsky, Mathis, Florian Keck, Olaf Dünkel, and Marius Zöllner. "Sample-Specific Output Constraints for Neural Networks." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 6812–21. http://dx.doi.org/10.1609/aaai.v35i8.16841.

Full text
Abstract:
It is common practice to constrain the output space of a neural network with the final layer to a problem-specific value range. However, for many tasks it is desired to restrict the output space for each input independently to a different subdomain with a non-trivial geometry, e.g. in safety-critical applications, to exclude hazardous outputs sample-wise. We propose ConstraintNet—a scalable neural network architecture which constrains the output space in each forward pass independently. Contrary to prior approaches, which perform a projection in the final layer, ConstraintNet applies an input-dependent parametrization of the constrained output space. Thereby, the complete interior of the constrained region is covered and computational costs are reduced significantly. For constraints in form of convex polytopes, we leverage the vertex representation to specify the parametrization. The second modification consists of adding an auxiliary input in form of a tensor description of the constraint to enable the handling of multiple constraints for the same sample. Finally, ConstraintNet is end-to-end trainable with almost no overhead in the forward and backward pass. We demonstrate ConstraintNet on two regression tasks: First, we modify a CNN and construct several constraints for facial landmark detection tasks. Second, we demonstrate the application to a follow object controller for vehicles and accomplish safe reinforcement learning in this case. In both experiments, ConstraintNet improves performance and we conclude that our approach is promising for applying neural networks in safety-critical environments.
APA, Harvard, Vancouver, ISO, and other styles
37

Parsonson, Christopher W. F., Alexandre Laterre, and Thomas D. Barrett. "Reinforcement Learning for Branch-and-Bound Optimisation Using Retrospective Trajectories." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 4 (June 26, 2023): 4061–69. http://dx.doi.org/10.1609/aaai.v37i4.25521.

Full text
Abstract:
Combinatorial optimisation problems framed as mixed integer linear programmes (MILPs) are ubiquitous across a range of real-world applications. The canonical branch-and-bound algorithm seeks to exactly solve MILPs by constructing a search tree of increasingly constrained sub-problems. In practice, its solving time performance is dependent on heuristics, such as the choice of the next variable to constrain ('branching'). Recently, machine learning (ML) has emerged as a promising paradigm for branching. However, prior works have struggled to apply reinforcement learning (RL), citing sparse rewards, difficult exploration, and partial observability as significant challenges. Instead, leading ML methodologies resort to approximating high quality handcrafted heuristics with imitation learning (IL), which precludes the discovery of novel policies and requires expensive data labelling. In this work, we propose retro branching; a simple yet effective approach to RL for branching. By retrospectively deconstructing the search tree into multiple paths each contained within a sub-tree, we enable the agent to learn from shorter trajectories with more predictable next states. In experiments on four combinatorial tasks, our approach enables learning-to-branch without any expert guidance or pre-training. We outperform the current state-of-the-art RL branching algorithm by 3-5x and come within 20% of the best IL method's performance on MILPs with 500 constraints and 1000 variables, with ablations verifying that our retrospectively constructed trajectories are essential to achieving these results.
APA, Harvard, Vancouver, ISO, and other styles
38

Brázdil, Tomáš, Krishnendu Chatterjee, Petr Novotný, and Jiří Vahala. "Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 06 (April 3, 2020): 9794–801. http://dx.doi.org/10.1609/aaai.v34i06.6531.

Full text
Abstract:
Markov decision processes (MDPs) are the defacto framework for sequential decision making in the presence of stochastic uncertainty. A classical optimization criterion for MDPs is to maximize the expected discounted-sum payoff, which ignores low probability catastrophic events with highly negative impact on the system. On the other hand, risk-averse policies require the probability of undesirable events to be below a given threshold, but they do not account for optimization of the expected payoff. We consider MDPs with discounted-sum payoff with failure states which represent catastrophic outcomes. The objective of risk-constrained planning is to maximize the expected discounted-sum payoff among risk-averse policies that ensure the probability to encounter a failure state is below a desired threshold. Our main contribution is an efficient risk-constrained planning algorithm that combines UCT-like search with a predictor learned through interaction with the MDP (in the style of AlphaZero) and with a risk-constrained action selection via linear programming. We demonstrate the effectiveness of our approach with experiments on classical MDPs from the literature, including benchmarks with an order of 106 states.
APA, Harvard, Vancouver, ISO, and other styles
39

Zhang, Hongchang, Jianzhun Shao, Yuhang Jiang, Shuncheng He, Guanwen Zhang, and Xiangyang Ji. "State Deviation Correction for Offline Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 9022–30. http://dx.doi.org/10.1609/aaai.v36i8.20886.

Full text
Abstract:
Offline reinforcement learning aims to maximize the expected cumulative rewards with a fixed collection of data. The basic principle of current offline reinforcement learning methods is to restrict the policy to the offline dataset action space. However, they ignore the case where the dataset's trajectories fail to cover the state space completely. Especially, when the dataset's size is limited, it is likely that the agent would encounter unseen states during test time. Prior policy-constrained methods are incapable of correcting the state deviation, and may lead the agent to its unexpected regions further. In this paper, we propose the state deviation correction (SDC) method to constrain the policy's induced state distribution by penalizing the out-of-distribution states which might appear during the test period. We first perturb the states sampled from the logged dataset, then simulate noisy next states on the basis of a dynamics model and the policy. We then train the policy to minimize the distances between the noisy next states and the offline dataset. In this manner, we allow the trained policy to guide the agent to its familiar regions. Experimental results demonstrate that our proposed method is competitive with the state-of-the-art methods in a GridWorld setup, offline Mujoco control suite, and a modified offline Mujoco dataset with a finite number of valuable samples.
APA, Harvard, Vancouver, ISO, and other styles
40

Bai, Fengshuo, Hongming Zhang, Tianyang Tao, Zhiheng Wu, Yanna Wang, and Bo Xu. "PiCor: Multi-Task Deep Reinforcement Learning with Policy Correction." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 6 (June 26, 2023): 6728–36. http://dx.doi.org/10.1609/aaai.v37i6.25825.

Full text
Abstract:
Multi-task deep reinforcement learning (DRL) ambitiously aims to train a general agent that masters multiple tasks simultaneously. However, varying learning speeds of different tasks compounding with negative gradients interference makes policy learning inefficient. In this work, we propose PiCor, an efficient multi-task DRL framework that splits learning into policy optimization and policy correction phases. The policy optimization phase improves the policy by any DRL algothrim on the sampled single task without considering other tasks. The policy correction phase first constructs an adaptive adjusted performance constraint set. Then the intermediate policy learned by the first phase is constrained to the set, which controls the negative interference and balances the learning speeds across tasks. Empirically, we demonstrate that PiCor outperforms previous methods and significantly improves sample efficiency on simulated robotic manipulation and continuous control tasks. We additionally show that adaptive weight adjusting can further improve data efficiency and performance.
APA, Harvard, Vancouver, ISO, and other styles
41

Lee, Xian Yeow, Sambit Ghadai, Kai Liang Tan, Chinmay Hegde, and Soumik Sarkar. "Spatiotemporally Constrained Action Space Attacks on Deep Reinforcement Learning Agents." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 4577–84. http://dx.doi.org/10.1609/aaai.v34i04.5887.

Full text
Abstract:
Robustness of Deep Reinforcement Learning (DRL) algorithms towards adversarial attacks in real world applications such as those deployed in cyber-physical systems (CPS) are of increasing concern. Numerous studies have investigated the mechanisms of attacks on the RL agent's state space. Nonetheless, attacks on the RL agent's action space (corresponding to actuators in engineering systems) are equally perverse, but such attacks are relatively less studied in the ML literature. In this work, we first frame the problem as an optimization problem of minimizing the cumulative reward of an RL agent with decoupled constraints as the budget of attack. We propose the white-box Myopic Action Space (MAS) attack algorithm that distributes the attacks across the action space dimensions. Next, we reformulate the optimization problem above with the same objective function, but with a temporally coupled constraint on the attack budget to take into account the approximated dynamics of the agent. This leads to the white-box Look-ahead Action Space (LAS) attack algorithm that distributes the attacks across the action and temporal dimensions. Our results showed that using the same amount of resources, the LAS attack deteriorates the agent's performance significantly more than the MAS attack. This reveals the possibility that with limited resource, an adversary can utilize the agent's dynamics to malevolently craft attacks that causes the agent to fail. Additionally, we leverage these attack strategies as a possible tool to gain insights on the potential vulnerabilities of DRL agents.
APA, Harvard, Vancouver, ISO, and other styles
42

Costero, Luis, Arman Iranfar, Marina Zapater, Francisco D. Igual, Katzalin Olcoz, and David Atienza. "Resource Management for Power-Constrained HEVC Transcoding Using Reinforcement Learning." IEEE Transactions on Parallel and Distributed Systems 31, no. 12 (December 1, 2020): 2834–50. http://dx.doi.org/10.1109/tpds.2020.3004735.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Uchibe, Eiji, and Kenji Doya. "Finding intrinsic rewards by embodied evolution and constrained reinforcement learning." Neural Networks 21, no. 10 (December 2008): 1447–55. http://dx.doi.org/10.1016/j.neunet.2008.09.013.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Li, Hepeng, Zhiqiang Wan, and Haibo He. "Constrained EV Charging Scheduling Based on Safe Deep Reinforcement Learning." IEEE Transactions on Smart Grid 11, no. 3 (May 2020): 2427–39. http://dx.doi.org/10.1109/tsg.2019.2955437.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Wang, Huiwei, Tingwen Huang, Xiaofeng Liao, Haitham Abu-Rub, and Guo Chen. "Reinforcement Learning for Constrained Energy Trading Games With Incomplete Information." IEEE Transactions on Cybernetics 47, no. 10 (October 2017): 3404–16. http://dx.doi.org/10.1109/tcyb.2016.2539300.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Dong, Wenbo, Shaofan Liu, and Shiliang Sun. "Safe batch constrained deep reinforcement learning with generative adversarial network." Information Sciences 634 (July 2023): 259–70. http://dx.doi.org/10.1016/j.ins.2023.03.108.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Korivand, Soroush, Nader Jalili, and Jiaqi Gong. "Inertia-Constrained Reinforcement Learning to Enhance Human Motor Control Modeling." Sensors 23, no. 5 (March 1, 2023): 2698. http://dx.doi.org/10.3390/s23052698.

Full text
Abstract:
Locomotor impairment is a highly prevalent and significant source of disability and significantly impacts the quality of life of a large portion of the population. Despite decades of research on human locomotion, challenges remain in simulating human movement to study the features of musculoskeletal drivers and clinical conditions. Most recent efforts to utilize reinforcement learning (RL) techniques are promising in the simulation of human locomotion and reveal musculoskeletal drives. However, these simulations often fail to mimic natural human locomotion because most reinforcement strategies have yet to consider any reference data regarding human movement. To address these challenges, in this study, we designed a reward function based on the trajectory optimization rewards (TOR) and bio-inspired rewards, which includes the rewards obtained from reference motion data captured by a single Inertial Moment Unit (IMU) sensor. The sensor was equipped on the participants’ pelvis to capture reference motion data. We also adapted the reward function by leveraging previous research on walking simulations for TOR. The experimental results showed that the simulated agents with the modified reward function performed better in mimicking the collected IMU data from participants, which means that the simulated human locomotion was more realistic. As a bio-inspired defined cost, IMU data enhanced the agent’s capacity to converge during the training process. As a result, the models’ convergence was faster than those developed without reference motion data. Consequently, human locomotion can be simulated more quickly and in a broader range of environments, with a better simulation performance.
APA, Harvard, Vancouver, ISO, and other styles
48

Jing, Mingxuan, Xiaojian Ma, Wenbing Huang, Fuchun Sun, Chao Yang, Bin Fang, and Huaping Liu. "Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 5109–16. http://dx.doi.org/10.1609/aaai.v34i04.5953.

Full text
Abstract:
In this paper, we study Reinforcement Learning from Demonstrations (RLfD) that improves the exploration efficiency of Reinforcement Learning (RL) by providing expert demonstrations. Most of existing RLfD methods require demonstrations to be perfect and sufficient, which yet is unrealistic to meet in practice. To work on imperfect demonstrations, we first define an imperfect expert setting for RLfD in a formal way, and then point out that previous methods suffer from two issues in terms of optimality and convergence, respectively. Upon the theoretical findings we have derived, we tackle these two issues by regarding the expert guidance as a soft constraint on regulating the policy exploration of the agent, which eventually leads to a constrained optimization problem. We further demonstrate that such problem is able to be addressed efficiently by performing a local linear search on its dual form. Considerable empirical evaluations on a comprehensive collection of benchmarks indicate our method attains consistent improvement over other RLfD counterparts.
APA, Harvard, Vancouver, ISO, and other styles
49

Ma, Jing, So Hasegawa, Song-Ju Kim, and Mikio Hasegawa. "A Reinforcement-Learning-Based Distributed Resource Selection Algorithm for Massive IoT." Applied Sciences 9, no. 18 (September 6, 2019): 3730. http://dx.doi.org/10.3390/app9183730.

Full text
Abstract:
Massive IoT including the large number of resource-constrained IoT devices has gained great attention. IoT devices generate enormous traffic, which causes network congestion. To manage network congestion, multi-channel-based algorithms are proposed. However, most of the existing multi-channel algorithms require strict synchronization, an extra overhead for negotiating channel assignment, which poses significant challenges to resource-constrained IoT devices. In this paper, a distributed channel selection algorithm utilizing the tug-of-war (TOW) dynamics is proposed for improving successful frame delivery of the whole network by letting IoT devices always select suitable channels for communication adaptively. The proposed TOW dynamics-based channel selection algorithm has a simple reinforcement learning procedure that only needs to receive the acknowledgment (ACK) frame for the learning procedure, while simply requiring minimal memory and computation capability. Thus, the proposed TOW dynamics-based algorithm can run on resource-constrained IoT devices. We prototype the proposed algorithm on an extremely resource-constrained single-board computer, which hereafter is called the cognitive-IoT prototype. Moreover, the cognitive-IoT prototype is densely deployed in a frequently-changing radio environment for evaluation experiments. The evaluation results show that the cognitive-IoT prototype accurately and adaptively makes decisions to select the suitable channel when the real environment regularly varies. Accordingly, the successful frame ratio of the network is improved.
APA, Harvard, Vancouver, ISO, and other styles
50

Ding, Zhenhuan, Xiaoge Huang, and Zhao Liu. "Active Exploration by Chance-Constrained Optimization for Voltage Regulation with Reinforcement Learning." Energies 15, no. 2 (January 16, 2022): 614. http://dx.doi.org/10.3390/en15020614.

Full text
Abstract:
Voltage regulation in distribution networks encounters a challenge of handling uncertainties caused by the high penetration of photovoltaics (PV). This research proposes an active exploration (AE) method based on reinforcement learning (RL) to respond to the uncertainties by regulating the voltage of a distribution network with battery energy storage systems (BESS). The proposed method integrates engineering knowledge to accelerate the training process of RL. The engineering knowledge is the chance-constrained optimization. We formulate the problem in a chance-constrained optimization with a linear load flow approximation. The optimization results are used to guide the action selection of the exploration for improving training efficiency and reducing the conserveness characteristic. The comparison of methods focuses on how BESSs are used, training efficiency, and robustness under varying uncertainties and BESS sizes. We implement the proposed algorithm, a chance-constrained optimization, and a traditional Q-learning in the IEEE 13 Node Test Feeder. Our evaluation shows that the proposed AE method has a better response to the training efficiency compared to traditional Q-learning. Meanwhile, the proposed method has advantages in BESS usage in conserveness compared to the chance-constrained optimization.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography