Log in

Relevant bibliographies by topics / Actor-critic algorithm / Journal articles

To see the other types of publications on this topic, follow the link: Actor-critic algorithm.

Journal articles on the topic 'Actor-critic algorithm'

Author: Grafiati

Published: 6 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Actor-critic algorithm.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Wang, Jing, and Ioannis Ch Paschalidis. "An Actor-Critic Algorithm With Second-Order Actor and Critic." IEEE Transactions on Automatic Control 62, no. 6 (June 2017): 2689–703. http://dx.doi.org/10.1109/tac.2016.2616384.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Zheng, Liyuan, Tanner Fiez, Zane Alumbaugh, Benjamin Chasnov, and Lillian J. Ratliff. "Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 9217–24. http://dx.doi.org/10.1609/aaai.v36i8.20908.

Full text

Abstract:

The hierarchical interaction between the actor and critic in actor-critic based reinforcement learning algorithms naturally lends itself to a game-theoretic interpretation. We adopt this viewpoint and model the actor and critic interaction as a two-player general-sum game with a leader-follower structure known as a Stackelberg game. Given this abstraction, we propose a meta-framework for Stackelberg actor-critic algorithms where the leader player follows the total derivative of its objective instead of the usual individual gradient. From a theoretical standpoint, we develop a policy gradient theorem for the refined update and provide a local convergence guarantee for the Stackelberg actor-critic algorithms to a local Stackelberg equilibrium. From an empirical standpoint, we demonstrate via simple examples that the learning dynamics we study mitigate cycling and accelerate convergence compared to the usual gradient dynamics given cost structures induced by actor-critic formulations. Finally, extensive experiments on OpenAI gym environments show that Stackelberg actor-critic algorithms always perform at least as well and often significantly outperform the standard actor-critic algorithm counterparts.

APA, Harvard, Vancouver, ISO, and other styles

3

Iwaki, Ryo, and Minoru Asada. "Implicit incremental natural actor critic algorithm." Neural Networks 109 (January 2019): 103–12. http://dx.doi.org/10.1016/j.neunet.2018.10.007.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Kim, Gi-Soo, Jane P. Kim, and Hyun-Joon Yang. "Robust Tests in Online Decision-Making." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 9 (June 28, 2022): 10016–24. http://dx.doi.org/10.1609/aaai.v36i9.21240.

Full text

Abstract:

Bandit algorithms are widely used in sequential decision problems to maximize the cumulative reward. One potential application is mobile health, where the goal is to promote the user's health through personalized interventions based on user specific information acquired through wearable devices. Important considerations include the type of, and frequency with which data is collected (e.g. GPS, or continuous monitoring), as such factors can severely impact app performance and users’ adherence. In order to balance the need to collect data that is useful with the constraint of impacting app performance, one needs to be able to assess the usefulness of variables. Bandit feedback data are sequentially correlated, so traditional testing procedures developed for independent data cannot apply. Recently, a statistical testing procedure was developed for the actor-critic bandit algorithm. An actor-critic algorithm maintains two separate models, one for the actor, the action selection policy, and the other for the critic, the reward model. The performance of the algorithm as well as the validity of the test are guaranteed only when the critic model is correctly specified. However, misspecification is frequent in practice due to incorrect functional form or missing covariates. In this work, we propose a modified actor-critic algorithm which is robust to critic misspecification and derive a novel testing procedure for the actor parameters in this case.

APA, Harvard, Vancouver, ISO, and other styles

5

Sergey, Denisov, and Jee-Hyong Lee. "Actor-Critic Algorithm with Transition Cost Estimation." International Journal of Fuzzy Logic and Intelligent Systems 16, no. 4 (December 25, 2016): 270–75. http://dx.doi.org/10.5391/ijfis.2016.16.4.270.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Ahmed, Ayman Elshabrawy M. "Controller parameter tuning using actor-critic algorithm." IOP Conference Series: Materials Science and Engineering 610 (October 11, 2019): 012054. http://dx.doi.org/10.1088/1757-899x/610/1/012054.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Ding, Siyuan, Shengxiang Li, Guangyi Liu, Ou Li, Ke Ke, Yijie Bai, and Weiye Chen. "Decentralized Multiagent Actor-Critic Algorithm Based on Message Diffusion." Journal of Sensors 2021 (December 8, 2021): 1–14. http://dx.doi.org/10.1155/2021/8739206.

Full text

Abstract:

The exponential explosion of joint actions and massive data collection are two main challenges in multiagent reinforcement learning algorithms with centralized training. To overcome these problems, in this paper, we propose a model-free and fully decentralized actor-critic multiagent reinforcement learning algorithm based on message diffusion. To this end, the agents are assumed to be placed in a time-varying communication network. Each agent makes limited observations regarding the global state and joint actions; therefore, it needs to obtain and share information with others over the network. In the proposed algorithm, agents hold local estimations of the global state and joint actions and update them with local observations and the messages received from neighbors. Under the hypothesis of the global value decomposition, the gradient of the global objective function to an individual agent is derived. The convergence of the proposed algorithm with linear function approximation is guaranteed according to the stochastic approximation theory. In the experiments, the proposed algorithm was applied to a passive location task multiagent environment and achieved superior performance compared to state-of-the-art algorithms.

APA, Harvard, Vancouver, ISO, and other styles

8

Hafez, Muhammad Burhan, Cornelius Weber, Matthias Kerzel, and Stefan Wermter. "Deep intrinsically motivated continuous actor-critic for efficient robotic visuomotor skill learning." Paladyn, Journal of Behavioral Robotics 10, no. 1 (January 1, 2019): 14–29. http://dx.doi.org/10.1515/pjbr-2019-0005.

Full text

Abstract:

Abstract In this paper, we present a new intrinsically motivated actor-critic algorithm for learning continuous motor skills directly from raw visual input. Our neural architecture is composed of a critic and an actor network. Both networks receive the hidden representation of a deep convolutional autoencoder which is trained to reconstruct the visual input, while the centre-most hidden representation is also optimized to estimate the state value. Separately, an ensemble of predictive world models generates, based on its learning progress, an intrinsic reward signal which is combined with the extrinsic reward to guide the exploration of the actor-critic learner. Our approach is more data-efficient and inherently more stable than the existing actor-critic methods for continuous control from pixel data. We evaluate our algorithm for the task of learning robotic reaching and grasping skills on a realistic physics simulator and on a humanoid robot. The results show that the control policies learned with our approach can achieve better performance than the compared state-of-the-art and baseline algorithms in both dense-reward and challenging sparse-reward settings.

APA, Harvard, Vancouver, ISO, and other styles

9

Zhang, Haifei, Jian Xu, Jian Zhang, and Quan Liu. "Network Architecture for Optimizing Deep Deterministic Policy Gradient Algorithms." Computational Intelligence and Neuroscience 2022 (November 18, 2022): 1–10. http://dx.doi.org/10.1155/2022/1117781.

Full text

Abstract:

The traditional Deep Deterministic Policy Gradient (DDPG) algorithm has been widely used in continuous action spaces, but it still suffers from the problems of easily falling into local optima and large error fluctuations. Aiming at these deficiencies, this paper proposes a dual-actor-dual-critic DDPG algorithm (DN-DDPG). First, on the basis of the original actor-critic network architecture of the algorithm, a critic network is added to assist the training, and the smallest Q value of the two critic networks is taken as the estimated value of the action in each update. Reduce the probability of local optimal phenomenon; then, introduce the idea of dual-actor network to alleviate the underestimation of value generated by dual-evaluator network, and select the action with the greatest value in the two-actor networks to update to stabilize the training of the algorithm process. Finally, the improved method is validated on four continuous action tasks provided by MuJoCo, and the results show that the improved method can reduce the fluctuation range of error and improve the cumulative return compared with the classical algorithm.

APA, Harvard, Vancouver, ISO, and other styles

10

Jain, Arushi, Gandharv Patil, Ayush Jain, Khimya Khetarpal, and Doina Precup. "Variance Penalized On-Policy and Off-Policy Actor-Critic." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 9 (May 18, 2021): 7899–907. http://dx.doi.org/10.1609/aaai.v35i9.16964.

Full text

Abstract:

Reinforcement learning algorithms are typically geared towards optimizing the expected return of an agent. However, in many practical applications, low variance in the return is desired to ensure the reliability of an algorithm. In this paper, we propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Previous work uses the second moment of return to estimate the variance indirectly. Instead, we use a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods. Using the variance-penalized criterion, we guarantee the convergence of our algorithm to locally optimal policies for finite state action Markov decision processes. We demonstrate the utility of our algorithm in tabular and continuous MuJoCo domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.

APA, Harvard, Vancouver, ISO, and other styles

11

Hendzel, Zenon, and Marcin Szuster. "Discrete Action Dependant Heuristic Dynamic Programming in Control of a Wheeled Mobile Robot." Solid State Phenomena 164 (June 2010): 419–24. http://dx.doi.org/10.4028/www.scientific.net/ssp.164.419.

Full text

Abstract:

In presented paper we propose a discrete tracking control algorithm for a two-wheeled mobile robot. The control algorithm consists of discrete Adaptive Critic Design (ACD) in Action Dependant Heuristic Dynamic Programming (ADHDP) configuration, PD controller and a supervisory term, derived from the Lyapunov stability theorem and based on the variable structure systems theory. Adaptive Critic Designs are a group of algorithms that use two independent structures for estimation of optimal value function from Bellman equation and estimation of optimal control law. ADHDP algorithm consists of Actor (ASE - Associate Search Element) that estimates the optimal control law and Critic (ACE - Adaptive Critic Element) that evaluates quality of control by estimation of the optimal value function from Bellman equation. Both structures are realized in a form of Neural Networks (NN). ADHDP algorithm does not require a plant model (the wheeled mobile robot (WMR) model) for ACE or ASE neural network weights update procedure (in contrast with other ACD configurations e.g. Heuristic Dynamic Programming or Dual Heuristic Programming that use the plant model). In presented control algorithm Actor-Critic structure is supported by PD controller and the supervisory term, that guarantee stable implementation of tracking in an initial adaptive critic neural networks learning phase, and robustness in a face of disturbances. Verification of proposed control algorithm was realized on the two-wheeled mobile robot Pioneer-2DX.

APA, Harvard, Vancouver, ISO, and other styles

12

Su, Jianyu, Stephen Adams, and Peter Beling. "Value-Decomposition Multi-Agent Actor-Critics." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 13 (May 18, 2021): 11352–60. http://dx.doi.org/10.1609/aaai.v35i13.17353.

Full text

Abstract:

The exploitation of extra state information has been an active research area in multi-agent reinforcement learning (MARL). QMIX represents the joint action-value using a non-negative function approximator and achieves the best performance on the StarCraft II micromanagement testbed, a common MARL benchmark. However, our experiments demonstrate that, in some cases, QMIX performs sub-optimally with the A2C framework, a training paradigm that promotes algorithm training efficiency. To obtain a reasonable trade-off between training efficiency and algorithm performance, we extend value-decomposition to actor-critic methods that are compatible with A2C and propose a novel actor-critic framework, value-decomposition actor-critic (VDAC). We evaluate VDAC on the StarCraft II micromanagement task and demonstrate that the proposed framework improves median performance over other actor-critic methods. Furthermore, we use a set of ablation experiments to identify the key factors that contribute to the performance of VDAC.

APA, Harvard, Vancouver, ISO, and other styles

13

Qiu, Shuang, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. "On Finite-Time Convergence of Actor-Critic Algorithm." IEEE Journal on Selected Areas in Information Theory 2, no. 2 (June 2021): 652–64. http://dx.doi.org/10.1109/jsait.2021.3078754.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Ding, Feng, Guanfeng Ma, Zhikui Chen, Jing Gao, and Peng Li. "Averaged Soft Actor-Critic for Deep Reinforcement Learning." Complexity 2021 (April 1, 2021): 1–16. http://dx.doi.org/10.1155/2021/6658724.

Full text

Abstract:

With the advent of the era of artificial intelligence, deep reinforcement learning (DRL) has achieved unprecedented success in high-dimensional and large-scale artificial intelligence tasks. However, the insecurity and instability of the DRL algorithm have an important impact on its performance. The Soft Actor-Critic (SAC) algorithm uses advanced functions to update the policy and value network to alleviate some of these problems. However, SAC still has some problems. In order to reduce the error caused by the overestimation of SAC, we propose a new SAC algorithm called Averaged-SAC. By averaging the previously learned action-state estimates, it reduces the overestimation problem of soft Q-learning, thereby contributing to a more stable training process and improving performance. We evaluate the performance of Averaged-SAC through some games in the MuJoCo environment. The experimental results show that the Averaged-SAC algorithm effectively improves the performance of the SAC algorithm and the stability of the training process.

APA, Harvard, Vancouver, ISO, and other styles

15

Hatakeyama, Hiroyuki, Shingo Mabu, Kotaro Hirasawa, and Jinglu Hu. "Genetic Network Programming with Actor-Critic." Journal of Advanced Computational Intelligence and Intelligent Informatics 11, no. 1 (January 20, 2007): 79–86. http://dx.doi.org/10.20965/jaciii.2007.p0079.

Full text

Abstract:

A new graph-based evolutionary algorithm named “Genetic Network Programming, GNP” has been already proposed. GNP represents its solutions as graph structures, which can improve the expression ability and performance. In addition, GNP with Reinforcement Learning (GNP-RL) was proposed a few years ago. Since GNP-RL can do reinforcement learning during task execution in addition to evolution after task execution, it can search for solutions efficiently. In this paper, GNP with Actor-Critic (GNP-AC) which is a new type of GNP-RL is proposed. Originally, GNP deals with discrete information, but GNP-AC aims to deal with continuous information. The proposed method is applied to the controller of the Khepera simulator and its performance is evaluated.

APA, Harvard, Vancouver, ISO, and other styles

16

Hyeon, Soo-Jong, Tae-Young Kang, and Chang-Kyung Ryoo. "A Path Planning for Unmanned Aerial Vehicles Using SAC (Soft Actor Critic) Algorithm." Journal of Institute of Control, Robotics and Systems 28, no. 2 (February 28, 2022): 138–45. http://dx.doi.org/10.5302/j.icros.2022.21.0220.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Li, Xinzhou, Guifen Chen, Guowei Wu, Zhiyao Sun, and Guangjiao Chen. "Research on Multi-Agent D2D Communication Resource Allocation Algorithm Based on A2C." Electronics 12, no. 2 (January 10, 2023): 360. http://dx.doi.org/10.3390/electronics12020360.

Full text

Abstract:

Device to device (D2D) communication technology is the main component of future communication, which greatly improves the utilization of spectrum resources. However, in the D2D subscriber multiplex communication network, the interference between communication links is serious and the system performance is degraded. Traditional resource allocation schemes need a lot of channel information when dealing with interference problems in the system, and have the problems of weak dynamic resource allocation capability and low system throughput. Aiming at this challenge, this paper proposes a multi-agent D2D communication resource allocation algorithm based on Advantage Actor Critic (A2C). First, a multi-D2D cellular communication system model based on A2C Critic is established, then the parameters of the actor network and the critic network in the system are updated, and finally the resource allocation scheme of D2D users is dynamically and adaptively output. The simulation results show that compared with DQN (deep Q-network) and MAAC (multi-agent actor–critic), the average throughput of the system is improved by 26% and 12.5%, respectively.

APA, Harvard, Vancouver, ISO, and other styles

18

Arvindhan, M., and D. Rajesh Kumar. "Adaptive Resource Allocation in Cloud Data Centers using Actor-Critical Deep Reinforcement Learning for Optimized Load Balancing." International Journal on Recent and Innovation Trends in Computing and Communication 11, no. 5s (May 18, 2023): 310–18. http://dx.doi.org/10.17762/ijritcc.v11i5s.6671.

Full text

Abstract:

This paper proposes a deep reinforcement learning-based actor-critic method for efficient resource allocation in cloud computing. The proposed method uses an actor network to generate the allocation strategy and a critic network to evaluate the quality of the allocation. The actor and critic networks are trained using a deep reinforcement learning algorithm to optimize the allocation strategy. The proposed method is evaluated using a simulation-based experimental study, and the results show that it outperforms several existing allocation methods in terms of resource utilization, energy efficiency and overall cost. Some algorithms for managing workloads or virtual machines have been developed in previous works in an effort to reduce energy consumption; however, these solutions often fail to take into account the high dynamic nature of server states and are not implemented at a sufficiently enough scale. In order to guarantee the QoS of workloads while simultaneously lowering the computational energy consumption of physical servers, this study proposes the Actor Critic based Compute-Intensive Workload Allocation Scheme (AC-CIWAS). AC-CIWAS captures the dynamic feature of server states in a continuous manner, and considers the influence of different workloads on energy consumption, to accomplish logical task allocation. In order to determine how best to allocate workloads in terms of energy efficiency, AC-CIWAS uses a Deep Reinforcement Learning (DRL)-based Actor Critic (AC) algorithm to calculate the projected cumulative return over time. Through simulation, we see that the proposed AC-CIWAS can reduce the workload of the job scheduled with QoS assurance by around 20% decrease compared to existing baseline allocation methods. The report also covers the ways in which the proposed technology could be used in cloud computing and offers suggestions for future study.

APA, Harvard, Vancouver, ISO, and other styles

19

Seo, Kanghyeon, and Jihoon Yang. "Differentially Private Actor and Its Eligibility Trace." Electronics 9, no. 9 (September 10, 2020): 1486. http://dx.doi.org/10.3390/electronics9091486.

Full text

Abstract:

We present a differentially private actor and its eligibility trace in an actor-critic approach, wherein an actor takes actions directly interacting with an environment; however, the critic estimates only the state values that are obtained through bootstrapping. In other words, the actor reflects the more detailed information about the sequence of taken actions on its parameter than the critic. Moreover, their corresponding eligibility traces have the same properties. Therefore, it is necessary to preserve the privacy of an actor and its eligibility trace while training on private or sensitive data. In this paper, we confirm the applicability of differential privacy methods to the actors updated using the policy gradient algorithm and discuss the advantages of such an approach with regard to differentially private critic learning. In addition, we measured the cosine similarity between the differentially private applied eligibility trace and the non-differentially private eligibility trace to analyze whether their anonymity is appropriately protected in the differentially private actor or the critic. We conducted the experiments considering two synthetic examples imitating real-world problems in medical and autonomous navigation domains, and the results confirmed the feasibility of the proposed method.

APA, Harvard, Vancouver, ISO, and other styles

20

Liao, Junrong, Shiyue Liu, Qinghe Wu, Jiabin Chen, and Fuhua Wei. "PID Control of Permanent Magnet Synchronous Motor Based on Improved Actor-Critic Framework." Journal of Physics: Conference Series 2213, no. 1 (March 1, 2022): 012005. http://dx.doi.org/10.1088/1742-6596/2213/1/012005.

Full text

Abstract:

Abstract Aiming at the shortcomings of the traditional PID control method that the parameters cannot be adjusted flexible, an improved Actor-Critic reinforcement learning algorithm combined with incremental PID control is proposed to improve the control performance of permanent magnet synchronous motor (PMSM). The strategy function of Actor and value function of Critic are approximated by two back propagation (BP) neural networks respectively. The simulation results show that the proposed algorithm has better control performance and effect than the traditional PID control method.

APA, Harvard, Vancouver, ISO, and other styles

21

Zhong, Shan, Quan Liu, and QiMing Fu. "Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning." Computational Intelligence and Neuroscience 2016 (2016): 1–15. http://dx.doi.org/10.1155/2016/4824072.

Full text

Abstract:

To improve the convergence rate and the sample efficiency, two efficient learning methods AC-HMLP and RAC-HMLP (AC-HMLP withl2-regularization) are proposed by combining actor-critic algorithm with hierarchical model learning and planning. The hierarchical models consisting of the local and the global models, which are learned at the same time during learning of the value function and the policy, are approximated by local linear regression (LLR) and linear function approximation (LFA), respectively. Both the local model and the global model are applied to generate samples for planning; the former is used only if the state-prediction error does not surpass the threshold at each time step, while the latter is utilized at the end of each episode. The purpose of taking both models is to improve the sample efficiency and accelerate the convergence rate of the whole algorithm through fully utilizing the local and global information. Experimentally, AC-HMLP and RAC-HMLP are compared with three representative algorithms on two Reinforcement Learning (RL) benchmark problems. The results demonstrate that they perform best in terms of convergence rate and sample efficiency.

APA, Harvard, Vancouver, ISO, and other styles

22

HIROYASU, Tomoyuki, Akiyuki NAKAMURA, Mitsunori MIKI, Masato YOSHIMI, and Hisatake YOKOUCHI. "The Sensuous Lighting Control System using Actor-Critic Algorithm." Journal of Japan Society for Fuzzy Theory and Intelligent Informatics 23, no. 4 (2011): 501–12. http://dx.doi.org/10.3156/jsoft.23.501.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Borkar, V. S. "An actor-critic algorithm for constrained Markov decision processes." Systems & Control Letters 54, no. 3 (March 2005): 207–13. http://dx.doi.org/10.1016/j.sysconle.2004.08.007.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Li, Shuang, Yanghui Yan, Ju Ren, Yuezhi Zhou, and Yaoxue Zhang. "A Sample-Efficient Actor-Critic Algorithm for Recommendation Diversification." Chinese Journal of Electronics 29, no. 1 (January 1, 2020): 89–96. http://dx.doi.org/10.1049/cje.2019.10.004.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Shi, Wei, Long Chen, and Xia Zhu. "Task Offloading Decision-Making Algorithm for Vehicular Edge Computing: A Deep-Reinforcement-Learning-Based Approach." Sensors 23, no. 17 (September 1, 2023): 7595. http://dx.doi.org/10.3390/s23177595.

Full text

Abstract:

Efficient task offloading decision is a crucial technology in vehicular edge computing, which aims to fulfill the computational performance demands of complex vehicular tasks with respect to delay and energy consumption while minimizing network resource competition and consumption. Conventional distributed task offloading decisions rely solely on the local state of the vehicle, failing to optimize the utilization of the server’s resources to its fullest potential. In addition, the mobility aspect of vehicles is often neglected in these decisions. In this paper, a cloud-edge-vehicle three-tier vehicular edge computing (VEC) system is proposed, where vehicles partially offload their computing tasks to edge or cloud servers while keeping the remaining tasks local to the vehicle terminals. Under the restrictions of vehicle mobility and discrete variables, task scheduling and task offloading proportion are jointly optimized with the objective of minimizing the total system cost. Considering the non-convexity, high-dimensional complex state and continuous action space requirements of the optimization problem, we propose a task offloading decision-making algorithm based on deep deterministic policy gradient (TODM_DDPG). TODM_DDPG algorithm adopts the actor–critic framework in which the actor network outputs floating point numbers to represent deterministic policy, while the critic network evaluates the action output by the actor network, and adjusts the network evaluation policy according to the rewards with the environment to maximize the long-term reward. To explore the algorithm performance, this conduct parameter setting experiments to correct the algorithm core hyper-parameters and select the optimal combination of parameters. In addition, in order to verify algorithm performance, we also carry out a series of comparative experiments with baseline algorithms. The results demonstrate that in terms of reducing system costs, the proposed algorithm outperforms the compared baseline algorithm, such as the deep Q network (DQN) and the actor–critic (AC), and the performance is improved by about 13% on average.

APA, Harvard, Vancouver, ISO, and other styles

26

Zhou, Chengmin, Bingding Huang, and Pasi Fränti. "A review of motion planning algorithms for intelligent robots." Journal of Intelligent Manufacturing 33, no. 2 (November 25, 2021): 387–424. http://dx.doi.org/10.1007/s10845-021-01867-z.

Full text

Abstract:

AbstractPrinciples of typical motion planning algorithms are investigated and analyzed in this paper. These algorithms include traditional planning algorithms, classical machine learning algorithms, optimal value reinforcement learning, and policy gradient reinforcement learning. Traditional planning algorithms investigated include graph search algorithms, sampling-based algorithms, interpolating curve algorithms, and reaction-based algorithms. Classical machine learning algorithms include multiclass support vector machine, long short-term memory, Monte-Carlo tree search and convolutional neural network. Optimal value reinforcement learning algorithms include Q learning, deep Q-learning network, double deep Q-learning network, dueling deep Q-learning network. Policy gradient algorithms include policy gradient method, actor-critic algorithm, asynchronous advantage actor-critic, advantage actor-critic, deterministic policy gradient, deep deterministic policy gradient, trust region policy optimization and proximal policy optimization. New general criteria are also introduced to evaluate the performance and application of motion planning algorithms by analytical comparisons. The convergence speed and stability of optimal value and policy gradient algorithms are specially analyzed. Future directions are presented analytically according to principles and analytical comparisons of motion planning algorithms. This paper provides researchers with a clear and comprehensive understanding about advantages, disadvantages, relationships, and future of motion planning algorithms in robots, and paves ways for better motion planning algorithms in academia, engineering, and manufacturing.

APA, Harvard, Vancouver, ISO, and other styles

27

Yue, Han, Jiapeng Liu, Dongmei Tian, and Qin Zhang. "A Novel Anti-Risk Method for Portfolio Trading Using Deep Reinforcement Learning." Electronics 11, no. 9 (May 7, 2022): 1506. http://dx.doi.org/10.3390/electronics11091506.

Full text

Abstract:

In the past decade, the application of deep reinforcement learning (DRL) in portfolio management has attracted extensive attention. However, most classical RL algorithms do not consider the exogenous and noise of financial time series data, which may lead to treacherous trading decisions. To address this issue, we propose a novel anti-risk portfolio trading method based on deep reinforcement learning (DRL). It consists of a stacked sparse denoising autoencoder (SSDAE) network and an actor–critic based reinforcement learning (RL) agent. SSDAE will carry out off-line training first, while the decoder will used for on-line feature extraction in each state. The SSDAE network is used for the noise resistance training of financial data. The actor–critic algorithm we use is advantage actor–critic (A2C) and consists of two networks: the actor network learns and implements an investment policy, which is then evaluated by the critic network to determine the best action plan by continuously redistributing various portfolio assets, taking Sharp ratio as the optimization function. Through extensive experiments, the results show that our proposed method is effective and superior to the Dow Jones Industrial Average index (DJIA), several variants of our proposed method, and a state-of-the-art (SOTA) method.

APA, Harvard, Vancouver, ISO, and other styles

28

Nakamura, Yutaka, Takeshi Mori, Yoichi Tokita, Tomohiro Shibata, and Shin Ishii. "Off-Policy Natural Policy Gradient Method for a Biped Walking Using a CPG Controller." Journal of Robotics and Mechatronics 17, no. 6 (December 20, 2005): 636–44. http://dx.doi.org/10.20965/jrm.2005.p0636.

Full text

Abstract:

Referring to the mechanism of animals’ rhythmic movements, motor control schemes using a central pattern generator (CPG) controller have been studied. We previously proposed reinforcement learning (RL) called the CPG-actor-critic model, as an autonomous learning framework for a CPG controller. Here, we propose an off-policy natural policy gradient RL algorithm for the CPG-actor-critic model, to solve the “exploration-exploitation” problem by meta-controlling “behavior policy.” We apply this RL algorithm to an automatic control problem using a biped robot simulator. Computer simulation demonstrated that the CPG controller enables the biped robot to walk stably and efficiently based on our new algorithm.

APA, Harvard, Vancouver, ISO, and other styles

29

Hwang, Ha Jun, Jaeyeon Jang, Jongkwan Choi, Jung Ho Bae, Sung Ho Kim, and Chang Ouk Kim. "Stepwise Soft Actor–Critic for UAV Autonomous Flight Control." Drones 7, no. 9 (August 24, 2023): 549. http://dx.doi.org/10.3390/drones7090549.

Full text

Abstract:

Despite the growing demand for unmanned aerial vehicles (UAVs), the use of conventional UAVs is limited, as most of them require being remotely operated by a person who is not within the vehicle’s field of view. Recently, many studies have introduced reinforcement learning (RL) to address hurdles for the autonomous flight of UAVs. However, most previous studies have assumed overly simplified environments, and thus, they cannot be applied to real-world UAV operation scenarios. To address the limitations of previous studies, we propose a stepwise soft actor–critic (SeSAC) algorithm for efficient learning in a continuous state and action space environment. SeSAC aims to overcome the inefficiency of learning caused by attempting challenging tasks from the beginning. Instead, it starts with easier missions and gradually increases the difficulty level during training, ultimately achieving the final goal. We also control a learning hyperparameter of the soft actor–critic algorithm and implement a positive buffer mechanism during training to enhance learning effectiveness. Our proposed algorithm was verified in a six-degree-of-freedom (DOF) flight environment with high-dimensional state and action spaces. The experimental results demonstrate that the proposed algorithm successfully completed missions in two challenging scenarios, one for disaster management and another for counter-terrorism missions, while surpassing the performance of other baseline approaches.

APA, Harvard, Vancouver, ISO, and other styles

30

Takano, Toshiaki, Haruhiko Takase, Hiroharu Kawanaka, and Shinji Tsuruoka. "Merging with Extraction Method for Transfer Learning in Actor-Critic." Journal of Advanced Computational Intelligence and Intelligent Informatics 15, no. 7 (September 20, 2011): 814–21. http://dx.doi.org/10.20965/jaciii.2011.p0814.

Full text

Abstract:

This paper aims to accelerate learning process of actor-critic method, which is one of the major reinforcement learning algorithms, by a transfer learning. Transfer learning accelerates learning processes for the target task by reusing knowledge of source policies for each source task. In general, it consists of a selection phase and a training phase. Agents select source policies that are similar to the target one without trial and error, and train the target task by referring selected policies. In this paper, we discuss the training phase, and the rest of the training algorithm is based on our previous method. We proposed the effective transfer method that consists of the extractionmethod and the mergingmethod. Agents extract action preferences that are related to reliable states, and state values that lead to preferred states. Extracted parameters are merged into the current parameters by taking weighted average. We apply the proposed algorithm to simple maze tasks, and show the effectiveness of the proposed method: reduce 16% episodes and 55% failures without transfer.

APA, Harvard, Vancouver, ISO, and other styles

31

Zhang, Haifeng, Weizhe Chen, Zeren Huang, Minne Li, Yaodong Yang, Weinan Zhang, and Jun Wang. "Bi-Level Actor-Critic for Multi-Agent Coordination." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 7325–32. http://dx.doi.org/10.1609/aaai.v34i05.6226.

Full text

Abstract:

Coordination is one of the essential problems in multi-agent systems. Typically multi-agent reinforcement learning (MARL) methods treat agents equally and the goal is to solve the Markov game to an arbitrary Nash equilibrium (NE) when multiple equilibra exist, thus lacking a solution for NE selection. In this paper, we treat agents unequally and consider Stackelberg equilibrium as a potentially better convergence point than Nash equilibrium in terms of Pareto superiority, especially in cooperative environments. Under Markov games, we formally define the bi-level reinforcement learning problem in finding Stackelberg equilibrium. We propose a novel bi-level actor-critic learning method that allows agents to have different knowledge base (thus intelligent), while their actions still can be executed simultaneously and distributedly. The convergence proof is given, while the resulting learning algorithm is tested against the state of the arts. We found that the proposed bi-level actor-critic algorithm successfully converged to the Stackelberg equilibria in matrix games and find a asymmetric solution in a highway merge environment.

APA, Harvard, Vancouver, ISO, and other styles

32

Wang, Hui, Peng Zhang, and Quan Liu. "An Actor-critic Algorithm Using Cross Evaluation of Value Functions." IAES International Journal of Robotics and Automation (IJRA) 7, no. 1 (March 1, 2018): 39. http://dx.doi.org/10.11591/ijra.v7i1.pp39-47.

Full text

Abstract:

In order to overcome the difficulty of learning a global optimal policy caused by maximization bias in a continuous space, an actor-critic algorithm for cross evaluation of double value function is proposed. Two independent value functions make the critique closer to the real value function. And the actor is guided by a crossover function to choose its optimal actions. Cross evaluation of value functions avoids the policy jitter phenomenon behaved by greedy optimization methods in continuous spaces. The algorithm is more robust than CACLA learning algorithm, and the experimental results show that our algorithm is smoother and the stability of policy is improved obviously under the condition that the computation remains almost unchanged.

APA, Harvard, Vancouver, ISO, and other styles

33

Lan, Xuejing, Zhifeng Tan, Tao Zou, and Wenbiao Xu. "CACLA-Based Trajectory Tracking Guidance for RLV in Terminal Area Energy Management Phase." Sensors 21, no. 15 (July 26, 2021): 5062. http://dx.doi.org/10.3390/s21155062.

Full text

Abstract:

This paper focuses on the trajectory tracking guidance problem for the Terminal Area Energy Management (TAEM) phase of the Reusable Launch Vehicle (RLV). Considering the continuous state and action space of this guidance problem, the Continuous Actor–Critic Learning Automata (CACLA) is applied to construct the guidance strategy of RLV. Two three-layer neuron networks are used to model the critic and actor of CACLA, respectively. The weight vectors of the critic are updated by the model-free Temporal Difference (TD) learning algorithm, which is improved by eligibility trace and momentum factor. The weight vectors of the actor are updated based on the sign of TD error, and a Gauss exploration is carried out in the actor. Finally, a Monte Carlo simulation and a comparison simulation are performed to show the effectiveness of the CACLA-based guidance strategy.

APA, Harvard, Vancouver, ISO, and other styles

34

Zhang, Shangtong, and Hengshuai Yao. "ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 5789–96. http://dx.doi.org/10.1609/aaai.v33i01.33015789.

Full text

Abstract:

In this paper, we propose an actor ensemble algorithm, named ACE, for continuous control with a deterministic policy in reinforcement learning. In ACE, we use actor ensemble (i.e., multiple actors) to search the global maxima of the critic. Besides the ensemble perspective, we also formulate ACE in the option framework by extending the option-critic architecture with deterministic intra-option policies, revealing a relationship between ensemble and options. Furthermore, we perform a look-ahead tree search with those actors and a learned value prediction model, resulting in a refined value estimation. We demonstrate a significant performance boost of ACE over DDPG and its variants in challenging physical robot simulators.

APA, Harvard, Vancouver, ISO, and other styles

35

KIMURA, Hajime, and Shigenobu KOBAYASHI. "An Actor-Critic Algorithm Using a Binary Tree Action Selector." Transactions of the Society of Instrument and Control Engineers 37, no. 12 (2001): 1147–55. http://dx.doi.org/10.9746/sicetr1965.37.1147.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Zhong, Shan, Jack Tan, Husheng Dong, Xuemei Chen, Shengrong Gong, and Zhenjiang Qian. "Modeling-Learning-Based Actor-Critic Algorithm with Gaussian Process Approximator." Journal of Grid Computing 18, no. 2 (April 18, 2020): 181–95. http://dx.doi.org/10.1007/s10723-020-09512-4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Borkar, Vivek S., and Vijaymohan R. Konda. "The actor-critic algorithm as multi-time-scale stochastic approximation." Sadhana 22, no. 4 (August 1997): 525–43. http://dx.doi.org/10.1007/bf02745577.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Itoh, Hideaki, and Kazuyuki Aihara. "Combination of an actor/critic algorithm with goal-directed reasoning." Artificial Life and Robotics 5, no. 4 (December 2001): 233–41. http://dx.doi.org/10.1007/bf02481507.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Su, Jie-Ying, Jia-Lin Kang, and Shi-Shang Jang. "An Actor-Critic Algorithm for the Stochastic Cutting Stock Problem." Processes 11, no. 4 (April 13, 2023): 1203. http://dx.doi.org/10.3390/pr11041203.

Full text

Abstract:

The inventory level has a significant influence on the cost of process scheduling. The stochastic cutting stock problem (SCSP) is a complicated inventory-level scheduling problem due to the existence of random variables. In this study, we applied a model-free on-policy reinforcement learning (RL) approach based on a well-known RL method, called the Advantage Actor-Critic, to solve a SCSP example. To achieve the two goals of our RL model, namely, avoiding violating the constraints and minimizing cost, we proposed a two-stage discount factor algorithm to balance these goals during different training stages and adopted the game concept of an episode ending when an action violates any constraint. Experimental results demonstrate that our proposed method obtains solutions with low costs and is good at continuously generating actions that satisfy the constraints. Additionally, the two-stage discount factor algorithm trained the model faster while maintaining a good balance between the two aforementioned goals.

APA, Harvard, Vancouver, ISO, and other styles

40

Wu, Zhenning, Yiming Deng, and Lixing Wang. "A Pinning Actor-Critic Structure-Based Algorithm for Sizing Complex-Shaped Depth Profiles in MFL Inspection with High Degree of Freedom." Complexity 2021 (April 23, 2021): 1–12. http://dx.doi.org/10.1155/2021/9995033.

Full text

Abstract:

One of the most efficient nondestructive methods for pipeline in-line inspection is magnetic flux leakage (MFL) inspection. Estimating the size of the defect from MFL signal is one of the key problems of MFL inspection. As the inspection signal is usually contaminated by noise, sizing the defect is an ill-posed inverse problem, especially when sizing the depth as a complex shape. An actor-critic structure-based algorithm is proposed in this paper for sizing complex depth profiles. By learning with more information from the depth profile without knowing the corresponding MFL signal, the algorithm proposed saves computational costs and is robust. A pinning strategy is embedded in the reconstruction process, which highly reduces the dimension of action space. The pinning actor-critic structure (PACS) helps to make the reward for critic network more efficient when reconstructing the depth profiles with high degrees of freedom. A nonlinear FEM model is used to test the effectiveness of algorithm proposed under 20 dB noise. The results show that the algorithm reconstructs the depth profile of defects with good accuracy and is robust against noise.

APA, Harvard, Vancouver, ISO, and other styles

41

TAHAMI, EHSAN, AMIR HOMAYOUN JAFARI, and ALI FALLAH. "APPLICATION OF AN EVOLUTIONARY ACTOR–CRITIC REINFORCEMENT LEARNING METHOD FOR THE CONTROL OF A THREE-LINK MUSCULOSKELETAL ARM DURING A REACHING MOVEMENT." Journal of Mechanics in Medicine and Biology 13, no. 02 (April 2013): 1350040. http://dx.doi.org/10.1142/s0219519413500401.

Full text

Abstract:

In this paper, the control of a planar three-link musculoskeletal arm by using a revolutionary actor–critic reinforcement learning (RL) method during a reaching movement to a stationary target is presented. The arm model used in this study included three skeletal links (wrist, forearm, and upper arm), three joints (wrist, elbow, and shoulder without redundancy), and six non-linear monoarticular muscles (with redundancy), which were based on the Hill model. The learning control system was composed of actor, critic, and genetic algorithm (GA) parts. Two single-layer neural networks were used for each part of the actor and critic. This learning control system was used to apply six activation commands to six monoarticular muscles at each instant of time. It also used a reinforcement (reward) feedback for the learning process and controlling the direction of arm movement. Also, the GA was implemented to select the best learning rates for actor–critic neural networks. The results showed that mean square error (MSE) and average episode time gradually decrease and average reward gradually increases to constant values during the learning of the control policy. Furthermore, when learning was complete, optimal values of learning rates were selected.

APA, Harvard, Vancouver, ISO, and other styles

42

Doya, Kenji. "Reinforcement Learning in Continuous Time and Space." Neural Computation 12, no. 1 (January 1, 2000): 219–45. http://dx.doi.org/10.1162/089976600300015961.

Full text

Abstract:

This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD (0), and TD (λ) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.

APA, Harvard, Vancouver, ISO, and other styles

43

Chen, Haibo, Zhongwei Huang, Xiaorong Zhao, Xiao Liu, Youjun Jiang, Pinyong Geng, Guang Yang, Yewen Cao, and Deqiang Wang. "Policy Optimization of the Power Allocation Algorithm Based on the Actor–Critic Framework in Small Cell Networks." Mathematics 11, no. 7 (April 2, 2023): 1702. http://dx.doi.org/10.3390/math11071702.

Full text

Abstract:

A practical solution to the power allocation problem in ultra-dense small cell networks can be achieved by using deep reinforcement learning (DRL) methods. Unlike traditional algorithms, DRL methods are capable of achieving low latency and operating without the need for global real-time channel state information (CSI). Based on the actor–critic framework, we propose a policy optimization of the power allocation algorithm (POPA) for small cell networks in this paper. The POPA adopts the proximal policy optimization (PPO) algorithm to update the policy, which has been shown to have stable exploration and convergence effects in our simulations. Thanks to our proposed actor–critic architecture with distributed execution and centralized exploration training, the POPA can meet real-time requirements and has multi-dimensional scalability. Through simulations, we demonstrate that the POPA outperforms existing methods in terms of spectral efficiency. Our findings suggest that the POPA can be of practical value for power allocation in small cell networks.

APA, Harvard, Vancouver, ISO, and other styles

44

Pradhan, Arabinda, Sukant Kishoro Bisoy, and Mangal Sain. "Action-Based Load Balancing Technique in Cloud Network Using Actor-Critic-Swarm Optimization." Wireless Communications and Mobile Computing 2022 (June 30, 2022): 1–17. http://dx.doi.org/10.1155/2022/6456242.

Full text

Abstract:

Increasing scale of task in cloud network leads to problem in load balancing and its improvement in parameters. In this paper, we proposed a hybrid scheduling policy which is hybrid of both Particle Swarm Optimization (PSO) algorithm and actor-critic algorithm named as Hybrid Particle Swarm Optimization Actor Critic (HPSOAC) to solve this issue. This hybrid scheduling policy helps to each agent to improve an individual learning as well as learning through exchanging information among other agents. An experiment is carried out by the help of Python simulator with TensorFlow. Outcome shows that our proposed scheduling policy reduces 5.16% and 10.86% in energy consumption, reduces 7.13% and 10.04% in makespan time, and has marginally better resource utilization over Deep Q-network (DQN) and Q-learning based on Modified Particle Swarm Optimization (QMPSO) algorithm, respectively.

APA, Harvard, Vancouver, ISO, and other styles

45

Yang, Qisong, Thiago D. Simão, Simon H. Tindemans, and Matthijs T. J. Spaan. "WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10639–46. http://dx.doi.org/10.1609/aaai.v35i12.17272.

Full text

Abstract:

Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.

APA, Harvard, Vancouver, ISO, and other styles

46

Avkhimenia, Vadim, Matheus Gemignani, Tim Weis, and Petr Musilek. "Deep Reinforcement Learning-Based Operation of Transmission Battery Storage with Dynamic Thermal Line Rating." Energies 15, no. 23 (November 29, 2022): 9032. http://dx.doi.org/10.3390/en15239032.

Full text

Abstract:

It is well known that dynamic thermal line rating has the potential to use power transmission infrastructure more effectively by allowing higher currents when lines are cooler; however, it is not commonly implemented. Some of the barriers to implementation can be mitigated using modern battery energy storage systems. This paper proposes a combination of dynamic thermal line rating and battery use through the application of deep reinforcement learning. In particular, several algorithms based on deep deterministic policy gradient and soft actor critic are examined, in both single- and multi-agent settings. The selected algorithms are used to control battery energy storage systems in a 6-bus test grid. The effects of load and transmissible power forecasting on the convergence of those algorithms are also examined. The soft actor critic algorithm performs best, followed by deep deterministic policy gradient, and their multi-agent versions in the same order. One-step forecasting of the load and ampacity does not provide any significant benefit for predicting battery action.

APA, Harvard, Vancouver, ISO, and other styles

47

Sola, Yoann, Gilles Le Chenadec, and Benoit Clement. "Simultaneous Control and Guidance of an AUV Based on Soft Actor–Critic." Sensors 22, no. 16 (August 14, 2022): 6072. http://dx.doi.org/10.3390/s22166072.

Full text

Abstract:

The marine environment is a hostile setting for robotics. It is strongly unstructured, uncertain, and includes many external disturbances that cannot be easily predicted or modeled. In this work, we attempt to control an autonomous underwater vehicle (AUV) to perform a waypoint tracking task, using a machine learning-based controller. There has been great progress in machine learning (in many different domains) in recent years; in the subfield of deep reinforcement learning, several algorithms suitable for the continuous control of dynamical systems have been designed. We implemented the soft actor–critic (SAC) algorithm, an entropy-regularized deep reinforcement learning algorithm that allows fulfilling a learning task and encourages the exploration of the environment simultaneously. We compared a SAC-based controller with a proportional integral derivative (PID) controller on a waypoint tracking task using specific performance metrics. All tests were simulated via the UUV simulator. We applied these two controllers to the RexROV 2, a six degrees of freedom cube-shaped remotely operated underwater Vehicle (ROV) converted in an AUV. We propose several interesting contributions as a result of these tests, such as making the SAC control and guiding the AUV simultaneously, outperforming the PID controller in terms of energy saving, and reducing the amount of information needed by the SAC algorithm inputs. Moreover, our implementation of this controller allows facilitating the transfer towards real-world robots. The code corresponding to this work is available on GitHub.

APA, Harvard, Vancouver, ISO, and other styles

48

Wang, Xiao, Zhe Ma, Lei Mao, Kewu Sun, Xuhui Huang, Changchao Fan, and Jiake Li. "Accelerating Fuzzy Actor–Critic Learning via Suboptimal Knowledge for a Multi-Agent Tracking Problem." Electronics 12, no. 8 (April 13, 2023): 1852. http://dx.doi.org/10.3390/electronics12081852.

Full text

Abstract:

Multi-agent differential games usually include tracking policies and escaping policies. To obtain the proper policies in unknown environments, agents can learn through reinforcement learning. This typically requires a large amount of interaction with the environment, which is time-consuming and inefficient. However, if one can obtain an estimated model based on some prior knowledge, the control policy can be obtained based on suboptimal knowledge. Although there exists an error between the estimated model and the environment, the suboptimal guided policy will avoid unnecessary exploration; thus, the learning process can be significantly accelerated. Facing the problem of tracking policy optimization for multiple pursuers, this study proposed a new form of fuzzy actor–critic learning algorithm based on suboptimal knowledge (SK-FACL). In the SK-FACL, the information about the environment that can be obtained is abstracted as an estimated model, and the suboptimal guided policy is calculated based on the Apollonius circle. The guided policy is combined with the fuzzy actor–critic learning algorithm, improving the learning efficiency. Considering the ground game of two pursuers and one evader, the experimental results verified the advantages of the SK-FACL in reducing tracking error, adapting model error and adapting to sudden changes made by the evader compared with pure knowledge control and the pure fuzzy actor–critic learning algorithm.

APA, Harvard, Vancouver, ISO, and other styles

49

Ali, Hamid, Hammad Majeed, Imran Usman, and Khaled A. Almejalli. "Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network." Wireless Communications and Mobile Computing 2021 (June 10, 2021): 1–13. http://dx.doi.org/10.1155/2021/9920591.

Full text

Abstract:

In reinforcement learning (RL), an agent learns an environment through hit and trail. This behavior allows the agent to learn in complex and difficult environments. In RL, the agent normally learns the given environment by exploring or exploiting. Most of the algorithms suffer from under exploration in the latter stage of the episodes. Recently, an off-policy algorithm called soft actor critic (SAC) is proposed that overcomes this problem by maximizing entropy as it learns the environment. In it, the agent tries to maximize entropy along with the expected discounted rewards. In SAC, the agent tries to be as random as possible while moving towards the maximum reward. This randomness allows the agent to explore the environment and stops it from getting stuck into local optima. We believe that maximizing the entropy causes the overestimation of entropy term which results in slow policy learning. This is because of the drastic change in action distribution whenever agent revisits the similar states. To overcome this problem, we propose a dual policy optimization framework, in which two independent policies are trained. Both the policies try to maximize entropy by choosing actions against the minimum entropy to reduce the overestimation. The use of two policies result in better and faster convergence. We demonstrate our approach on different well known continuous control simulated environments. Results show that our proposed technique achieves better results against state of the art SAC algorithm and learns better policies.

APA, Harvard, Vancouver, ISO, and other styles

50

Xi, Bao, Rui Wang, Ying-Hao Cai, Tao Lu, and Shuo Wang. "A Novel Heterogeneous Actor-critic Algorithm with Recent Emphasizing Replay Memory." International Journal of Automation and Computing 18, no. 4 (April 23, 2021): 619–31. http://dx.doi.org/10.1007/s11633-021-1296-x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!