Log in

Relevant bibliographies by topics / Actor-critic methods / Journal articles

Journal articles on the topic 'Actor-critic methods'

To see the other types of publications on this topic, follow the link: Actor-critic methods.

Author: Grafiati

Published: 1 June 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Actor-critic methods.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Parisi, Simone, Voot Tangkaratt, Jan Peters, and Mohammad Emtiyaz Khan. "TD-regularized actor-critic methods." Machine Learning 108, no. 8-9 (February 21, 2019): 1467–501. http://dx.doi.org/10.1007/s10994-019-05788-0.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Wang, Jing, Xuchu Ding, Morteza Lahijanian, Ioannis Ch Paschalidis, and Calin A. Belta. "Temporal logic motion control using actor–critic methods." International Journal of Robotics Research 34, no. 10 (May 26, 2015): 1329–44. http://dx.doi.org/10.1177/0278364915581505.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Grondman, I., M. Vaandrager, L. Busoniu, R. Babuska, and E. Schuitema. "Efficient Model Learning Methods for Actor–Critic Control." IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42, no. 3 (June 2012): 591–602. http://dx.doi.org/10.1109/tsmcb.2011.2170565.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Wang, Mingyi, Jianhao Tang, Haoli Zhao, Zhenni Li, and Shengli Xie. "Automatic Compression of Neural Network with Deep Reinforcement Learning Based on Proximal Gradient Method." Mathematics 11, no. 2 (January 9, 2023): 338. http://dx.doi.org/10.3390/math11020338.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In recent years, the model compression technique is very effective for deep neural network compression. However, many existing model compression methods rely heavily on human experience to explore a compression strategy between network structure, speed, and accuracy, which is usually suboptimal and time-consuming. In this paper, we propose a framework for automatically compressing models through the actor–critic structured deep reinforcement learning (DRL) which interacts with each layer in the neural network, where the actor network determines the compression strategy and the critic network ensures the decision accuracy of the actor network through predicted values, thus improving the compression quality of the network. To enhance the prediction performance of the critic network, we impose the L1 norm regularizer on the weights of the critic network to obtain a distinct activation output feature on the representation, thus enhancing the prediction accuracy of the critic network. Moreover, to improve the decision performance of the actor network, we impose the L1 norm regularizer on the weights of the actor network to improve the decision accuracy of the actor network by removing the redundant weights in the actor network. Furthermore, to improve the training efficiency, we use the proximal gradient method to optimize the weights of the actor network and the critic network, which can obtain an effective weight solution and thus improve the compression performance. In the experiment, in MNIST datasets, the proposed method has only a 0.2% loss of accuracy when compressing more than 70% of neurons. Similarly, in CIFAR-10 datasets, the proposed method compresses more than 60% of neurons, with only 7.1% accuracy loss, which is superior to other existing methods. In terms of efficiency, the proposed method also cost the lowest time among the existing methods.

5

Su, Jianyu, Stephen Adams, and Peter Beling. "Value-Decomposition Multi-Agent Actor-Critics." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 13 (May 18, 2021): 11352–60. http://dx.doi.org/10.1609/aaai.v35i13.17353.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The exploitation of extra state information has been an active research area in multi-agent reinforcement learning (MARL). QMIX represents the joint action-value using a non-negative function approximator and achieves the best performance on the StarCraft II micromanagement testbed, a common MARL benchmark. However, our experiments demonstrate that, in some cases, QMIX performs sub-optimally with the A2C framework, a training paradigm that promotes algorithm training efficiency. To obtain a reasonable trade-off between training efficiency and algorithm performance, we extend value-decomposition to actor-critic methods that are compatible with A2C and propose a novel actor-critic framework, value-decomposition actor-critic (VDAC). We evaluate VDAC on the StarCraft II micromanagement task and demonstrate that the proposed framework improves median performance over other actor-critic methods. Furthermore, we use a set of ablation experiments to identify the key factors that contribute to the performance of VDAC.

6

Saglam, Baturay, Furkan B. Mutlu, Dogan C. Cicek, and Suleyman S. Kozat. "Actor Prioritized Experience Replay." Journal of Artificial Intelligence Research 78 (November 16, 2023): 639–72. http://dx.doi.org/10.1613/jair.1.14819.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

A widely-studied deep reinforcement learning (RL) technique known as Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error. Although it has been shown that PER is one of the most crucial components for the overall performance of deep RL methods in discrete action domains, many empirical studies indicate that it considerably underperforms off-policy actor-critic algorithms. We theoretically show that actor networks cannot be effectively trained with transitions that have large TD errors. As a result, the approximate policy gradient computed under the Q-network diverges from the actual gradient computed under the optimal Q-function. Motivated by this, we introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER. The introduced algorithm suggests a new branch of improvements to PER and schedules effective and efficient training for both actor and critic networks. An extensive set of experiments verifies our theoretical findings, showing that our method outperforms competing approaches and achieves state-of-the-art results over the standard off-policy actor-critic algorithms.

7

Seo, Kanghyeon, and Jihoon Yang. "Differentially Private Actor and Its Eligibility Trace." Electronics 9, no. 9 (September 10, 2020): 1486. http://dx.doi.org/10.3390/electronics9091486.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We present a differentially private actor and its eligibility trace in an actor-critic approach, wherein an actor takes actions directly interacting with an environment; however, the critic estimates only the state values that are obtained through bootstrapping. In other words, the actor reflects the more detailed information about the sequence of taken actions on its parameter than the critic. Moreover, their corresponding eligibility traces have the same properties. Therefore, it is necessary to preserve the privacy of an actor and its eligibility trace while training on private or sensitive data. In this paper, we confirm the applicability of differential privacy methods to the actors updated using the policy gradient algorithm and discuss the advantages of such an approach with regard to differentially private critic learning. In addition, we measured the cosine similarity between the differentially private applied eligibility trace and the non-differentially private eligibility trace to analyze whether their anonymity is appropriately protected in the differentially private actor or the critic. We conducted the experiments considering two synthetic examples imitating real-world problems in medical and autonomous navigation domains, and the results confirmed the feasibility of the proposed method.

8

Saglam, Baturay, Furkan Mutlu, Dogan Cicek, and Suleyman Kozat. "Actor Prioritized Experience Replay (Abstract Reprint)." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 20 (March 24, 2024): 22710. http://dx.doi.org/10.1609/aaai.v38i20.30610.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

A widely-studied deep reinforcement learning (RL) technique known as Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error. Although it has been shown that PER is one of the most crucial components for the overall performance of deep RL methods in discrete action domains, many empirical studies indicate that it considerably underperforms off-policy actor-critic algorithms. We theoretically show that actor networks cannot be effectively trained with transitions that have large TD errors. As a result, the approximate policy gradient computed under the Q-network diverges from the actual gradient computed under the optimal Q-function. Motivated by this, we introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER. The introduced algorithm suggests a new branch of improvements to PER and schedules effective and efficient training for both actor and critic networks. An extensive set of experiments verifies our theoretical findings, showing that our method outperforms competing approaches and achieves state-of-the-art results over the standard off-policy actor-critic algorithms.

9

Hafez, Muhammad Burhan, Cornelius Weber, Matthias Kerzel, and Stefan Wermter. "Deep intrinsically motivated continuous actor-critic for efficient robotic visuomotor skill learning." Paladyn, Journal of Behavioral Robotics 10, no. 1 (January 1, 2019): 14–29. http://dx.doi.org/10.1515/pjbr-2019-0005.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract In this paper, we present a new intrinsically motivated actor-critic algorithm for learning continuous motor skills directly from raw visual input. Our neural architecture is composed of a critic and an actor network. Both networks receive the hidden representation of a deep convolutional autoencoder which is trained to reconstruct the visual input, while the centre-most hidden representation is also optimized to estimate the state value. Separately, an ensemble of predictive world models generates, based on its learning progress, an intrinsic reward signal which is combined with the extrinsic reward to guide the exploration of the actor-critic learner. Our approach is more data-efficient and inherently more stable than the existing actor-critic methods for continuous control from pixel data. We evaluate our algorithm for the task of learning robotic reaching and grasping skills on a realistic physics simulator and on a humanoid robot. The results show that the control policies learned with our approach can achieve better performance than the compared state-of-the-art and baseline algorithms in both dense-reward and challenging sparse-reward settings.

10

Kong, Minseok, and Jungmin So. "Empirical Analysis of Automated Stock Trading Using Deep Reinforcement Learning." Applied Sciences 13, no. 1 (January 3, 2023): 633. http://dx.doi.org/10.3390/app13010633.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

There are several automated stock trading programs using reinforcement learning, one of which is an ensemble strategy. The main idea of the ensemble strategy is to train DRL agents and make an ensemble with three different actor–critic algorithms: Advantage Actor–Critic (A2C), Deep Deterministic Policy Gradient (DDPG), and Proximal Policy Optimization (PPO). This novel idea was the concept mainly used in this paper. However, we did not stop there, but we refined the automated stock trading in two areas. First, we made another DRL-based ensemble and employed it as a new trading agent. We named it Remake Ensemble, and it combines not only A2C, DDPG, and PPO but also Actor–Critic using Kronecker-Factored Trust Region (ACKTR), Soft Actor–Critic (SAC), Twin Delayed DDPG (TD3), and Trust Region Policy Optimization (TRPO). Furthermore, we expanded the application domain of automated stock trading. Although the existing stock trading method treats only 30 Dow Jones stocks, ours handles KOSPI stocks, JPX stocks, and Dow Jones stocks. We conducted experiments with our modified automated stock trading system to validate its robustness in terms of cumulative return. Finally, we suggested some methods to gain relatively stable profits following the experiments.

11

Hernandez-Leal, Pablo, Bilal Kartal, and Matthew E. Taylor. "Agent Modeling as Auxiliary Task for Deep Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 15, no. 1 (October 8, 2019): 31–37. http://dx.doi.org/10.1609/aiide.v15i1.5221.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this paper we explore how actor-critic methods in deep reinforcement learning, in particular Asynchronous Advantage Actor-Critic (A3C), can be extended with agent modeling. Inspired by recent works on representation learning and multiagent deep reinforcement learning, we propose two architectures to perform agent modeling: the first one based on parameter sharing, and the second one based on agent policy features. Both architectures aim to learn other agents’ policies as auxiliary tasks, besides the standard actor (policy) and critic (values). We performed experiments in both cooperative and competitive domains. The former is a problem of coordinated multiagent object transportation and the latter is a two-player mini version of the Pommerman game. Our results show that the proposed architectures stabilize learning and outperform the standard A3C architecture when learning a best response in terms of expected rewards.

12

Arvindhan, M., and D. Rajesh Kumar. "Adaptive Resource Allocation in Cloud Data Centers using Actor-Critical Deep Reinforcement Learning for Optimized Load Balancing." International Journal on Recent and Innovation Trends in Computing and Communication 11, no. 5s (May 18, 2023): 310–18. http://dx.doi.org/10.17762/ijritcc.v11i5s.6671.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper proposes a deep reinforcement learning-based actor-critic method for efficient resource allocation in cloud computing. The proposed method uses an actor network to generate the allocation strategy and a critic network to evaluate the quality of the allocation. The actor and critic networks are trained using a deep reinforcement learning algorithm to optimize the allocation strategy. The proposed method is evaluated using a simulation-based experimental study, and the results show that it outperforms several existing allocation methods in terms of resource utilization, energy efficiency and overall cost. Some algorithms for managing workloads or virtual machines have been developed in previous works in an effort to reduce energy consumption; however, these solutions often fail to take into account the high dynamic nature of server states and are not implemented at a sufficiently enough scale. In order to guarantee the QoS of workloads while simultaneously lowering the computational energy consumption of physical servers, this study proposes the Actor Critic based Compute-Intensive Workload Allocation Scheme (AC-CIWAS). AC-CIWAS captures the dynamic feature of server states in a continuous manner, and considers the influence of different workloads on energy consumption, to accomplish logical task allocation. In order to determine how best to allocate workloads in terms of energy efficiency, AC-CIWAS uses a Deep Reinforcement Learning (DRL)-based Actor Critic (AC) algorithm to calculate the projected cumulative return over time. Through simulation, we see that the proposed AC-CIWAS can reduce the workload of the job scheduled with QoS assurance by around 20% decrease compared to existing baseline allocation methods. The report also covers the ways in which the proposed technology could be used in cloud computing and offers suggestions for future study.

13

Aws, Ahmad, Arkadij Yuschenko, and Vladimir Soloviev. "End-to-end deep reinforcement learning for control of an autonomous underwater robot with an undulating propulsor." Robotics and Technical Cybernetics 12, no. 1 (March 2024): 36–45. http://dx.doi.org/10.31776/rtcj.12105.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper focuses on the development and implementation of control algorithms for positioning an Autonomous Underwater Vehicle (AUV) with an undulating propulsor, using reinforcement learning methods. It provides an analysis and overview of works incorporating reinforcement learning methods such as Actor-only, Critic-only, and Actor-Critic. The paper primarily focuses on the Deep Deterministic Policy Gradient method and its implementation using deep neural networks to train the Actor-Critic agent. In the agent's architecture, a replay buffer and target neural networks were utilized to address the data correlation issue that induces training instability. An adaptive ar-chitecture was proposed for training the agent to force the robot to move from the initial point to any target point. Additionally, a random target point generator was incorporated at the training stage so as not to retrain the agent when the target points change. The training objective is to optimize the actor's policy by optimizing the critic and maximizing the reward function. Reward function is determined as the distance from the robot's center of mass to the target points. Consequently, the reward received by the agent increases when the robot gets closer to the target point and becomes maximal when the target point is reached with an acceptable error.

14

Zhang, Haifeng, Weizhe Chen, Zeren Huang, Minne Li, Yaodong Yang, Weinan Zhang, and Jun Wang. "Bi-Level Actor-Critic for Multi-Agent Coordination." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 7325–32. http://dx.doi.org/10.1609/aaai.v34i05.6226.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Coordination is one of the essential problems in multi-agent systems. Typically multi-agent reinforcement learning (MARL) methods treat agents equally and the goal is to solve the Markov game to an arbitrary Nash equilibrium (NE) when multiple equilibra exist, thus lacking a solution for NE selection. In this paper, we treat agents unequally and consider Stackelberg equilibrium as a potentially better convergence point than Nash equilibrium in terms of Pareto superiority, especially in cooperative environments. Under Markov games, we formally define the bi-level reinforcement learning problem in finding Stackelberg equilibrium. We propose a novel bi-level actor-critic learning method that allows agents to have different knowledge base (thus intelligent), while their actions still can be executed simultaneously and distributedly. The convergence proof is given, while the resulting learning algorithm is tested against the state of the arts. We found that the proposed bi-level actor-critic algorithm successfully converged to the Stackelberg equilibria in matrix games and find a asymmetric solution in a highway merge environment.

15

Luo, Ziwei, Jing Hu, Xin Wang, Shu Hu, Bin Kong, Youbing Yin, Qi Song, Xi Wu, and Siwei Lyu. "Stochastic Planner-Actor-Critic for Unsupervised Deformable Image Registration." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 2 (June 28, 2022): 1917–25. http://dx.doi.org/10.1609/aaai.v36i2.20086.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Large deformations of organs, caused by diverse shapes and nonlinear shape changes, pose a significant challenge for medical image registration. Traditional registration methods need to iteratively optimize an objective function via a specific deformation model along with meticulous parameter tuning, but which have limited capabilities in registering images with large deformations. While deep learning-based methods can learn the complex mapping from input images to their respective deformation field, it is regression-based and is prone to be stuck at local minima, particularly when large deformations are involved. To this end, we present Stochastic Planner-Actor-Critic (spac), a novel reinforcement learning-based framework that performs step-wise registration. The key notion is warping a moving image successively by each time step to finally align to a fixed image. Considering that it is challenging to handle high dimensional continuous action and state spaces in the conventional reinforcement learning (RL) framework, we introduce a new concept `Plan' to the standard Actor-Critic model, which is of low dimension and can facilitate the actor to generate a tractable high dimensional action. The entire framework is based on unsupervised training and operates in an end-to-end manner. We evaluate our method on several 2D and 3D medical image datasets, some of which contain large deformations. Our empirical results highlight that our work achieves consistent, significant gains and outperforms state-of-the-art methods.

16

Aslani, Mohammad, Mohammad Saadi Mesgari, Stefan Seipel, and Marco Wiering. "Developing adaptive traffic signal control by actor–critic and direct exploration methods." Proceedings of the Institution of Civil Engineers - Transport 172, no. 5 (October 2019): 289–98. http://dx.doi.org/10.1680/jtran.17.00085.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Doya, Kenji. "Reinforcement Learning in Continuous Time and Space." Neural Computation 12, no. 1 (January 1, 2000): 219–45. http://dx.doi.org/10.1162/089976600300015961.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This article presents a reinforcement learning framework for continuous-time dynamical systems without a priori discretization of time, state, and action. Basedonthe Hamilton-Jacobi-Bellman (HJB) equation for infinite-horizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD (0), and TD (λ) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJB-based framework. The performance of the proposed algorithms is first tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cart-pole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model.

18

Zhu, Qingling, Xiaoqiang Wu, Qiuzhen Lin, and Wei-Neng Chen. "Two-Stage Evolutionary Reinforcement Learning for Enhancing Exploration and Exploitation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 18 (March 24, 2024): 20892–900. http://dx.doi.org/10.1609/aaai.v38i18.30079.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The integration of Evolutionary Algorithm (EA) and Reinforcement Learning (RL) has emerged as a promising approach for tackling some challenges in RL, such as sparse rewards, lack of exploration, and brittle convergence properties. However, existing methods often employ actor networks as individuals of EA, which may constrain their exploratory capabilities, as the entire actor population will stop evolution when the critic network in RL falls into local optimal. To alleviate this issue, this paper introduces a Two-stage Evolutionary Reinforcement Learning (TERL) framework that maintains a population containing both actor and critic networks. TERL divides the learning process into two stages. In the initial stage, individuals independently learn actor-critic networks, which are optimized alternatively by RL and Particle Swarm Optimization (PSO). This dual optimization fosters greater exploration, curbing susceptibility to local optima. Shared information from a common replay buffer and PSO algorithm substantially mitigates the computational load of training multiple agents. In the subsequent stage, TERL shifts to a refined exploitation phase. Here, only the best individual undergoes further refinement, while the rest individuals continue PSO-based optimization. This allocates more computational resources to the best individual for yielding superior performance. Empirical assessments, conducted across a range of continuous control problems, validate the efficacy of the proposed TERL paradigm.

19

Jain, Arushi, Gandharv Patil, Ayush Jain, Khimya Khetarpal, and Doina Precup. "Variance Penalized On-Policy and Off-Policy Actor-Critic." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 9 (May 18, 2021): 7899–907. http://dx.doi.org/10.1609/aaai.v35i9.16964.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Reinforcement learning algorithms are typically geared towards optimizing the expected return of an agent. However, in many practical applications, low variance in the return is desired to ensure the reliability of an algorithm. In this paper, we propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Previous work uses the second moment of return to estimate the variance indirectly. Instead, we use a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods. Using the variance-penalized criterion, we guarantee the convergence of our algorithm to locally optimal policies for finite state action Markov decision processes. We demonstrate the utility of our algorithm in tabular and continuous MuJoCo domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.

20

Ryu, Heechang, Hayong Shin, and Jinkyoo Park. "Multi-Agent Actor-Critic with Hierarchical Graph Attention Network." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 7236–43. http://dx.doi.org/10.1609/aaai.v34i05.6214.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Most previous studies on multi-agent reinforcement learning focus on deriving decentralized and cooperative policies to maximize a common reward and rarely consider the transferability of trained policies to new tasks. This prevents such policies from being applied to more complex multi-agent tasks. To resolve these limitations, we propose a model that conducts both representation learning for multiple agents using hierarchical graph attention network and policy learning using multi-agent actor-critic. The hierarchical graph attention network is specially designed to model the hierarchical relationships among multiple agents that either cooperate or compete with each other to derive more advanced strategic policies. Two attention networks, the inter-agent and inter-group attention layers, are used to effectively model individual and group level interactions, respectively. The two attention networks have been proven to facilitate the transfer of learned policies to new tasks with different agent compositions and allow one to interpret the learned strategies. Empirically, we demonstrate that the proposed model outperforms existing methods in several mixed cooperative and competitive tasks.

21

Shi, Daming, Xudong Guo, Yi Liu, and Wenhui Fan. "Optimal Policy of Multiplayer Poker via Actor-Critic Reinforcement Learning." Entropy 24, no. 6 (May 30, 2022): 774. http://dx.doi.org/10.3390/e24060774.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Poker has been considered a challenging problem in both artificial intelligence and game theory because poker is characterized by imperfect information and uncertainty, which are similar to many realistic problems like auctioning, pricing, cyber security, and operations. However, it is not clear that playing an equilibrium policy in multi-player games would be wise so far, and it is infeasible to theoretically validate whether a policy is optimal. Therefore, designing an effective optimal policy learning method has more realistic significance. This paper proposes an optimal policy learning method for multi-player poker games based on Actor-Critic reinforcement learning. Firstly, this paper builds the Actor network to make decisions with imperfect information and the Critic network to evaluate policies with perfect information. Secondly, this paper proposes a novel multi-player poker policy update method: asynchronous policy update algorithm (APU) and dual-network asynchronous policy update algorithm (Dual-APU) for multi-player multi-policy scenarios and multi-player sharing-policy scenarios, respectively. Finally, this paper takes the most popular six-player Texas hold ’em poker to validate the performance of the proposed optimal policy learning method. The experiments demonstrate the policies learned by the proposed methods perform well and gain steadily compared with the existing approaches. In sum, the policy learning methods of imperfect information games based on Actor-Critic reinforcement learning perform well on poker and can be transferred to other imperfect information games. Such training with perfect information and testing with imperfect information models show an effective and explainable approach to learning an approximately optimal policy.

22

Wang, Hui, Peng Zhang, and Quan Liu. "An Actor-critic Algorithm Using Cross Evaluation of Value Functions." IAES International Journal of Robotics and Automation (IJRA) 7, no. 1 (March 1, 2018): 39. http://dx.doi.org/10.11591/ijra.v7i1.pp39-47.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In order to overcome the difficulty of learning a global optimal policy caused by maximization bias in a continuous space, an actor-critic algorithm for cross evaluation of double value function is proposed. Two independent value functions make the critique closer to the real value function. And the actor is guided by a crossover function to choose its optimal actions. Cross evaluation of value functions avoids the policy jitter phenomenon behaved by greedy optimization methods in continuous spaces. The algorithm is more robust than CACLA learning algorithm, and the experimental results show that our algorithm is smoother and the stability of policy is improved obviously under the condition that the computation remains almost unchanged.

23

Zhang, Zuozhen, Junzhong Ji, and Jinduo Liu. "MetaRLEC: Meta-Reinforcement Learning for Discovery of Brain Effective Connectivity." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 9 (March 24, 2024): 10261–69. http://dx.doi.org/10.1609/aaai.v38i9.28892.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In recent years, the discovery of brain effective connectivity (EC) networks through computational analysis of functional magnetic resonance imaging (fMRI) data has gained prominence in neuroscience and neuroimaging. However, owing to the influence of diverse factors during data collection and processing, fMRI data typically exhibits high noise and limited sample characteristics, consequently leading to suboptimal performance of current methods. In this paper, we propose a novel brain effective connectivity discovery method based on meta-reinforcement learning, called MetaRLEC. The method mainly consists of three modules: actor, critic, and meta-critic. MetaRLEC first employs an encoder-decoder framework: the encoder utilizing a Transformer, converts noisy fMRI data into a state embedding; the decoder employing bidirectional LSTM, discovers brain region dependencies from the state and generates actions (EC networks). Then a critic network evaluates these actions, incentivizing the actor to learn higher-reward actions amidst the high-noise setting. Finally, a meta-critic framework facilitates online learning of historical state-action pairs, integrating an action-value neural network and supplementary training losses to enhance the model's adaptability to small-sample fMRI data. We conduct comprehensive experiments on both simulated and real-world data to demonstrate the efficacy of our proposed method.

24

Zhao, Nan, Zehua Liu, Yiqiang Cheng, and Chao Tian. "Multi-Agent Actor Critic for Channel Allocation in Heterogeneous Networks." International Journal of Mobile Computing and Multimedia Communications 11, no. 1 (January 2020): 23–41. http://dx.doi.org/10.4018/ijmcmc.2020010102.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Heterogeneous networks (HetNets) can equalize traffic loads and cut down the cost of deploying cells. Thus, it is regarded to be the significant technique of the next-generation communication networks. Due to the non-convexity nature of the channel allocation problem in HetNets, it is difficult to design an optimal approach for allocating channels. To ensure the user quality of service as well as the long-term total network utility, this article proposes a new method through utilizing multi-agent reinforcement learning. Moreover, for the purpose of solving computational complexity problem caused by the large action space, deep reinforcement learning is put forward to learn optimal policy. A nearly-optimal solution with high efficiency and rapid convergence speed could be obtained by this learning method. Simulation results reveal that this new method has the best performance than other methods.

25

Chen, Haibo, Zhongwei Huang, Xiaorong Zhao, Xiao Liu, Youjun Jiang, Pinyong Geng, Guang Yang, Yewen Cao, and Deqiang Wang. "Policy Optimization of the Power Allocation Algorithm Based on the Actor–Critic Framework in Small Cell Networks." Mathematics 11, no. 7 (April 2, 2023): 1702. http://dx.doi.org/10.3390/math11071702.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

A practical solution to the power allocation problem in ultra-dense small cell networks can be achieved by using deep reinforcement learning (DRL) methods. Unlike traditional algorithms, DRL methods are capable of achieving low latency and operating without the need for global real-time channel state information (CSI). Based on the actor–critic framework, we propose a policy optimization of the power allocation algorithm (POPA) for small cell networks in this paper. The POPA adopts the proximal policy optimization (PPO) algorithm to update the policy, which has been shown to have stable exploration and convergence effects in our simulations. Thanks to our proposed actor–critic architecture with distributed execution and centralized exploration training, the POPA can meet real-time requirements and has multi-dimensional scalability. Through simulations, we demonstrate that the POPA outperforms existing methods in terms of spectral efficiency. Our findings suggest that the POPA can be of practical value for power allocation in small cell networks.

26

Yang, Qisong, Thiago D. Simão, Simon H. Tindemans, and Matthijs T. J. Spaan. "WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10639–46. http://dx.doi.org/10.1609/aaai.v35i12.17272.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.

27

Wang, Zhihai, Jie Wang, Qi Zhou, Bin Li, and Houqiang Li. "Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8612–20. http://dx.doi.org/10.1609/aaai.v36i8.20839.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Model-based reinforcement learning algorithms, which aim to learn a model of the environment to make decisions, are more sample efficient than their model-free counterparts. The sample efficiency of model-based approaches relies on whether the model can well approximate the environment. However, learning an accurate model is challenging, especially in complex and noisy environments. To tackle this problem, we propose the conservative model-based actor-critic (CMBAC), a novel approach that achieves high sample efficiency without the strong reliance on accurate learned models. Specifically, CMBAC learns multiple estimates of the Q-value function from a set of inaccurate models and uses the average of the bottom-k estimates---a conservative estimate---to optimize the policy. An appealing feature of CMBAC is that the conservative estimates effectively encourage the agent to avoid unreliable “promising actions”---whose values are high in only a small fraction of the models. Experiments demonstrate that CMBAC significantly outperforms state-of-the-art approaches in terms of sample efficiency on several challenging control tasks, and the proposed method is more robust than previous methods in noisy environments.

28

Zhong, Shan, Quan Liu, and QiMing Fu. "Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning." Computational Intelligence and Neuroscience 2016 (2016): 1–15. http://dx.doi.org/10.1155/2016/4824072.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

To improve the convergence rate and the sample efficiency, two efficient learning methods AC-HMLP and RAC-HMLP (AC-HMLP withl2-regularization) are proposed by combining actor-critic algorithm with hierarchical model learning and planning. The hierarchical models consisting of the local and the global models, which are learned at the same time during learning of the value function and the policy, are approximated by local linear regression (LLR) and linear function approximation (LFA), respectively. Both the local model and the global model are applied to generate samples for planning; the former is used only if the state-prediction error does not surpass the threshold at each time step, while the latter is utilized at the end of each episode. The purpose of taking both models is to improve the sample efficiency and accelerate the convergence rate of the whole algorithm through fully utilizing the local and global information. Experimentally, AC-HMLP and RAC-HMLP are compared with three representative algorithms on two Reinforcement Learning (RL) benchmark problems. The results demonstrate that they perform best in terms of convergence rate and sample efficiency.

29

Wu, Zhenning, Yiming Deng, and Lixing Wang. "A Pinning Actor-Critic Structure-Based Algorithm for Sizing Complex-Shaped Depth Profiles in MFL Inspection with High Degree of Freedom." Complexity 2021 (April 23, 2021): 1–12. http://dx.doi.org/10.1155/2021/9995033.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

One of the most efficient nondestructive methods for pipeline in-line inspection is magnetic flux leakage (MFL) inspection. Estimating the size of the defect from MFL signal is one of the key problems of MFL inspection. As the inspection signal is usually contaminated by noise, sizing the defect is an ill-posed inverse problem, especially when sizing the depth as a complex shape. An actor-critic structure-based algorithm is proposed in this paper for sizing complex depth profiles. By learning with more information from the depth profile without knowing the corresponding MFL signal, the algorithm proposed saves computational costs and is robust. A pinning strategy is embedded in the reconstruction process, which highly reduces the dimension of action space. The pinning actor-critic structure (PACS) helps to make the reward for critic network more efficient when reconstructing the depth profiles with high degrees of freedom. A nonlinear FEM model is used to test the effectiveness of algorithm proposed under 20 dB noise. The results show that the algorithm reconstructs the depth profile of defects with good accuracy and is robust against noise.

30

Liang, Kun, Guoqiang Zhang, Jinhui Guo, and Wentao Li. "An Actor-Critic Hierarchical Reinforcement Learning Model for Course Recommendation." Electronics 12, no. 24 (December 8, 2023): 4939. http://dx.doi.org/10.3390/electronics12244939.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Online learning platforms provide diverse course resources, but this often results in the issue of information overload. Learners always want to learn courses that are appropriate for their knowledge level and preferences quickly and accurately. Effective course recommendation plays a key role in helping learners select appropriate courses and improving the efficiency of online learning. However, when a user is enrolled in multiple courses, existing course recommendation methods face the challenge of accurately recommending the target course that is most relevant to the user because of the noise courses. In this paper, we propose a novel reinforcement learning model named Actor-Critic Hierarchical Reinforcement Learning (ACHRL). The model incorporates the actor-critic method to construct the profile reviser. This can remove noise courses and make personalized course recommendations effectively. Furthermore, we propose a policy gradient based on the temporal difference error to reduce the variance in the training process, to speed up the convergence of the model, and to improve the accuracy of the recommendation. We evaluate the proposed model using two real datasets, and the experimental results show that the proposed model significantly outperforms the existing recommendation models (improving 3.77% to 13.66% in terms of HR@5).

31

Kwon, Ki-Young, Keun-Woo Jung, Dong-Su Yang, and Jooyoung Park. "Autonomous Vehicle Path Tracking Based on Natural Gradient Methods." Journal of Advanced Computational Intelligence and Intelligent Informatics 16, no. 7 (November 20, 2012): 888–93. http://dx.doi.org/10.20965/jaciii.2012.p0888.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Recently, reinforcement learning and evolution strategy have become major tools in the field of machine learning, and have shown excellent performance in various engineering problems. In particular, the Natural Actor-Critic (NAC) approach and the Natural Evolution Strategies (NES) have led to considerable interests in the area of natural-gradient-based machine learning methods with many successful applications. In this paper, we apply the NAC and the NES to pathtracking control problems for autonomous vehicles. Simulation results show that these methods can yield better performance compared to the conventional PID controllers.

32

Li, Yarong. "Sequence Alignment with Q-Learning Based on the Actor-Critic Model." ACM Transactions on Asian and Low-Resource Language Information Processing 20, no. 5 (July 2, 2021): 1–7. http://dx.doi.org/10.1145/3433540.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Multiple sequence alignment methods refer to a series of algorithmic solutions for the alignment of evolutionary-related sequences while taking into account evolutionary events such as mutations, insertions, deletions, and rearrangements under certain conditions. In this article, we propose a method with Q-learning based on the Actor-Critic model for sequence alignment. We transform the sequence alignment problem into an agent's autonomous learning process. In this process, the reward of the possible next action taken is calculated, and the cumulative reward of the entire process is calculated. The results show that the method we propose is better than the gene algorithm and the dynamic programming method.

33

Jiang, Liang, Ying Nan, Yu Zhang, and Zhihan Li. "Anti-Interception Guidance for Hypersonic Glide Vehicle: A Deep Reinforcement Learning Approach." Aerospace 9, no. 8 (August 4, 2022): 424. http://dx.doi.org/10.3390/aerospace9080424.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Anti-interception guidance can enhance a hypersonic glide vehicle (HGV) compard to multiple interceptors. In general, anti-interception guidance for aircraft can be divided into procedural guidance, fly-around guidance and active evading guidance. However, these guidance methods cannot be applied to an HGV’s unknown real-time process due to limited intelligence information or on-board computing abilities. In this paper, an anti-interception guidance approach based on deep reinforcement learning (DRL) is proposed. First, the penetration process is conceptualized as a generalized three-body adversarial optimal (GTAO) problem. The problem is then modelled as a Markov decision process (MDP), and a DRL scheme consisting of an actor-critic architecture is designed to solve this. Reusing the same sample batch during training results in fewer serious estimation errors in the critic network (CN), which provides better gradients to the immature actor network (AN). We propose a new mechanismcalled repetitive batch training (RBT). In addition, the training data and test results confirm that the RBT can improve the traditional DDPG-based-methodes.

34

Likmeta, Amarildo, Matteo Sacco, Alberto Maria Metelli, and Marcello Restelli. "Wasserstein Actor-Critic: Directed Exploration via Optimism for Continuous-Actions Control." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 7 (June 26, 2023): 8782–90. http://dx.doi.org/10.1609/aaai.v37i7.26056.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Uncertainty quantification has been extensively used as a means to achieve efficient directed exploration in Reinforcement Learning (RL). However, state-of-the-art methods for continuous actions still suffer from high sample complexity requirements. Indeed, they either completely lack strategies for propagating the epistemic uncertainty throughout the updates, or they mix it with aleatoric uncertainty while learning the full return distribution (e.g., distributional RL). In this paper, we propose Wasserstein Actor-Critic (WAC), an actor-critic architecture inspired by the recent Wasserstein Q-Learning (WQL), that employs approximate Q-posteriors to represent the epistemic uncertainty and Wasserstein barycenters for uncertainty propagation across the state-action space. WAC enforces exploration in a principled way by guiding the policy learning process with the optimization of an upper bound of the Q-value estimates. Furthermore, we study some peculiar issues that arise when using function approximation, coupled with the uncertainty estimation, and propose a regularized loss for the uncertainty estimation. Finally, we evaluate our algorithm on standard MujoCo tasks as well as suite of continuous-actions domains, where exploration is crucial, in comparison with state-of-the-art baselines. Additional details and results can be found in the supplementary material with our Arxiv preprint.

35

Shi, Lei, Tian Li, Lin Wei, Yongcai Tao, Cuixia Li, and Yufei Gao. "FASTune: Towards Fast and Stable Database Tuning System with Reinforcement Learning." Electronics 12, no. 10 (May 10, 2023): 2168. http://dx.doi.org/10.3390/electronics12102168.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Configuration tuning is vital to achieving high performance for a database management system (DBMS). Recently, automatic tuning methods using Reinforcement Learning (RL) have been explored to find better configurations compared with database administrators (DBAs) and heuristics. However, existing RL-based methods still have several limitations: (1) Excessive overhead due to reliance on cloned databases; (2) trial-and-error strategy may produce dangerous configurations that lead to database failure; (3) lack the ability to handle dynamic workload. To address the above challenges, a fast and stable RL-based database tuning system, FASTune, is proposed. A virtual environment is proposed to evaluate configurations which is an equivalent yet more efficient scheme than the cloned database. To ensure stability during tuning, FASTune adopts an environment proxy to avoid dangerous configurations. In addition, a Multi-State Soft Actor–Critic (MS-SAC) model is proposed to handle dynamic workloads, which utilizes the soft actor–critic network to tune the database according to workload and database states. The experimental results indicate that, compared with the state-of-the-art methods, FASTune can achieve improvements in performance while maintaining stability in the tuning.

36

Yu, Zhiwen, Wenjie Zheng, Kaiwen Zeng, Ruifeng Zhao, Yanxu Zhang, and Mengdi Zeng. "Energy optimization management of microgrid using improved soft actor-critic algorithm." International Journal of Renewable Energy Development 13, no. 2 (February 20, 2024): 329–39. http://dx.doi.org/10.61435/ijred.2024.59988.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

To tackle the challenges associated with variability and uncertainty in distributed power generation, as well as the complexities of solving high-dimensional energy management mathematical models in mi-crogrid energy optimization, a microgrid energy optimization management method is proposed based on an improved soft actor-critic algorithm. In the proposed method, the improved soft actor-critic algorithm employs an entropy-based objective function to encourage target exploration without assigning signifi-cantly higher probabilities to any part of the action space, which can simplify the analysis process of distributed power generation variability and uncertainty while effectively mitigating the convergence fragility issues in solving the high-dimensional mathematical model of microgrid energy management. The effectiveness of the proposed method is validated through a case study analysis of microgrid energy op-timization management. The results revealed an increase of 51.20%, 52.38%, 13.43%, 16.50%, 58.26%, and 36.33% in the total profits of a microgrid compared with the Deep Q-network algorithm, the state-action-reward-state-action algorithm, the proximal policy optimization algorithm, the ant-colony based algorithm, a microgrid energy optimization management strategy based on the genetic algorithm and the fuzzy inference system, and the theoretical retailer stragety, respectively. Additionally, com-pared with other methods and strategies, the proposed method can learn more optimal microgrid energy management behaviors and anticipate fluctuations in electricity prices and demand.

37

Ismail, Ahmed, and Mustafa Baysal. "Dynamic Pricing Based on Demand Response Using Actor–Critic Agent Reinforcement Learning." Energies 16, no. 14 (July 19, 2023): 5469. http://dx.doi.org/10.3390/en16145469.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Eco-friendly technologies for sustainable energy development require the efficient utilization of energy resources. Real-time pricing (RTP), also known as dynamic pricing, offers advantages over other pricing systems by enabling demand response (DR) actions. However, existing methods for determining and controlling DR have limitations in managing an increasing demand and predicting future pricing. This paper presents a novel approach to address the limitations of existing methods for determining and controlling demand response (DR) in the context of dynamic pricing systems for sustainable energy development. By leveraging actor–critic agent reinforcement learning (RL) techniques, a dynamic pricing DR model is proposed for efficient energy management. The model’s learning framework was trained using DR and real-time pricing data extracted from the Australian Energy Market Operator (AEMO) spanning a period of 17 years. The efficacy of the RL-based dynamic pricing approach was evaluated through two predicting cases: actual-predicted demand and actual-predicted price. Initially, long short-term memory (LSTM) models were employed to predict price and demand, and the results were subsequently enhanced using the deep RL model. Remarkably, the proposed approach achieved an impressive accuracy of 99% for every 30 min future price prediction. The results demonstrated the efficiency of the proposed RL-based model in accurately predicting both demand and price for effective energy management.

38

Drechsler, M. Funk, T. A. Fiorentin, and H. Göllinger. "Actor-Critic Traction Control Based on Reinforcement Learning with Open-Loop Training." Modelling and Simulation in Engineering 2021 (December 7, 2021): 1–10. http://dx.doi.org/10.1155/2021/4641450.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The use of actor-critic algorithms can improve the controllers currently implemented in automotive applications. This method combines reinforcement learning (RL) and neural networks to achieve the possibility of controlling nonlinear systems with real-time capabilities. Actor-critic algorithms were already applied with success in different controllers including autonomous driving, antilock braking system (ABS), and electronic stability control (ESC). However, in the current researches, virtual environments are implemented for the training process instead of using real plants to obtain the datasets. This limitation is given by trial and error methods implemented for the training process, which generates considerable risks in case the controller directly acts on the real plant. In this way, the present research proposes and evaluates an open-loop training process, which permits the data acquisition without the control interaction and an open-loop training of the neural networks. The performance of the trained controllers is evaluated by a design of experiments (DOE) to understand how it is affected by the generated dataset. The results present a successful application of open-loop training architecture. The controller can maintain the slip ratio under adequate levels during maneuvers on different floors, including grounds that are not applied during the training process. The actor neural network is also able to identify the different floors and change the acceleration profile according to the characteristics of each ground.

39

Wu, Jiying, Zhong Yang, Haoze Zhuo, Changliang Xu, Chi Zhang, Naifeng He, Luwei Liao, and Zhiyong Wang. "A Supervised Reinforcement Learning Algorithm for Controlling Drone Hovering." Drones 8, no. 3 (February 20, 2024): 69. http://dx.doi.org/10.3390/drones8030069.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The application of drones carrying different devices for aerial hovering operations is becoming increasingly widespread, but currently there is very little research relying on reinforcement learning methods for hovering control, and it has not been implemented on physical machines. Drone’s behavior space regarding hover control is continuous and large-scale, making it difficult for basic algorithms and value-based reinforcement learning (RL) algorithms to have good results. In response to this issue, this article applies a watcher-actor-critic (WAC) algorithm to the drone’s hover control, which can quickly lock the exploration direction and achieve high robustness of the drone’s hover control while improving learning efficiency and reducing learning costs. This article first utilizes the actor-critic algorithm based on behavioral value Q (QAC) and the deep deterministic policy gradient algorithm (DDPG) for drone hover control learning. Subsequently, an actor-critic algorithm with an added watcher is proposed, in which the watcher uses a PID controller with parameters provided by a neural network as the dynamic monitor, transforming the learning process into supervised learning. Finally, this article uses a classic reinforcement learning environment library, Gym, and a current mainstream reinforcement learning framework, PARL, for simulation, and deploys the algorithm to a practical environment. A multi-sensor fusion strategy-based autonomous localization method for unmanned aerial vehicles is used for practical exercises. The simulation and experimental results show that the training episodes of WAC are reduced by 20% compared to the DDPG and 55% compared to the QAC, and the proposed algorithm has a higher learning efficiency, faster convergence speed, and smoother hovering effect compared to the QAC and DDPG.

40

Qian, Tiancheng, Xue Mei, Pengxiang Xu, Kangqi Ge, and Zhelei Qiu. "Filtration network: A frame sampling strategy via deep reinforcement learning for video captioning." Journal of Intelligent & Fuzzy Systems 40, no. 6 (June 21, 2021): 11085–97. http://dx.doi.org/10.3233/jifs-202249.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Recently many methods use encoder-decoder framework for video captioning, aiming to translate short videos into natural language. These methods usually use equal interval frame sampling. However, lacking a good efficiency in sampling, it has a high temporal and spatial redundancy, resulting in unnecessary computation cost. In addition, the existing approaches simply splice different visual features on the fully connection layer. Therefore, features cannot be effectively utilized. In order to solve the defects, we proposed filtration network (FN) to select key frames, which is trained by deep reinforcement learning algorithm actor-double-critic. According to behavior psychology, the core idea of actor-double-critic is that the behavior of agent is determined by both the external environment and the internal personality. It avoids the phenomenon of unclear reward and sparse feedback in training because it gives steady feedback after each action. The key frames are sent to combine codec network (CCN) to generate sentences. The operation of feature combination in CCN make fusion of visual features by complex number representation to make good semantic modeling. Experiments and comparisons with other methods on two datasets (MSVD/MSR-VTT) show that our approach achieves better performance in terms of four metrics, BLEU-4, METEOR, ROUGE-L and CIDEr.

41

Wang, Xinshui, Ke Meng, Xu Wang, Zhibin Liu, and Yuefeng Ma. "Dynamic User Resource Allocation for Downlink Multicarrier NOMA with an Actor–Critic Method." Energies 16, no. 7 (March 24, 2023): 2984. http://dx.doi.org/10.3390/en16072984.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Future wireless communication systems require higher performance requirements. Based on this, we study the combinatorial optimization problem of power allocation and dynamic user pairing in a downlink multicarrier non-orthogonal multiple-access (NOMA) system scenario, aiming at maximizing the user sum rate of the overall system. Due to the complex coupling of variables, it is difficult and time-consuming to obtain an optimal solution, making engineering impractical. To circumvent the difficulties and obtain a sub-optimal solution, we decompose this optimization problem into two sub-problems. First, a closed-form expression for the optimal power allocation scheme is obtained for a given subchannel allocation. Then, we provide the optimal user-pairing scheme using the actor–critic (AC) algorithm. As a promising approach to solving the exhaustive problem, deep-reinforcement learning (DRL) possesses higher learning ability and better self-adaptive capability than traditional optimization methods. Simulation results have demonstrated that our method has significant advantages over traditional methods and other deep-learning algorithms, and effectively improves the communication performance of NOMA transmission to some extent.

42

Melo, Francisco. "Differential Eligibility Vectors for Advantage Updating and Gradient Methods." Proceedings of the AAAI Conference on Artificial Intelligence 25, no. 1 (August 4, 2011): 441–46. http://dx.doi.org/10.1609/aaai.v25i1.7938.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this paper we propose differential eligibility vectors (DEV) for temporal-difference (TD) learning, a new class of eligibility vectors designed to bring out the contribution of each action in the TD-error at each state. Specifically, we use DEV in TD-Q(lambda) to more accurately learn the relative value of the actions, rather than their absolute value. We identify conditions that ensure convergence w.p.1 of TD-Q(lambda) with DEV and show that this algorithm can also be used to directly approximate the advantage function associated with a given policy, without the need to compute an auxiliary function - something that, to the extent of our knowledge, was not known possible. Finally, we discuss the integration of DEV in LSTDQ and actor-critic algorithms.

43

Lyu, Xueguang, Andrea Baisero, Yuchen Xiao, Brett Daley, and Christopher Amato. "On Centralized Critics in Multi-Agent Reinforcement Learning." Journal of Artificial Intelligence Research 77 (May 31, 2023): 295–354. http://dx.doi.org/10.1613/jair.1.14386.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Centralized Training for Decentralized Execution, where agents are trained offline in a centralized fashion and execute online in a decentralized manner, has become a popular approach in Multi-Agent Reinforcement Learning (MARL). In particular, it has become popular to develop actor-critic methods that train decentralized actors with a centralized critic where the centralized critic is allowed access global information of the entire system, including the true system state. Such centralized critics are possible given offline information and are not used for online execution. While these methods perform well in a number of domains and have become a de facto standard in MARL, using a centralized critic in this context has yet to be sufficiently analyzed theoretically or empirically. In this paper, we therefore formally analyze centralized and decentralized critic approaches, and analyze the effect of using state-based critics in partially observable environments. We derive theories contrary to the common intuition: critic centralization is not strictly beneficial, and using state values can be harmful. We further prove that, in particular, state-based critics can introduce unexpected bias and variance compared to history-based critics. Finally, we demonstrate how the theory applies in practice by comparing different forms of critics on a wide range of common multi-agent benchmarks. The experiments show practical issues such as the difficulty of representation learning with partial observability, which highlights why the theoretical problems are often overlooked in the literature.

44

Zhao, Mingjun, Haijiang Wu, Di Niu, and Xiaoli Wang. "Reinforced Curriculum Learning on Pre-Trained Neural Machine Translation Models." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 9652–59. http://dx.doi.org/10.1609/aaai.v34i05.6513.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The competitive performance of neural machine translation (NMT) critically relies on large amounts of training data. However, acquiring high-quality translation pairs requires expert knowledge and is costly. Therefore, how to best utilize a given dataset of samples with diverse quality and characteristics becomes an important yet understudied question in NMT. Curriculum learning methods have been introduced to NMT to optimize a model's performance by prescribing the data input order, based on heuristics such as the assessment of noise and difficulty levels. However, existing methods require training from scratch, while in practice most NMT models are pre-trained on big data already. Moreover, as heuristics, they do not generalize well. In this paper, we aim to learn a curriculum for improving a pre-trained NMT model by re-selecting influential data samples from the original training set and formulate this task as a reinforcement learning problem. Specifically, we propose a data selection framework based on Deterministic Actor-Critic, in which a critic network predicts the expected change of model performance due to a certain sample, while an actor network learns to select the best sample out of a random batch of samples presented to it. Experiments on several translation datasets show that our method can further improve the performance of NMT when original batch training reaches its ceiling, without using additional new training data, and significantly outperforms several strong baseline methods.

45

Zhao, Jun, Qingliang Zeng, and Bin Guo. "Adaptive Critic Learning-Based Robust Control of Systems with Uncertain Dynamics." Computational Intelligence and Neuroscience 2021 (November 16, 2021): 1–8. http://dx.doi.org/10.1155/2021/2952115.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Model uncertainties are usually unavoidable in the control systems, which are caused by imperfect system modeling, disturbances, and nonsmooth dynamics. This paper presents a novel method to address the robust control problem for uncertain systems. The original robust control problem of the uncertain system is first transformed into an optimal control of nominal system via selecting the appropriate cost function. Then, we develop an adaptive critic leaning algorithm to learn online the optimal control solution, where only the critic neural network (NN) is used, and the actor NN widely used in the existing methods is removed. Finally, the feasibility analysis of the control algorithm is given in the paper. Simulation results are given to show the availability of the presented control method.

46

Yue, Longfei, Rennong Yang, Jialiang Zuo, Mengda Yan, Xiaoru Zhao, and Maolong Lv. "Factored Multi-Agent Soft Actor-Critic for Cooperative Multi-Target Tracking of UAV Swarms." Drones 7, no. 3 (February 22, 2023): 150. http://dx.doi.org/10.3390/drones7030150.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In recent years, significant progress has been made in the multi-target tracking (MTT) of unmanned aerial vehicle (UAV) swarms. Most existing MTT approaches rely on the ideal assumption of a pre-set target trajectory. However, in practice, the trajectory of a moving target cannot be known by the UAV in advance, which poses a great challenge for realizing real-time tracking. Meanwhile, state-of-the-art multi-agent value-based methods have achieved significant progress for cooperative tasks. In contrast, multi-agent actor-critic (MAAC) methods face high variance and credit assignment issues. To address the aforementioned issues, this paper proposes a learning-based factored multi-agent soft actor-critic (FMASAC) scheme under the maximum entropy framework, where the UAV swarm is able to learn cooperative MTT in an unknown environment. This method introduces the idea of value decomposition into the MAAC setting to reduce the variance in policy updates and learn efficient credit assignment. Moreover, to further increase the detection tracking coverage of a UAV swarm, a spatial entropy reward (SER), inspired by the spatial entropy concept, is proposed in this scheme. Experiments demonstrated that the FMASAC can significantly improve the cooperative MTT performance of a UAV swarm, and it outperforms existing baselines in terms of the mean reward and tracking success rates. Additionally, the proposed scheme scales more successfully as the number of UAVs and targets increases.

47

Zhou, Kun, Wenyong Wang, Teng Hu, and Kai Deng. "Application of Improved Asynchronous Advantage Actor Critic Reinforcement Learning Model on Anomaly Detection." Entropy 23, no. 3 (February 25, 2021): 274. http://dx.doi.org/10.3390/e23030274.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Anomaly detection research was conducted traditionally using mathematical and statistical methods. This topic has been widely applied in many fields. Recently reinforcement learning has achieved exceptional successes in many areas such as the AlphaGo chess playing and video gaming etc. However, there were scarce researches applying reinforcement learning to the field of anomaly detection. This paper therefore aimed at proposing an adaptable asynchronous advantage actor-critic model of reinforcement learning to this field. The performances were evaluated and compared among classical machine learning and the generative adversarial model with variants. Basic principles of the related models were introduced firstly. Then problem definitions, modelling processes and testing were detailed. The proposed model differentiated the sequence and image from other anomalies by proposing appropriate neural networks of attention mechanism and convolutional network for the two kinds of anomalies, respectively. Finally, performances with classical models using public benchmark datasets (NSL-KDD, AWID and CICIDS-2017, DoHBrw-2020) were evaluated and compared. Experiments confirmed the effectiveness of the proposed model with the results indicating higher rewards and lower loss rates on the datasets during training and testing. The metrics of precision, recall rate and F1 score were higher than or at least comparable to the state-of-the-art models. We concluded the proposed model could outperform or at least achieve comparable results with the existing anomaly detection models.

48

Lu, Junqi, Xinning Wu, Su Cao, Xiangke Wang, and Huangchao Yu. "An Implementation of Actor-Critic Algorithm on Spiking Neural Network Using Temporal Coding Method." Applied Sciences 12, no. 20 (October 16, 2022): 10430. http://dx.doi.org/10.3390/app122010430.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Taking advantage of faster speed, less resource consumption and better biological interpretability of spiking neural networks, this paper developed a novel spiking neural network reinforcement learning method using actor-critic architecture and temporal coding. The simple improved leaky integrate-and-fire (LIF) model was used to describe the behavior of a spike neuron. Then the actor-critic network structure and the update formulas using temporally encoded information were provided. The current model was finally examined in the decision-making task, the gridworld task, the UAV flying through a window task and the avoiding a flying basketball task. In the 5 × 5 grid map, the value function learned was close to the ideal situation and the quickest way from one state to another was found. A UAV trained by this method was able to fly through the window quickly in simulation. An actual flight test of a UAV avoiding a flying basketball was conducted. With this model, the success rate of the test was 96% and the average decision time was 41.3 ms. The results show the effectiveness and accuracy of the temporal coded spiking neural network RL method. In conclusion, an attempt was made to provide insights into developing spiking neural network reinforcement learning methods for decision-making and autonomous control of unmanned systems.

49

Oh, Sang Ho, Jeongyoon Kim, Jae Hoon Nah, and Jongyoul Park. "Employing Deep Reinforcement Learning to Cyber-Attack Simulation for Enhancing Cybersecurity." Electronics 13, no. 3 (January 30, 2024): 555. http://dx.doi.org/10.3390/electronics13030555.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In the current landscape where cybersecurity threats are escalating in complexity and frequency, traditional defense mechanisms like rule-based firewalls and signature-based detection are proving inadequate. The dynamism and sophistication of modern cyber-attacks necessitate advanced solutions that can evolve and adapt in real-time. Enter the field of deep reinforcement learning (DRL), a branch of artificial intelligence that has been effectively tackling complex decision-making problems across various domains, including cybersecurity. In this study, we advance the field by implementing a DRL framework to simulate cyber-attacks, drawing on authentic scenarios to enhance the realism and applicability of the simulations. By meticulously adapting DRL algorithms to the nuanced requirements of cybersecurity contexts—such as custom reward structures and actions, adversarial training, and dynamic environments—we provide a tailored approach that significantly improves upon traditional methods. Our research undertakes a thorough comparative analysis of three sophisticated DRL algorithms—deep Q-network (DQN), actor–critic, and proximal policy optimization (PPO)—against the traditional RL algorithm Q-learning, within a controlled simulation environment reflective of real-world cyber threats. The findings are striking: the actor–critic algorithm not only outperformed its counterparts with a success rate of 0.78 but also demonstrated superior efficiency, requiring the fewest iterations (171) to complete an episode and achieving the highest average reward of 4.8. In comparison, DQN, PPO, and Q-learning lagged slightly behind. These results underscore the critical impact of selecting the most fitting algorithm for cybersecurity simulations, as the right choice leads to more effective learning and defense strategies. The impressive performance of the actor–critic algorithm in this study marks a significant stride towards the development of adaptive, intelligent cybersecurity systems capable of countering the increasingly sophisticated landscape of cyber threats. Our study not only contributes a robust model for simulating cyber threats but also provides a scalable framework that can be adapted to various cybersecurity challenges.

50

Sun, Zhiyao, and Guifen Chen. "Enhancing Heterogeneous Network Performance: Advanced Content Popularity Prediction and Efficient Caching." Electronics 13, no. 4 (February 18, 2024): 794. http://dx.doi.org/10.3390/electronics13040794.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

With the popularity of smart devices and the growth of high-bandwidth applications, the wireless industry is facing an increased surge in data traffic. This challenge highlights the limitations of traditional edge-caching solutions, especially in terms of content-caching effectiveness and network-communication latency. To address this problem, we investigated efficient caching strategies in heterogeneous network environments. The caching decision process becomes more complex due to the heterogeneity of the network environment, as well as due to the diversity of user behaviors and content requests. To address the problem of increased system latency due to the dynamically changing nature of content popularity and limited cache capacity, we propose a novel content placement strategy, the long-short-term-memory–content-population-prediction model, to capture the correlation of request patterns between different contents and the periodicity in the time domain, in order to improve the accuracy of the prediction of content popularity. Then, to address the heterogeneity of heterogeneous network environments, we propose an efficient content delivery strategy: the multi-intelligent critical collaborative caching policy. This strategy models the edge-caching problem in heterogeneous scenarios as a Markov decision process using multi-base-station-environment information. In order to fully utilize the multi-intelligence information, we have improved the actor–critic approach by integrating the attention mechanism into a neural network. Whereas the actor network is responsible for making decisions based on local information, the critic network evaluates and enhances the actor’s performance. We conducted extensive simulations, and the results showed that the Long Short Term Memory content population prediction model was more advantageous, in terms of content-popularity-prediction accuracy, with a 28.61% improvement in prediction error, compared to several other existing methods. The proposed multi-intelligence actor–critic collaborative caching policy algorithm improved the cache-hit-rate metric by up to 32.3% and reduced the system latency by 1.6%, demonstrating the feasibility and effectiveness of the algorithm.