Log in

Relevant bibliographies by topics / Actor-critic algorithm

Contents

Journal articles
Dissertations / Theses
Book chapters
Conference papers

Academic literature on the topic 'Actor-critic algorithm'

Author: Grafiati

Published: 6 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Actor-critic algorithm.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Actor-critic algorithm"

1

Wang, Jing, and Ioannis Ch Paschalidis. "An Actor-Critic Algorithm With Second-Order Actor and Critic." IEEE Transactions on Automatic Control 62, no. 6 (2017): 2689–703. http://dx.doi.org/10.1109/tac.2016.2616384.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Zheng, Liyuan, Tanner Fiez, Zane Alumbaugh, Benjamin Chasnov, and Lillian J. Ratliff. "Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (2022): 9217–24. http://dx.doi.org/10.1609/aaai.v36i8.20908.

Full text

Abstract:

The hierarchical interaction between the actor and critic in actor-critic based reinforcement learning algorithms naturally lends itself to a game-theoretic interpretation. We adopt this viewpoint and model the actor and critic interaction as a two-player general-sum game with a leader-follower structure known as a Stackelberg game. Given this abstraction, we propose a meta-framework for Stackelberg actor-critic algorithms where the leader player follows the total derivative of its objective instead of the usual individual gradient. From a theoretical standpoint, we develop a policy gradient theorem for the refined update and provide a local convergence guarantee for the Stackelberg actor-critic algorithms to a local Stackelberg equilibrium. From an empirical standpoint, we demonstrate via simple examples that the learning dynamics we study mitigate cycling and accelerate convergence compared to the usual gradient dynamics given cost structures induced by actor-critic formulations. Finally, extensive experiments on OpenAI gym environments show that Stackelberg actor-critic algorithms always perform at least as well and often significantly outperform the standard actor-critic algorithm counterparts.

APA, Harvard, Vancouver, ISO, and other styles

3

Iwaki, Ryo, and Minoru Asada. "Implicit incremental natural actor critic algorithm." Neural Networks 109 (January 2019): 103–12. http://dx.doi.org/10.1016/j.neunet.2018.10.007.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Kim, Gi-Soo, Jane P. Kim, and Hyun-Joon Yang. "Robust Tests in Online Decision-Making." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 9 (2022): 10016–24. http://dx.doi.org/10.1609/aaai.v36i9.21240.

Full text

Abstract:

Bandit algorithms are widely used in sequential decision problems to maximize the cumulative reward. One potential application is mobile health, where the goal is to promote the user's health through personalized interventions based on user specific information acquired through wearable devices. Important considerations include the type of, and frequency with which data is collected (e.g. GPS, or continuous monitoring), as such factors can severely impact app performance and users’ adherence. In order to balance the need to collect data that is useful with the constraint of impacting app performance, one needs to be able to assess the usefulness of variables. Bandit feedback data are sequentially correlated, so traditional testing procedures developed for independent data cannot apply. Recently, a statistical testing procedure was developed for the actor-critic bandit algorithm. An actor-critic algorithm maintains two separate models, one for the actor, the action selection policy, and the other for the critic, the reward model. The performance of the algorithm as well as the validity of the test are guaranteed only when the critic model is correctly specified. However, misspecification is frequent in practice due to incorrect functional form or missing covariates. In this work, we propose a modified actor-critic algorithm which is robust to critic misspecification and derive a novel testing procedure for the actor parameters in this case.

APA, Harvard, Vancouver, ISO, and other styles

5

Sergey, Denisov, and Jee-Hyong Lee. "Actor-Critic Algorithm with Transition Cost Estimation." International Journal of Fuzzy Logic and Intelligent Systems 16, no. 4 (2016): 270–75. http://dx.doi.org/10.5391/ijfis.2016.16.4.270.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Ahmed, Ayman Elshabrawy M. "Controller parameter tuning using actor-critic algorithm." IOP Conference Series: Materials Science and Engineering 610 (October 11, 2019): 012054. http://dx.doi.org/10.1088/1757-899x/610/1/012054.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Ding, Siyuan, Shengxiang Li, Guangyi Liu, et al. "Decentralized Multiagent Actor-Critic Algorithm Based on Message Diffusion." Journal of Sensors 2021 (December 8, 2021): 1–14. http://dx.doi.org/10.1155/2021/8739206.

Full text

Abstract:

The exponential explosion of joint actions and massive data collection are two main challenges in multiagent reinforcement learning algorithms with centralized training. To overcome these problems, in this paper, we propose a model-free and fully decentralized actor-critic multiagent reinforcement learning algorithm based on message diffusion. To this end, the agents are assumed to be placed in a time-varying communication network. Each agent makes limited observations regarding the global state and joint actions; therefore, it needs to obtain and share information with others over the network. In the proposed algorithm, agents hold local estimations of the global state and joint actions and update them with local observations and the messages received from neighbors. Under the hypothesis of the global value decomposition, the gradient of the global objective function to an individual agent is derived. The convergence of the proposed algorithm with linear function approximation is guaranteed according to the stochastic approximation theory. In the experiments, the proposed algorithm was applied to a passive location task multiagent environment and achieved superior performance compared to state-of-the-art algorithms.

APA, Harvard, Vancouver, ISO, and other styles

8

Hafez, Muhammad Burhan, Cornelius Weber, Matthias Kerzel, and Stefan Wermter. "Deep intrinsically motivated continuous actor-critic for efficient robotic visuomotor skill learning." Paladyn, Journal of Behavioral Robotics 10, no. 1 (2019): 14–29. http://dx.doi.org/10.1515/pjbr-2019-0005.

Full text

Abstract:

Abstract In this paper, we present a new intrinsically motivated actor-critic algorithm for learning continuous motor skills directly from raw visual input. Our neural architecture is composed of a critic and an actor network. Both networks receive the hidden representation of a deep convolutional autoencoder which is trained to reconstruct the visual input, while the centre-most hidden representation is also optimized to estimate the state value. Separately, an ensemble of predictive world models generates, based on its learning progress, an intrinsic reward signal which is combined with the extrinsic reward to guide the exploration of the actor-critic learner. Our approach is more data-efficient and inherently more stable than the existing actor-critic methods for continuous control from pixel data. We evaluate our algorithm for the task of learning robotic reaching and grasping skills on a realistic physics simulator and on a humanoid robot. The results show that the control policies learned with our approach can achieve better performance than the compared state-of-the-art and baseline algorithms in both dense-reward and challenging sparse-reward settings.

APA, Harvard, Vancouver, ISO, and other styles

9

Zhang, Haifei, Jian Xu, Jian Zhang, and Quan Liu. "Network Architecture for Optimizing Deep Deterministic Policy Gradient Algorithms." Computational Intelligence and Neuroscience 2022 (November 18, 2022): 1–10. http://dx.doi.org/10.1155/2022/1117781.

Full text

Abstract:

The traditional Deep Deterministic Policy Gradient (DDPG) algorithm has been widely used in continuous action spaces, but it still suffers from the problems of easily falling into local optima and large error fluctuations. Aiming at these deficiencies, this paper proposes a dual-actor-dual-critic DDPG algorithm (DN-DDPG). First, on the basis of the original actor-critic network architecture of the algorithm, a critic network is added to assist the training, and the smallest Q value of the two critic networks is taken as the estimated value of the action in each update. Reduce the probability of local optimal phenomenon; then, introduce the idea of dual-actor network to alleviate the underestimation of value generated by dual-evaluator network, and select the action with the greatest value in the two-actor networks to update to stabilize the training of the algorithm process. Finally, the improved method is validated on four continuous action tasks provided by MuJoCo, and the results show that the improved method can reduce the fluctuation range of error and improve the cumulative return compared with the classical algorithm.

APA, Harvard, Vancouver, ISO, and other styles

10

Jain, Arushi, Gandharv Patil, Ayush Jain, Khimya Khetarpal, and Doina Precup. "Variance Penalized On-Policy and Off-Policy Actor-Critic." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 9 (2021): 7899–907. http://dx.doi.org/10.1609/aaai.v35i9.16964.

Full text

Abstract:

Reinforcement learning algorithms are typically geared towards optimizing the expected return of an agent. However, in many practical applications, low variance in the return is desired to ensure the reliability of an algorithm. In this paper, we propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return. Previous work uses the second moment of return to estimate the variance indirectly. Instead, we use a much simpler recently proposed direct variance estimator which updates the estimates incrementally using temporal difference methods. Using the variance-penalized criterion, we guarantee the convergence of our algorithm to locally optimal policies for finite state action Markov decision processes. We demonstrate the utility of our algorithm in tabular and continuous MuJoCo domains. Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Actor-critic algorithm"

1

Konda, Vijaymohan (Vijaymohan Gao) 1973. "Actor-critic algorithms." Thesis, Massachusetts Institute of Technology, 2002. http://hdl.handle.net/1721.1/8120.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.<br>Includes bibliographical references (leaves 143-147).<br>Many complex decision making problems like scheduling in manufacturing systems, portfolio management in finance, admission control in communication networks etc., with clear and precise objectives, can be formulated as stochastic dynamic programming problems in which the objective of decision making is to maximize a single "overall" reward. In these formulations, finding an optimal decision policy involves computing a certain "value function" which assigns to each state the optimal reward one would obtain if the system was started from that state. This function then naturally prescribes the optimal policy, which is to take decisions that drive the system to states with maximum value. For many practical problems, the computation of the exact value function is intractable, analytically and numerically, due to the enormous size of the state space. Therefore one has to resort to one of the following approximation methods to find a good sub-optimal policy: (1) Approximate the value function. (2) Restrict the search for a good policy to a smaller family of policies. In this thesis, we propose and study actor-critic algorithms which combine the above two approaches with simulation to find the best policy among a parameterized class of policies. Actor-critic algorithms have two learning units: an actor and a critic. An actor is a decision maker with a tunable parameter. A critic is a function approximator. The critic tries to approximate the value function of the policy used by the actor, and the actor in turn tries to improve its policy based on the current approximation provided by the critic. Furthermore, the critic evolves on a faster time-scale than the actor.<br>(cont.) We propose several variants of actor-critic algorithms. In all the variants, the critic uses Temporal Difference (TD) learning with linear function approximation. Some of the variants are inspired by a new geometric interpretation of the formula for the gradient of the overall reward with respect to the actor parameters. This interpretation suggests a natural set of basis functions for the critic, determined by the family of policies parameterized by the actor's parameters. We concentrate on the average expected reward criterion but we also show how the algorithms can be modified for other objective criteria. We prove convergence of the algorithms for problems with general (finite, countable, or continuous) state and decision spaces. To compute the rate of convergence (ROC) of our algorithms, we develop a general theory of the ROC of two-time-scale algorithms and we apply it to study our algorithms. In the process, we study the ROC of TD learning and compare it with related methods such as Least Squares TD (LSTD). We study the effect of the basis functions used for linear function approximation on the ROC of TD. We also show that the ROC of actor-critic algorithms does not depend on the actual basis functions used in the critic but depends only on the subspace spanned by them and study this dependence. Finally, we compare the performance of our algorithms with other algorithms that optimize over a parameterized family of policies. We show that when only the "natural" basis functions are used for the critic, the rate of convergence of the actor- critic algorithms is the same as that of certain stochastic gradient descent algorithms ...<br>by Vijaymohan Konda.<br>Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

2

Saxena, Naman. "Average Reward Actor-Critic with Deterministic Policy Search." Thesis, 2023. https://etd.iisc.ac.in/handle/2005/6175.

Full text

Abstract:

The average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this work, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first show asymptotic convergence analysis using the ODE-based method. Subsequently, we provide a finite time analysis of the resulting stochastic approximation scheme with linear function approximator and obtain an $\epsilon$-optimal stationary policy with a sample complexity of $\Omega(\epsilon^{-2.5})$. We compare the average reward performance of our proposed ARO-DDPG algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments.

APA, Harvard, Vancouver, ISO, and other styles

3

Diddigi, Raghuram Bharadwaj. "Reinforcement Learning Algorithms for Off-Policy, Multi-Agent Learning and Applications to Smart Grids." Thesis, 2022. https://etd.iisc.ac.in/handle/2005/5673.

Full text

Abstract:

Reinforcement Learning (RL) algorithms are a popular class of algorithms for training an agent to learn desired behavior through interaction with an environment whose dynamics is unknown to the agent. RL algorithms combined with neural network architectures have enjoyed much success in various disciplines like games, medicine, energy management, economics and supply chain management. In our thesis, we study interesting extensions of standard single-agent RL settings, like off-policy and multi-agent settings. We discuss the motivations and importance of these settings and propose convergent algorithms to solve these problems. Finally, we consider one of the important applications of RL, namely smart grids. The goal of the smart grid is to develop a power grid model that intelligently manages its energy resources. In our thesis, we propose RL models for efficient smart grid design. Learning the value function of a given policy (target policy) from the data samples obtained from a different policy (behavior policy) is an important problem in Reinforcement Learning (RL). This problem is studied under the setting of off-policy prediction. Temporal Difference (TD) learning algorithms are a popular class of algorithms for solving prediction problems. TD algorithms with linear function approximation are convergent when the data samples are generated from the target policy (known as on-policy prediction) itself. However, it has been well established in the literature that off-policy TD algorithms under linear function approximation may diverge. In the first part of the thesis, we propose a convergent online off-policy TD algorithm under linear function approximation. The main idea is to penalize updates of the algorithm to ensure convergence of the iterates. We provide a convergence analysis of our algorithm. Through numerical evaluations, we further demonstrate the effectiveness of our proposed scheme. Subsequently, we consider the “off-policy control” setup in RL, where an agent’s objective is to compute an optimal policy based on the data obtained from a behavior policy. As the optimal policy can be very different from the behavior policy, learning optimal behavior is very hard in the “offpolicy” setting compared to the “on-policy” setting wherein the data is collected from the new policy updates. In this work, we propose the first deep off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency. Unlike the existing natural gradient-based actor-critic algorithms that use only fixed features for policy and value function approximation, the proposed natural actor-critic algorithm can utilize a deep neural network’s power to approximate both policy and value function. We illustrate the benefit of the proposed off-policy natural gradient algorithm by comparing it with the Euclidean gradient actor-critic algorithm on benchmark RL tasks. In the third part of the thesis, we consider the problem of two-player zero-sum games. In this setting, there are two agents, both of whom aim to optimize their payoffs. Both the agents observe the same state of the game, and the agents’ objective is to compute a strategy profile that maximizes their payoffs. However, the payoff of the second agent is the negative of the payoff obtained by the first agent. Therefore, the objective of the second agent is to minimize the total payoff obtained by the first agent. This problem is formulated as a min-max Markov game in the literature. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation. Successive relaxation has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the two-player zero-sum games. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is unknown. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic approximation techniques. Through experiments, we demonstrate the advantages of our proposed algorithm. Next, we consider a cooperative stochastic games framework where multiple agents work towards learning optimal joint actions in an unknown environment to achieve a common goal. In many realworld applications, however, constraints are often imposed on the actions that the agents can jointly take. In such scenarios, the agents aim to learn joint actions to achieve a common goal (minimizing a specified cost function) while meeting the given constraints (specified via certain penalty functions). Our work considers the relaxation of the constrained optimization problem by constructing the Lagrangian of the cost and penalty functions. We propose a nested actor-critic solution approach to solve this relaxed problem. In this approach, an actor-critic scheme is employed to improve the policy for a given Lagrange parameter update on a faster timescale as in the classical actor-critic architecture. Using this faster timescale policy update, a meta actor-critic scheme is employed to improve the Lagrange parameters on the slower timescale. Utilizing the proposed nested actor-critic scheme, we develop three Nested Actor-Critic (N-AC) algorithms. In recent times, actor-critic algorithms with attention mechanisms have been successfully applied to obtain optimal actions for RL agents in multi-agent environments. In the fifth part of our thesis, we extend this algorithm to the constrained multi-agent RL setting considered above. The idea here is that optimizing the common goal and satisfying the constraints may require different modes of attention. Thus, by incorporating different attention modes, the agents can select useful information required for optimizing the objective and satisfying the constraints separately, thereby yielding better actions. Through experiments on benchmark multi-agent environments, we discuss the advantages of our proposed attention-based actor-critic algorithm. In the last part of our thesis, we study the applications of RL algorithms to Smart Grids. We consider two important problems - on the supply-side and demand-side, respectively, and study both in a unified framework. On the supply side, we study the problem of energy trading among microgrids to maximize profit obtained from selling power while at the same time satisfying the customer demand. On the demand side, we consider optimally scheduling the time-adjustable demand - i.e., of loads with flexible time windows in which they can be scheduled. While previous works have treated these two problems in isolation, we combine these problems and provide a unified Markov decision process (MDP) framework for these problems.

APA, Harvard, Vancouver, ISO, and other styles

4

Lakshmanan, K. "Online Learning and Simulation Based Algorithms for Stochastic Optimization." Thesis, 2012. http://etd.iisc.ac.in/handle/2005/3245.

Full text

Abstract:

In many optimization problems, the relationship between the objective and parameters is not known. The objective function itself may be stochastic such as a long-run average over some random cost samples. In such cases finding the gradient of the objective is not possible. It is in this setting that stochastic approximation algorithms are used. These algorithms use some estimates of the gradient and are stochastic in nature. Amongst gradient estimation techniques, Simultaneous Perturbation Stochastic Approximation (SPSA) and Smoothed Functional(SF) scheme are widely used. In this thesis we have proposed a novel multi-time scale quasi-Newton based smoothed functional (QN-SF) algorithm for unconstrained as well as constrained optimization. The algorithm uses the smoothed functional scheme for estimating the gradient and the quasi-Newton method to solve the optimization problem. The algorithm is shown to converge with probability one. We have also provided here experimental results on the problem of optimal routing in a multi-stage network of queues. Policies like Join the Shortest Queue or Least Work Left assume knowledge of the queue length values that can change rapidly or hard to estimate. If the only information available is the expected end-to-end delay as with our case, such policies cannot be used. The QN-SF based probabilistic routing algorithm uses only the total end-to-end delay for tuning the probabilities. We observe from the experiments that the QN-SF algorithm has better performance than the gradient and Jacobi versions of Newton based smoothed functional algorithms. Next we consider constrained routing in a similar queueing network. We extend the QN-SF algorithm to this case. We study the convergence behavior of the algorithm and observe that the constraints are satisfied at the point of convergence. We provide experimental results for the constrained routing setup as well. Next we study reinforcement learning algorithms which are useful for solving Markov Decision Process(MDP) when the precise information on transition probabilities is not known. When the state, and action sets are very large, it is not possible to store all the state-action tuples. In such cases, function approximators like neural networks have been used. The popular Q-learning algorithm is known to diverge when used with linear function approximation due to the ’off-policy’ problem. Hence developing stable learning algorithms when used with function approximation is an important problem. We present in this thesis a variant of Q-learning with linear function approximation that is based on two-timescale stochastic approximation. The Q-value parameters for a given policy in our algorithm are updated on the slower timescale while the policy parameters themselves are updated on the faster scale. We perform a gradient search in the space of policy parameters. Since the objective function and hence the gradient are not analytically known, we employ the efficient one-simulation simultaneous perturbation stochastic approximation(SPSA) gradient estimates that employ Hadamard matrix based deterministic perturbations. Our algorithm has the advantage that, unlike Q-learning, it does not suffer from high oscillations due to the off-policy problem when using function approximators. Whereas it is difficult to prove convergence of regular Q-learning with linear function approximation because of the off-policy problem, we prove that our algorithm which is on-policy is convergent. Numerical results on a multi-stage stochastic shortest path problem show that our algorithm exhibits significantly better performance and is more robust as compared to Q-learning. Future work would be to compare it with other policy-based reinforcement learning algorithms. Finally, we develop an online actor-critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process(MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multistage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point.

APA, Harvard, Vancouver, ISO, and other styles

5

Lakshmanan, K. "Online Learning and Simulation Based Algorithms for Stochastic Optimization." Thesis, 2012. http://hdl.handle.net/2005/3245.

Full text

Abstract:

In many optimization problems, the relationship between the objective and parameters is not known. The objective function itself may be stochastic such as a long-run average over some random cost samples. In such cases finding the gradient of the objective is not possible. It is in this setting that stochastic approximation algorithms are used. These algorithms use some estimates of the gradient and are stochastic in nature. Amongst gradient estimation techniques, Simultaneous Perturbation Stochastic Approximation (SPSA) and Smoothed Functional(SF) scheme are widely used. In this thesis we have proposed a novel multi-time scale quasi-Newton based smoothed functional (QN-SF) algorithm for unconstrained as well as constrained optimization. The algorithm uses the smoothed functional scheme for estimating the gradient and the quasi-Newton method to solve the optimization problem. The algorithm is shown to converge with probability one. We have also provided here experimental results on the problem of optimal routing in a multi-stage network of queues. Policies like Join the Shortest Queue or Least Work Left assume knowledge of the queue length values that can change rapidly or hard to estimate. If the only information available is the expected end-to-end delay as with our case, such policies cannot be used. The QN-SF based probabilistic routing algorithm uses only the total end-to-end delay for tuning the probabilities. We observe from the experiments that the QN-SF algorithm has better performance than the gradient and Jacobi versions of Newton based smoothed functional algorithms. Next we consider constrained routing in a similar queueing network. We extend the QN-SF algorithm to this case. We study the convergence behavior of the algorithm and observe that the constraints are satisfied at the point of convergence. We provide experimental results for the constrained routing setup as well. Next we study reinforcement learning algorithms which are useful for solving Markov Decision Process(MDP) when the precise information on transition probabilities is not known. When the state, and action sets are very large, it is not possible to store all the state-action tuples. In such cases, function approximators like neural networks have been used. The popular Q-learning algorithm is known to diverge when used with linear function approximation due to the ’off-policy’ problem. Hence developing stable learning algorithms when used with function approximation is an important problem. We present in this thesis a variant of Q-learning with linear function approximation that is based on two-timescale stochastic approximation. The Q-value parameters for a given policy in our algorithm are updated on the slower timescale while the policy parameters themselves are updated on the faster scale. We perform a gradient search in the space of policy parameters. Since the objective function and hence the gradient are not analytically known, we employ the efficient one-simulation simultaneous perturbation stochastic approximation(SPSA) gradient estimates that employ Hadamard matrix based deterministic perturbations. Our algorithm has the advantage that, unlike Q-learning, it does not suffer from high oscillations due to the off-policy problem when using function approximators. Whereas it is difficult to prove convergence of regular Q-learning with linear function approximation because of the off-policy problem, we prove that our algorithm which is on-policy is convergent. Numerical results on a multi-stage stochastic shortest path problem show that our algorithm exhibits significantly better performance and is more robust as compared to Q-learning. Future work would be to compare it with other policy-based reinforcement learning algorithms. Finally, we develop an online actor-critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process(MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multistage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point.

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Actor-critic algorithm"

1

Kim, Chayoung, Jung-min Park, and Hye-young Kim. "An Actor-Critic Algorithm for SVM Hyperparameters." In Information Science and Applications 2018. Springer Singapore, 2018. http://dx.doi.org/10.1007/978-981-13-1056-0_64.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Zha, ZhongYi, XueSong Tang, and Bo Wang. "An Advanced Actor-Critic Algorithm for Training Video Game AI." In Neural Computing for Advanced Applications. Springer Singapore, 2020. http://dx.doi.org/10.1007/978-981-15-7670-6_31.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Melo, Francisco S., and Manuel Lopes. "Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs." In Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2008. http://dx.doi.org/10.1007/978-3-540-87481-2_5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Sun, Qifeng, Hui Ren, Youxiang Duan, and Yanan Yan. "The Adaptive PID Controlling Algorithm Using Asynchronous Advantage Actor-Critic Learning Method." In Simulation Tools and Techniques. Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-32216-8_48.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Liu, Guiliang, Xu Li, Miningming Sun, and Ping Li. "An Advantage Actor-Critic Algorithm with Confidence Exploration for Open Information Extraction." In Proceedings of the 2020 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics, 2020. http://dx.doi.org/10.1137/1.9781611976236.25.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Cheng, Yuhu, Huanting Feng, and Xuesong Wang. "Actor-Critic Algorithm Based on Incremental Least-Squares Temporal Difference with Eligibility Trace." In Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence. Springer Berlin Heidelberg, 2012. http://dx.doi.org/10.1007/978-3-642-25944-9_24.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Jiang, Haobo, Jianjun Qian, Jin Xie, and Jian Yang. "Episode-Experience Replay Based Tree-Backup Method for Off-Policy Actor-Critic Algorithm." In Pattern Recognition and Computer Vision. Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-03398-9_48.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Chuyen, T. D., Dao Huy Du, N. D. Dien, R. V. Hoa, and N. V. Toan. "Building Intelligent Navigation System for Mobile Robots Based on the Actor – Critic Algorithm." In Advances in Engineering Research and Application. Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-030-92574-1_24.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Zhang, Huaqing, Hongbin Ma, and Ying Jin. "An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control." In Intelligent Robotics and Applications. Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-13841-6_41.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Park, Jooyoung, Jongho Kim, and Daesung Kang. "An RLS-Based Natural Actor-Critic Algorithm for Locomotion of a Two-Linked Robot Arm." In Computational Intelligence and Security. Springer Berlin Heidelberg, 2005. http://dx.doi.org/10.1007/11596448_9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Actor-critic algorithm"

1

Wang, Jing, and Ioannis Ch Paschalidis. "A Hessian actor-critic algorithm." In 2014 IEEE 53rd Annual Conference on Decision and Control (CDC). IEEE, 2014. http://dx.doi.org/10.1109/cdc.2014.7039533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Yaputra, Jordi, and Suyanto Suyanto. "The Effect of Discounting Actor-loss in Actor-Critic Algorithm." In 2021 4th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). IEEE, 2021. http://dx.doi.org/10.1109/isriti54043.2021.9702883.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Aleixo, Everton, Juan Colonna, and Raimundo Barreto. "SVC-A2C - Actor Critic Algorithm to Improve Smart Vacuum Cleaner." In IX Simpósio Brasileiro de Engenharia de Sistemas Computacionais. Sociedade Brasileira de Computação - SBC, 2019. http://dx.doi.org/10.5753/sbesc_estendido.2019.8637.

Full text

Abstract:

This work present a new approach to develop a vacuum cleaner. This use actor-critic algorithm. We execute tests with three other algoritms to compare. Even that, we develop a new simulator based on Gym to execute the tests.

APA, Harvard, Vancouver, ISO, and other styles

4

Prabuchandran K.J., Shalabh Bhatnagar, and Vivek S. Borkar. "An actor critic algorithm based on Grassmanian search." In 2014 IEEE 53rd Annual Conference on Decision and Control (CDC). IEEE, 2014. http://dx.doi.org/10.1109/cdc.2014.7039948.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Yang, Zhuoran, Kaiqing Zhang, Mingyi Hong, and Tamer Basar. "A Finite Sample Analysis of the Actor-Critic Algorithm." In 2018 IEEE Conference on Decision and Control (CDC). IEEE, 2018. http://dx.doi.org/10.1109/cdc.2018.8619440.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Vrushabh, D., Shalini K, and K. Sonam. "Actor-Critic Algorithm for Optimal Synchronization of Kuramoto Oscillator." In 2020 7th International Conference on Control, Decision and Information Technologies (CoDIT). IEEE, 2020. http://dx.doi.org/10.1109/codit49905.2020.9263785.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Paschalidis, Ioannis Ch, and Yingwei Lin. "Mobile agent coordination via a distributed actor-critic algorithm." In Automation (MED 2011). IEEE, 2011. http://dx.doi.org/10.1109/med.2011.5983038.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Diddigi, Raghuram Bharadwaj, Prateek Jain, Prabuchandran K. J, and Shalabh Bhatnagar. "Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm." In 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 2022. http://dx.doi.org/10.1109/ijcnn55064.2022.9892303.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Liu, Bo, Yue Zhang, Shupo Fu, and Xuan Liu. "Reduce UAV Coverage Energy Consumption through Actor-Critic Algorithm." In 2019 15th International Conference on Mobile Ad-Hoc and Sensor Networks (MSN). IEEE, 2019. http://dx.doi.org/10.1109/msn48538.2019.00069.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Zhong, Shan, Quan Liu, Shengrong Gong, Qiming Fu, and Jin Xu. "Efficient actor-critic algorithm with dual piecewise model learning." In 2017 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 2017. http://dx.doi.org/10.1109/ssci.2017.8280911.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!