Log in

Relevant bibliographies by topics / Reinforcement Learning, Multi-armed Bandits / Journal articles

To see the other types of publications on this topic, follow the link: Reinforcement Learning, Multi-armed Bandits.

Journal articles on the topic 'Reinforcement Learning, Multi-armed Bandits'

Author: Grafiati

Published: 6 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Reinforcement Learning, Multi-armed Bandits.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Wan, Zongqi, Zhijie Zhang, Tongyang Li, Jialin Zhang, and Xiaoming Sun. "Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 10087–94. http://dx.doi.org/10.1609/aaai.v37i8.26202.

Full text

Abstract:

Multi-arm bandit (MAB) and stochastic linear bandit (SLB) are important models in reinforcement learning, and it is well-known that classical algorithms for bandits with time horizon T suffer from the regret of at least the square root of T. In this paper, we study MAB and SLB with quantum reward oracles and propose quantum algorithms for both models with the order of the polylog T regrets, exponentially improving the dependence in terms of T. To the best of our knowledge, this is the first provable quantum speedup for regrets of bandit problems and in general exploitation in reinforcement learning. Compared to previous literature on quantum exploration algorithms for MAB and reinforcement learning, our quantum input model is simpler and only assumes quantum oracles for each individual arm.

APA, Harvard, Vancouver, ISO, and other styles

2

Ciucanu, Radu, Pascal Lafourcade, Gael Marcadet, and Marta Soare. "SAMBA: A Generic Framework for Secure Federated Multi-Armed Bandits." Journal of Artificial Intelligence Research 73 (February 23, 2022): 737–65. http://dx.doi.org/10.1613/jair.1.13163.

Full text

Abstract:

The multi-armed bandit is a reinforcement learning model where a learning agent repeatedly chooses an action (pull a bandit arm) and the environment responds with a stochastic outcome (reward) coming from an unknown distribution associated with the chosen arm. Bandits have a wide-range of application such as Web recommendation systems. We address the cumulative reward maximization problem in a secure federated learning setting, where multiple data owners keep their data stored locally and collaborate under the coordination of a central orchestration server. We rely on cryptographic schemes and propose Samba, a generic framework for Secure federAted Multi-armed BAndits. Each data owner has data associated to a bandit arm and the bandit algorithm has to sequentially select which data owner is solicited at each time step. We instantiate Samba for five bandit algorithms. We show that Samba returns the same cumulative reward as the nonsecure versions of bandit algorithms, while satisfying formally proven security properties. We also show that the overhead due to cryptographic primitives is linear in the size of the input, which is confirmed by our proof-of-concept implementation.

APA, Harvard, Vancouver, ISO, and other styles

3

Huanca-Anquise, Candy A., Ana Lúcia Cetertich Bazzan, and Anderson R. Tavares. "Multi-Objective, Multi-Armed Bandits: Algorithms for Repeated Games and Application to Route Choice." Revista de Informática Teórica e Aplicada 30, no. 1 (January 30, 2023): 11–23. http://dx.doi.org/10.22456/2175-2745.122929.

Full text

Abstract:

Multi-objective decision-making in multi-agent scenarios poses multiple challenges. Dealing with multiple objectives and non-stationarity caused by simultaneous learning are only two of them, which have been addressed separately. In this work, reinforcement learning algorithms that tackle both issues together are proposed and applied to a route choice problem, where drivers must select an action in a single-state formulation, while aiming to minimize both their travel time and toll. Hence, we deal with repeated games, now with a multi-objective approach. Advantages, limitations and differences of these algorithms are discussed. Our results show that the proposed algorithms for action selection using reinforcement learning deal with non-stationarity and multiple objectives, while providing alternative solutions to those of centralized methods.

APA, Harvard, Vancouver, ISO, and other styles

4

Giachino, Chiara, Luigi Bollani, Alessandro Bonadonna, and Marco Bertetti. "Reinforcement learning for content's customization: a first step of experimentation in Skyscanner." Industrial Management & Data Systems 121, no. 6 (January 15, 2021): 1417–34. http://dx.doi.org/10.1108/imds-12-2019-0722.

Full text

Abstract:

PurposeThe aim of the paper is to test and demonstrate the potential benefits in applying reinforcement learning instead of traditional methods to optimize the content of a company's mobile application to best help travellers finding their ideal flights. To this end, two approaches were considered and compared via simulation: standard randomized experiments or A/B testing and multi-armed bandits.Design/methodology/approachThe simulation of the two approaches to optimize the content of its mobile application and, consequently, increase flights conversions is illustrated as applied by Skyscanner, using R software.FindingsThe first results are about the comparison between the two approaches – A/B testing and multi-armed bandits – to identify the best one to achieve better results for the company. The second one is to gain experiences and suggestion in the application of the two approaches useful for other industries/companies.Research limitations/implicationsThe case study demonstrated, via simulation, the potential benefits to apply the reinforcement learning in a company. Finally, the multi-armed bandit was implemented in the company, but the period of the available data was limited, and due to its strategic relevance, the company cannot show all the findings.Practical implicationsThe right algorithm can change according to the situation and industry but would bring great benefits to the company's ability to surface content that is more relevant to users and help improving the experience for travellers. The study shows how to manage complexity and data to achieve good results.Originality/valueThe paper describes the approach used by an European leading company operating in the travel sector in understanding how to adapt reinforcement learning to its strategic goals. It presents a real case study and the simulation of the application of A/B testing and multi-armed bandit in Skyscanner; moreover, it highlights practical suggestion useful to other companies.

APA, Harvard, Vancouver, ISO, and other styles

5

Noothigattu, Ritesh, Tom Yan, and Ariel D. Procaccia. "Inverse Reinforcement Learning From Like-Minded Teachers." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 9197–204. http://dx.doi.org/10.1609/aaai.v35i10.17110.

Full text

Abstract:

We study the problem of learning a policy in a Markov decision process (MDP) based on observations of the actions taken by multiple teachers. We assume that the teachers are like-minded in that their reward functions -- while different from each other -- are random perturbations of an underlying reward function. Under this assumption, we demonstrate that inverse reinforcement learning algorithms that satisfy a certain property -- that of matching feature expectations -- yield policies that are approximately optimal with respect to the underlying reward function, and that no algorithm can do better in the worst case. We also show how to efficiently recover the optimal policy when the MDP has one state -- a setting that is akin to multi-armed bandits.

APA, Harvard, Vancouver, ISO, and other styles

6

Xiong, Guojun, Jian Li, and Rahul Singh. "Reinforcement Learning Augmented Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8726–34. http://dx.doi.org/10.1609/aaai.v36i8.20852.

Full text

Abstract:

We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed as R(MA)^2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state and action of the corresponding MDP. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy entitled Occupancy-Measured-Reward Index Policy for the finite-horizon R(MA)^2B. Our index policy is well-defined without the requirement of indexability condition and is provably asymptotically optimal as the number of arms tends to infinity. We then adopt a learning perspective where the system parameters are unknown, and propose R(MA)^2B-UCB, a generative model based reinforcement learning augmented algorithm that can fully exploit the structure of Occupancy-Measured-Reward Index Policy. Compared to existing algorithms, R(MA)^2B-UCB performs close to offline optimum, and achieves a sub-linear regret and a low computational complexity all at once. Experimental results show that R(MA)^2B-UCB outperforms existing algorithms in both regret and running time.

APA, Harvard, Vancouver, ISO, and other styles

7

Huo, Xiaoguang, and Feng Fu. "Risk-aware multi-armed bandit problem with application to portfolio selection." Royal Society Open Science 4, no. 11 (November 2017): 171377. http://dx.doi.org/10.1098/rsos.171377.

Full text

Abstract:

Sequential portfolio selection has attracted increasing interest in the machine learning and quantitative finance communities in recent years. As a mathematical framework for reinforcement learning policies, the stochastic multi-armed bandit problem addresses the primary difficulty in sequential decision-making under uncertainty, namely the exploration versus exploitation dilemma, and therefore provides a natural connection to portfolio selection. In this paper, we incorporate risk awareness into the classic multi-armed bandit setting and introduce an algorithm to construct portfolio. Through filtering assets based on the topological structure of the financial market and combining the optimal multi-armed bandit policy with the minimization of a coherent risk measure, we achieve a balance between risk and return.

APA, Harvard, Vancouver, ISO, and other styles

8

Nobari, Sadegh. "DBA: Dynamic Multi-Armed Bandit Algorithm." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9869–70. http://dx.doi.org/10.1609/aaai.v33i01.33019869.

Full text

Abstract:

We introduce Dynamic Bandit Algorithm (DBA), a practical solution to improve the shortcoming of the pervasively employed reinforcement learning algorithm called Multi-Arm Bandit, aka Bandit. Bandit makes real-time decisions based on the prior observations. However, Bandit is heavily biased to the priors that it cannot quickly adapt itself to a trend that is interchanging. As a result, Bandit cannot, quickly enough, make profitable decisions when the trend is changing. Unlike Bandit, DBA focuses on quickly adapting itself to detect these trends early enough. Furthermore, DBA remains as almost as light as Bandit in terms of computations. Therefore, DBA can be easily deployed in production as a light process similar to The Bandit. We demonstrate how critical and beneficial is the main focus of DBA, i.e. the ability to quickly finding the most profitable option in real-time, over its stateof-the-art competitors. Our experiments are augmented with a visualization mechanism that explains the profitability of the decisions made by each algorithm in each step by animations. Finally we observe that DBA can substantially outperform the original Bandit by close to 3 times for a set Key Performance Indicator (KPI) in a case of having 3 arms.

APA, Harvard, Vancouver, ISO, and other styles

9

Esfandiari, Hossein, MohammadTaghi HajiAghayi, Brendan Lucier, and Michael Mitzenmacher. "Online Pandora’s Boxes and Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 1885–92. http://dx.doi.org/10.1609/aaai.v33i01.33011885.

Full text

Abstract:

We consider online variations of the Pandora’s box problem (Weitzman 1979), a standard model for understanding issues related to the cost of acquiring information for decision-making. Our problem generalizes both the classic Pandora’s box problem and the prophet inequality framework. Boxes are presented online, each with a random value and cost drawn jointly from some known distribution. Pandora chooses online whether to open each box given its cost, and then chooses irrevocably whether to keep the revealed prize or pass on it. We aim for approximation algorithms against adversaries that can choose the largest prize over any opened box, and use optimal offline policies to decide which boxes to open (without knowledge of the value inside)1. We consider variations where Pandora can collect multiple prizes subject to feasibility constraints, such as cardinality, matroid, or knapsack constraints. We also consider variations related to classic multi-armed bandit problems from reinforcement learning. Our results use a reduction-based framework where we separate the issues of the cost of acquiring information from the online decision process of which prizes to keep. Our work shows that in many scenarios, Pandora can achieve a good approximation to the best possible performance.

APA, Harvard, Vancouver, ISO, and other styles

10

Lefebvre, Germain, Christopher Summerfield, and Rafal Bogacz. "A Normative Account of Confirmation Bias During Reinforcement Learning." Neural Computation 34, no. 2 (January 14, 2022): 307–37. http://dx.doi.org/10.1162/neco_a_01455.

Full text

Abstract:

Abstract Reinforcement learning involves updating estimates of the value of states and actions on the basis of experience. Previous work has shown that in humans, reinforcement learning exhibits a confirmatory bias: when the value of a chosen option is being updated, estimates are revised more radically following positive than negative reward prediction errors, but the converse is observed when updating the unchosen option value estimate. Here, we simulate performance on a multi-arm bandit task to examine the consequences of a confirmatory bias for reward harvesting. We report a paradoxical finding: that confirmatory biases allow the agent to maximize reward relative to an unbiased updating rule. This principle holds over a wide range of experimental settings and is most influential when decisions are corrupted by noise. We show that this occurs because on average, confirmatory biases lead to overestimating the value of more valuable bandits and underestimating the value of less valuable bandits, rendering decisions overall more robust in the face of noise. Our results show how apparently suboptimal learning rules can in fact be reward maximizing if decisions are made with finite computational precision.

APA, Harvard, Vancouver, ISO, and other styles

11

Koulouriotis, D. E., and A. Xanthopoulos. "Reinforcement learning and evolutionary algorithms for non-stationary multi-armed bandit problems." Applied Mathematics and Computation 196, no. 2 (March 2008): 913–22. http://dx.doi.org/10.1016/j.amc.2007.07.043.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Elizarov, Artem Aleksandrovich, and Evgenii Viktorovich Razinkov. "Image Classification Using Reinforcement Learning." Russian Digital Libraries Journal 23, no. 6 (May 12, 2020): 1172–91. http://dx.doi.org/10.26907/1562-5419-2020-23-6-1172-1191.

Full text

Abstract:

Recently, such a direction of machine learning as reinforcement learning has been actively developing. As a consequence, attempts are being made to use reinforcement learning for solving computer vision problems, in particular for solving the problem of image classification. The tasks of computer vision are currently one of the most urgent tasks of artificial intelligence. The article proposes a method for image classification in the form of a deep neural network using reinforcement learning. The idea of the developed method comes down to solving the problem of a contextual multi-armed bandit using various strategies for achieving a compromise between exploitation and research and reinforcement learning algorithms. Strategies such as -greedy, -softmax, -decay-softmax, and the UCB1 method, and reinforcement learning algorithms such as DQN, REINFORCE, and A2C are considered. The analysis of the influence of various parameters on the efficiency of the method is carried out, and options for further development of the method are proposed.

APA, Harvard, Vancouver, ISO, and other styles

13

Morimoto, Juliano. "Foraging decisions as multi-armed bandit problems: Applying reinforcement learning algorithms to foraging data." Journal of Theoretical Biology 467 (April 2019): 48–56. http://dx.doi.org/10.1016/j.jtbi.2019.02.002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Askhedkar, Anjali R., and Bharat S. Chaudhari. "Multi-Armed Bandit Algorithm Policy for LoRa Network Performance Enhancement." Journal of Sensor and Actuator Networks 12, no. 3 (May 4, 2023): 38. http://dx.doi.org/10.3390/jsan12030038.

Full text

Abstract:

Low-power wide-area networks (LPWANs) constitute a variety of modern-day Internet of Things (IoT) applications. Long range (LoRa) is a promising LPWAN technology with its long-range and low-power benefits. Performance enhancement of LoRa networks is one of the crucial challenges to meet application requirements, and it primarily depends on the optimal selection of transmission parameters. Reinforcement learning-based multi-armed bandit (MAB) is a prominent approach for optimizing the LoRa parameters and network performance. In this work, we propose a new discounted upper confidence bound (DUCB) MAB to maximize energy efficiency and improve the overall performance of the LoRa network. We designed novel discount and exploration bonus functions to maximize the policy rewards to increase the number of successful transmissions. The results show that the proposed discount and exploration functions give better mean rewards irrespective of the number of trials, which has significant importance for LoRa networks. The designed policy outperforms other policies reported in the literature and has a lesser time complexity, a comparable mean rewards, and improves the mean rewards by a minimum of 8%.

APA, Harvard, Vancouver, ISO, and other styles

15

Espinosa-Leal, Leonardo, Anthony Chapman, and Magnus Westerlund. "Autonomous Industrial Management via Reinforcement Learning." Journal of Intelligent & Fuzzy Systems 39, no. 6 (December 4, 2020): 8427–39. http://dx.doi.org/10.3233/jifs-189161.

Full text

Abstract:

Industry has always been in the pursuit of becoming more economically efficient and the current focus has been to reduce human labour using modern technologies. Even with cutting edge technologies, which range from packaging robots to AI for fault detection, there is still some ambiguity on the aims of some new systems, namely, whether they are automated or autonomous. In this paper, we indicate the distinctions between automated and autonomous systems as well as review the current literature and identify the core challenges for creating learning mechanisms of autonomous agents. We discuss using different types of extended realities, such as digital twins, how to train reinforcement learning agents to learn specific tasks through generalisation. Once generalisation is achieved, we discuss how these can be used to develop self-learning agents. We then introduce self-play scenarios and how they can be used to teach self-learning agents through a supportive environment that focuses on how the agents can adapt to different environments. We introduce an initial prototype of our ideas by solving a multi-armed bandit problem using two ε-greedy algorithms. Further, we discuss future applications in the industrial management realm and propose a modular architecture for improving the decision-making process via autonomous agents.

APA, Harvard, Vancouver, ISO, and other styles

16

Teymuri, Benyamin, Reza Serati, Nikolaos Athanasios Anagnostopoulos, and Mehdi Rasti. "LP-MAB: Improving the Energy Efficiency of LoRaWAN Using a Reinforcement-Learning-Based Adaptive Configuration Algorithm." Sensors 23, no. 4 (February 20, 2023): 2363. http://dx.doi.org/10.3390/s23042363.

Full text

Abstract:

In the Internet of Things (IoT), Low-Power Wide-Area Networks (LPWANs) are designed to provide low energy consumption while maintaining a long communications’ range for End Devices (EDs). LoRa is a communication protocol that can cover a wide range with low energy consumption. To evaluate the efficiency of the LoRa Wide-Area Network (LoRaWAN), three criteria can be considered, namely, the Packet Delivery Rate (PDR), Energy Consumption (EC), and coverage area. A set of transmission parameters have to be configured to establish a communication link. These parameters can affect the data rate, noise resistance, receiver sensitivity, and EC. The Adaptive Data Rate (ADR) algorithm is a mechanism to configure the transmission parameters of EDs aiming to improve the PDR. Therefore, we introduce a new algorithm using the Multi-Armed Bandit (MAB) technique, to configure the EDs’ transmission parameters in a centralized manner on the Network Server (NS) side, while improving the EC, too. The performance of the proposed algorithm, the Low-Power Multi-Armed Bandit (LP-MAB), is evaluated through simulation results and is compared with other approaches in different scenarios. The simulation results indicate that the LP-MAB’s EC outperforms other algorithms while maintaining a relatively high PDR in various circumstances.

APA, Harvard, Vancouver, ISO, and other styles

17

Varatharajah, Yogatheesan, and Brent Berry. "A Contextual-Bandit-Based Approach for Informed Decision-Making in Clinical Trials." Life 12, no. 8 (August 21, 2022): 1277. http://dx.doi.org/10.3390/life12081277.

Full text

Abstract:

Clinical trials are conducted to evaluate the efficacy of new treatments. Clinical trials involving multiple treatments utilize the randomization of treatment assignments to enable the evaluation of treatment efficacies in an unbiased manner. Such evaluation is performed in post hoc studies that usually use supervised-learning methods that rely on large amounts of data collected in a randomized fashion. That approach often proves to be suboptimal in that some participants may suffer and even die as a result of having not received the most appropriate treatments during the trial. Reinforcement-learning methods improve the situation by making it possible to learn the treatment efficacies dynamically during the course of the trial, and to adapt treatment assignments accordingly. Recent efforts using multi-arm bandits, a type of reinforcement-learning method, have focused on maximizing clinical outcomes for a population that was assumed to be homogeneous. However, those approaches have failed to account for the variability among participants that is becoming increasingly evident as a result of recent clinical-trial-based studies. We present a contextual-bandit-based online treatment optimization algorithm that, in choosing treatments for new participants in the study, takes into account not only the maximization of the clinical outcomes as well as the patient characteristics. We evaluated our algorithm using a real clinical trial dataset from the International Stroke Trial. We simulated the online setting by sequentially going through the data of each participant admitted to the trial. Two bandits (one for each context) were created, with four choices of treatments. For a new participant in the trial, depending on the context, one of the bandits was selected. Then, we took three different approaches to choose a treatment: (a) a random choice (i.e., the strategy currently used in clinical trial settings), (b) a Thompson sampling-based approach, and (c) a UCB-based approach. Success probabilities of each context were calculated separately by considering the participants with the same context. Those estimated outcomes were used to update the prior distributions within the bandit corresponding to the context of each participant. We repeated that process through the end of the trial and recorded the outcomes and the chosen treatments for each approach. We also evaluated a context-free multi-arm-bandit-based approach, using the same dataset, to showcase the benefits of our approach. In the context-free case, we calculated the success probabilities for the Bernoulli sampler using the whole clinical trial dataset in a context-independent manner. The results of our retrospective analysis indicate that the proposed approach performs significantly better than either a random assignment of treatments (the current gold standard) or a multi-arm-bandit-based approach, providing substantial gains in the percentage of participants who are assigned the most suitable treatments. The contextual-bandit and multi-arm bandit approaches provide 72.63% and 64.34% gains, respectively, compared to a random assignment.

APA, Harvard, Vancouver, ISO, and other styles

18

Zhou, Jinkai, Xuebo Lai, and Joseph Y. J. Chow. "Multi-Armed Bandit On-Time Arrival Algorithms for Sequential Reliable Route Selection under Uncertainty." Transportation Research Record: Journal of the Transportation Research Board 2673, no. 10 (June 2, 2019): 673–82. http://dx.doi.org/10.1177/0361198119850457.

Full text

Abstract:

Traditionally vehicles act only as servers in transporting passengers and goods. With increasing sensor equipment in vehicles, including automated vehicles, there is a need to test algorithms that consider the dual role of vehicles as both servers and sensors. The paper formulates a sequential route selection problem as a shortest path problem with on-time arrival reliability under a multi-armed bandit setting, a type of reinforcement learning model. A decision-maker has to make a finite set of decisions sequentially on departure time and path between a fixed origin-destination pair such that on-time reliability is maximized while travel time is minimized. The upper confidence bound algorithm is extended to handle this problem. Several tests are conducted. First, simulated data successfully verifies the method, then a real-data scenario is constructed of a hotel shuttle service from midtown Manhattan in New York City providing hourly access to John F. Kennedy International Airport. Results suggest that route selection with multi-armed bandit learning algorithms can be effective but neglecting passenger scheduling constraints can have negative effects on on-time arrival reliability by as much as 4.8% and combined reliability and travel time by 66.1%.

APA, Harvard, Vancouver, ISO, and other styles

19

Dai, Yue, Jiangang Lu, Zhiwen Yu, and Ruifeng Zhao. "High-Precision Timing Method of BeiDou-3 System Based on Reinforcement Learning." Journal of Physics: Conference Series 2401, no. 1 (December 1, 2022): 012093. http://dx.doi.org/10.1088/1742-6596/2401/1/012093.

Full text

Abstract:

Abstract All aspects of smart grid operation require precise timing technology to improve the timeliness and precision of the network. However, due to the complex and time-varying network environment and the low precision of device timing, the existing timing technology cannot meet the timing precision required by smart grid services. Therefore, according to the multi-armed bandit (MAB) theory and the upper confidence bound (UCB) algorithm, this paper proposes a high-precision timing method for power transmission and distribution park named adaptive timing-aware upper confidence bound (AUCB). The long-term average reward is maximized based on local information and historical decision-making experience. The simulation results show that the proposed high-precision timing method of AUCB in the power transmission and distribution park can effectively improve the device timing precision and realize self-adaptive timing for power system devices.

APA, Harvard, Vancouver, ISO, and other styles

20

Dunne, Simon, Arun D'Souza, and John P. O'Doherty. "The involvement of model-based but not model-free learning signals during observational reward learning in the absence of choice." Journal of Neurophysiology 115, no. 6 (June 1, 2016): 3195–203. http://dx.doi.org/10.1152/jn.00046.2016.

Full text

Abstract:

A major open question is whether computational strategies thought to be used during experiential learning, specifically model-based and model-free reinforcement learning, also support observational learning. Furthermore, the question of how observational learning occurs when observers must learn about the value of options from observing outcomes in the absence of choice has not been addressed. In the present study we used a multi-armed bandit task that encouraged human participants to employ both experiential and observational learning while they underwent functional magnetic resonance imaging (fMRI). We found evidence for the presence of model-based learning signals during both observational and experiential learning in the intraparietal sulcus. However, unlike during experiential learning, model-free learning signals in the ventral striatum were not detectable during this form of observational learning. These results provide insight into the flexibility of the model-based learning system, implicating this system in learning during observation as well as from direct experience, and further suggest that the model-free reinforcement learning system may be less flexible with regard to its involvement in observational learning.

APA, Harvard, Vancouver, ISO, and other styles

21

Kessler, Samuel, Jack Parker-Holder, Philip Ball, Stefan Zohren, and Stephen J. Roberts. "Same State, Different Task: Continual Reinforcement Learning without Interference." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 7 (June 28, 2022): 7143–51. http://dx.doi.org/10.1609/aaai.v36i7.20674.

Full text

Abstract:

Continual Learning (CL) considers the problem of training an agent sequentially on a set of tasks while seeking to retain performance on all previous tasks. A key challenge in CL is catastrophic forgetting, which arises when performance on a previously mastered task is reduced when learning a new task. While a variety of methods exist to combat forgetting, in some cases tasks are fundamentally incompatible with each other and thus cannot be learnt by a single policy. This can occur, in reinforcement learning (RL) when an agent may be rewarded for achieving different goals from the same observation. In this paper we formalize this "interference" as distinct from the problem of forgetting. We show that existing CL methods based on single neural network predictors with shared replay buffers fail in the presence of interference. Instead, we propose a simple method, OWL, to address this challenge. OWL learns a factorized policy, using shared feature extraction layers, but separate heads, each specializing on a new task. The separate heads in OWL are used to prevent interference. At test time, we formulate policy selection as a multi-armed bandit problem, and show it is possible to select the best policy for an unknown task using feedback from the environment. The use of bandit algorithms allows the OWL agent to constructively re-use different continually learnt policies at different times during an episode. We show in multiple RL environments that existing replay based CL methods fail, while OWL is able to achieve close to optimal performance when training sequentially.

APA, Harvard, Vancouver, ISO, and other styles

22

Li, Xinbin, Xianglin Xu, Lei Yan, Haihong Zhao, and Tongwei Zhang. "Energy-Efficient Data Collection Using Autonomous Underwater Glider: A Reinforcement Learning Formulation." Sensors 20, no. 13 (July 4, 2020): 3758. http://dx.doi.org/10.3390/s20133758.

Full text

Abstract:

The autonomous underwater glider has attracted enormous interest for underwater activities, especially in long-term and large-scale underwater data collection. In this paper, we focus on the application of gliders gathering data from underwater sensor networks over underwater acoustic channels. However, this application suffers from a rapidly time-varying environment and limited energy. To optimize the performance of data collection and maximize the network lifetime, we propose a distributed, energy-efficient sensor scheduling algorithm based on the multi-armed bandit formulation. Besides, we design an indexable threshold policy to tradeoff between the data quality and the collection delay. Moreover, to reduce the computational complexity, we divide the proposed algorithm into off-line computation and on-line scheduling parts. Simulation results indicate that the proposed policy significantly improves the performance of the data collection and reduces the energy consumption. They prove the effectiveness of the threshold, which could reduce the collection delay by at least 10% while guaranteeing the data quality.

APA, Harvard, Vancouver, ISO, and other styles

23

Yu, Junpu. "Thompson -Greedy Algorithm: An Improvement to the Regret of Thompson Sampling and -Greedy on Multi-Armed Bandit Problems." Applied and Computational Engineering 8, no. 1 (August 1, 2023): 525–34. http://dx.doi.org/10.54254/2755-2721/8/20230264.

Full text

Abstract:

The multi-armed bandit problem is one of the most classic reinforcement learning problems, aiming to find balanced decisions of exploration and exploitation and to increase the total reward of the actions from each round. To solve multi-armed bandit problems, algorithms were designed, including some of the most typical and widely used ones, like the Explore-Then-Commit algorithm, Upper Confidence Bound algorithm, Epsilon-Greedy algorithm, and Thompson Sampling algorithm. Some of them are improvements upon others, while all of them seek to increase total reward but contain specific weaknesses. Epsilon-Greedy algorithm, as a simple method to balance exploration and exploitation of multi-armed bandit problems, has the disadvantage of still picking non-optimal actions even if it appears to be non-optimal for a very long time. Thompson Sampling algorithm, though performing well in many scenarios, costs a significantly long time to update its prior distribution each round and tends to explore too much in initial tries when the real distribution of reward is scattered. To further fix their weaknesses and improve their performance, this paper proposed a newly designed algorithm, Thompson -Greedy (TEG), which seeks to utilize the advantages of both algorithms to complement each others disadvantages. The TEG algorithm is not only proved to perform better than -Greedy in most cases, but also turned out to be more adaptive in environments with true reward distributions that weaken Thompson Sampling Algorithm. Beyond the comparison of regrets, the paper further analyzed the time cost of applying TEG with those two existing methods and their best arm selection rates to illustrate the significance of the TEG algorithm.

APA, Harvard, Vancouver, ISO, and other styles

24

Botchkaryov, Alexey. "Task sequence planning by intelligent agent with context awareness." Computer systems and network 4, no. 1 (December 16, 2022): 12–20. http://dx.doi.org/10.23939/csn2022.01.012.

Full text

Abstract:

The problem of context-aware task sequence planning for independent or weakly related tasks by an intelligent agent has been considered. The principle of matching the task to the context is analyzed. The structure of the context-aware task sequence planning module as part of an intelligent agent and the corresponding algorithm are proposed. In the structure of the planning module, three main groups of blocks are implemented: operations with tasks, operations with the context, determining the relevance of tasks to the context. The article also proposes an algorithm for calculating the dynamic priority of a task, an algorithm for determining the relevance of a task to the context, and an algorithm for adapting a set of rules for matching the task to the context. The value of the dynamic priority depends on the static priority of the task, which is assigned when a new task is added, and the degree of correspondence of the task to the context, taking into account the context vector. Two modes of starting the planning algorithm are allowed: when the intelligent agent requests a new task and when the context vector changes. The dynamic priority calculation algorithm is performed independently for each task. As a result, its software implementation has a large parallelization resource. To adapt the set of rules for matching tasks to the context, a scheme of reinforcement learning in a contextual multi-armed bandit was used. For each matching rule, a separate instance of the reinforcement learning procedure is performed. The reinforcement learning method used in the article is Upper-Confidence-Bound Action Selection. The functional scheme of the reinforcement learning procedure, implemented in the prototype of the context-aware task sequence planning module has been proposed. The functional scheme of the reinforcement learning procedure allows the use of data decomposition and functional decomposition to parallelize the corresponding calculations.

APA, Harvard, Vancouver, ISO, and other styles

25

Amirizadeh, Khosrow, and Rajeswari Mandava. "Fast Iterative model for Sequential-Selection-Based Applications." INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 12, no. 7 (February 14, 2014): 3689–96. http://dx.doi.org/10.24297/ijct.v12i7.3092.

Full text

Abstract:

Accelerated multi-armed bandit (MAB) model in Reinforcement-Learning for on-line sequential selection problems is presented. This iterative model utilizes an automatic step size calculation that improves the performance of MAB algorithm under different conditions such as, variable variance of reward and larger set of usable actions. As result of these modifications, number of optimal selections will be maximized and stability of the algorithm under mentioned conditions may be amplified. This adaptive model with automatic step size computation may attractive for on-line applications in which,Â variance of observations vary with time and re-tuning their step size are unavoidable where, this re-tuning is not a simple task. The proposed model governed by upper confidence bound (UCB) approach in iterative form with automatic step size computation. It called adaptive UCB (AUCB) that may use in industrial robotics, autonomous control and intelligent selection or prediction tasks in the economical engineering applications under lack of information.

APA, Harvard, Vancouver, ISO, and other styles

26

Kamikokuryo, Kenta, Takumi Haga, Gentiane Venture, and Vincent Hernandez. "Adversarial Autoencoder and Multi-Armed Bandit for Dynamic Difficulty Adjustment in Immersive Virtual Reality for Rehabilitation: Application to Hand Movement." Sensors 22, no. 12 (June 14, 2022): 4499. http://dx.doi.org/10.3390/s22124499.

Full text

Abstract:

Motor rehabilitation is used to improve motor control skills to improve the patient’s quality of life. Regular adjustments based on the effect of therapy are necessary, but this can be time-consuming for the clinician. This study proposes to use an efficient tool for high-dimensional data by considering a deep learning approach for dimensionality reduction of hand movement recorded using a wireless remote control embedded with the Oculus Rift S. This latent space is created as a visualization tool also for use in a reinforcement learning (RL) algorithm employed to provide a decision-making framework. The data collected consists of motions drawn with wireless remote control in an immersive VR environment for six different motions called “Cube”, “Cylinder”, “Heart”, “Infinity”, “Sphere”, and “Triangle”. From these collected data, different artificial databases were created to simulate variations of the data. A latent space representation is created using an adversarial autoencoder (AAE), taking into account unsupervised (UAAE) and semi-supervised (SSAAE) training. Then, each test point is represented by a distance metric and used as a reward for two classes of Multi-Armed Bandit (MAB) algorithms, namely Boltzmann and Sibling Kalman filters. The results showed that AAE models can represent high-dimensional data in a two-dimensional latent space and that MAB agents can efficiently and quickly learn the distance evolution in the latent space. The results show that Sibling Kalman filter exploration outperforms Boltzmann exploration with an average cumulative weighted probability error of 7.9 versus 19.9 using the UAAE latent space representation and 8.0 versus 20.0 using SSAAE. In conclusion, this approach provides an effective approach to visualize and track current motor control capabilities regarding a target in order to reflect the patient’s abilities in VR games in the context of DDA.

APA, Harvard, Vancouver, ISO, and other styles

27

Moy, Christophe, Lilian Besson, Guillaume Delbarre, and Laurent Toutain. "Decentralized spectrum learning for radio collision mitigation in ultra-dense IoT networks: LoRaWAN case study and experiments." Annals of Telecommunications 75, no. 11-12 (August 27, 2020): 711–27. http://dx.doi.org/10.1007/s12243-020-00795-y.

Full text

Abstract:

AbstractThis paper describes the theoretical principles and experimental results of reinforcement learning algorithms embedded into IoT devices (Internet of Things), in order to tackle the problem of radio collision mitigation in ISM unlicensed bands. Multi-armed bandit (MAB) learning algorithms are used here to improve both the IoT network capability to support the expected massive number of objects and the energetic autonomy of the IoT devices. We first illustrate the efficiency of the proposed approach in a proof-of-concept, based on USRP software radio platforms operating on real radio signals. It shows how collisions with other RF signals are diminished for IoT devices that use MAB learning. Then we describe the first implementation of such algorithms on LoRa devices operating in a real LoRaWAN network at 868 MHz. We named this solution IoTligent. IoTligent does not add neither processing overhead, so it can be run into the IoT devices, nor network overhead, so that it requires no change to LoRaWAN protocol. Real-life experiments done in a real LoRa network show that IoTligent devices’ battery life can be extended by a factor of 2, in the scenarios we faced during our experiment. Finally we submit IoTligent devices to very constrained conditions that are expected in the future with the growing number of IoT devices, by generating an artificial IoT massive radio traffic in anechoic chamber. We show that IoTligent devices can cope with spectrum scarcity that will occur at that time in unlicensed bands.

APA, Harvard, Vancouver, ISO, and other styles

28

Shi, Chengshuai, and Cong Shen. "Federated Multi-Armed Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 11 (May 18, 2021): 9603–11. http://dx.doi.org/10.1609/aaai.v35i11.17156.

Full text

Abstract:

Federated multi-armed bandits (FMAB) is a new bandit paradigm that parallels the federated learning (FL) framework in supervised learning. It is inspired by practical applications in cognitive radio and recommender systems, and enjoys features that are analogous to FL. This paper proposes a general framework of FMAB and then studies two specific federated bandit models. We first study the approximate model where the heterogeneous local models are random realizations of the global model from an unknown distribution. This model introduces a new uncertainty of client sampling, as the global model may not be reliably learned even if the finite local models are perfectly known. Furthermore, this uncertainty cannot be quantified a priori without knowledge of the suboptimality gap. We solve the approximate model by proposing Federated Double UCB (Fed2-UCB), which constructs a novel “double UCB” principle accounting for uncertainties from both arm and client sampling. We show that gradually admitting new clients is critical in achieving an O(log(T)) regret while explicitly considering the communication loss. The exact model, where the global bandit model is the exact average of heterogeneous local models, is then studied as a special case. We show that, somewhat surprisingly, the order-optimal regret can be achieved independent of the number of clients with a careful choice of the update periodicity. Experiments using both synthetic and real-world datasets corroborate the theoretical analysis and demonstrate the effectiveness and efficiency of the proposed algorithms.

APA, Harvard, Vancouver, ISO, and other styles

29

Chai, Chengliang, Jiabin Liu, Nan Tang, Guoliang Li, and Yuyu Luo. "Selective data acquisition in the wild for model charging." Proceedings of the VLDB Endowment 15, no. 7 (March 2022): 1466–78. http://dx.doi.org/10.14778/3523210.3523223.

Full text

Abstract:

The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world supervised machine learning (ML) tasks. In this paper, we study a new problem, namely selective data acquisition in the wild for model charging : given a supervised ML task and data in the wild (e.g., enterprise data warehouses, online data repositories, data markets, and so on), the problem is to select labeled data points from the data in the wild as additional train data that can help the ML task. It consists of two steps (Fig. 1). The first step is to discover relevant datasets ( e.g. , tables with similar relational schema), which will result in a set of candidate datasets. Because these candidate datasets come from different sources and may follow different distributions, not all data points they contain can help. The second step is to select which data points from these candidate datasets should be used. We build an end-to-end solution. For step 1, we piggyback off-the-shelf data discovery tools. Technically, our focus is on step 2, for which we propose a solution framework called AutoData. It first clusters all data points from candidate datasets such that each cluster contains similar data points from different sources. It then iteratively picks which cluster to use, samples data points ( i.e. , a mini-batch) from the picked cluster, evaluates the mini-batch, and then revises the search criteria by learning from the feedback ( i.e. , reward) based on the evaluation. We propose a multi-armed bandit based solution and a Deep Q Networks-based reinforcement learning solution. Experiments using both relational and image datasets show the effectiveness of our solutions.

APA, Harvard, Vancouver, ISO, and other styles

30

Sankararaman, Abishek, Ayalvadi Ganesh, and Sanjay Shakkottai. "Social Learning in Multi Agent Multi Armed Bandits." Proceedings of the ACM on Measurement and Analysis of Computing Systems 3, no. 3 (December 17, 2019): 1–35. http://dx.doi.org/10.1145/3366701.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Sankararaman, Abishek, Ayalvadi Ganesh, and Sanjay Shakkottai. "Social Learning in Multi Agent Multi Armed Bandits." ACM SIGMETRICS Performance Evaluation Review 48, no. 1 (July 8, 2020): 29–30. http://dx.doi.org/10.1145/3410048.3410065.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Flynn, Hamish, David Reeb, Melih Kandemir, and Jan Peters. "PAC-Bayesian lifelong learning for multi-armed bandits." Data Mining and Knowledge Discovery 36, no. 2 (March 2022): 841–76. http://dx.doi.org/10.1007/s10618-022-00825-4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Ameen, Salem, and Sunil Vadera. "Pruning Neural Networks Using Multi-Armed Bandits." Computer Journal 63, no. 7 (September 26, 2019): 1099–108. http://dx.doi.org/10.1093/comjnl/bxz078.

Full text

Abstract:

Abstract The successful application of deep learning has led to increasing expectations of their use in embedded systems. This, in turn, has created the need to find ways of reducing the size of neural networks. Decreasing the size of a neural network requires deciding which weights should be removed without compromising accuracy, which is analogous to the kind of problems addressed by multi-armed bandits (MABs). Hence, this paper explores the use of MABs for reducing the number of parameters of a neural network. Different MAB algorithms, namely $\epsilon $-greedy, win-stay, lose-shift, UCB1, KL-UCB, BayesUCB, UGapEb, successive rejects and Thompson sampling are evaluated and their performance compared to existing approaches. The results show that MAB pruning methods, especially those based on UCB, outperform other pruning methods.

APA, Harvard, Vancouver, ISO, and other styles

34

Xu, Xiao, and Qing Zhao. "Memory-Constrained No-Regret Learning in Adversarial Multi-Armed Bandits." IEEE Transactions on Signal Processing 69 (2021): 2371–82. http://dx.doi.org/10.1109/tsp.2021.3070201.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Wang, Kai, Lily Xu, Aparna Taneja, and Milind Tambe. "Optimistic Whittle Index Policy: Online Learning for Restless Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 10131–39. http://dx.doi.org/10.1609/aaai.v37i8.26207.

Full text

Abstract:

Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. Solving RMABs requires information on transition dynamics, which are often unknown upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we estimate confidence bounds of the transition probabilities and formulate a bilinear program to compute optimistic Whittle indices using these estimates. Our algorithm, UCWhittle, achieves sublinear O(H \sqrt{T log T}) frequentist regret to solve RMABs with unknown transitions in T episodes with a constant horizon H. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including one constructed from a real-world maternal and childcare dataset.

APA, Harvard, Vancouver, ISO, and other styles

36

Zhao, Qing. "Multi-Armed Bandits: Theory and Applications to Online Learning in Networks." Synthesis Lectures on Communication Networks 12, no. 1 (November 20, 2019): 1–165. http://dx.doi.org/10.2200/s00941ed2v01y201907cnt022.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Weinstein, Ari, and Michael Littman. "Bandit-Based Planning and Learning in Continuous-Action Markov Decision Processes." Proceedings of the International Conference on Automated Planning and Scheduling 22 (May 14, 2012): 306–14. http://dx.doi.org/10.1609/icaps.v22i1.13507.

Full text

Abstract:

Recent research leverages results from the continuous-armed bandit literature to create a reinforcement-learning algorithm for continuous state and action spaces. Initially proposed in a theoretical setting, we provide the first examination of the empirical properties of the algorithm. Through experimentation, we demonstrate the effectiveness of this planning method when coupled with exploration and model learning and show that, in addition to its formal guarantees, the approach is very competitive with other continuous-action reinforcement learners.

APA, Harvard, Vancouver, ISO, and other styles

38

Kaibel, Chris, and Torsten Biemann. "Rethinking the Gold Standard With Multi-armed Bandits: Machine Learning Allocation Algorithms for Experiments." Organizational Research Methods 24, no. 1 (June 11, 2019): 78–103. http://dx.doi.org/10.1177/1094428119854153.

Full text

Abstract:

In experiments, researchers commonly allocate subjects randomly and equally to the different treatment conditions before the experiment starts. While this approach is intuitive, it means that new information gathered during the experiment is not utilized until after the experiment has ended. Based on methodological approaches from other scientific disciplines such as computer science and medicine, we suggest machine learning algorithms for subject allocation in experiments. Specifically, we discuss a Bayesian multi-armed bandit algorithm for randomized controlled trials and use Monte Carlo simulations to compare its efficiency with randomized controlled trials that have a fixed and balanced subject allocation. Our findings indicate that a randomized allocation based on Bayesian multi-armed bandits is more efficient and ethical in most settings. We develop recommendations for researchers and discuss the limitations of our approach.

APA, Harvard, Vancouver, ISO, and other styles

39

Lumbreras, Josep, Erkka Haapasalo, and Marco Tomamichel. "Multi-armed quantum bandits: Exploration versus exploitation when learning properties of quantum states." Quantum 6 (June 29, 2022): 749. http://dx.doi.org/10.22331/q-2022-06-29-749.

Full text

Abstract:

We initiate the study of tradeoffs between exploration and exploitation in online learning of properties of quantum states. Given sequential oracle access to an unknown quantum state, in each round, we are tasked to choose an observable from a set of actions aiming to maximize its expectation value on the state (the reward). Information gained about the unknown state from previous rounds can be used to gradually improve the choice of action, thus reducing the gap between the reward and the maximal reward attainable with the given action set (the regret). We provide various information-theoretic lower bounds on the cumulative regret that an optimal learner must incur, and show that it scales at least as the square root of the number of rounds played. We also investigate the dependence of the cumulative regret on the number of available actions and the dimension of the underlying space. Moreover, we exhibit strategies that are optimal for bandits with a finite number of arms and general mixed states.

APA, Harvard, Vancouver, ISO, and other styles

40

Vial, Daniel, Sanjay Shakkottai, and R. Srikant. "Robust Multi-Agent Bandits Over Undirected Graphs." ACM SIGMETRICS Performance Evaluation Review 51, no. 1 (June 26, 2023): 67–68. http://dx.doi.org/10.1145/3606376.3593567.

Full text

Abstract:

We consider a multi-agent multi-armed bandit setting in which n honest agents collaborate over a network to minimize regret but m malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur O((m + K/n) log (T) / Δ) regret in this setting, where K is the number of arms and Δ is the arm gap. For m << K, this improves over the single-agent baseline regret of O(K log(T)/Δ). In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we provide an instance for which honest agents using the state-of-the-art algorithm suffer (nearly) linear regret until time is doubly exponential in n. In light of this negative result, we propose a new algorithm for which the i-th agent has regret O((dmal (i) + K/n) log(T)/Δ) on any connected and undirected graph, where dmal (i) is the number of i's neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph and show the effect of malicious agents is entirely local.

APA, Harvard, Vancouver, ISO, and other styles

41

Ben-Porat, Omer, Lee Cohen, Liu Leqi, Zachary C. Lipton, and Yishay Mansour. "Modeling Attrition in Recommender Systems with Departing Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 6 (June 28, 2022): 6072–79. http://dx.doi.org/10.1609/aaai.v36i6.20554.

Full text

Abstract:

Traditionally, when recommender systems are formalized as multi-armed bandits, the policy of the recommender system influences the rewards accrued, but not the length of interaction. However, in real-world systems, dissatisfied users may depart (and never come back). In this work, we propose a novel multi-armed bandit setup that captures such policy-dependent horizons. Our setup consists of a finite set of user types, and multiple arms with Bernoulli payoffs. Each (user type, arm) tuple corresponds to an (unknown) reward probability. Each user's type is initially unknown and can only be inferred through their response to recommendations. Moreover, if a user is dissatisfied with their recommendation, they might depart the system. We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal. We then move forward to the more challenging case, where users are divided among two types. While naive approaches cannot handle this setting, we provide an efficient learning algorithm that achieves O(sqrt(T)ln(T)) regret, where T is the number of users.

APA, Harvard, Vancouver, ISO, and other styles

42

Vial, Daniel, Sanjay Shakkottai, and R. Srikant. "Robust Multi-Agent Bandits Over Undirected Graphs." Proceedings of the ACM on Measurement and Analysis of Computing Systems 6, no. 3 (December 2022): 1–57. http://dx.doi.org/10.1145/3570614.

Full text

Abstract:

We consider a multi-agent multi-armed bandit setting in which n honest agents collaborate over a network to minimize regret but m malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur O((m + K/n) łog (T) / Δ ) regret in this setting, where K is the number of arms and Δ is the arm gap. For m łl K, this improves over the single-agent baseline regret of O(Kłog(T)/Δ). In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in K and n . In light of this negative result, we propose a new algorithm for which the i -th agent has regret O(( dmal (i) + K/n) łog(T)/Δ) on any connected and undirected graph, where dmal(i) is the number of i 's neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where dmal(i) = m), and show the effect of malicious agents is entirely local (in the sense that only the dmal (i) malicious agents directly connected to i affect its long-term regret).

APA, Harvard, Vancouver, ISO, and other styles

43

Pacchiano, Aldo, Heinrich Jiang, and Michael I. Jordan. "Robustness Guarantees for Mode Estimation with an Application to Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 9277–84. http://dx.doi.org/10.1609/aaai.v35i10.17119.

Full text

Abstract:

Mode estimation is a classical problem in statistics with a wide range of applications in machine learning. Despite this, there is little understanding in its robustness properties under possibly adversarial data contamination. In this paper, we give precise robustness guarantees as well as privacy guarantees under simple randomization. We then introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean. We prove regret guarantees for the problems of top arm identification, top m-arms identification, contextual modal bandits, and infinite continuous arms top arm recovery. We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences, thus rendering modal bandits an attractive choice in situations where the rewards may have outliers or adversarial corruptions.

APA, Harvard, Vancouver, ISO, and other styles

44

Yeh, Yi-Liang, and Po-Kai Yang. "Design and Comparison of Reinforcement-Learning-Based Time-Varying PID Controllers with Gain-Scheduled Actions." Machines 9, no. 12 (November 26, 2021): 319. http://dx.doi.org/10.3390/machines9120319.

Full text

Abstract:

This paper presents innovative reinforcement learning methods for automatically tuning the parameters of a proportional integral derivative controller. Conventionally, the high dimension of the Q-table is a primary drawback when implementing a reinforcement learning algorithm. To overcome the obstacle, the idea underlying the n-armed bandit problem is used in this paper. Moreover, gain-scheduled actions are presented to tune the algorithms to improve the overall system behavior; therefore, the proposed controllers fulfill the multiple performance requirements. An experiment was conducted for the piezo-actuated stage to illustrate the effectiveness of the proposed control designs relative to competing algorithms.

APA, Harvard, Vancouver, ISO, and other styles

45

Garcelon, Evrard, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. "Improved Algorithms for Conservative Exploration in Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 3962–69. http://dx.doi.org/10.1609/aaai.v34i04.5812.

Full text

Abstract:

In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.

APA, Harvard, Vancouver, ISO, and other styles

46

Du, Yihan, Siwei Wang, and Longbo Huang. "A One-Size-Fits-All Solution to Conservative Bandit Problems." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 7254–61. http://dx.doi.org/10.1609/aaai.v35i8.16891.

Full text

Abstract:

In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees (T-independent additive regrets instead of T-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with O(1/T) normalized additive regrets (T-independent in the cumulative form) and validate this result through empirical evaluation.

APA, Harvard, Vancouver, ISO, and other styles

47

Li, Yang, Jiawei Jiang, Jinyang Gao, Yingxia Shao, Ce Zhang, and Bin Cui. "Efficient Automatic CASH via Rising Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 4763–71. http://dx.doi.org/10.1609/aaai.v34i04.5910.

Full text

Abstract:

The Combined Algorithm Selection and Hyperparameter optimization (CASH) is one of the most fundamental problems in Automatic Machine Learning (AutoML). The existing Bayesian optimization (BO) based solutions turn the CASH problem into a Hyperparameter Optimization (HPO) problem by combining the hyperparameters of all machine learning (ML) algorithms, and use BO methods to solve it. As a result, these methods suffer from the low-efficiency problem due to the huge hyperparameter space in CASH. To alleviate this issue, we propose the alternating optimization framework, where the HPO problem for each ML algorithm and the algorithm selection problem are optimized alternately. In this framework, the BO methods are used to solve the HPO problem for each ML algorithm separately, incorporating a much smaller hyperparameter space for BO methods. Furthermore, we introduce Rising Bandits, a CASH-oriented Multi-Armed Bandits (MAB) variant, to model the algorithm selection in CASH. This framework can take the advantages of both BO in solving the HPO problem with a relatively small hyperparameter space and the MABs in accelerating the algorithm selection. Moreover, we further develop an efficient online algorithm to solve the Rising Bandits with provably theoretical guarantees. The extensive experiments on 30 OpenML datasets demonstrate the superiority of the proposed approach over the competitive baselines.

APA, Harvard, Vancouver, ISO, and other styles

48

Wang, Kai, Shresth Verma, Aditya Mate, Sanket Shah, Aparna Taneja, Neha Madhiwalla, Aparna Hegde, and Milind Tambe. "Scalable Decision-Focused Learning in Restless Multi-Armed Bandits with Application to Maternal and Child Health." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 10 (June 26, 2023): 12138–46. http://dx.doi.org/10.1609/aaai.v37i10.26431.

Full text

Abstract:

This paper studies restless multi-armed bandit (RMAB) problems with unknown arm transition dynamics but with known correlated arm features. The goal is to learn a model to predict transition dynamics given features, where the Whittle index policy solves the RMAB problems using predicted transitions. However, prior works often learn the model by maximizing the predictive accuracy instead of final RMAB solution quality, causing a mismatch between training and evaluation objectives. To address this shortcoming, we propose a novel approach for decision-focused learning in RMAB that directly trains the predictive model to maximize the Whittle index solution quality. We present three key contributions: (i) we establish differentiability of the Whittle index policy to support decision-focused learning; (ii) we significantly improve the scalability of decision-focused learning approaches in sequential problems, specifically RMAB problems; (iii) we apply our algorithm to a previously collected dataset of maternal and child health to demonstrate its performance. Indeed, our algorithm is the first for decision-focused learning in RMAB that scales to real-world problem sizes.

APA, Harvard, Vancouver, ISO, and other styles

49

Guo, Han, Ramakanth Pasunuru, and Mohit Bansal. "Multi-Source Domain Adaptation for Text Classification via DistanceNet-Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (April 3, 2020): 7830–38. http://dx.doi.org/10.1609/aaai.v34i05.6288.

Full text

Abstract:

Domain adaptation performance of a learning algorithm on a target domain is a function of its source domain error and a divergence measure between the data distribution of these two domains. We present a study of various distance-based measures in the context of NLP tasks, that characterize the dissimilarity between domains based on sample estimates. We first conduct analysis experiments to show which of these distance measures can best differentiate samples from same versus different domains, and are correlated with empirical results. Next, we develop a DistanceNet model which uses these distance measures, or a mixture of these distance measures, as an additional loss function to be minimized jointly with the task's loss function, so as to achieve better unsupervised domain adaptation. Finally, we extend this model to a novel DistanceNet-Bandit model, which employs a multi-armed bandit controller to dynamically switch between multiple source domains and allow the model to learn an optimal trajectory and mixture of domains for transfer to the low-resource target domain. We conduct experiments on popular sentiment analysis datasets with several diverse domains and show that our DistanceNet model, as well as its dynamic bandit variant, can outperform competitive baselines in the context of unsupervised domain adaptation.

APA, Harvard, Vancouver, ISO, and other styles

50

Lupu, Andrei, Audrey Durand, and Doina Precup. "Leveraging Observations in Bandits: Between Risks and Benefits." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6112–19. http://dx.doi.org/10.1609/aaai.v33i01.33016112.

Full text

Abstract:

Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. Allowing an agent to be influenced by external observations can benefit to the learning process, but it also puts the agent at risk of following sub-optimal behaviours. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, not the rewards obtained. We introduce a new bandit optimism modifier that uses conditional optimism contingent on the actions of the target in order to guide the agent’s exploration. We analyze the effect of this modification on the well-known Upper Confidence Bound algorithm by proving that it preserves a regret upper-bound of order O(lnT), even in the presence of a very poor target, and we derive the dependency of the expected regret on the general target policy. We provide empirical results showing both great benefits as well as certain limitations inherent to observational learning in the multi-armed bandit setting. Experiments are conducted using targets satisfying theoretical assumptions with high probability, thus narrowing the gap between theory and application.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!