Log in

Relevant bibliographies by topics / Bandit algorithm / Journal articles

Journal articles on the topic 'Bandit algorithm'

To see the other types of publications on this topic, follow the link: Bandit algorithm.

Author: Grafiati

Published: 1 June 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Bandit algorithm.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Ciucanu, Radu, Pascal Lafourcade, Gael Marcadet, and Marta Soare. "SAMBA: A Generic Framework for Secure Federated Multi-Armed Bandits." Journal of Artificial Intelligence Research 73 (February 23, 2022): 737–65. http://dx.doi.org/10.1613/jair.1.13163.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The multi-armed bandit is a reinforcement learning model where a learning agent repeatedly chooses an action (pull a bandit arm) and the environment responds with a stochastic outcome (reward) coming from an unknown distribution associated with the chosen arm. Bandits have a wide-range of application such as Web recommendation systems. We address the cumulative reward maximization problem in a secure federated learning setting, where multiple data owners keep their data stored locally and collaborate under the coordination of a central orchestration server. We rely on cryptographic schemes and propose Samba, a generic framework for Secure federAted Multi-armed BAndits. Each data owner has data associated to a bandit arm and the bandit algorithm has to sequentially select which data owner is solicited at each time step. We instantiate Samba for five bandit algorithms. We show that Samba returns the same cumulative reward as the nonsecure versions of bandit algorithms, while satisfying formally proven security properties. We also show that the overhead due to cryptographic primitives is linear in the size of the input, which is confirmed by our proof-of-concept implementation.

2

Zhou, Huozhi, Lingda Wang, Lav Varshney, and Ee-Peng Lim. "A Near-Optimal Change-Detection Based Algorithm for Piecewise-Stationary Combinatorial Semi-Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 6933–40. http://dx.doi.org/10.1609/aaai.v34i04.6176.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We investigate the piecewise-stationary combinatorial semi-bandit problem. Compared to the original combinatorial semi-bandit problem, our setting assumes the reward distributions of base arms may change in a piecewise-stationary manner at unknown time steps. We propose an algorithm, GLR-CUCB, which incorporates an efficient combinatorial semi-bandit algorithm, CUCB, with an almost parameter-free change-point detector, the Generalized Likelihood Ratio Test (GLRT). Our analysis shows that the regret of GLR-CUCB is upper bounded by O(√NKT log T), where N is the number of piecewise-stationary segments, K is the number of base arms, and T is the number of time steps. As a complement, we also derive a nearly matching regret lower bound on the order of Ω(√NKT), for both piecewise-stationary multi-armed bandits and combinatorial semi-bandits, using information-theoretic techniques and judiciously constructed piecewise-stationary bandit instances. Our lower bound is tighter than the best available regret lower bound, which is Ω(√T). Numerical experiments on both synthetic and real-world datasets demonstrate the superiority of GLR-CUCB compared to other state-of-the-art algorithms.

3

Azizi, Javad, Branislav Kveton, Mohammad Ghavamzadeh, and Sumeet Katariya. "Meta-Learning for Simple Regret Minimization." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 6 (June 26, 2023): 6709–17. http://dx.doi.org/10.1609/aaai.v37i6.25823.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We develop a meta-learning framework for simple regret minimization in bandits. In this framework, a learning agent interacts with a sequence of bandit tasks, which are sampled i.i.d. from an unknown prior distribution, and learns its meta-parameters to perform better on future tasks. We propose the first Bayesian and frequentist meta-learning algorithms for this setting. The Bayesian algorithm has access to a prior distribution over the meta-parameters and its meta simple regret over m bandit tasks with horizon n is mere O(m / √n). On the other hand, the meta simple regret of the frequentist algorithm is O(n√m + m/ √n). While its regret is worse, the frequentist algorithm is more general because it does not need a prior distribution over the meta-parameters. It can also be analyzed in more settings. We instantiate our algorithms for several classes of bandit problems. Our algorithms are general and we complement our theory by evaluating them empirically in several environments.

4

Kuroki, Yuko, Liyuan Xu, Atsushi Miyauchi, Junya Honda, and Masashi Sugiyama. "Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback." Neural Computation 32, no. 9 (September 2020): 1733–73. http://dx.doi.org/10.1162/neco_a_01299.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We study the problem of stochastic multiple-arm identification, where an agent sequentially explores a size-[Formula: see text] subset of arms (also known as a super arm) from given [Formula: see text] arms and tries to identify the best super arm. Most work so far has considered the semi-bandit setting, where the agent can observe the reward of each pulled arm or assumed each arm can be queried at each round. However, in real-world applications, it is costly or sometimes impossible to observe a reward of individual arms. In this study, we tackle the full-bandit setting, where only a noisy observation of the total sum of a super arm is given at each pull. Although our problem can be regarded as an instance of the best arm identification in linear bandits, a naive approach based on linear bandits is computationally infeasible since the number of super arms [Formula: see text] is exponential. To cope with this problem, we first design a polynomial-time approximation algorithm for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we propose a bandit algorithm whose computation time is [Formula: see text](log [Formula: see text]), thereby achieving an exponential speedup over linear bandit algorithms. We provide a sample complexity upper bound that is still worst-case optimal. Finally, we conduct experiments on large-scale data sets with more than 10[Formula: see text] super arms, demonstrating the superiority of our algorithms in terms of both the computation time and the sample complexity.

5

Li, Youxuan. "Improvement of the recommendation system based on the multi-armed bandit algorithm." Applied and Computational Engineering 36, no. 1 (January 22, 2024): 237–41. http://dx.doi.org/10.54254/2755-2721/36/20230453.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In order to effectively solve common problems of the recommendation system, such as the cold start problem and dynamic data modeling problem, the multi-armed bandit (MAB) algorithm, the collaborative filtering (CF) algorithm, and the user information feedback are applied by researchers to update the recommendation model online and in time. In other words, the cold start problem of the recommendation system is transformed into an issue of exploration and utilization. The MAB algorithm is used, user features are introduced as content, and the synergy between users is further considered. In this paper, the author studies the improvement of the recommendation system based on the multi-armed bandit algorithm. The Liner Upper Confidence Bound (LinUCB), Collaborative Filtering Bandits (COFIBA), and Context-Aware clustering of Bandits (CAB) algorithms are analyzed. It is found that the MAB algorithm can get a good maximum total revenue regardless of the content value after going through the cold start stage. In the case of a particularly large amount of content, the CAB algorithm achieves the greatest effect.

6

Liu, Zizhuo. "Investigation of progress and application related to Multi-Armed Bandit algorithms." Applied and Computational Engineering 37, no. 1 (January 22, 2024): 155–59. http://dx.doi.org/10.54254/2755-2721/37/20230496.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper discusses four Multi-armed Bandit algorithms: Explore-then-Commit (ETC), Epsilon-Greedy, Upper Confidence Bound (UCB), and Thompson Sampling algorithm. ETC algorithm aims to spend the majority of rounds on the best arm, but it can lead to a suboptimal outcome if the environment changes rapidly. The Epsilon-Greedy algorithm is designed to explore and exploit simultaneously, while it often tries sub-optimal arm even after the algorithm finds the best arm. Thus, the Epsilon-Greedy algorithm performs well when the environment continuously changes. UCB algorithm is one of the most used Multi-armed Bandit algorithms because it can rapidly narrow the potential optimal decisions in a wide range of scenarios; however, the algorithm can be influenced by some specific pattern of reward distribution or noise presenting in the environment. Thompson Sampling algorithm is also one of the most common algorithms in the Multi-armed Bandit algorithm due to its simplicity, effectiveness, and adaptability to various reward distributions. The Thompson Sampling algorithm performs well in multiple scenarios because it explores and exploits simultaneously, but its variance is greater than the three algorithms mentioned above. Today, Multi-armed bandit algorithms are widely used in advertisement, health care, and website and app optimization. Finally, the Multi-armed Bandit algorithms are rapidly replacing the traditional algorithms; in the future, the advanced Multi-armed Bandit algorithm, contextual Multi-armed Bandit algorithm, will gradually replace the old one.

7

Agarwal, Mridul, Vaneet Aggarwal, Abhishek Kumar Umrawal, and Chris Quinn. "DART: Adaptive Accept Reject Algorithm for Non-Linear Combinatorial Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 6557–65. http://dx.doi.org/10.1609/aaai.v35i8.16812.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a non-linear function of the rewards of the selected individual arms. The direct use of a multi-armed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for top-K subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in N. Further, DART achieves a regret bound of Õ(K√KNT) for a time horizon T, which matches the lower bound in bandit feedback up to a factor of √log 2NT. When applied to the problem of cross-selling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of state-of-the-art algorithms. We also show that DART significantly outperforms existing methods for both linear and non-linear joint reward environments.

8

Xue, Bo, Ji Cheng, Fei Liu, Yimu Wang, and Qingfu Zhang. "Multiobjective Lipschitz Bandits under Lexicographic Ordering." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 15 (March 24, 2024): 16238–46. http://dx.doi.org/10.1609/aaai.v38i15.29558.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper studies the multiobjective bandit problem under lexicographic ordering, wherein the learner aims to simultaneously maximize ? objectives hierarchically. The only existing algorithm for this problem considers the multi-armed bandit model, and its regret bound is O((KT)^(2/3)) under a metric called priority-based regret. However, this bound is suboptimal, as the lower bound for single objective multi-armed bandits is Omega(KlogT). Moreover, this bound becomes vacuous when the arm number K is infinite. To address these limitations, we investigate the multiobjective Lipschitz bandit model, which allows for an infinite arm set. Utilizing a newly designed multi-stage decision-making strategy, we develop an improved algorithm that achieves a general regret bound of O(T^((d_z^i+1)/(d_z^i+2))) for the i-th objective, where d_z^i is the zooming dimension for the i-th objective, with i in {1,2,...,m}. This bound matches the lower bound of the single objective Lipschitz bandit problem in terms of T, indicating that our algorithm is almost optimal. Numerical experiments confirm the effectiveness of our algorithm.

9

Sharaf, Amr, and Hal Daumé III. "Meta-Learning Effective Exploration Strategies for Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 11 (May 18, 2021): 9541–48. http://dx.doi.org/10.1609/aaai.v35i11.17149.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In contextual bandits, an algorithm must choose actions given ob- served contexts, learning from a reward signal that is observed only for the action chosen. This leads to an exploration/exploitation trade-off: the algorithm must balance taking actions it already believes are good with taking new actions to potentially discover better choices. We develop a meta-learning algorithm, Mêlée, that learns an exploration policy based on simulated, synthetic con- textual bandit tasks. Mêlée uses imitation learning against these simulations to train an exploration policy that can be applied to true contextual bandit tasks at test time. We evaluate Mêlée on both a natural contextual bandit problem derived from a learning to rank dataset as well as hundreds of simulated contextual ban- dit problems derived from classification tasks. Mêlée outperforms seven strong baselines on most of these datasets by leveraging a rich feature representation for learning an exploration strategy.

10

Nobari, Sadegh. "DBA: Dynamic Multi-Armed Bandit Algorithm." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9869–70. http://dx.doi.org/10.1609/aaai.v33i01.33019869.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We introduce Dynamic Bandit Algorithm (DBA), a practical solution to improve the shortcoming of the pervasively employed reinforcement learning algorithm called Multi-Arm Bandit, aka Bandit. Bandit makes real-time decisions based on the prior observations. However, Bandit is heavily biased to the priors that it cannot quickly adapt itself to a trend that is interchanging. As a result, Bandit cannot, quickly enough, make profitable decisions when the trend is changing. Unlike Bandit, DBA focuses on quickly adapting itself to detect these trends early enough. Furthermore, DBA remains as almost as light as Bandit in terms of computations. Therefore, DBA can be easily deployed in production as a light process similar to The Bandit. We demonstrate how critical and beneficial is the main focus of DBA, i.e. the ability to quickly finding the most profitable option in real-time, over its stateof-the-art competitors. Our experiments are augmented with a visualization mechanism that explains the profitability of the decisions made by each algorithm in each step by animations. Finally we observe that DBA can substantially outperform the original Bandit by close to 3 times for a set Key Performance Indicator (KPI) in a case of having 3 arms.

11

Qu, Jiaming. "Survey of dynamic pricing based on Multi-Armed Bandit algorithms." Applied and Computational Engineering 37, no. 1 (January 22, 2024): 160–65. http://dx.doi.org/10.54254/2755-2721/37/20230497.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Dynamic pricing seeks to determine the most optimal selling price for a product or service, taking into account factors like limited supply and uncertain demand. This study aims to provide a comprehensive exploration of dynamic pricing using the multi-armed bandit problem framework in various contexts. The investigation highlights the prevalence of Thompson sampling in dynamic pricing scenarios with a Bayesian backdrop, where the seller possesses prior knowledge of demand functions. On the other hand, in non-Bayesian situations, the Upper Confidence Bound (UCB) algorithm family gains traction due to their favorable regret bounds. As markets often exhibit temporal fluctuations, the domain of non-stationary multi-armed bandits within dynamic pricing emerges as crucial. Future research directions include enhancing traditional multi-armed bandit algorithms to suit online learning settings, especially those involving dynamic reward distributions. Additionally, merging prior insights into demand functions with contextual multi-armed bandit approaches holds promise for advancing dynamic pricing strategies. In conclusion, this study sheds light on dynamic pricing through the lens of multi-armed bandit problems, offering insights and pathways for further exploration.

12

Niño-Mora, José. "A Fast-Pivoting Algorithm for Whittle’s Restless Bandit Index." Mathematics 8, no. 12 (December 15, 2020): 2226. http://dx.doi.org/10.3390/math8122226.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The Whittle index for restless bandits (two-action semi-Markov decision processes) provides an intuitively appealing optimal policy for controlling a single generic project that can be active (engaged) or passive (rested) at each decision epoch, and which can change state while passive. It further provides a practical heuristic priority-index policy for the computationally intractable multi-armed restless bandit problem, which has been widely applied over the last three decades in multifarious settings, yet mostly restricted to project models with a one-dimensional state. This is due in part to the difficulty of establishing indexability (existence of the index) and of computing the index for projects with large state spaces. This paper draws on the author’s prior results on sufficient indexability conditions and an adaptive-greedy algorithmic scheme for restless bandits to obtain a new fast-pivoting algorithm that computes the n Whittle index values of an n-state restless bandit by performing, after an initialization stage, n steps that entail (2/3)n3+O(n2) arithmetic operations. This algorithm also draws on the parametric simplex method, and is based on elucidating the pattern of parametric simplex tableaux, which allows to exploit special structure to substantially simplify and reduce the complexity of simplex pivoting steps. A numerical study demonstrates substantial runtime speed-ups versus alternative algorithms.

13

Lamberton, Damien, and Gilles Pagès. "A penalized bandit algorithm." Electronic Journal of Probability 13 (2008): 341–73. http://dx.doi.org/10.1214/ejp.v13-489.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Cheung, Wang Chi, David Simchi-Levi, and Ruihao Zhu. "Hedging the Drift: Learning to Optimize Under Nonstationarity." Management Science 68, no. 3 (March 2022): 1696–713. http://dx.doi.org/10.1287/mnsc.2021.4024.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We introduce data-driven decision-making algorithms that achieve state-of-the-art dynamic regret bounds for a collection of nonstationary stochastic bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traffic network routing in changing environments. We show how the difficulty posed by the (unknown a priori and possibly adversarial) nonstationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Beginning with the linear bandit setting, we design and analyze a sliding window-upper confidence bound algorithm that achieves the optimal dynamic regret bound when the underlying variation budget is known. This budget quantifies the total amount of temporal variation of the latent environments. Boosted by the novel bandit-over-bandit framework that adapts to the latent changes, our algorithm can further enjoy nearly optimal dynamic regret bounds in a (surprisingly) parameter-free manner. We extend our results to other related bandit problems, namely the multiarmed bandit, generalized linear bandit, and combinatorial semibandit settings, which model a variety of operations research applications. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the “forgetting principle” in the learning processes, which is vital in changing environments. Extensive numerical experiments with synthetic datasets and a dataset of an online auto-loan company during the severe acute respiratory syndrome (SARS) epidemic period demonstrate that our proposed algorithms achieve superior performance compared with existing algorithms. This paper was accepted by George J. Shanthikumar, Management Science Special Section on Data-Driven Prescriptive Analytics.

15

Chen, Panyangjie. "Investigation of selection and application of Multi-Armed Bandit algorithms in recommendation system." Applied and Computational Engineering 34, no. 1 (January 22, 2024): 185–90. http://dx.doi.org/10.54254/2755-2721/34/20230323.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The Multi-Armed Bandit (MAB) algorithm holds significant prominence as a recommendation system technique, effectively presenting user-centric content preferences based on the analysis of collected data. However, the application of the basic MAB algorithm in real-world recommendation systems is not without challenges, including issues related to data volume and data processing accuracy. Therefore, the optimization algorithm based on the MAB algorithm is more widely used in the recommendation system. This paper briefly introduces the multi-armed bandit algorithm, that is, the use of MAB in the recommendation system and the problems of the basic MAB algorithm. Aiming at the problems of the basic MAB algorithm, it introduces the MAB-based optimization algorithm used in the recommendation system. At the same time, this paper also analyzes and summarizes such algorithms. This paper introduces two different MAB-based optimization algorithms, namely The Details of Dynamic clustering based contextual combinatorial multi-armed bandit (DC3MAB) and Binary Upper Confidence Bound (BiUCB). In addition, this paper also introduces the application of algorithm in recommended system. Finally, this paper summarizes the introduced algorithms and proposes the future prospects for MAB optimization algorithms.

16

Fourati, Fares, Christopher John Quinn, Mohamed-Slim Alouini, and Vaneet Aggarwal. "Combinatorial Stochastic-Greedy Bandit." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 11 (March 24, 2024): 12052–60. http://dx.doi.org/10.1609/aaai.v38i11.29093.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We propose a novel combinatorial stochastic-greedy bandit (SGB) algorithm for combinatorial multi-armed bandit problems when no extra information other than the joint reward of the selected set of n arms at each time step t in [T] is observed. SGB adopts an optimized stochastic-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms. Unlike existing methods that explore the entire set of unselected base arms during each selection step, our SGB algorithm samples only an optimized proportion of unselected arms and selects actions from this subset. We prove that our algorithm achieves a (1-1/e)-regret bound of O(n^(1/3) k^(2/3) T^(2/3) log(T)^(2/3)) for monotone stochastic submodular rewards, which outperforms the state-of-the-art in terms of the cardinality constraint k. Furthermore, we empirically evaluate the performance of our algorithm in the context of online constrained social influence maximization. Our results demonstrate that our proposed approach consistently outperforms the other algorithms, increasing the performance gap as k grows.

17

Oswal, Urvashi, Aniruddha Bhargava, and Robert Nowak. "Linear Bandits with Feature Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 5331–38. http://dx.doi.org/10.1609/aaai.v34i04.5980.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper explores a new form of the linear bandit problem in which the algorithm receives the usual stochastic rewards as well as stochastic feedback about which features are relevant to the rewards, the latter feedback being the novel aspect. The focus of this paper is the development of new theory and algorithms for linear bandits with feature feedback which can achieve regret over time horizon T that scales like k√T, without prior knowledge of which features are relevant nor the number k of relevant features. In comparison, the regret of traditional linear bandits is d√T, where d is the total number of (relevant and irrelevant) features, so the improvement can be dramatic if k ≪ d. The computational complexity of the algorithm is proportional to k rather than d, making it much more suitable for real-world applications compared to traditional linear bandits. We demonstrate the performance of the algorithm with synthetic and real human-labeled data.

18

Tang, Qiao, Hong Xie, Yunni Xia, Jia Lee, and Qingsheng Zhu. "Robust Contextual Bandits via Bootstrapping." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 13 (May 18, 2021): 12182–89. http://dx.doi.org/10.1609/aaai.v35i13.17446.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Upper confidence bound (UCB) based contextual bandit algorithms require one to know the tail property of the reward distribution. Unfortunately, such tail property is usually unknown or difficult to specify in real-world applications. Using a tail property heavier than the ground truth leads to a slow learning speed of the contextual bandit algorithm, while using a lighter one may cause the algorithm to diverge. To address this fundamental problem, we develop an estimator (evaluated from historical rewards) for the contextual bandit UCB based on the multiplier bootstrapping technique. We first establish sufficient conditions under which our estimator converges asymptotically to the ground truth of contextual bandit UCB. We further derive a second order correction for our estimator so as to obtain its confidence level with a finite number of rounds. To demonstrate the versatility of the estimator, we apply it to design a BootLinUCB algorithm for the contextual bandit. We prove that the BootLinUCB has a sub-linear regret upper bound and also conduct extensive experiments to validate its superior performance.

19

Li, Wenjie, Qifan Song, Jean Honorio, and Guang Lin. "Federated X-armed Bandit." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (March 24, 2024): 13628–36. http://dx.doi.org/10.1609/aaai.v38i12.29267.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This work establishes the first framework of federated X-armed bandit, where different clients face heterogeneous local objective functions defined on the same domain and are required to collaboratively figure out the global optimum. We propose the first federated algorithm for such problems, named Fed-PNE. By utilizing the topological structure of the global objective inside the hierarchical partitioning and the weak smoothness property, our algorithm achieves sublinear cumulative regret with respect to both the number of clients and the evaluation budget. Meanwhile, it only requires logarithmic communications between the central server and clients, protecting the client privacy. Experimental results on synthetic functions and real datasets validate the advantages of Fed-PNE over various centralized and federated baseline algorithms.

20

Wang, Liangxu. "Investigation of frontier Multi-Armed Bandit algorithms and applications." Applied and Computational Engineering 34, no. 1 (January 22, 2024): 179–84. http://dx.doi.org/10.54254/2755-2721/34/20230322.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Since the amount of content online is growing exponentially and people's time is limited, there is an urgent need for a high-performance algorithm that can make effective recommendations. This paper will introduce a recommendation system model, a sequential decision model, which is called Multi-Armed Bandit. The main idea of the Multi-Armed Bandit model is that at the beginning of the algorithm, all the recommended items are set to the same weight. In the subsequent recommendation process, the model explores the distribution of each item while changing the weight of each item according to the average revenue of each item, and selects more items with larger weight. This paper will introduce three cutting-edge Multi-Armed Bandit algorithms, their algorithmic ideas and their respective characteristics. The idea of Explore-Then-Commit (ETC) algorithm is to explore each item a certain number of times, and then select the best item for subsequent recommendation. The idea of the Upper Confidence Bound (UCB) algorithm is to represent the "exploration" and "exploitation" of each item by numerical values and add them to the UCB value, and select the item with the largest UCB value each time. The idea of TS is to first assume the distribution of each item, and then change the parameters of the distribution of each item based on the reward. At the end, this paper will introduce several scenarios where Multi-Armed Bandit algorithms can be used to give the reader an idea of how to use Multi-Armed Bandit algorithms.

21

Du, Yihan, Siwei Wang, and Longbo Huang. "A One-Size-Fits-All Solution to Conservative Bandit Problems." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 7254–61. http://dx.doi.org/10.1609/aaai.v35i8.16891.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees (T-independent additive regrets instead of T-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with O(1/T) normalized additive regrets (T-independent in the cumulative form) and validate this result through empirical evaluation.

22

Esfandiari, Hossein, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. "Regret Bounds for Batched Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 7340–48. http://dx.doi.org/10.1609/aaai.v35i8.16901.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We present simple algorithms for batched stochastic multi-armed bandit and batched stochastic linear bandit problems. We prove bounds for their expected regrets that improve and extend the best known regret bounds of Gao, Han, Ren, and Zhou (NeurIPS 2019), for any number of batches. In particular, our algorithms in both settings achieve the optimal expected regrets by using only a logarithmic number of batches. We also study the batched adversarial multi-armed bandit problem for the first time and provide the optimal regret, up to logarithmic factors, of any algorithm with predetermined batch sizes.

23

Han, Qi, Li Zhu, and Fei Guo. "Forced Exploration in Bandit Problems." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 11 (March 24, 2024): 12270–77. http://dx.doi.org/10.1609/aaai.v38i11.29117.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The multi-armed bandit(MAB) is a classical sequential decision problem. Most work requires assumptions about the reward distribution (e.g., bounded), while practitioners may have difficulty obtaining information about these distributions to design models for their problems, especially in non-stationary MAB problems. This paper aims to design a multi-armed bandit algorithm that can be implemented without using information about the reward distribution while still achieving substantial regret upper bounds. To this end, we propose a novel algorithm alternating between greedy rule and forced exploration. Our method can be applied to Gaussian, Bernoulli and other subgaussian distributions, and its implementation does not require additional information. We employ a unified analysis method for different forced exploration strategies and provide problem-dependent regret upper bounds for stationary and piecewise-stationary settings. Furthermore, we compare our algorithm with popular bandit algorithms on different reward distributions.

24

Chen, Xijin, Kim May Lee, Sofia S. Villar, and David S. Robertson. "Some performance considerations when using multi-armed bandit algorithms in the presence of missing data." PLOS ONE 17, no. 9 (September 12, 2022): e0274272. http://dx.doi.org/10.1371/journal.pone.0274272.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

When comparing the performance of multi-armed bandit algorithms, the potential impact of missing data is often overlooked. In practice, it also affects their implementation where the simplest approach to overcome this is to continue to sample according to the original bandit algorithm, ignoring missing outcomes. We investigate the impact on performance of this approach to deal with missing data for several bandit algorithms through an extensive simulation study assuming the rewards are missing at random. We focus on two-armed bandit algorithms with binary outcomes in the context of patient allocation for clinical trials with relatively small sample sizes. However, our results apply to other applications of bandit algorithms where missing data is expected to occur. We assess the resulting operating characteristics, including the expected reward. Different probabilities of missingness in both arms are considered. The key finding of our work is that when using the simplest strategy of ignoring missing data, the impact on the expected performance of multi-armed bandit strategies varies according to the way these strategies balance the exploration-exploitation trade-off. Algorithms that are geared towards exploration continue to assign samples to the arm with more missing responses (which being perceived as the arm with less observed information is deemed more appealing by the algorithm than it would otherwise be). In contrast, algorithms that are geared towards exploitation would rapidly assign a high value to samples from the arms with a current high mean irrespective of the level observations per arm. Furthermore, for algorithms focusing more on exploration, we illustrate that the problem of missing responses can be alleviated using a simple mean imputation approach.

25

Ene, Alina, Huy L. Nguyen, and Adrian Vladu. "Projection-Free Bandit Optimization with Privacy Guarantees." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 7322–30. http://dx.doi.org/10.1609/aaai.v35i8.16899.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We design differentially private algorithms for the bandit convex optimization problem in the projection-free setting. This setting is important whenever the decision set has a complex geometry, and access to it is done efficiently only through a linear optimization oracle, hence Euclidean projections are unavailable (e.g. matroid polytope, submodular base polytope). This is the first differentially-private algorithm for projection-free bandit optimization, and in fact our bound matches the best known non-private projection-free algorithm and the best known private algorithm, even for the weaker setting when projections are available.

26

Chen, Tianfeng. "Empirical performances comparison for ETC algorithm." Applied and Computational Engineering 13, no. 1 (October 23, 2023): 29–36. http://dx.doi.org/10.54254/2755-2721/13/20230705.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Explore-then-commit (ETC) algorithm is a widely used algorithm in bandit problems, which are used to identify the optimal choice among a series of choices that yield random outcomes. The ETC algorithm is adapted from A/B testing, a popular procedure in decision-making process. This paper explores the multi-armed bandit problem and some related algorithms to tackle the multi-armed bandit problem. In particular, this paper focuses on the explore-then-commit (ETC) algorithm, a simple algorithm that has an exploration phase, and then commits the best action. To evaluate the performance of ETC, a variety of settings is made in the experiment, such as the number of arms and input parameter m, i.e., how many times each arm is pulled in the exploration phase. The result shows that the average cumulative regret increases when the number of arms gets larger. With the increase of parameter m, the cumulative regret decreases in the beginning, until reaching the minimum value, and then starts increasing. The purpose of this paper is to empirically evaluate the performance of the ETC algorithm and investigate the relationships between the parameter settings and the overall performance of the algorithm.

27

Zhu, Zhaowei, Jingxuan Zhu, Ji Liu, and Yang Liu. "Federated Bandit." Proceedings of the ACM on Measurement and Analysis of Computing Systems 5, no. 1 (February 18, 2021): 1–29. http://dx.doi.org/10.1145/3447380.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this paper, we study Federated Bandit, a decentralized Multi-Armed Bandit problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to sub-optimal actions while converging to a no-regret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a solution with which agents will never share their local observations with a central entity, and will be allowed to only share a private copy of his/her own information with their neighbors. We first propose a decentralized bandit algorithm \textttGossip\_UCB, which is a coupling of variants of both the classical gossiping algorithm and the celebrated Upper Confidence Bound (UCB) bandit algorithm. We show that \textttGossip\_UCB successfully adapts local bandit learning into a global gossiping process for sharing information among connected agents, and achieves guaranteed regret at the order of O(\max\ \textttpoly (N,M) łog T, \textttpoly (N,M)łog_łambda_2^-1 N\ ) for all N agents, where łambda_2\in(0,1) is the second largest eigenvalue of the expected gossip matrix, which is a function of G. We then propose \textttFed\_UCB, a differentially private version of \textttGossip\_UCB, in which the agents preserve ε-differential privacy of their local data while achieving O(\max \\frac\textttpoly (N,M) ε łog^2.5 T, \textttpoly (N,M) (łog_łambda_2^-1 N + łog T) \ ) regret.

28

Rangi, Anshuka, Long Tran-Thanh, Haifeng Xu, and Massimo Franceschetti. "Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 7 (June 28, 2022): 8054–61. http://dx.doi.org/10.1609/aaai.v36i7.20777.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper studies bandit algorithms under data poisoning attacks in a bounded reward setting. We consider a strong attacker model in which the attacker can observe both the selected actions and their corresponding rewards, and can contaminate the rewards with additive noise. We show that any bandit algorithm with regret O(log T) can be forced to suffer a regret O(T) with an expected amount of contamination O(log T). This amount of contamination is also necessary, as we prove that there exists an O(log T) regret bandit algorithm, specifically the classical UCB, that requires Omega(log T) amount of contamination to suffer regret Omega(T). To combat such poisoning attacks, our second main contribution is to propose verification based mechanisms, which use limited verification to access a limited number of uncontaminated rewards. In particular, for the case of unlimited verifications, we show that with O(log T) expected number of verifications, a simple modified version of the Explore-then-Commit type bandit algorithm can restore the order optimal O(log T) regret irrespective of the amount of contamination used by the attacker. We also provide a UCB-like verification scheme, called Secure-UCB, that also enjoys full recovery from any attacks, also with O(log T) expected number of verifications. To derive a matching lower bound on the number of verifications, we also prove that for any order-optimal bandit algorithm, this number of verifications O(log T) is necessary to recover the order-optimal regret. On the other hand, when the number of verifications is bounded above by a budget B, we propose a novel algorithm, Secure-BARBAR, which provably achieves O(min(C,T/sqrt(B))) regret with high probability against weak attackers (i.e., attackers who have to place the contamination before seeing the actual pulls of the bandit algorithm), where C is the total amount of contamination by the attacker, which breaks the known Omega(C) lower bound of the non-verified setting if C is large.

29

Amani, Sanae, and Christos Thrampoulidis. "Decentralized Multi-Agent Linear Bandits with Safety Constraints." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 6627–35. http://dx.doi.org/10.1609/aaai.v35i8.16820.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We study decentralized stochastic linear bandits, where a network of N agents acts cooperatively to efficiently solve a linear bandit-optimization problem over a d-dimensional space. For this problem, we propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network. At each round of the algorithm each agent chooses its actions following an upper confidence bound (UCB) strategy and agents share information with their immediate neighbors through a carefully designed consensus procedure that repeats over cycles. Our analysis adjusts the duration of these communication cycles ensuring near-optimal regret performance O(d \log{NT}\sqrt{NT}) at a communication rate of O(dN^2) per round. The structure of the network affects the regret performance via a small additive term – coined the regret of delay – that depends on the spectral gap of the underlying graph. Notably, our results apply to arbitrary network topologies without a requirement for a dedicated agent acting as a server. In consideration of situations with high communication cost, we propose RC-DLUCB: a modification of DLUCB with rare communication among agents. The new algorithm trades off regret performance for a significantly reduced total communication cost of O(d^3N^5/2) over all T rounds. Finally, we show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits. For the recently studied problem of linear bandits with unknown linear safety constraints, we propose the first safe decentralized algorithm. Our study contributes towards applying bandit techniques in safety-critical distributed systems that repeatedly deal with unknown stochastic environments. We present numerical simulations for various network topologies that corroborate our theoretical findings.

30

Huang, Wen, Lu Zhang, and Xintao Wu. "Achieving Counterfactual Fairness for Causal Bandit." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 6 (June 28, 2022): 6952–59. http://dx.doi.org/10.1609/aaai.v36i6.20653.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In online recommendation, customers arrive in a sequential and stochastic manner from an underlying distribution and the online decision model recommends a chosen item for each arriving individual based on some strategy. We study how to recommend an item at each step to maximize the expected reward while achieving user-side fairness for customers, i.e., customers who share similar profiles will receive a similar reward regardless of their sensitive attributes and items being recommended. By incorporating causal inference into bandits and adopting soft intervention to model the arm selection strategy, we first propose the d-separation based UCB algorithm (D-UCB) to explore the utilization of the d-separation set in reducing the amount of exploration needed to achieve low cumulative regret. Based on that, we then propose the fair causal bandit (F-UCB) for achieving the counterfactual individual fairness. Both theoretical analysis and empirical evaluation demonstrate effectiveness of our algorithms.

31

Narita, Yusuke, Shota Yasui, and Kohei Yata. "Efficient Counterfactual Learning from Bandit Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4634–41. http://dx.doi.org/10.1609/aaai.v33i01.33014634.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

What is the most statistically efficient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-theart benchmark.

32

Zhao, Shanshan, Wenhai Cui, Bei Jiang, Linglong Kong, and Xiaodong Yan. "Responsible Bandit Learning via Privacy-Protected Mean-Volatility Utility." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 19 (March 24, 2024): 21815–22. http://dx.doi.org/10.1609/aaai.v38i19.30182.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

For ensuring the safety of users by protecting the privacy, the traditional privacy-preserving bandit algorithm aiming to maximize the mean reward has been widely studied in scenarios such as online ride-hailing, advertising recommendations, and personalized healthcare. However, classical bandit learning is irresponsible in such practical applications as they fail to account for risks in online decision-making and ignore external system information. This paper firstly proposes privacy protected mean-volatility utility as the objective of bandit learning and proves its responsibility, because it aims at achieving the maximum probability of utility by considering the risk. Theoretically, our proposed responsible bandit learning is expected to achieve the fastest convergence rate among current bandit algorithms and generates more statistical power than classical normality-based test. Finally, simulation studies provide supporting evidence for the theoretical results and demonstrate stronger performance when using stricter privacy budgets.

33

Tolpin, David, and Solomon Shimony. "MCTS Based on Simple Rerget." Proceedings of the International Symposium on Combinatorial Search 3, no. 1 (August 20, 2021): 193–99. http://dx.doi.org/10.1609/socs.v3i1.18221.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS),is based on UCB, a policy for the Multi-armed Bandit problem (MAB) thatminimizes the cumulative regret. However, search differs from MAB inthat in MCTS it is usually only the final ``arm pull''that collects a reward, rather than all ``arm pulls''.Therefore, it makes more sense to minimize the simple, rather thancumulative, regret. We introduce policies formulti-armed bandits with lower simpleregret than UCB and develop a two-stage scheme (SR+CR) for MCTSwhich outperforms UCT empirically. We also propose a samplingscheme based on value of information (VOI), achieving an algorithmthat empirically outperforms other proposed algorithms.

34

Li, Litao. "Exploring Multi-Armed Bandit algorithms: Performance analysis in dynamic environments." Applied and Computational Engineering 34, no. 1 (January 22, 2024): 252–59. http://dx.doi.org/10.54254/2755-2721/34/20230338.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The Multi-armed Bandit algorithm, a proficient solver of the exploration-and-exploitation trade-off predicament, furnishes businesses with a robust tool for resource allocation that predominantly aligns with customer preferences. However, varying Multi-armed Bandit algorithm types exhibit dissimilar performance characteristics based on contextual variations. Hence, a series of experiments is imperative, involving alterations to input values across distinct algorithms. Within this study, three specific algorithms were applied, Explore-then-commit (ETC), Upper Confident Bound (UCB) and its asymptotically optimal variant, and Thompson Sampling (TS), to the extensively utilized MovieLens dataset. This application aimed to gauge their effectiveness comprehensively. The algorithms were translated into executable code, and their performance was visually depicted through multiple figures. Through cumulative regret tracking within defined conditions, algorithmic performance was scrutinized, laying the groundwork for subsequent parameter-based comparisons. A dedicated experimentation framework was devised to evaluate the robustness of each algorithm, involving deliberate parameter adjustments and tailored experiments to elucidate distinct performance nuances. The ensuing graphical depictions distinctly illustrated Thompson Sampling's persistent minimal regrets across most scenarios. UCB algorithms displayed steadfast stability. ETC manifested excellent performance with a low number of runs but escalate significantly along the number of runs growing. It also warranting constraints on exploratory phases to mitigate regrets. This investigation underscores the efficacy of Multi-armed Bandit algorithms while elucidating their nuanced behaviors within diverse contextual contingencies.

35

Oh, Min-hwan, and Garud Iyengar. "Multinomial Logit Contextual Bandits: Provable Optimality and Practicality." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 9205–13. http://dx.doi.org/10.1609/aaai.v35i10.17111.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown. In each period, the learning agent observes a d-dimensional contextual information about the user and the N available items, and offers an assortment of size K to the user, and observes the bandit feedback of the item chosen from the assortment. We propose upper confidence bound based algorithms for this MNL contextual bandit. The first algorithm is a simple and practical method that achieves an O(d√T) regret over T rounds. Next, we propose a second algorithm which achieves a O(√dT) regret. This matches the lower bound for the MNL bandit problem, up to logarithmic terms, and improves on the best-known result by a √d factor. To establish this sharper regret bound, we present a non-asymptotic confidence bound for the maximum likelihood estimator of the MNL model that may be of independent interest as its own theoretical contribution. We then revisit the simpler, significantly more practical, first algorithm and show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.

36

Varatharajah, Yogatheesan, and Brent Berry. "A Contextual-Bandit-Based Approach for Informed Decision-Making in Clinical Trials." Life 12, no. 8 (August 21, 2022): 1277. http://dx.doi.org/10.3390/life12081277.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Clinical trials are conducted to evaluate the efficacy of new treatments. Clinical trials involving multiple treatments utilize the randomization of treatment assignments to enable the evaluation of treatment efficacies in an unbiased manner. Such evaluation is performed in post hoc studies that usually use supervised-learning methods that rely on large amounts of data collected in a randomized fashion. That approach often proves to be suboptimal in that some participants may suffer and even die as a result of having not received the most appropriate treatments during the trial. Reinforcement-learning methods improve the situation by making it possible to learn the treatment efficacies dynamically during the course of the trial, and to adapt treatment assignments accordingly. Recent efforts using multi-arm bandits, a type of reinforcement-learning method, have focused on maximizing clinical outcomes for a population that was assumed to be homogeneous. However, those approaches have failed to account for the variability among participants that is becoming increasingly evident as a result of recent clinical-trial-based studies. We present a contextual-bandit-based online treatment optimization algorithm that, in choosing treatments for new participants in the study, takes into account not only the maximization of the clinical outcomes as well as the patient characteristics. We evaluated our algorithm using a real clinical trial dataset from the International Stroke Trial. We simulated the online setting by sequentially going through the data of each participant admitted to the trial. Two bandits (one for each context) were created, with four choices of treatments. For a new participant in the trial, depending on the context, one of the bandits was selected. Then, we took three different approaches to choose a treatment: (a) a random choice (i.e., the strategy currently used in clinical trial settings), (b) a Thompson sampling-based approach, and (c) a UCB-based approach. Success probabilities of each context were calculated separately by considering the participants with the same context. Those estimated outcomes were used to update the prior distributions within the bandit corresponding to the context of each participant. We repeated that process through the end of the trial and recorded the outcomes and the chosen treatments for each approach. We also evaluated a context-free multi-arm-bandit-based approach, using the same dataset, to showcase the benefits of our approach. In the context-free case, we calculated the success probabilities for the Bernoulli sampler using the whole clinical trial dataset in a context-independent manner. The results of our retrospective analysis indicate that the proposed approach performs significantly better than either a random assignment of treatments (the current gold standard) or a multi-arm-bandit-based approach, providing substantial gains in the percentage of participants who are assigned the most suitable treatments. The contextual-bandit and multi-arm bandit approaches provide 72.63% and 64.34% gains, respectively, compared to a random assignment.

37

Шиян, Дмитрий Николаевич, and Dmitry Shiyan. "One-armed bandit problem and the mirror descent algorithm." Mathematical Game Theory and Applications 15, no. 3 (February 2, 2024): 88–106. http://dx.doi.org/10.17076/mgta_2023_3_75.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We consider the application of the mirror descent algorithm (MDA) to the one-armed bandit problem in the minimax statement as applied to data processing. This problem is also known as the game with nature, where the player's payoff function is the mathematical expectation of the total income. The player must determine the most effective method of the two available and provide that it is predominantly used. In this case, the a priori efficiency of one of the methods is known. This article proposes a modification of the MDA that allows to improve the efficiency of control through the use of additional a priori information. The proposed strategy retains the characteristic property of strategies for one-armed bandits - if a known action is applied once, it will be applied until the end of the control. Modifications for the algorithm for one-by-one processing and for its batch version are considered. Batch processing is interesting in that the total processing time is determined by the number of batches and not the original amount of data, if it is possible to provide parallel processing of data in batches. For the proposed algorithms, using the Monte-Carlo simulation, the optimal values of the tunable parameters were calculated and the minimax risk estimates were obtained.

38

Yu, Junpu. "Thompson -Greedy Algorithm: An Improvement to the Regret of Thompson Sampling and -Greedy on Multi-Armed Bandit Problems." Applied and Computational Engineering 8, no. 1 (August 1, 2023): 525–34. http://dx.doi.org/10.54254/2755-2721/8/20230264.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The multi-armed bandit problem is one of the most classic reinforcement learning problems, aiming to find balanced decisions of exploration and exploitation and to increase the total reward of the actions from each round. To solve multi-armed bandit problems, algorithms were designed, including some of the most typical and widely used ones, like the Explore-Then-Commit algorithm, Upper Confidence Bound algorithm, Epsilon-Greedy algorithm, and Thompson Sampling algorithm. Some of them are improvements upon others, while all of them seek to increase total reward but contain specific weaknesses. Epsilon-Greedy algorithm, as a simple method to balance exploration and exploitation of multi-armed bandit problems, has the disadvantage of still picking non-optimal actions even if it appears to be non-optimal for a very long time. Thompson Sampling algorithm, though performing well in many scenarios, costs a significantly long time to update its prior distribution each round and tends to explore too much in initial tries when the real distribution of reward is scattered. To further fix their weaknesses and improve their performance, this paper proposed a newly designed algorithm, Thompson -Greedy (TEG), which seeks to utilize the advantages of both algorithms to complement each others disadvantages. The TEG algorithm is not only proved to perform better than -Greedy in most cases, but also turned out to be more adaptive in environments with true reward distributions that weaken Thompson Sampling Algorithm. Beyond the comparison of regrets, the paper further analyzed the time cost of applying TEG with those two existing methods and their best arm selection rates to illustrate the significance of the TEG algorithm.

39

Garcelon, Evrard, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. "Improved Algorithms for Conservative Exploration in Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 3962–69. http://dx.doi.org/10.1609/aaai.v34i04.5812.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.

40

Kasy, Maximilian, and Anja Sautmann. "Adaptive Treatment Assignment in Experiments for Policy Choice." Econometrica 89, no. 1 (2021): 113–32. http://dx.doi.org/10.3982/ecta17527.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Standard experimental designs are geared toward point estimation and hypothesis testing, while bandit algorithms are geared toward in‐sample outcomes. Here, we instead consider treatment assignment in an experiment with several waves for choosing the best among a set of possible policies (treatments) at the end of the experiment. We propose a computationally tractable assignment algorithm that we call “exploration sampling,” where assignment probabilities in each wave are an increasing concave function of the posterior probabilities that each treatment is optimal. We prove an asymptotic optimality result for this algorithm and demonstrate improvements in welfare in calibrated simulations over both non‐adaptive designs and bandit algorithms. An application to selecting between six different recruitment strategies for an agricultural extension service in India demonstrates practical feasibility.

41

Ontanon, Santiago. "The Combinatorial Multi-Armed Bandit Problem and Its Application to Real-Time Strategy Games." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 9, no. 1 (June 30, 2021): 58–64. http://dx.doi.org/10.1609/aiide.v9i1.12681.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Game tree search in games with large branching factors is a notoriously hard problem. In this paper, we address this problem with a new sampling strategy for Monte Carlo Tree Search (MCTS) algorithms, called "Naive Sampling", based on a variant of the Multi-armed Bandit problem called the "Combinatorial Multi-armed Bandit" (CMAB) problem. We present a new MCTS algorithm based on Naive Sampling called NaiveMCTS, and evaluate it in the context of real-time strategy (RTS) games. Our results show that as the branching factor grows, NaiveMCTS performs significantly better than other algorithms.

42

Patil, Vishakha, Ganesh Ghalme, Vineet Nair, and Y. Narahari. "Achieving Fairness in the Stochastic Multi-Armed Bandit Problem." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 5379–86. http://dx.doi.org/10.1609/aaai.v34i04.5986.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We study an interesting variant of the stochastic multi-armed bandit problem, which we call the Fair-MAB problem, where, in addition to the objective of maximizing the sum of expected rewards, the algorithm also needs to ensure that at any time, each arm is pulled at least a pre-specified fraction of times. We investigate the interplay between learning and fairness in terms of a pre-specified vector denoting the fractions of guaranteed pulls. We define a fairness-aware regret, which we call r-Regret, that takes into account the above fairness constraints and extends the conventional notion of regret in a natural way. Our primary contribution is to obtain a complete characterization of a class of Fair-MAB algorithms via two parameters: the unfairness tolerance and the learning algorithm used as a black-box. For this class of algorithms, we provide a fairness guarantee that holds uniformly over time, irrespective of the choice of the learning algorithm. Further, when the learning algorithm is UCB1, we show that our algorithm achieves constant r-Regret for a large enough time horizon. Finally, we analyze the cost of fairness in terms of the conventional notion of regret. We conclude by experimentally validating our theoretical results.

43

Wang, Zhenlin, and Jonathan Scarlett. "Max-Min Grouped Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8603–11. http://dx.doi.org/10.1609/aaai.v36i8.20838.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this paper, we introduce a multi-armed bandit problem termed max-min grouped bandits, in which the arms are arranged in possibly-overlapping groups, and the goal is to find a group whose worst arm has the highest mean reward. This problem is of interest in applications such as recommendation systems, and is also closely related to widely-studied robust optimization problems. We present two algorithms based successive elimination and robust optimization, and derive upper bounds on the number of samples to guarantee finding a max-min optimal or near-optimal group, as well as an algorithm-independent lower bound. We discuss the degree of tightness of our bounds in various cases of interest, and the difficulties in deriving uniformly tight bounds.

44

Sakakibara, Masaya, Akira Notsu, Seiki Ubukata, and Katsuhiro Honda. "Designation of Candidate Solutions in Differential Evolution Based on Bandit Algorithm and its Evaluation." Journal of Advanced Computational Intelligence and Intelligent Informatics 23, no. 4 (July 20, 2019): 758–66. http://dx.doi.org/10.20965/jaciii.2019.p0758.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We propose UCT-Grid Area Search (UCT-GAS), which is an efficient optimization method that roughly estimates specific values in areas, and consider exploration and exploitation in optimization problems. This approach divides the search space and imagines it to be a multi-armed bandit, which enables us to use bandit algorithms to solve mathematical programming problems. Although the search speed is fast than other search algorithm like differential evolution, it might converge to a local solution. In this study, we improve this algorithm by replacing its random search part with differential evolution after several searches. Comparative experiments confirmed the search ability of the optimal solution, and our method benefits by showing that it avoids falling into a local solution and that its search speed is fast.

45

Kim, Gi-Soo, Jane P. Kim, and Hyun-Joon Yang. "Robust Tests in Online Decision-Making." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 9 (June 28, 2022): 10016–24. http://dx.doi.org/10.1609/aaai.v36i9.21240.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Bandit algorithms are widely used in sequential decision problems to maximize the cumulative reward. One potential application is mobile health, where the goal is to promote the user's health through personalized interventions based on user specific information acquired through wearable devices. Important considerations include the type of, and frequency with which data is collected (e.g. GPS, or continuous monitoring), as such factors can severely impact app performance and users’ adherence. In order to balance the need to collect data that is useful with the constraint of impacting app performance, one needs to be able to assess the usefulness of variables. Bandit feedback data are sequentially correlated, so traditional testing procedures developed for independent data cannot apply. Recently, a statistical testing procedure was developed for the actor-critic bandit algorithm. An actor-critic algorithm maintains two separate models, one for the actor, the action selection policy, and the other for the critic, the reward model. The performance of the algorithm as well as the validity of the test are guaranteed only when the critic model is correctly specified. However, misspecification is frequent in practice due to incorrect functional form or missing covariates. In this work, we propose a modified actor-critic algorithm which is robust to critic misspecification and derive a novel testing procedure for the actor parameters in this case.

46

Mansour, Yishay, Aleksandrs Slivkins, and Vasilis Syrgkanis. "Bayesian Incentive-Compatible Bandit Exploration." Operations Research 68, no. 4 (July 2020): 1132–61. http://dx.doi.org/10.1287/opre.2019.1949.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

As self-interested individuals (“agents”) make decisions over time, they utilize information revealed by other agents in the past and produce information that may help agents in the future. This phenomenon is common in a wide range of scenarios in the Internet economy, as well as in medical decisions. Each agent would like to exploit: select the best action given the current information, but would prefer the previous agents to explore: try out various alternatives to collect information. A social planner, by means of a carefully designed recommendation policy, can incentivize the agents to balance the exploration and exploitation so as to maximize social welfare. We model the planner’s recommendation policy as a multiarm bandit algorithm under incentive-compatibility constraints induced by agents’ Bayesian priors. We design a bandit algorithm which is incentive-compatible and has asymptotically optimal performance, as expressed by regret. Further, we provide a black-box reduction from an arbitrary multiarm bandit algorithm to an incentive-compatible one, with only a constant multiplicative increase in regret. This reduction works for very general bandit settings that incorporate contexts and arbitrary partial feedback.

47

Ding, Wenkui, Tao Qin, Xu-Dong Zhang, and Tie-Yan Liu. "Multi-Armed Bandit with Budget Constraint and Variable Costs." Proceedings of the AAAI Conference on Artificial Intelligence 27, no. 1 (June 30, 2013): 232–38. http://dx.doi.org/10.1609/aaai.v27i1.8637.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We study the multi-armed bandit problems with budget constraint and variable costs (MAB-BV). In this setting, pulling an arm will receive a random reward together with a random cost, and the objective of an algorithm is to pull a sequence of arms in order to maximize the expected total reward with the costs of pulling those arms complying with a budget constraint. This new setting models many Internet applications (e.g., ad exchange, sponsored search, and cloud computing) in a more accurate manner than previous settings where the pulling of arms is either costless or with a fixed cost. We propose two UCB based algorithms for the new setting. The first algorithm needs prior knowledge about the lower bound of the expected costs when computing the exploration term. The second algorithm eliminates this need by estimating the minimal expected costs from empirical observations, and therefore can be applied to more real-world applications where prior knowledge is not available. We prove that both algorithms have nice learning abilities, with regret bounds of O(ln B). Furthermore, we show that when applying our proposed algorithms to a previous setting with fixed costs (which can be regarded as our special case), one can improve the previously obtained regret bound. Our simulation results on real-time bidding in ad exchange verify the effectiveness of the algorithms and are consistent with our theoretical analysis.

48

Liu, Yizhi. "An investigation of progress related to stochastic stationary bandit algorithms." Applied and Computational Engineering 34, no. 1 (January 22, 2024): 197–201. http://dx.doi.org/10.54254/2755-2721/34/20230326.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The Multi-armed Bandit algorithm stands as a consequential tool for informed decision-making, distinct from reliance on intuitive selections, given its systematic proclivity to meticulously assess accessible alternatives with the intent of discerning the most auspicious outcome. Amid the repertoire of algorithmic variations, the Stochastic Stationary Bandit algorithm assumes a foundational and enduring role, finding versatile application across diverse domains, including but not limited to digital advertising, price optimization, and recommendation systems. With these considerations in view, the present study embarks upon a comprehensive scrutiny of this subject matter. This paper reviews on the Explore-Then-Commit algorithm, Upper Confidence Bound algorithm, and Thompson Sampling algorithm by explaining, comparing their formulation, features, and expected results. Explore-Then-Commit algorithm has distinct phase to explore all the choices uniformly. Upper Confidence Bound algorithm make decisions by calculate an upper confidence index which is an overestimate for each choice. Thompson Sampling algorithm depends on randomness to make choices. Explore-Then-Commit algorithm faces the problem of when to explore and when to stop. Upper Confidence Bound algorithm and Thompson Sampling algorithm solve this problem by avoid certain phases. Multi-armed Bandit algorithm could deal with the process of displaying items of potential interest to users in a recommendation system, the delivery of resources in resource allocation, or the way to maximize revenue in a business.

49

Kaibel, Chris, and Torsten Biemann. "Rethinking the Gold Standard With Multi-armed Bandits: Machine Learning Allocation Algorithms for Experiments." Organizational Research Methods 24, no. 1 (June 11, 2019): 78–103. http://dx.doi.org/10.1177/1094428119854153.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In experiments, researchers commonly allocate subjects randomly and equally to the different treatment conditions before the experiment starts. While this approach is intuitive, it means that new information gathered during the experiment is not utilized until after the experiment has ended. Based on methodological approaches from other scientific disciplines such as computer science and medicine, we suggest machine learning algorithms for subject allocation in experiments. Specifically, we discuss a Bayesian multi-armed bandit algorithm for randomized controlled trials and use Monte Carlo simulations to compare its efficiency with randomized controlled trials that have a fixed and balanced subject allocation. Our findings indicate that a randomized allocation based on Bayesian multi-armed bandits is more efficient and ethical in most settings. We develop recommendations for researchers and discuss the limitations of our approach.

50

Lupu, Andrei, Audrey Durand, and Doina Precup. "Leveraging Observations in Bandits: Between Risks and Benefits." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6112–19. http://dx.doi.org/10.1609/aaai.v33i01.33016112.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. Allowing an agent to be influenced by external observations can benefit to the learning process, but it also puts the agent at risk of following sub-optimal behaviours. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, not the rewards obtained. We introduce a new bandit optimism modifier that uses conditional optimism contingent on the actions of the target in order to guide the agent’s exploration. We analyze the effect of this modification on the well-known Upper Confidence Bound algorithm by proving that it preserves a regret upper-bound of order O(lnT), even in the presence of a very poor target, and we derive the dependency of the expected regret on the general target policy. We provide empirical results showing both great benefits as well as certain limitations inherent to observational learning in the multi-armed bandit setting. Experiments are conducted using targets satisfying theoretical assumptions with high probability, thus narrowing the gap between theory and application.