Accedi

Bibliografie tematiche / Algorithme de bandit / Articoli di riviste

Articoli di riviste sul tema "Algorithme de bandit"

Segui questo link per vedere altri tipi di pubblicazioni sul tema: Algorithme de bandit.

Autore: Grafiati

Pubblicato: 1 giugno 2024

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili

Scegli il tipo di fonte:

Vedi i top-50 articoli di riviste per l'attività di ricerca sul tema "Algorithme de bandit".

Accanto a ogni fonte nell'elenco di riferimenti c'è un pulsante "Aggiungi alla bibliografia". Premilo e genereremo automaticamente la citazione bibliografica dell'opera scelta nello stile citazionale di cui hai bisogno: APA, MLA, Harvard, Chicago, Vancouver ecc.

Puoi anche scaricare il testo completo della pubblicazione scientifica nel formato .pdf e leggere online l'abstract (il sommario) dell'opera se è presente nei metadati.

Vedi gli articoli di riviste di molte aree scientifiche e compila una bibliografia corretta.

1

Ciucanu, Radu, Pascal Lafourcade, Gael Marcadet e Marta Soare. "SAMBA: A Generic Framework for Secure Federated Multi-Armed Bandits". Journal of Artificial Intelligence Research 73 (23 febbraio 2022): 737–65. http://dx.doi.org/10.1613/jair.1.13163.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The multi-armed bandit is a reinforcement learning model where a learning agent repeatedly chooses an action (pull a bandit arm) and the environment responds with a stochastic outcome (reward) coming from an unknown distribution associated with the chosen arm. Bandits have a wide-range of application such as Web recommendation systems. We address the cumulative reward maximization problem in a secure federated learning setting, where multiple data owners keep their data stored locally and collaborate under the coordination of a central orchestration server. We rely on cryptographic schemes and propose Samba, a generic framework for Secure federAted Multi-armed BAndits. Each data owner has data associated to a bandit arm and the bandit algorithm has to sequentially select which data owner is solicited at each time step. We instantiate Samba for five bandit algorithms. We show that Samba returns the same cumulative reward as the nonsecure versions of bandit algorithms, while satisfying formally proven security properties. We also show that the overhead due to cryptographic primitives is linear in the size of the input, which is confirmed by our proof-of-concept implementation.

2

Azizi, Javad, Branislav Kveton, Mohammad Ghavamzadeh e Sumeet Katariya. "Meta-Learning for Simple Regret Minimization". Proceedings of the AAAI Conference on Artificial Intelligence 37, n. 6 (26 giugno 2023): 6709–17. http://dx.doi.org/10.1609/aaai.v37i6.25823.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We develop a meta-learning framework for simple regret minimization in bandits. In this framework, a learning agent interacts with a sequence of bandit tasks, which are sampled i.i.d. from an unknown prior distribution, and learns its meta-parameters to perform better on future tasks. We propose the first Bayesian and frequentist meta-learning algorithms for this setting. The Bayesian algorithm has access to a prior distribution over the meta-parameters and its meta simple regret over m bandit tasks with horizon n is mere O(m / √n). On the other hand, the meta simple regret of the frequentist algorithm is O(n√m + m/ √n). While its regret is worse, the frequentist algorithm is more general because it does not need a prior distribution over the meta-parameters. It can also be analyzed in more settings. We instantiate our algorithms for several classes of bandit problems. Our algorithms are general and we complement our theory by evaluating them empirically in several environments.

3

Zhou, Huozhi, Lingda Wang, Lav Varshney e Ee-Peng Lim. "A Near-Optimal Change-Detection Based Algorithm for Piecewise-Stationary Combinatorial Semi-Bandits". Proceedings of the AAAI Conference on Artificial Intelligence 34, n. 04 (3 aprile 2020): 6933–40. http://dx.doi.org/10.1609/aaai.v34i04.6176.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We investigate the piecewise-stationary combinatorial semi-bandit problem. Compared to the original combinatorial semi-bandit problem, our setting assumes the reward distributions of base arms may change in a piecewise-stationary manner at unknown time steps. We propose an algorithm, GLR-CUCB, which incorporates an efficient combinatorial semi-bandit algorithm, CUCB, with an almost parameter-free change-point detector, the Generalized Likelihood Ratio Test (GLRT). Our analysis shows that the regret of GLR-CUCB is upper bounded by O(√NKT log T), where N is the number of piecewise-stationary segments, K is the number of base arms, and T is the number of time steps. As a complement, we also derive a nearly matching regret lower bound on the order of Ω(√NKT), for both piecewise-stationary multi-armed bandits and combinatorial semi-bandits, using information-theoretic techniques and judiciously constructed piecewise-stationary bandit instances. Our lower bound is tighter than the best available regret lower bound, which is Ω(√T). Numerical experiments on both synthetic and real-world datasets demonstrate the superiority of GLR-CUCB compared to other state-of-the-art algorithms.

4

Li, Youxuan. "Improvement of the recommendation system based on the multi-armed bandit algorithm". Applied and Computational Engineering 36, n. 1 (22 gennaio 2024): 237–41. http://dx.doi.org/10.54254/2755-2721/36/20230453.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In order to effectively solve common problems of the recommendation system, such as the cold start problem and dynamic data modeling problem, the multi-armed bandit (MAB) algorithm, the collaborative filtering (CF) algorithm, and the user information feedback are applied by researchers to update the recommendation model online and in time. In other words, the cold start problem of the recommendation system is transformed into an issue of exploration and utilization. The MAB algorithm is used, user features are introduced as content, and the synergy between users is further considered. In this paper, the author studies the improvement of the recommendation system based on the multi-armed bandit algorithm. The Liner Upper Confidence Bound (LinUCB), Collaborative Filtering Bandits (COFIBA), and Context-Aware clustering of Bandits (CAB) algorithms are analyzed. It is found that the MAB algorithm can get a good maximum total revenue regardless of the content value after going through the cold start stage. In the case of a particularly large amount of content, the CAB algorithm achieves the greatest effect.

5

Kuroki, Yuko, Liyuan Xu, Atsushi Miyauchi, Junya Honda e Masashi Sugiyama. "Polynomial-Time Algorithms for Multiple-Arm Identification with Full-Bandit Feedback". Neural Computation 32, n. 9 (settembre 2020): 1733–73. http://dx.doi.org/10.1162/neco_a_01299.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We study the problem of stochastic multiple-arm identification, where an agent sequentially explores a size-[Formula: see text] subset of arms (also known as a super arm) from given [Formula: see text] arms and tries to identify the best super arm. Most work so far has considered the semi-bandit setting, where the agent can observe the reward of each pulled arm or assumed each arm can be queried at each round. However, in real-world applications, it is costly or sometimes impossible to observe a reward of individual arms. In this study, we tackle the full-bandit setting, where only a noisy observation of the total sum of a super arm is given at each pull. Although our problem can be regarded as an instance of the best arm identification in linear bandits, a naive approach based on linear bandits is computationally infeasible since the number of super arms [Formula: see text] is exponential. To cope with this problem, we first design a polynomial-time approximation algorithm for a 0-1 quadratic programming problem arising in confidence ellipsoid maximization. Based on our approximation algorithm, we propose a bandit algorithm whose computation time is [Formula: see text](log [Formula: see text]), thereby achieving an exponential speedup over linear bandit algorithms. We provide a sample complexity upper bound that is still worst-case optimal. Finally, we conduct experiments on large-scale data sets with more than 10[Formula: see text] super arms, demonstrating the superiority of our algorithms in terms of both the computation time and the sample complexity.

6

Niño-Mora, José. "A Fast-Pivoting Algorithm for Whittle’s Restless Bandit Index". Mathematics 8, n. 12 (15 dicembre 2020): 2226. http://dx.doi.org/10.3390/math8122226.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The Whittle index for restless bandits (two-action semi-Markov decision processes) provides an intuitively appealing optimal policy for controlling a single generic project that can be active (engaged) or passive (rested) at each decision epoch, and which can change state while passive. It further provides a practical heuristic priority-index policy for the computationally intractable multi-armed restless bandit problem, which has been widely applied over the last three decades in multifarious settings, yet mostly restricted to project models with a one-dimensional state. This is due in part to the difficulty of establishing indexability (existence of the index) and of computing the index for projects with large state spaces. This paper draws on the author’s prior results on sufficient indexability conditions and an adaptive-greedy algorithmic scheme for restless bandits to obtain a new fast-pivoting algorithm that computes the n Whittle index values of an n-state restless bandit by performing, after an initialization stage, n steps that entail (2/3)n3+O(n2) arithmetic operations. This algorithm also draws on the parametric simplex method, and is based on elucidating the pattern of parametric simplex tableaux, which allows to exploit special structure to substantially simplify and reduce the complexity of simplex pivoting steps. A numerical study demonstrates substantial runtime speed-ups versus alternative algorithms.

7

Oswal, Urvashi, Aniruddha Bhargava e Robert Nowak. "Linear Bandits with Feature Feedback". Proceedings of the AAAI Conference on Artificial Intelligence 34, n. 04 (3 aprile 2020): 5331–38. http://dx.doi.org/10.1609/aaai.v34i04.5980.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This paper explores a new form of the linear bandit problem in which the algorithm receives the usual stochastic rewards as well as stochastic feedback about which features are relevant to the rewards, the latter feedback being the novel aspect. The focus of this paper is the development of new theory and algorithms for linear bandits with feature feedback which can achieve regret over time horizon T that scales like k√T, without prior knowledge of which features are relevant nor the number k of relevant features. In comparison, the regret of traditional linear bandits is d√T, where d is the total number of (relevant and irrelevant) features, so the improvement can be dramatic if k ≪ d. The computational complexity of the algorithm is proportional to k rather than d, making it much more suitable for real-world applications compared to traditional linear bandits. We demonstrate the performance of the algorithm with synthetic and real human-labeled data.

8

Agarwal, Mridul, Vaneet Aggarwal, Abhishek Kumar Umrawal e Chris Quinn. "DART: Adaptive Accept Reject Algorithm for Non-Linear Combinatorial Bandits". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 8 (18 maggio 2021): 6557–65. http://dx.doi.org/10.1609/aaai.v35i8.16812.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a non-linear function of the rewards of the selected individual arms. The direct use of a multi-armed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for top-K subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in N. Further, DART achieves a regret bound of Õ(K√KNT) for a time horizon T, which matches the lower bound in bandit feedback up to a factor of √log 2NT. When applied to the problem of cross-selling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of state-of-the-art algorithms. We also show that DART significantly outperforms existing methods for both linear and non-linear joint reward environments.

9

Qu, Jiaming. "Survey of dynamic pricing based on Multi-Armed Bandit algorithms". Applied and Computational Engineering 37, n. 1 (22 gennaio 2024): 160–65. http://dx.doi.org/10.54254/2755-2721/37/20230497.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Dynamic pricing seeks to determine the most optimal selling price for a product or service, taking into account factors like limited supply and uncertain demand. This study aims to provide a comprehensive exploration of dynamic pricing using the multi-armed bandit problem framework in various contexts. The investigation highlights the prevalence of Thompson sampling in dynamic pricing scenarios with a Bayesian backdrop, where the seller possesses prior knowledge of demand functions. On the other hand, in non-Bayesian situations, the Upper Confidence Bound (UCB) algorithm family gains traction due to their favorable regret bounds. As markets often exhibit temporal fluctuations, the domain of non-stationary multi-armed bandits within dynamic pricing emerges as crucial. Future research directions include enhancing traditional multi-armed bandit algorithms to suit online learning settings, especially those involving dynamic reward distributions. Additionally, merging prior insights into demand functions with contextual multi-armed bandit approaches holds promise for advancing dynamic pricing strategies. In conclusion, this study sheds light on dynamic pricing through the lens of multi-armed bandit problems, offering insights and pathways for further exploration.

10

Wan, Zongqi, Zhijie Zhang, Tongyang Li, Jialin Zhang e Xiaoming Sun. "Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets". Proceedings of the AAAI Conference on Artificial Intelligence 37, n. 8 (26 giugno 2023): 10087–94. http://dx.doi.org/10.1609/aaai.v37i8.26202.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Multi-arm bandit (MAB) and stochastic linear bandit (SLB) are important models in reinforcement learning, and it is well-known that classical algorithms for bandits with time horizon T suffer from the regret of at least the square root of T. In this paper, we study MAB and SLB with quantum reward oracles and propose quantum algorithms for both models with the order of the polylog T regrets, exponentially improving the dependence in terms of T. To the best of our knowledge, this is the first provable quantum speedup for regrets of bandit problems and in general exploitation in reinforcement learning. Compared to previous literature on quantum exploration algorithms for MAB and reinforcement learning, our quantum input model is simpler and only assumes quantum oracles for each individual arm.

11

Du, Yihan, Siwei Wang e Longbo Huang. "A One-Size-Fits-All Solution to Conservative Bandit Problems". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 8 (18 maggio 2021): 7254–61. http://dx.doi.org/10.1609/aaai.v35i8.16891.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees (T-independent additive regrets instead of T-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with O(1/T) normalized additive regrets (T-independent in the cumulative form) and validate this result through empirical evaluation.

12

Xue, Bo, Ji Cheng, Fei Liu, Yimu Wang e Qingfu Zhang. "Multiobjective Lipschitz Bandits under Lexicographic Ordering". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 15 (24 marzo 2024): 16238–46. http://dx.doi.org/10.1609/aaai.v38i15.29558.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This paper studies the multiobjective bandit problem under lexicographic ordering, wherein the learner aims to simultaneously maximize ? objectives hierarchically. The only existing algorithm for this problem considers the multi-armed bandit model, and its regret bound is O((KT)^(2/3)) under a metric called priority-based regret. However, this bound is suboptimal, as the lower bound for single objective multi-armed bandits is Omega(KlogT). Moreover, this bound becomes vacuous when the arm number K is infinite. To address these limitations, we investigate the multiobjective Lipschitz bandit model, which allows for an infinite arm set. Utilizing a newly designed multi-stage decision-making strategy, we develop an improved algorithm that achieves a general regret bound of O(T^((d_z^i+1)/(d_z^i+2))) for the i-th objective, where d_z^i is the zooming dimension for the i-th objective, with i in {1,2,...,m}. This bound matches the lower bound of the single objective Lipschitz bandit problem in terms of T, indicating that our algorithm is almost optimal. Numerical experiments confirm the effectiveness of our algorithm.

13

Liu, Zizhuo. "Investigation of progress and application related to Multi-Armed Bandit algorithms". Applied and Computational Engineering 37, n. 1 (22 gennaio 2024): 155–59. http://dx.doi.org/10.54254/2755-2721/37/20230496.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This paper discusses four Multi-armed Bandit algorithms: Explore-then-Commit (ETC), Epsilon-Greedy, Upper Confidence Bound (UCB), and Thompson Sampling algorithm. ETC algorithm aims to spend the majority of rounds on the best arm, but it can lead to a suboptimal outcome if the environment changes rapidly. The Epsilon-Greedy algorithm is designed to explore and exploit simultaneously, while it often tries sub-optimal arm even after the algorithm finds the best arm. Thus, the Epsilon-Greedy algorithm performs well when the environment continuously changes. UCB algorithm is one of the most used Multi-armed Bandit algorithms because it can rapidly narrow the potential optimal decisions in a wide range of scenarios; however, the algorithm can be influenced by some specific pattern of reward distribution or noise presenting in the environment. Thompson Sampling algorithm is also one of the most common algorithms in the Multi-armed Bandit algorithm due to its simplicity, effectiveness, and adaptability to various reward distributions. The Thompson Sampling algorithm performs well in multiple scenarios because it explores and exploits simultaneously, but its variance is greater than the three algorithms mentioned above. Today, Multi-armed bandit algorithms are widely used in advertisement, health care, and website and app optimization. Finally, the Multi-armed Bandit algorithms are rapidly replacing the traditional algorithms; in the future, the advanced Multi-armed Bandit algorithm, contextual Multi-armed Bandit algorithm, will gradually replace the old one.

14

Sharaf, Amr, e Hal Daumé III. "Meta-Learning Effective Exploration Strategies for Contextual Bandits". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 11 (18 maggio 2021): 9541–48. http://dx.doi.org/10.1609/aaai.v35i11.17149.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In contextual bandits, an algorithm must choose actions given ob- served contexts, learning from a reward signal that is observed only for the action chosen. This leads to an exploration/exploitation trade-off: the algorithm must balance taking actions it already believes are good with taking new actions to potentially discover better choices. We develop a meta-learning algorithm, Mêlée, that learns an exploration policy based on simulated, synthetic con- textual bandit tasks. Mêlée uses imitation learning against these simulations to train an exploration policy that can be applied to true contextual bandit tasks at test time. We evaluate Mêlée on both a natural contextual bandit problem derived from a learning to rank dataset as well as hundreds of simulated contextual ban- dit problems derived from classification tasks. Mêlée outperforms seven strong baselines on most of these datasets by leveraging a rich feature representation for learning an exploration strategy.

15

Dimakopoulou, Maria, Zhengyuan Zhou, Susan Athey e Guido Imbens. "Balanced Linear Contextual Bandits". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 luglio 2019): 3445–53. http://dx.doi.org/10.1609/aaai.v33i01.33013445.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We develop algorithms for contextual bandits with linear payoffs that integrate balancing methods from the causal inference literature in their estimation to make it less prone to problems of estimation bias. We provide the first regret bound analyses for linear contextual bandits with balancing and show that our algorithms match the state of the art theoretical guarantees. We demonstrate the strong practical advantage of balanced contextual bandits on a large number of supervised learning datasets and on a synthetic example that simulates model misspecification and prejudice in the initial training data.

16

Tolpin, David, e Solomon Shimony. "MCTS Based on Simple Rerget". Proceedings of the International Symposium on Combinatorial Search 3, n. 1 (20 agosto 2021): 193–99. http://dx.doi.org/10.1609/socs.v3i1.18221.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS),is based on UCB, a policy for the Multi-armed Bandit problem (MAB) thatminimizes the cumulative regret. However, search differs from MAB inthat in MCTS it is usually only the final ``arm pull''that collects a reward, rather than all ``arm pulls''.Therefore, it makes more sense to minimize the simple, rather thancumulative, regret. We introduce policies formulti-armed bandits with lower simpleregret than UCB and develop a two-stage scheme (SR+CR) for MCTSwhich outperforms UCT empirically. We also propose a samplingscheme based on value of information (VOI), achieving an algorithmthat empirically outperforms other proposed algorithms.

17

Wang, Liangxu. "Investigation of frontier Multi-Armed Bandit algorithms and applications". Applied and Computational Engineering 34, n. 1 (22 gennaio 2024): 179–84. http://dx.doi.org/10.54254/2755-2721/34/20230322.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Since the amount of content online is growing exponentially and people's time is limited, there is an urgent need for a high-performance algorithm that can make effective recommendations. This paper will introduce a recommendation system model, a sequential decision model, which is called Multi-Armed Bandit. The main idea of the Multi-Armed Bandit model is that at the beginning of the algorithm, all the recommended items are set to the same weight. In the subsequent recommendation process, the model explores the distribution of each item while changing the weight of each item according to the average revenue of each item, and selects more items with larger weight. This paper will introduce three cutting-edge Multi-Armed Bandit algorithms, their algorithmic ideas and their respective characteristics. The idea of Explore-Then-Commit (ETC) algorithm is to explore each item a certain number of times, and then select the best item for subsequent recommendation. The idea of the Upper Confidence Bound (UCB) algorithm is to represent the "exploration" and "exploitation" of each item by numerical values and add them to the UCB value, and select the item with the largest UCB value each time. The idea of TS is to first assume the distribution of each item, and then change the parameters of the distribution of each item based on the reward. At the end, this paper will introduce several scenarios where Multi-Armed Bandit algorithms can be used to give the reader an idea of how to use Multi-Armed Bandit algorithms.

18

Amani, Sanae, e Christos Thrampoulidis. "Decentralized Multi-Agent Linear Bandits with Safety Constraints". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 8 (18 maggio 2021): 6627–35. http://dx.doi.org/10.1609/aaai.v35i8.16820.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We study decentralized stochastic linear bandits, where a network of N agents acts cooperatively to efficiently solve a linear bandit-optimization problem over a d-dimensional space. For this problem, we propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network. At each round of the algorithm each agent chooses its actions following an upper confidence bound (UCB) strategy and agents share information with their immediate neighbors through a carefully designed consensus procedure that repeats over cycles. Our analysis adjusts the duration of these communication cycles ensuring near-optimal regret performance O(d \log{NT}\sqrt{NT}) at a communication rate of O(dN^2) per round. The structure of the network affects the regret performance via a small additive term – coined the regret of delay – that depends on the spectral gap of the underlying graph. Notably, our results apply to arbitrary network topologies without a requirement for a dedicated agent acting as a server. In consideration of situations with high communication cost, we propose RC-DLUCB: a modification of DLUCB with rare communication among agents. The new algorithm trades off regret performance for a significantly reduced total communication cost of O(d^3N^5/2) over all T rounds. Finally, we show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits. For the recently studied problem of linear bandits with unknown linear safety constraints, we propose the first safe decentralized algorithm. Our study contributes towards applying bandit techniques in safety-critical distributed systems that repeatedly deal with unknown stochastic environments. We present numerical simulations for various network topologies that corroborate our theoretical findings.

19

Wang, Zhiyong, Xutong Liu, Shuai Li e John C. S. Lui. "Efficient Explorative Key-Term Selection Strategies for Conversational Contextual Bandits". Proceedings of the AAAI Conference on Artificial Intelligence 37, n. 8 (26 giugno 2023): 10288–95. http://dx.doi.org/10.1609/aaai.v37i8.26225.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Conversational contextual bandits elicit user preferences by occasionally querying for explicit feedback on key-terms to accelerate learning. However, there are aspects of existing approaches which limit their performance. First, information gained from key-term-level conversations and arm-level recommendations is not appropriately incorporated to speed up learning. Second, it is important to ask explorative key-terms to quickly elicit the user's potential interests in various domains to accelerate the convergence of user preference estimation, which has never been considered in existing works. To tackle these issues, we first propose ``ConLinUCB", a general framework for conversational bandits with better information incorporation, combining arm-level and key-term-level feedback to estimate user preference in one step at each time. Based on this framework, we further design two bandit algorithms with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR. We prove tighter regret upper bounds of our proposed algorithms. Particularly, ConLinUCB-BS achieves a better regret bound than the previous result. Extensive experiments on synthetic and real-world data show significant advantages of our algorithms in learning accuracy (up to 54% improvement) and computational efficiency (up to 72% improvement), compared to the classic ConUCB algorithm, showing the potential benefit to recommender systems.

20

Wang, Zhenlin, e Jonathan Scarlett. "Max-Min Grouped Bandits". Proceedings of the AAAI Conference on Artificial Intelligence 36, n. 8 (28 giugno 2022): 8603–11. http://dx.doi.org/10.1609/aaai.v36i8.20838.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In this paper, we introduce a multi-armed bandit problem termed max-min grouped bandits, in which the arms are arranged in possibly-overlapping groups, and the goal is to find a group whose worst arm has the highest mean reward. This problem is of interest in applications such as recommendation systems, and is also closely related to widely-studied robust optimization problems. We present two algorithms based successive elimination and robust optimization, and derive upper bounds on the number of samples to guarantee finding a max-min optimal or near-optimal group, as well as an algorithm-independent lower bound. We discuss the degree of tightness of our bounds in various cases of interest, and the difficulties in deriving uniformly tight bounds.

21

Li, Litao. "Exploring Multi-Armed Bandit algorithms: Performance analysis in dynamic environments". Applied and Computational Engineering 34, n. 1 (22 gennaio 2024): 252–59. http://dx.doi.org/10.54254/2755-2721/34/20230338.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The Multi-armed Bandit algorithm, a proficient solver of the exploration-and-exploitation trade-off predicament, furnishes businesses with a robust tool for resource allocation that predominantly aligns with customer preferences. However, varying Multi-armed Bandit algorithm types exhibit dissimilar performance characteristics based on contextual variations. Hence, a series of experiments is imperative, involving alterations to input values across distinct algorithms. Within this study, three specific algorithms were applied, Explore-then-commit (ETC), Upper Confident Bound (UCB) and its asymptotically optimal variant, and Thompson Sampling (TS), to the extensively utilized MovieLens dataset. This application aimed to gauge their effectiveness comprehensively. The algorithms were translated into executable code, and their performance was visually depicted through multiple figures. Through cumulative regret tracking within defined conditions, algorithmic performance was scrutinized, laying the groundwork for subsequent parameter-based comparisons. A dedicated experimentation framework was devised to evaluate the robustness of each algorithm, involving deliberate parameter adjustments and tailored experiments to elucidate distinct performance nuances. The ensuing graphical depictions distinctly illustrated Thompson Sampling's persistent minimal regrets across most scenarios. UCB algorithms displayed steadfast stability. ETC manifested excellent performance with a low number of runs but escalate significantly along the number of runs growing. It also warranting constraints on exploratory phases to mitigate regrets. This investigation underscores the efficacy of Multi-armed Bandit algorithms while elucidating their nuanced behaviors within diverse contextual contingencies.

22

Yang, Luting, Jianyi Yang e Shaolei Ren. "Contextual Bandits with Delayed Feedback and Semi-supervised Learning (Student Abstract)". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 18 (18 maggio 2021): 15943–44. http://dx.doi.org/10.1609/aaai.v35i18.17968.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Contextual multi-armed bandit (MAB) is a classic online learning problem, where a learner/agent selects actions (i.e., arms) given contextual information and discovers optimal actions based on reward feedback. Applications of contextual bandit have been increasingly expanding, including advertisement, personalization, resource allocation in wireless networks, among others. Nonetheless, the reward feedback is delayed in many applications (e.g., a user may only provide service ratings after a period of time), creating challenges for contextual bandits. In this paper, we address delayed feedback in contextual bandits by using semi-supervised learning — incorporate estimates of delayed rewards to improve the estimation of future rewards. Concretely, the reward feedback for an arm selected at the beginning of a round is only observed by the agent/learner with some observation noise and provided to the agent after some a priori unknown but bounded delays. Motivated by semi-supervised learning that produces pseudo labels for unlabeled data to further improve the model performance, we generate fictitious estimates of rewards that are delayed and have yet to arrive based on already-learnt reward functions. Thus, by combining semi-supervised learning with online contextual bandit learning, we propose a novel extension and design two algorithms, which estimate the values for currently unavailable reward feedbacks to minimize the maximum estimation error and average estimation error, respectively.

23

Yu, Baosheng, Meng Fang e Dacheng Tao. "Per-Round Knapsack-Constrained Linear Submodular Bandits". Neural Computation 28, n. 12 (dicembre 2016): 2757–89. http://dx.doi.org/10.1162/neco_a_00887.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Linear submodular bandits has been proven to be effective in solving the diversification and feature-based exploration problem in information retrieval systems. Considering there is inevitably a budget constraint in many web-based applications, such as news article recommendations and online advertising, we study the problem of diversification under a budget constraint in a bandit setting. We first introduce a budget constraint to each exploration step of linear submodular bandits as a new problem, which we call per-round knapsack-constrained linear submodular bandits. We then define an [Formula: see text]-approximation unit-cost regret considering that the submodular function maximization is NP-hard. To solve this new problem, we propose two greedy algorithms based on a modified UCB rule. We prove these two algorithms with different regret bounds and computational complexities. Inspired by the lazy evaluation process in submodular function maximization, we also prove that a modified lazy evaluation process can be used to accelerate our algorithms without losing their theoretical guarantee. We conduct a number of experiments, and the experimental results confirm our theoretical analyses.

24

Roy Chaudhuri, Arghya, e Shivaram Kalyanakrishnan. "Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory". Proceedings of the AAAI Conference on Artificial Intelligence 34, n. 06 (3 aprile 2020): 10085–92. http://dx.doi.org/10.1609/aaai.v34i06.6566.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Regret minimisation in stochastic multi-armed bandits is a well-studied problem, for which several optimal algorithms have been proposed. Such algorithms depend on (sufficient statistics of) the empirical reward distributions of the arms to decide which arm to pull next. In this paper, we consider the design of algorithms that are constrained to store statistics from only a bounded number of arms. For bandits with a finite set of arms, we derive a sub-linear upper bound on the regret that decreases with the “arm memory” size M. For instances with a large, possibly infinite, set of arms, we show a sub-linear bound on the quantile regret.Our problem formulation generalises that of Liau et al. (2018), who fix M = O(1), and so do not obtain bounds that depend on M. More importantly, our algorithms keep exploration and exploitation tightly coupled, without a dedicated exploration phase as employed by Liau et al. (2018). Although this choice makes our analysis harder, it leads to much-improved practical performance. For bandits with a large number of arms and no known structure on the rewards, our algorithms serve as a viable option. Unlike many other approaches to restrict the memory of bandit algorithms, our algorithms do not need any additional technical assumptions.

25

Kaibel, Chris, e Torsten Biemann. "Rethinking the Gold Standard With Multi-armed Bandits: Machine Learning Allocation Algorithms for Experiments". Organizational Research Methods 24, n. 1 (11 giugno 2019): 78–103. http://dx.doi.org/10.1177/1094428119854153.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In experiments, researchers commonly allocate subjects randomly and equally to the different treatment conditions before the experiment starts. While this approach is intuitive, it means that new information gathered during the experiment is not utilized until after the experiment has ended. Based on methodological approaches from other scientific disciplines such as computer science and medicine, we suggest machine learning algorithms for subject allocation in experiments. Specifically, we discuss a Bayesian multi-armed bandit algorithm for randomized controlled trials and use Monte Carlo simulations to compare its efficiency with randomized controlled trials that have a fixed and balanced subject allocation. Our findings indicate that a randomized allocation based on Bayesian multi-armed bandits is more efficient and ethical in most settings. We develop recommendations for researchers and discuss the limitations of our approach.

26

Xi, Guangyu, Chao Tao e Yuan Zhou. "Near-Optimal MNL Bandits Under Risk Criteria". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 12 (18 maggio 2021): 10397–404. http://dx.doi.org/10.1609/aaai.v35i12.17245.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We study MNL bandits, which is a variant of the traditional multi-armed bandit problem, under risk criteria. Unlike the ordinary expected revenue, risk criteria are more general goals widely used in industries and business. We design algorithms for a broad class of risk criteria, including but not limited to the well-known conditional value-at-risk, Sharpe ratio, and entropy risk, and prove that they suffer a near-optimal regret. As a complement, we also conduct experiments with both synthetic and real data to show the empirical performance of our proposed algorithms.

27

Niño-Mora, José. "Restless bandits, partial conservation laws and indexability". Advances in Applied Probability 33, n. 1 (marzo 2001): 76–98. http://dx.doi.org/10.1017/s0001867800010648.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We show that if performance measures in a general stochastic scheduling problem satisfy partial conservation laws (PCL), which extend the generalized conservation laws (GCL) introduced by Bertsimas and Niño-Mora (1996), then the problem is solved optimally by a priority-index policy under a range of admissible linear performance objectives, with both this range and the optimal indices being determined by a one-pass adaptive-greedy algorithm that extends Klimov's: we call such scheduling problems PCL-indexable. We further apply the PCL framework to investigate the indexability property of restless bandits (two-action finite-state Markov decision chains) introduced by Whittle, obtaining the following results: (i) we present conditions on model parameters under which a single restless bandit is PCL-indexable, and hence indexable; membership of the class of PCL-indexable bandits is tested through a single run of the adaptive-greedy algorithm, which further computes the Whittle indices when the test is positive; this provides a tractable sufficient condition for indexability; (ii) we further introduce the subclass of GCL-indexable bandits (including classical bandits), which are indexable under arbitrary linear rewards. Our analysis is based on the achievable region approach to stochastic optimization, as the results follow from deriving and exploiting a new linear programming reformulation for single restless bandits.

28

Cheung, Wang Chi, David Simchi-Levi e Ruihao Zhu. "Hedging the Drift: Learning to Optimize Under Nonstationarity". Management Science 68, n. 3 (marzo 2022): 1696–713. http://dx.doi.org/10.1287/mnsc.2021.4024.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We introduce data-driven decision-making algorithms that achieve state-of-the-art dynamic regret bounds for a collection of nonstationary stochastic bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traffic network routing in changing environments. We show how the difficulty posed by the (unknown a priori and possibly adversarial) nonstationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Beginning with the linear bandit setting, we design and analyze a sliding window-upper confidence bound algorithm that achieves the optimal dynamic regret bound when the underlying variation budget is known. This budget quantifies the total amount of temporal variation of the latent environments. Boosted by the novel bandit-over-bandit framework that adapts to the latent changes, our algorithm can further enjoy nearly optimal dynamic regret bounds in a (surprisingly) parameter-free manner. We extend our results to other related bandit problems, namely the multiarmed bandit, generalized linear bandit, and combinatorial semibandit settings, which model a variety of operations research applications. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the “forgetting principle” in the learning processes, which is vital in changing environments. Extensive numerical experiments with synthetic datasets and a dataset of an online auto-loan company during the severe acute respiratory syndrome (SARS) epidemic period demonstrate that our proposed algorithms achieve superior performance compared with existing algorithms. This paper was accepted by George J. Shanthikumar, Management Science Special Section on Data-Driven Prescriptive Analytics.

29

Varatharajah, Yogatheesan, e Brent Berry. "A Contextual-Bandit-Based Approach for Informed Decision-Making in Clinical Trials". Life 12, n. 8 (21 agosto 2022): 1277. http://dx.doi.org/10.3390/life12081277.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Clinical trials are conducted to evaluate the efficacy of new treatments. Clinical trials involving multiple treatments utilize the randomization of treatment assignments to enable the evaluation of treatment efficacies in an unbiased manner. Such evaluation is performed in post hoc studies that usually use supervised-learning methods that rely on large amounts of data collected in a randomized fashion. That approach often proves to be suboptimal in that some participants may suffer and even die as a result of having not received the most appropriate treatments during the trial. Reinforcement-learning methods improve the situation by making it possible to learn the treatment efficacies dynamically during the course of the trial, and to adapt treatment assignments accordingly. Recent efforts using multi-arm bandits, a type of reinforcement-learning method, have focused on maximizing clinical outcomes for a population that was assumed to be homogeneous. However, those approaches have failed to account for the variability among participants that is becoming increasingly evident as a result of recent clinical-trial-based studies. We present a contextual-bandit-based online treatment optimization algorithm that, in choosing treatments for new participants in the study, takes into account not only the maximization of the clinical outcomes as well as the patient characteristics. We evaluated our algorithm using a real clinical trial dataset from the International Stroke Trial. We simulated the online setting by sequentially going through the data of each participant admitted to the trial. Two bandits (one for each context) were created, with four choices of treatments. For a new participant in the trial, depending on the context, one of the bandits was selected. Then, we took three different approaches to choose a treatment: (a) a random choice (i.e., the strategy currently used in clinical trial settings), (b) a Thompson sampling-based approach, and (c) a UCB-based approach. Success probabilities of each context were calculated separately by considering the participants with the same context. Those estimated outcomes were used to update the prior distributions within the bandit corresponding to the context of each participant. We repeated that process through the end of the trial and recorded the outcomes and the chosen treatments for each approach. We also evaluated a context-free multi-arm-bandit-based approach, using the same dataset, to showcase the benefits of our approach. In the context-free case, we calculated the success probabilities for the Bernoulli sampler using the whole clinical trial dataset in a context-independent manner. The results of our retrospective analysis indicate that the proposed approach performs significantly better than either a random assignment of treatments (the current gold standard) or a multi-arm-bandit-based approach, providing substantial gains in the percentage of participants who are assigned the most suitable treatments. The contextual-bandit and multi-arm bandit approaches provide 72.63% and 64.34% gains, respectively, compared to a random assignment.

30

Chen, Panyangjie. "Investigation of selection and application of Multi-Armed Bandit algorithms in recommendation system". Applied and Computational Engineering 34, n. 1 (22 gennaio 2024): 185–90. http://dx.doi.org/10.54254/2755-2721/34/20230323.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The Multi-Armed Bandit (MAB) algorithm holds significant prominence as a recommendation system technique, effectively presenting user-centric content preferences based on the analysis of collected data. However, the application of the basic MAB algorithm in real-world recommendation systems is not without challenges, including issues related to data volume and data processing accuracy. Therefore, the optimization algorithm based on the MAB algorithm is more widely used in the recommendation system. This paper briefly introduces the multi-armed bandit algorithm, that is, the use of MAB in the recommendation system and the problems of the basic MAB algorithm. Aiming at the problems of the basic MAB algorithm, it introduces the MAB-based optimization algorithm used in the recommendation system. At the same time, this paper also analyzes and summarizes such algorithms. This paper introduces two different MAB-based optimization algorithms, namely The Details of Dynamic clustering based contextual combinatorial multi-armed bandit (DC3MAB) and Binary Upper Confidence Bound (BiUCB). In addition, this paper also introduces the application of algorithm in recommended system. Finally, this paper summarizes the introduced algorithms and proposes the future prospects for MAB optimization algorithms.

31

Chen, Xijin, Kim May Lee, Sofia S. Villar e David S. Robertson. "Some performance considerations when using multi-armed bandit algorithms in the presence of missing data". PLOS ONE 17, n. 9 (12 settembre 2022): e0274272. http://dx.doi.org/10.1371/journal.pone.0274272.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

When comparing the performance of multi-armed bandit algorithms, the potential impact of missing data is often overlooked. In practice, it also affects their implementation where the simplest approach to overcome this is to continue to sample according to the original bandit algorithm, ignoring missing outcomes. We investigate the impact on performance of this approach to deal with missing data for several bandit algorithms through an extensive simulation study assuming the rewards are missing at random. We focus on two-armed bandit algorithms with binary outcomes in the context of patient allocation for clinical trials with relatively small sample sizes. However, our results apply to other applications of bandit algorithms where missing data is expected to occur. We assess the resulting operating characteristics, including the expected reward. Different probabilities of missingness in both arms are considered. The key finding of our work is that when using the simplest strategy of ignoring missing data, the impact on the expected performance of multi-armed bandit strategies varies according to the way these strategies balance the exploration-exploitation trade-off. Algorithms that are geared towards exploration continue to assign samples to the arm with more missing responses (which being perceived as the arm with less observed information is deemed more appealing by the algorithm than it would otherwise be). In contrast, algorithms that are geared towards exploitation would rapidly assign a high value to samples from the arms with a current high mean irrespective of the level observations per arm. Furthermore, for algorithms focusing more on exploration, we illustrate that the problem of missing responses can be alleviated using a simple mean imputation approach.

32

Huang, Wen, Lu Zhang e Xintao Wu. "Achieving Counterfactual Fairness for Causal Bandit". Proceedings of the AAAI Conference on Artificial Intelligence 36, n. 6 (28 giugno 2022): 6952–59. http://dx.doi.org/10.1609/aaai.v36i6.20653.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In online recommendation, customers arrive in a sequential and stochastic manner from an underlying distribution and the online decision model recommends a chosen item for each arriving individual based on some strategy. We study how to recommend an item at each step to maximize the expected reward while achieving user-side fairness for customers, i.e., customers who share similar profiles will receive a similar reward regardless of their sensitive attributes and items being recommended. By incorporating causal inference into bandits and adopting soft intervention to model the arm selection strategy, we first propose the d-separation based UCB algorithm (D-UCB) to explore the utilization of the d-separation set in reducing the amount of exploration needed to achieve low cumulative regret. Based on that, we then propose the fair causal bandit (F-UCB) for achieving the counterfactual individual fairness. Both theoretical analysis and empirical evaluation demonstrate effectiveness of our algorithms.

33

Esfandiari, Hossein, Amin Karbasi, Abbas Mehrabian e Vahab Mirrokni. "Regret Bounds for Batched Bandits". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 8 (18 maggio 2021): 7340–48. http://dx.doi.org/10.1609/aaai.v35i8.16901.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We present simple algorithms for batched stochastic multi-armed bandit and batched stochastic linear bandit problems. We prove bounds for their expected regrets that improve and extend the best known regret bounds of Gao, Han, Ren, and Zhou (NeurIPS 2019), for any number of batches. In particular, our algorithms in both settings achieve the optimal expected regrets by using only a logarithmic number of batches. We also study the batched adversarial multi-armed bandit problem for the first time and provide the optimal regret, up to logarithmic factors, of any algorithm with predetermined batch sizes.

34

Yan, Xue, Yali Du, Binxin Ru, Jun Wang, Haifeng Zhang e Xu Chen. "Learning to Identify Top Elo Ratings: A Dueling Bandits Approach". Proceedings of the AAAI Conference on Artificial Intelligence 36, n. 8 (28 giugno 2022): 8797–805. http://dx.doi.org/10.1609/aaai.v36i8.20860.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The Elo rating system is widely adopted to evaluate the skills of (chess) game and sports players. Recently it has been also integrated into machine learning algorithms in evaluating the performance of computerised AI agents. However, an accurate estimation of the Elo rating (for the top players) often requires many rounds of competitions, which can be expensive to carry out. In this paper, to minimize the number of comparisons and to improve the sample efficiency of the Elo evaluation (for top players), we propose an efficient online match scheduling algorithm. Specifically, we identify and match the top players through a dueling bandits framework and tailor the bandit algorithm to the gradient-based update of Elo. We show that it reduces the per-step memory and time complexity to constant, compared to the traditional likelihood maximization approaches requiring O(t) time. Our algorithm has a regret guarantee that is sublinear in the number of competition rounds and has been extended to the multidimensional Elo ratings for handling intransitive games. We empirically demonstrate that our method achieves superior convergence speed and time efficiency on a variety of gaming tasks.

35

Ontañón, Santiago. "Combinatorial Multi-armed Bandits for Real-Time Strategy Games". Journal of Artificial Intelligence Research 58 (29 marzo 2017): 665–702. http://dx.doi.org/10.1613/jair.5398.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Games with large branching factors pose a significant challenge for game tree search algorithms. In this paper, we address this problem with a sampling strategy for Monte Carlo Tree Search (MCTS) algorithms called "naive sampling", based on a variant of the Multi-armed Bandit problem called "Combinatorial Multi-armed Bandits" (CMAB). We analyze the theoretical properties of several variants of naive sampling, and empirically compare it against the other existing strategies in the literature for CMABs. We then evaluate these strategies in the context of real-time strategy (RTS) games, a genre of computer games characterized by their very large branching factors. Our results show that as the branching factor grows, naive sampling outperforms the other sampling strategies.

36

Tang, Qiao, Hong Xie, Yunni Xia, Jia Lee e Qingsheng Zhu. "Robust Contextual Bandits via Bootstrapping". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 13 (18 maggio 2021): 12182–89. http://dx.doi.org/10.1609/aaai.v35i13.17446.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Upper confidence bound (UCB) based contextual bandit algorithms require one to know the tail property of the reward distribution. Unfortunately, such tail property is usually unknown or difficult to specify in real-world applications. Using a tail property heavier than the ground truth leads to a slow learning speed of the contextual bandit algorithm, while using a lighter one may cause the algorithm to diverge. To address this fundamental problem, we develop an estimator (evaluated from historical rewards) for the contextual bandit UCB based on the multiplier bootstrapping technique. We first establish sufficient conditions under which our estimator converges asymptotically to the ground truth of contextual bandit UCB. We further derive a second order correction for our estimator so as to obtain its confidence level with a finite number of rounds. To demonstrate the versatility of the estimator, we apply it to design a BootLinUCB algorithm for the contextual bandit. We prove that the BootLinUCB has a sub-linear regret upper bound and also conduct extensive experiments to validate its superior performance.

37

Tornede, Alexander, Viktor Bengs e Eyke Hüllermeier. "Machine Learning for Online Algorithm Selection under Censored Feedback". Proceedings of the AAAI Conference on Artificial Intelligence 36, n. 9 (28 giugno 2022): 10370–80. http://dx.doi.org/10.1609/aaai.v36i9.21279.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms. For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime. As the latter is known to exhibit a heavy-tail distribution, an algorithm is normally stopped when exceeding a predefined upper time limit. As a consequence, machine learning methods used to optimize an algorithm selection strategy in a data-driven manner need to deal with right-censored samples, a problem that has received little attention in the literature so far. In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem. Moreover, we adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon. In an extensive experimental evaluation on an adapted version of the ASlib benchmark, we demonstrate that theoretically well-founded methods based on Thompson sampling perform specifically strong and improve in comparison to existing methods.

38

Fourati, Fares, Christopher John Quinn, Mohamed-Slim Alouini e Vaneet Aggarwal. "Combinatorial Stochastic-Greedy Bandit". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 11 (24 marzo 2024): 12052–60. http://dx.doi.org/10.1609/aaai.v38i11.29093.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We propose a novel combinatorial stochastic-greedy bandit (SGB) algorithm for combinatorial multi-armed bandit problems when no extra information other than the joint reward of the selected set of n arms at each time step t in [T] is observed. SGB adopts an optimized stochastic-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms. Unlike existing methods that explore the entire set of unselected base arms during each selection step, our SGB algorithm samples only an optimized proportion of unselected arms and selects actions from this subset. We prove that our algorithm achieves a (1-1/e)-regret bound of O(n^(1/3) k^(2/3) T^(2/3) log(T)^(2/3)) for monotone stochastic submodular rewards, which outperforms the state-of-the-art in terms of the cardinality constraint k. Furthermore, we empirically evaluate the performance of our algorithm in the context of online constrained social influence maximization. Our results demonstrate that our proposed approach consistently outperforms the other algorithms, increasing the performance gap as k grows.

39

Barman, Siddharth, Arindam Khan, Arnab Maiti e Ayush Sawarni. "Fairness and Welfare Quantification for Regret in Multi-Armed Bandits". Proceedings of the AAAI Conference on Artificial Intelligence 37, n. 6 (26 giugno 2023): 6762–69. http://dx.doi.org/10.1609/aaai.v37i6.25829.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We extend the notion of regret with a welfarist perspective. Focussing on the classic multi-armed bandit (MAB) framework, the current work quantifies the performance of bandit algorithms by applying a fundamental welfare function, namely the Nash social welfare (NSW) function. This corresponds to equating algorithm's performance to the geometric mean of its expected rewards and leads us to the study of Nash regret, defined as the difference between the - a priori unknown - optimal mean (among the arms) and the algorithm's performance. Since NSW is known to satisfy fairness axioms, our approach complements the utilitarian considerations of average (cumulative) regret, wherein the algorithm is evaluated via the arithmetic mean of its expected rewards. This work develops an algorithm that, given the horizon of play T, achieves a Nash regret of O ( sqrt{(k log T)/T} ), here k denotes the number of arms in the MAB instance. Since, for any algorithm, the Nash regret is at least as much as its average regret (the AM-GM inequality), the known lower bound on average regret holds for Nash regret as well. Therefore, our Nash regret guarantee is essentially tight. In addition, we develop an anytime algorithm with a Nash regret guarantee of O( sqrt{(k log T)/T} log T ).

40

Li, Wenjie, Qifan Song, Jean Honorio e Guang Lin. "Federated X-armed Bandit". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 12 (24 marzo 2024): 13628–36. http://dx.doi.org/10.1609/aaai.v38i12.29267.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This work establishes the first framework of federated X-armed bandit, where different clients face heterogeneous local objective functions defined on the same domain and are required to collaboratively figure out the global optimum. We propose the first federated algorithm for such problems, named Fed-PNE. By utilizing the topological structure of the global objective inside the hierarchical partitioning and the weak smoothness property, our algorithm achieves sublinear cumulative regret with respect to both the number of clients and the evaluation budget. Meanwhile, it only requires logarithmic communications between the central server and clients, protecting the client privacy. Experimental results on synthetic functions and real datasets validate the advantages of Fed-PNE over various centralized and federated baseline algorithms.

41

Tolpin, David, e Solomon Shimony. "MCTS Based on Simple Regret". Proceedings of the AAAI Conference on Artificial Intelligence 26, n. 1 (20 settembre 2021): 570–76. http://dx.doi.org/10.1609/aaai.v26i1.8126.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

UCT, a state-of-the art algorithm for Monte Carlo tree search (MCTS) in games and Markov decision processes, is based on UCB, a sampling policy for the Multi-armed Bandit problem (MAB) that minimizes the cumulative regret. However, search differs from MAB in that in MCTS it is usually only the final ``arm pull'' (the actual move selection) that collects a reward, rather than all ``arm pulls''. Therefore, it makes more sense to minimize the simple regret, as opposed to the cumulative regret. We begin by introducing policies for multi-armed bandits with lower finite-time and asymptotic simple regret than UCB, using it to develop a two-stage scheme (SR+CR) for MCTS which outperforms UCT empirically. Optimizing the sampling process is itself a metareasoning problem, a solution of which can use value of information (VOI) techniques. Although the theory of VOI for search exists, applying it to MCTS is non-trivial, as typical myopic assumptions fail. Lacking a complete working VOI theory for MCTS, we nevertheless propose a sampling scheme that is ``aware'' of VOI, achieving an algorithm that in empirical evaluation outperforms both UCT and the other proposed algorithms.

42

Ene, Alina, Huy L. Nguyen e Adrian Vladu. "Projection-Free Bandit Optimization with Privacy Guarantees". Proceedings of the AAAI Conference on Artificial Intelligence 35, n. 8 (18 maggio 2021): 7322–30. http://dx.doi.org/10.1609/aaai.v35i8.16899.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We design differentially private algorithms for the bandit convex optimization problem in the projection-free setting. This setting is important whenever the decision set has a complex geometry, and access to it is done efficiently only through a linear optimization oracle, hence Euclidean projections are unavailable (e.g. matroid polytope, submodular base polytope). This is the first differentially-private algorithm for projection-free bandit optimization, and in fact our bound matches the best known non-private projection-free algorithm and the best known private algorithm, even for the weaker setting when projections are available.

43

Wu, Chenyue. "Comparative analysis of the KL-UCB and UCB algorithms: Delving into complexity and performance". Applied and Computational Engineering 53, n. 1 (28 marzo 2024): 39–47. http://dx.doi.org/10.54254/2755-2721/53/20241221.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This paper embarks on a meticulous comparative exploration of two venerable algorithms often invoked in multi-armed bandit problems: the Kullback-Leibler Upper Confidence Bound (KL-UCB) and the generic Upper Confidence Bound (UCB) algorithms. Initially, a comprehensive discourse is presented, elucidating the definition, evolution, and real-world applications of both algorithms. The crux of the study then shifts to a side-by-side comparison, weighing the regret performance and time complexities when applied to a quintessential movie rating dataset. In the trenches of practical implementations, addressing multi-armed bandit problems invariably demands extensive training. Consequently, even seemingly minor variations in algorithmic complexity can usher in pronounced differences in computational durations and resource utilization. This inherent intricacy prompts introspection: Is the potency of a given algorithm in addressing diverse practical quandaries commensurate with its inherent complexity. By juxtaposing the KL-UCB and UCB algorithms, this study not only highlights their relative merits and demerits but also furnishes insights that could serve as catalysts for further refinement and optimization. The overarching aim is to cultivate an informed perspective, guiding practitioners in choosing or fine-tuning algorithms tailored to specific applications without incurring undue computational overheads.

44

Herlihy, Christine, e John P. Dickerson. "Networked Restless Bandits with Positive Externalities". Proceedings of the AAAI Conference on Artificial Intelligence 37, n. 10 (26 giugno 2023): 11997–2004. http://dx.doi.org/10.1609/aaai.v37i10.26415.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Restless multi-armed bandits are often used to model budget-constrained resource allocation tasks where receipt of the resource is associated with an increased probability of a favorable state transition. Prior work assumes that individual arms only benefit if they receive the resource directly. However, many allocation tasks occur within communities and can be characterized by positive externalities that allow arms to derive partial benefit when their neighbor(s) receive the resource. We thus introduce networked restless bandits, a novel multi-armed bandit setting in which arms are both restless and embedded within a directed graph. We then present Greta, a graph-aware, Whittle index-based heuristic algorithm that can be used to efficiently construct a constrained reward-maximizing action vector at each timestep. Our empirical results demonstrate that Greta outperforms comparison policies across a range of hyperparameter values and graph topologies. Code and appendices are available at https://github.com/crherlihy/networked_restless_bandits.

45

Nobari, Sadegh. "DBA: Dynamic Multi-Armed Bandit Algorithm". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 luglio 2019): 9869–70. http://dx.doi.org/10.1609/aaai.v33i01.33019869.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

We introduce Dynamic Bandit Algorithm (DBA), a practical solution to improve the shortcoming of the pervasively employed reinforcement learning algorithm called Multi-Arm Bandit, aka Bandit. Bandit makes real-time decisions based on the prior observations. However, Bandit is heavily biased to the priors that it cannot quickly adapt itself to a trend that is interchanging. As a result, Bandit cannot, quickly enough, make profitable decisions when the trend is changing. Unlike Bandit, DBA focuses on quickly adapting itself to detect these trends early enough. Furthermore, DBA remains as almost as light as Bandit in terms of computations. Therefore, DBA can be easily deployed in production as a light process similar to The Bandit. We demonstrate how critical and beneficial is the main focus of DBA, i.e. the ability to quickly finding the most profitable option in real-time, over its stateof-the-art competitors. Our experiments are augmented with a visualization mechanism that explains the profitability of the decisions made by each algorithm in each step by animations. Finally we observe that DBA can substantially outperform the original Bandit by close to 3 times for a set Key Performance Indicator (KPI) in a case of having 3 arms.

46

Mintz, Yonatan, Anil Aswani, Philip Kaminsky, Elena Flowers e Yoshimi Fukuoka. "Nonstationary Bandits with Habituation and Recovery Dynamics". Operations Research 68, n. 5 (settembre 2020): 1493–516. http://dx.doi.org/10.1287/opre.2019.1918.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In many sequential decision-making settings where there is uncertainty about the reward of each action, frequent selection of specific actions may reduce expected reward while choosing less frequently selected actions could lead to an increase. These effects are commonly observed in settings ranging from personalized healthcare interventions and targeted online advertising. To address this problem, the authors propose a new class of models called ROGUE (reducing or gaining unknown efficacy) multiarmed bandits. In the paper, the authors present a maximum likelihood approach to estimate the parameters of these models, and we show that these estimates can be used to construct upper confidence bound algorithms and epsilon-greedy algorithms for optimizing these models with strong theoretical guarantees. The authors conclude with a simulation study to show that these algorithms perform better than current nonstationary bandit algorithms in terms of both cumulative regret and average reward.

47

Han, Qi, Li Zhu e Fei Guo. "Forced Exploration in Bandit Problems". Proceedings of the AAAI Conference on Artificial Intelligence 38, n. 11 (24 marzo 2024): 12270–77. http://dx.doi.org/10.1609/aaai.v38i11.29117.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The multi-armed bandit(MAB) is a classical sequential decision problem. Most work requires assumptions about the reward distribution (e.g., bounded), while practitioners may have difficulty obtaining information about these distributions to design models for their problems, especially in non-stationary MAB problems. This paper aims to design a multi-armed bandit algorithm that can be implemented without using information about the reward distribution while still achieving substantial regret upper bounds. To this end, we propose a novel algorithm alternating between greedy rule and forced exploration. Our method can be applied to Gaussian, Bernoulli and other subgaussian distributions, and its implementation does not require additional information. We employ a unified analysis method for different forced exploration strategies and provide problem-dependent regret upper bounds for stationary and piecewise-stationary settings. Furthermore, we compare our algorithm with popular bandit algorithms on different reward distributions.

48

Lupu, Andrei, Audrey Durand e Doina Precup. "Leveraging Observations in Bandits: Between Risks and Benefits". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 luglio 2019): 6112–19. http://dx.doi.org/10.1609/aaai.v33i01.33016112.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. Allowing an agent to be influenced by external observations can benefit to the learning process, but it also puts the agent at risk of following sub-optimal behaviours. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, not the rewards obtained. We introduce a new bandit optimism modifier that uses conditional optimism contingent on the actions of the target in order to guide the agent’s exploration. We analyze the effect of this modification on the well-known Upper Confidence Bound algorithm by proving that it preserves a regret upper-bound of order O(lnT), even in the presence of a very poor target, and we derive the dependency of the expected regret on the general target policy. We provide empirical results showing both great benefits as well as certain limitations inherent to observational learning in the multi-armed bandit setting. Experiments are conducted using targets satisfying theoretical assumptions with high probability, thus narrowing the gap between theory and application.

49

Ontanon, Santiago. "The Combinatorial Multi-Armed Bandit Problem and Its Application to Real-Time Strategy Games". Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 9, n. 1 (30 giugno 2021): 58–64. http://dx.doi.org/10.1609/aiide.v9i1.12681.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Game tree search in games with large branching factors is a notoriously hard problem. In this paper, we address this problem with a new sampling strategy for Monte Carlo Tree Search (MCTS) algorithms, called "Naive Sampling", based on a variant of the Multi-armed Bandit problem called the "Combinatorial Multi-armed Bandit" (CMAB) problem. We present a new MCTS algorithm based on Naive Sampling called NaiveMCTS, and evaluate it in the context of real-time strategy (RTS) games. Our results show that as the branching factor grows, NaiveMCTS performs significantly better than other algorithms.

50

Narita, Yusuke, Shota Yasui e Kohei Yata. "Efficient Counterfactual Learning from Bandit Feedback". Proceedings of the AAAI Conference on Artificial Intelligence 33 (17 luglio 2019): 4634–41. http://dx.doi.org/10.1609/aaai.v33i01.33014634.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

What is the most statistically efficient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-theart benchmark.