Journal articles on the topic 'Bandit learning'

To see the other types of publications on this topic, follow the link: Bandit learning.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Bandit learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Ciucanu, Radu, Pascal Lafourcade, Gael Marcadet, and Marta Soare. "SAMBA: A Generic Framework for Secure Federated Multi-Armed Bandits." Journal of Artificial Intelligence Research 73 (February 23, 2022): 737–65. http://dx.doi.org/10.1613/jair.1.13163.

Full text
Abstract:
The multi-armed bandit is a reinforcement learning model where a learning agent repeatedly chooses an action (pull a bandit arm) and the environment responds with a stochastic outcome (reward) coming from an unknown distribution associated with the chosen arm. Bandits have a wide-range of application such as Web recommendation systems. We address the cumulative reward maximization problem in a secure federated learning setting, where multiple data owners keep their data stored locally and collaborate under the coordination of a central orchestration server. We rely on cryptographic schemes and propose Samba, a generic framework for Secure federAted Multi-armed BAndits. Each data owner has data associated to a bandit arm and the bandit algorithm has to sequentially select which data owner is solicited at each time step. We instantiate Samba for five bandit algorithms. We show that Samba returns the same cumulative reward as the nonsecure versions of bandit algorithms, while satisfying formally proven security properties. We also show that the overhead due to cryptographic primitives is linear in the size of the input, which is confirmed by our proof-of-concept implementation.
APA, Harvard, Vancouver, ISO, and other styles
2

Sharaf, Amr, and Hal Daumé III. "Meta-Learning Effective Exploration Strategies for Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 11 (May 18, 2021): 9541–48. http://dx.doi.org/10.1609/aaai.v35i11.17149.

Full text
Abstract:
In contextual bandits, an algorithm must choose actions given ob- served contexts, learning from a reward signal that is observed only for the action chosen. This leads to an exploration/exploitation trade-off: the algorithm must balance taking actions it already believes are good with taking new actions to potentially discover better choices. We develop a meta-learning algorithm, Mêlée, that learns an exploration policy based on simulated, synthetic con- textual bandit tasks. Mêlée uses imitation learning against these simulations to train an exploration policy that can be applied to true contextual bandit tasks at test time. We evaluate Mêlée on both a natural contextual bandit problem derived from a learning to rank dataset as well as hundreds of simulated contextual ban- dit problems derived from classification tasks. Mêlée outperforms seven strong baselines on most of these datasets by leveraging a rich feature representation for learning an exploration strategy.
APA, Harvard, Vancouver, ISO, and other styles
3

Yang, Luting, Jianyi Yang, and Shaolei Ren. "Contextual Bandits with Delayed Feedback and Semi-supervised Learning (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 18 (May 18, 2021): 15943–44. http://dx.doi.org/10.1609/aaai.v35i18.17968.

Full text
Abstract:
Contextual multi-armed bandit (MAB) is a classic online learning problem, where a learner/agent selects actions (i.e., arms) given contextual information and discovers optimal actions based on reward feedback. Applications of contextual bandit have been increasingly expanding, including advertisement, personalization, resource allocation in wireless networks, among others. Nonetheless, the reward feedback is delayed in many applications (e.g., a user may only provide service ratings after a period of time), creating challenges for contextual bandits. In this paper, we address delayed feedback in contextual bandits by using semi-supervised learning — incorporate estimates of delayed rewards to improve the estimation of future rewards. Concretely, the reward feedback for an arm selected at the beginning of a round is only observed by the agent/learner with some observation noise and provided to the agent after some a priori unknown but bounded delays. Motivated by semi-supervised learning that produces pseudo labels for unlabeled data to further improve the model performance, we generate fictitious estimates of rewards that are delayed and have yet to arrive based on already-learnt reward functions. Thus, by combining semi-supervised learning with online contextual bandit learning, we propose a novel extension and design two algorithms, which estimate the values for currently unavailable reward feedbacks to minimize the maximum estimation error and average estimation error, respectively.
APA, Harvard, Vancouver, ISO, and other styles
4

Kapoor, Sayash, Kumar Kshitij Patel, and Purushottam Kar. "Corruption-tolerant bandit learning." Machine Learning 108, no. 4 (August 29, 2018): 687–715. http://dx.doi.org/10.1007/s10994-018-5758-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Cheung, Wang Chi, David Simchi-Levi, and Ruihao Zhu. "Hedging the Drift: Learning to Optimize Under Nonstationarity." Management Science 68, no. 3 (March 2022): 1696–713. http://dx.doi.org/10.1287/mnsc.2021.4024.

Full text
Abstract:
We introduce data-driven decision-making algorithms that achieve state-of-the-art dynamic regret bounds for a collection of nonstationary stochastic bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traffic network routing in changing environments. We show how the difficulty posed by the (unknown a priori and possibly adversarial) nonstationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Beginning with the linear bandit setting, we design and analyze a sliding window-upper confidence bound algorithm that achieves the optimal dynamic regret bound when the underlying variation budget is known. This budget quantifies the total amount of temporal variation of the latent environments. Boosted by the novel bandit-over-bandit framework that adapts to the latent changes, our algorithm can further enjoy nearly optimal dynamic regret bounds in a (surprisingly) parameter-free manner. We extend our results to other related bandit problems, namely the multiarmed bandit, generalized linear bandit, and combinatorial semibandit settings, which model a variety of operations research applications. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the “forgetting principle” in the learning processes, which is vital in changing environments. Extensive numerical experiments with synthetic datasets and a dataset of an online auto-loan company during the severe acute respiratory syndrome (SARS) epidemic period demonstrate that our proposed algorithms achieve superior performance compared with existing algorithms. This paper was accepted by George J. Shanthikumar, Management Science Special Section on Data-Driven Prescriptive Analytics.
APA, Harvard, Vancouver, ISO, and other styles
6

Du, Yihan, Siwei Wang, and Longbo Huang. "A One-Size-Fits-All Solution to Conservative Bandit Problems." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 7254–61. http://dx.doi.org/10.1609/aaai.v35i8.16891.

Full text
Abstract:
In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees (T-independent additive regrets instead of T-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with O(1/T) normalized additive regrets (T-independent in the cumulative form) and validate this result through empirical evaluation.
APA, Harvard, Vancouver, ISO, and other styles
7

Narita, Yusuke, Shota Yasui, and Kohei Yata. "Efficient Counterfactual Learning from Bandit Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4634–41. http://dx.doi.org/10.1609/aaai.v33i01.33014634.

Full text
Abstract:
What is the most statistically efficient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-theart benchmark.
APA, Harvard, Vancouver, ISO, and other styles
8

Lupu, Andrei, Audrey Durand, and Doina Precup. "Leveraging Observations in Bandits: Between Risks and Benefits." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6112–19. http://dx.doi.org/10.1609/aaai.v33i01.33016112.

Full text
Abstract:
Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. Allowing an agent to be influenced by external observations can benefit to the learning process, but it also puts the agent at risk of following sub-optimal behaviours. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, not the rewards obtained. We introduce a new bandit optimism modifier that uses conditional optimism contingent on the actions of the target in order to guide the agent’s exploration. We analyze the effect of this modification on the well-known Upper Confidence Bound algorithm by proving that it preserves a regret upper-bound of order O(lnT), even in the presence of a very poor target, and we derive the dependency of the expected regret on the general target policy. We provide empirical results showing both great benefits as well as certain limitations inherent to observational learning in the multi-armed bandit setting. Experiments are conducted using targets satisfying theoretical assumptions with high probability, thus narrowing the gap between theory and application.
APA, Harvard, Vancouver, ISO, and other styles
9

Lopez, Romain, Inderjit S. Dhillon, and Michael I. Jordan. "Learning from eXtreme Bandit Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 8732–40. http://dx.doi.org/10.1609/aaai.v35i10.17058.

Full text
Abstract:
We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The sIS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure---named Policy Optimization for eXtreme Models (POXM)---for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the sIS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-to-bandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: BanditNet, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas BanditNet sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines.
APA, Harvard, Vancouver, ISO, and other styles
10

Caro, Felipe, and Onesun Steve Yoo. "INDEXABILITY OF BANDIT PROBLEMS WITH RESPONSE DELAYS." Probability in the Engineering and Informational Sciences 24, no. 3 (April 23, 2010): 349–74. http://dx.doi.org/10.1017/s0269964810000021.

Full text
Abstract:
This article considers an important class of discrete time restless bandits, given by the discounted multiarmed bandit problems with response delays. The delays in each period are independent random variables, in which the delayed responses do not cross over. For a bandit arm in this class, we use a coupling argument to show that in each state there is a unique subsidy that equates the pulling and nonpulling actions (i.e., the bandit satisfies the indexibility criterion introduced by Whittle (1988). The result allows for infinite or finite horizon and holds for arbitrary delay lengths and infinite state spaces. We compute the resulting marginal productivity indexes (MPI) for the Beta-Bernoulli Bayesian learning model, formulate and compute a tractable upper bound, and compare the suboptimality gap of the MPI policy to those of other heuristics derived from different closed-form indexes. The MPI policy performs near optimally and provides a theoretical justification for the use of the other heuristics.
APA, Harvard, Vancouver, ISO, and other styles
11

Zhu, Zhaowei, Jingxuan Zhu, Ji Liu, and Yang Liu. "Federated Bandit." Proceedings of the ACM on Measurement and Analysis of Computing Systems 5, no. 1 (February 18, 2021): 1–29. http://dx.doi.org/10.1145/3447380.

Full text
Abstract:
In this paper, we study Federated Bandit, a decentralized Multi-Armed Bandit problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to sub-optimal actions while converging to a no-regret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a solution with which agents will never share their local observations with a central entity, and will be allowed to only share a private copy of his/her own information with their neighbors. We first propose a decentralized bandit algorithm \textttGossip\_UCB, which is a coupling of variants of both the classical gossiping algorithm and the celebrated Upper Confidence Bound (UCB) bandit algorithm. We show that \textttGossip\_UCB successfully adapts local bandit learning into a global gossiping process for sharing information among connected agents, and achieves guaranteed regret at the order of O(\max\ \textttpoly (N,M) łog T, \textttpoly (N,M)łog_łambda_2^-1 N\ ) for all N agents, where łambda_2\in(0,1) is the second largest eigenvalue of the expected gossip matrix, which is a function of G. We then propose \textttFed\_UCB, a differentially private version of \textttGossip\_UCB, in which the agents preserve ε-differential privacy of their local data while achieving O(\max \\frac\textttpoly (N,M) ε łog^2.5 T, \textttpoly (N,M) (łog_łambda_2^-1 N + łog T) \ ) regret.
APA, Harvard, Vancouver, ISO, and other styles
12

Varatharajah, Yogatheesan, and Brent Berry. "A Contextual-Bandit-Based Approach for Informed Decision-Making in Clinical Trials." Life 12, no. 8 (August 21, 2022): 1277. http://dx.doi.org/10.3390/life12081277.

Full text
Abstract:
Clinical trials are conducted to evaluate the efficacy of new treatments. Clinical trials involving multiple treatments utilize the randomization of treatment assignments to enable the evaluation of treatment efficacies in an unbiased manner. Such evaluation is performed in post hoc studies that usually use supervised-learning methods that rely on large amounts of data collected in a randomized fashion. That approach often proves to be suboptimal in that some participants may suffer and even die as a result of having not received the most appropriate treatments during the trial. Reinforcement-learning methods improve the situation by making it possible to learn the treatment efficacies dynamically during the course of the trial, and to adapt treatment assignments accordingly. Recent efforts using multi-arm bandits, a type of reinforcement-learning method, have focused on maximizing clinical outcomes for a population that was assumed to be homogeneous. However, those approaches have failed to account for the variability among participants that is becoming increasingly evident as a result of recent clinical-trial-based studies. We present a contextual-bandit-based online treatment optimization algorithm that, in choosing treatments for new participants in the study, takes into account not only the maximization of the clinical outcomes as well as the patient characteristics. We evaluated our algorithm using a real clinical trial dataset from the International Stroke Trial. We simulated the online setting by sequentially going through the data of each participant admitted to the trial. Two bandits (one for each context) were created, with four choices of treatments. For a new participant in the trial, depending on the context, one of the bandits was selected. Then, we took three different approaches to choose a treatment: (a) a random choice (i.e., the strategy currently used in clinical trial settings), (b) a Thompson sampling-based approach, and (c) a UCB-based approach. Success probabilities of each context were calculated separately by considering the participants with the same context. Those estimated outcomes were used to update the prior distributions within the bandit corresponding to the context of each participant. We repeated that process through the end of the trial and recorded the outcomes and the chosen treatments for each approach. We also evaluated a context-free multi-arm-bandit-based approach, using the same dataset, to showcase the benefits of our approach. In the context-free case, we calculated the success probabilities for the Bernoulli sampler using the whole clinical trial dataset in a context-independent manner. The results of our retrospective analysis indicate that the proposed approach performs significantly better than either a random assignment of treatments (the current gold standard) or a multi-arm-bandit-based approach, providing substantial gains in the percentage of participants who are assigned the most suitable treatments. The contextual-bandit and multi-arm bandit approaches provide 72.63% and 64.34% gains, respectively, compared to a random assignment.
APA, Harvard, Vancouver, ISO, and other styles
13

Asanov, Igor. "Bandit cascade: A test of observational learning in the bandit problem." Journal of Economic Behavior & Organization 189 (September 2021): 150–71. http://dx.doi.org/10.1016/j.jebo.2021.06.006.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Dimakopoulou, Maria, Zhengyuan Zhou, Susan Athey, and Guido Imbens. "Balanced Linear Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 3445–53. http://dx.doi.org/10.1609/aaai.v33i01.33013445.

Full text
Abstract:
Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We develop algorithms for contextual bandits with linear payoffs that integrate balancing methods from the causal inference literature in their estimation to make it less prone to problems of estimation bias. We provide the first regret bound analyses for linear contextual bandits with balancing and show that our algorithms match the state of the art theoretical guarantees. We demonstrate the strong practical advantage of balanced contextual bandits on a large number of supervised learning datasets and on a synthetic example that simulates model misspecification and prejudice in the initial training data.
APA, Harvard, Vancouver, ISO, and other styles
15

Yang, Jianyi, and Shaolei Ren. "Robust Bandit Learning with Imperfect Context." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10594–602. http://dx.doi.org/10.1609/aaai.v35i12.17267.

Full text
Abstract:
A standard assumption in contextual multi-arm bandit is that the true context is perfectly known before arm selection. Nonetheless, in many practical applications (e.g., cloud resource management), prior to arm selection, the context information can only be acquired by prediction subject to errors or adversarial modification. In this paper, we study a novel contextual bandit setting in which only imperfect context is available for arm selection while the true context is revealed at the end of each round. We propose two robust arm selection algorithms: MaxMinUCB (Maximize Minimum UCB) which maximizes the worst-case reward, and MinWD (Minimize Worst-case Degradation) which minimizes the worst-case regret. Importantly, we analyze the robustness of MaxMinUCB and MinWD by deriving both regret and reward bounds compared to an oracle that knows the true context. Our results show that as time goes on, MaxMinUCB and MinWD both perform as asymptotically well as their optimal counterparts that know the reward function. Finally, we apply MaxMinUCB and MinWD to online edge datacenter selection, and run synthetic simulations to validate our theoretical analysis.
APA, Harvard, Vancouver, ISO, and other styles
16

Truong, Quoc-Tuan, and Hady W. Lauw. "Variational learning from implicit bandit feedback." Machine Learning 110, no. 8 (July 9, 2021): 2085–105. http://dx.doi.org/10.1007/s10994-021-06028-0.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Tze-Leung Lai and S. Yakowitz. "Machine learning and nonparametric bandit theory." IEEE Transactions on Automatic Control 40, no. 7 (July 1995): 1199–209. http://dx.doi.org/10.1109/9.400491.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Tran, Alasdair, Cheng Soon Ong, and Christian Wolf. "Combining active learning suggestions." PeerJ Computer Science 4 (July 23, 2018): e157. http://dx.doi.org/10.7717/peerj-cs.157.

Full text
Abstract:
We study the problem of combining active learning suggestions to identify informative training examples by empirically comparing methods on benchmark datasets. Many active learning heuristics for classification problems have been proposed to help us pick which instance to annotate next. But what is the optimal heuristic for a particular source of data? Motivated by the success of methods that combine predictors, we combine active learners with bandit algorithms and rank aggregation methods. We demonstrate that a combination of active learners outperforms passive learning in large benchmark datasets and removes the need to pick a particular active learner a priori. We discuss challenges to finding good rewards for bandit approaches and show that rank aggregation performs well.
APA, Harvard, Vancouver, ISO, and other styles
19

Shi, Chengshuai, and Cong Shen. "Federated Multi-Armed Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 11 (May 18, 2021): 9603–11. http://dx.doi.org/10.1609/aaai.v35i11.17156.

Full text
Abstract:
Federated multi-armed bandits (FMAB) is a new bandit paradigm that parallels the federated learning (FL) framework in supervised learning. It is inspired by practical applications in cognitive radio and recommender systems, and enjoys features that are analogous to FL. This paper proposes a general framework of FMAB and then studies two specific federated bandit models. We first study the approximate model where the heterogeneous local models are random realizations of the global model from an unknown distribution. This model introduces a new uncertainty of client sampling, as the global model may not be reliably learned even if the finite local models are perfectly known. Furthermore, this uncertainty cannot be quantified a priori without knowledge of the suboptimality gap. We solve the approximate model by proposing Federated Double UCB (Fed2-UCB), which constructs a novel “double UCB” principle accounting for uncertainties from both arm and client sampling. We show that gradually admitting new clients is critical in achieving an O(log(T)) regret while explicitly considering the communication loss. The exact model, where the global bandit model is the exact average of heterogeneous local models, is then studied as a special case. We show that, somewhat surprisingly, the order-optimal regret can be achieved independent of the number of clients with a careful choice of the update periodicity. Experiments using both synthetic and real-world datasets corroborate the theoretical analysis and demonstrate the effectiveness and efficiency of the proposed algorithms.
APA, Harvard, Vancouver, ISO, and other styles
20

Nobari, Sadegh. "DBA: Dynamic Multi-Armed Bandit Algorithm." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9869–70. http://dx.doi.org/10.1609/aaai.v33i01.33019869.

Full text
Abstract:
We introduce Dynamic Bandit Algorithm (DBA), a practical solution to improve the shortcoming of the pervasively employed reinforcement learning algorithm called Multi-Arm Bandit, aka Bandit. Bandit makes real-time decisions based on the prior observations. However, Bandit is heavily biased to the priors that it cannot quickly adapt itself to a trend that is interchanging. As a result, Bandit cannot, quickly enough, make profitable decisions when the trend is changing. Unlike Bandit, DBA focuses on quickly adapting itself to detect these trends early enough. Furthermore, DBA remains as almost as light as Bandit in terms of computations. Therefore, DBA can be easily deployed in production as a light process similar to The Bandit. We demonstrate how critical and beneficial is the main focus of DBA, i.e. the ability to quickly finding the most profitable option in real-time, over its stateof-the-art competitors. Our experiments are augmented with a visualization mechanism that explains the profitability of the decisions made by each algorithm in each step by animations. Finally we observe that DBA can substantially outperform the original Bandit by close to 3 times for a set Key Performance Indicator (KPI) in a case of having 3 arms.
APA, Harvard, Vancouver, ISO, and other styles
21

Karpov, Nikolai, and Qin Zhang. "Instance-Sensitive Algorithms for Pure Exploration in Multinomial Logit Bandit." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 7 (June 28, 2022): 7096–103. http://dx.doi.org/10.1609/aaai.v36i7.20669.

Full text
Abstract:
Motivated by real-world applications such as fast fashion retailing and online advertising, the Multinomial Logit Bandit (MNL-bandit) is a popular model in online learning and operations research, and has attracted much attention in the past decade. In this paper, we give efficient algorithms for pure exploration in MNL-bandit. Our algorithms achieve instance-sensitive pull complexities. We also complement the upper bounds by an almost matching lower bound.
APA, Harvard, Vancouver, ISO, and other styles
22

Gao, Xuefeng, and Tianrun Xu. "Order scoring, bandit learning and order cancellations." Journal of Economic Dynamics and Control 134 (January 2022): 104287. http://dx.doi.org/10.1016/j.jedc.2021.104287.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Xu, Yiming, Vahid Keshavarzzadeh, Robert M. Kirby, and Akil Narayan. "A Bandit-Learning Approach to Multifidelity Approximation." SIAM Journal on Scientific Computing 44, no. 1 (January 18, 2022): A150—A175. http://dx.doi.org/10.1137/21m1408312.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Brezzi, Monica, and Tze Leung Lai. "Optimal learning and experimentation in bandit problems." Journal of Economic Dynamics and Control 27, no. 1 (November 2002): 87–108. http://dx.doi.org/10.1016/s0165-1889(01)00028-8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Rosenberg, Dinah, Eilon Solan, and Nicolas Vieille. "Social Learning in One-Arm Bandit Problems." Econometrica 75, no. 6 (November 2007): 1591–611. http://dx.doi.org/10.1111/j.1468-0262.2007.00807.x.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Lefebvre, Germain, Christopher Summerfield, and Rafal Bogacz. "A Normative Account of Confirmation Bias During Reinforcement Learning." Neural Computation 34, no. 2 (January 14, 2022): 307–37. http://dx.doi.org/10.1162/neco_a_01455.

Full text
Abstract:
Abstract Reinforcement learning involves updating estimates of the value of states and actions on the basis of experience. Previous work has shown that in humans, reinforcement learning exhibits a confirmatory bias: when the value of a chosen option is being updated, estimates are revised more radically following positive than negative reward prediction errors, but the converse is observed when updating the unchosen option value estimate. Here, we simulate performance on a multi-arm bandit task to examine the consequences of a confirmatory bias for reward harvesting. We report a paradoxical finding: that confirmatory biases allow the agent to maximize reward relative to an unbiased updating rule. This principle holds over a wide range of experimental settings and is most influential when decisions are corrupted by noise. We show that this occurs because on average, confirmatory biases lead to overestimating the value of more valuable bandits and underestimating the value of less valuable bandits, rendering decisions overall more robust in the face of noise. Our results show how apparently suboptimal learning rules can in fact be reward maximizing if decisions are made with finite computational precision.
APA, Harvard, Vancouver, ISO, and other styles
27

Huo, Xiaoguang, and Feng Fu. "Risk-aware multi-armed bandit problem with application to portfolio selection." Royal Society Open Science 4, no. 11 (November 2017): 171377. http://dx.doi.org/10.1098/rsos.171377.

Full text
Abstract:
Sequential portfolio selection has attracted increasing interest in the machine learning and quantitative finance communities in recent years. As a mathematical framework for reinforcement learning policies, the stochastic multi-armed bandit problem addresses the primary difficulty in sequential decision-making under uncertainty, namely the exploration versus exploitation dilemma, and therefore provides a natural connection to portfolio selection. In this paper, we incorporate risk awareness into the classic multi-armed bandit setting and introduce an algorithm to construct portfolio. Through filtering assets based on the topological structure of the financial market and combining the optimal multi-armed bandit policy with the minimization of a coherent risk measure, we achieve a balance between risk and return.
APA, Harvard, Vancouver, ISO, and other styles
28

Zhu, Zhaowei, Jingxuan Zhu, Ji Liu, and Yang Liu. "Federated Bandit: A Gossiping Approach." ACM SIGMETRICS Performance Evaluation Review 49, no. 1 (June 22, 2022): 3–4. http://dx.doi.org/10.1145/3543516.3453919.

Full text
Abstract:
We study Federated Bandit, a decentralized Multi-Armed Bandit (MAB) problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to sub-optimal actions while converging to a no-regret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a solution with which agents will never share their local observations with a central entity, and will be allowed to only share a private copy of his/her own information with their neighbors. We first propose a decentralized bandit algorithm GossipUCB, which is a coupling of variants of both the classical gossiping algorithm and the celebrated Upper Confidence Bound (UCB) bandit algorithm. We show that GossipUCB successfully adapts local bandit learning into a global gossiping process for sharing information among connected agents, and achieves guaranteed regret at the order of O(max(poly(N,M) log T, poly(N,M) logλ2-1 N)) for all N agents, where λ2∈(0,1) is the second largest eigenvalue of the expected gossip matrix, which is a function of G. We then propose FedUCB, a differentially private version of GossipUCB, in which the agents preserve ε-differential privacy of their local data while achieving O(max poly(N,M)/ε log2.5 T, poly(N,M) (logλ2-1 N + log T)) regret.
APA, Harvard, Vancouver, ISO, and other styles
29

Kaibel, Chris, and Torsten Biemann. "Rethinking the Gold Standard With Multi-armed Bandits: Machine Learning Allocation Algorithms for Experiments." Organizational Research Methods 24, no. 1 (June 11, 2019): 78–103. http://dx.doi.org/10.1177/1094428119854153.

Full text
Abstract:
In experiments, researchers commonly allocate subjects randomly and equally to the different treatment conditions before the experiment starts. While this approach is intuitive, it means that new information gathered during the experiment is not utilized until after the experiment has ended. Based on methodological approaches from other scientific disciplines such as computer science and medicine, we suggest machine learning algorithms for subject allocation in experiments. Specifically, we discuss a Bayesian multi-armed bandit algorithm for randomized controlled trials and use Monte Carlo simulations to compare its efficiency with randomized controlled trials that have a fixed and balanced subject allocation. Our findings indicate that a randomized allocation based on Bayesian multi-armed bandits is more efficient and ethical in most settings. We develop recommendations for researchers and discuss the limitations of our approach.
APA, Harvard, Vancouver, ISO, and other styles
30

Wu, Wen, Nan Cheng, Ning Zhang, Peng Yang, Weihua Zhuang, and Xuemin Shen. "Fast mmwave Beam Alignment via Correlated Bandit Learning." IEEE Transactions on Wireless Communications 18, no. 12 (December 2019): 5894–908. http://dx.doi.org/10.1109/twc.2019.2940454.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

He, Di, Wei Chen, Liwei Wang, and Tie-Yan Liu. "Online learning for auction mechanism in bandit setting." Decision Support Systems 56 (December 2013): 379–86. http://dx.doi.org/10.1016/j.dss.2013.07.004.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Cayci, Semih, Atilla Eryilmaz, and R. Srikant. "Learning to Control Renewal Processes with Bandit Feedback." ACM SIGMETRICS Performance Evaluation Review 47, no. 1 (December 17, 2019): 41–42. http://dx.doi.org/10.1145/3376930.3376957.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Cayci, Semih, Atilla Eryilmaz, and R. Srikant. "Learning to Control Renewal Processes with Bandit Feedback." Proceedings of the ACM on Measurement and Analysis of Computing Systems 3, no. 2 (June 19, 2019): 1–32. http://dx.doi.org/10.1145/3341617.3326158.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Tang, Qiao, Hong Xie, Yunni Xia, Jia Lee, and Qingsheng Zhu. "Robust Contextual Bandits via Bootstrapping." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 13 (May 18, 2021): 12182–89. http://dx.doi.org/10.1609/aaai.v35i13.17446.

Full text
Abstract:
Upper confidence bound (UCB) based contextual bandit algorithms require one to know the tail property of the reward distribution. Unfortunately, such tail property is usually unknown or difficult to specify in real-world applications. Using a tail property heavier than the ground truth leads to a slow learning speed of the contextual bandit algorithm, while using a lighter one may cause the algorithm to diverge. To address this fundamental problem, we develop an estimator (evaluated from historical rewards) for the contextual bandit UCB based on the multiplier bootstrapping technique. We first establish sufficient conditions under which our estimator converges asymptotically to the ground truth of contextual bandit UCB. We further derive a second order correction for our estimator so as to obtain its confidence level with a finite number of rounds. To demonstrate the versatility of the estimator, we apply it to design a BootLinUCB algorithm for the contextual bandit. We prove that the BootLinUCB has a sub-linear regret upper bound and also conduct extensive experiments to validate its superior performance.
APA, Harvard, Vancouver, ISO, and other styles
35

Garcelon, Evrard, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. "Improved Algorithms for Conservative Exploration in Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 3962–69. http://dx.doi.org/10.1609/aaai.v34i04.5812.

Full text
Abstract:
In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.
APA, Harvard, Vancouver, ISO, and other styles
36

Giachino, Chiara, Luigi Bollani, Alessandro Bonadonna, and Marco Bertetti. "Reinforcement learning for content's customization: a first step of experimentation in Skyscanner." Industrial Management & Data Systems 121, no. 6 (January 15, 2021): 1417–34. http://dx.doi.org/10.1108/imds-12-2019-0722.

Full text
Abstract:
PurposeThe aim of the paper is to test and demonstrate the potential benefits in applying reinforcement learning instead of traditional methods to optimize the content of a company's mobile application to best help travellers finding their ideal flights. To this end, two approaches were considered and compared via simulation: standard randomized experiments or A/B testing and multi-armed bandits.Design/methodology/approachThe simulation of the two approaches to optimize the content of its mobile application and, consequently, increase flights conversions is illustrated as applied by Skyscanner, using R software.FindingsThe first results are about the comparison between the two approaches – A/B testing and multi-armed bandits – to identify the best one to achieve better results for the company. The second one is to gain experiences and suggestion in the application of the two approaches useful for other industries/companies.Research limitations/implicationsThe case study demonstrated, via simulation, the potential benefits to apply the reinforcement learning in a company. Finally, the multi-armed bandit was implemented in the company, but the period of the available data was limited, and due to its strategic relevance, the company cannot show all the findings.Practical implicationsThe right algorithm can change according to the situation and industry but would bring great benefits to the company's ability to surface content that is more relevant to users and help improving the experience for travellers. The study shows how to manage complexity and data to achieve good results.Originality/valueThe paper describes the approach used by an European leading company operating in the travel sector in understanding how to adapt reinforcement learning to its strategic goals. It presents a real case study and the simulation of the application of A/B testing and multi-armed bandit in Skyscanner; moreover, it highlights practical suggestion useful to other companies.
APA, Harvard, Vancouver, ISO, and other styles
37

Metelli, Alberto Maria, Matteo Papini, Pierluca D'Oro, and Marcello Restelli. "Policy Optimization as Online Learning with Mediator Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 8958–66. http://dx.doi.org/10.1609/aaai.v35i10.17083.

Full text
Abstract:
Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. The additional available information, compared to the standard bandit feedback, allows reusing samples generated by one policy to estimate the performance of other policies. Based on this observation, we propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RANDOMIST), for regret minimization in PO, that employs a randomized exploration strategy, differently from the existing optimistic approaches. When the policy space is finite, we show that under certain circumstances, it is possible to achieve constant regret, while always enjoying logarithmic regret. We also derive problem-dependent regret lower bounds. Then, we extend RANDOMIST to compact policy spaces. Finally, we provide numerical simulations on finite and compact policy spaces, in comparison with PO and bandit baselines.
APA, Harvard, Vancouver, ISO, and other styles
38

Shi, Zheyuan Ryan, Zhiwei Steven Wu, Rayid Ghani, and Fei Fang. "Bandit Data-Driven Optimization for Crowdsourcing Food Rescue Platforms." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 11 (June 28, 2022): 12154–62. http://dx.doi.org/10.1609/aaai.v36i11.21475.

Full text
Abstract:
Food waste and insecurity are two societal challenges that coexist in many parts of the world. A prominent force to combat these issues, food rescue platforms match food donations to organizations that serve underprivileged communities, and then rely on external volunteers to transport the food. Previous work has developed machine learning models for food rescue volunteer engagement. However, having long worked with domain practitioners to deploy AI tools to help with food rescues, we understand that there are four main pain points that keep such a machine learning model from being actually useful in practice: small data, data collected only under the default intervention, unmodeled objectives due to communication gap, and unforeseen consequences of the intervention. In this paper, we introduce bandit data-driven optimization which not only helps address these pain points in food rescue, but also is applicable to other nonprofit domains that share similar challenges. Bandit data-driven optimization combines the advantages of online bandit learning and offline predictive analytics in an integrated framework. We propose PROOF, a novel algorithm for this framework and formally prove that it has no-regret. We show that PROOF performs better than existing baseline on food rescue volunteer recommendation.
APA, Harvard, Vancouver, ISO, and other styles
39

Ding, Qinxu, Yong Liu, Chunyan Miao, Fei Cheng, and Haihong Tang. "A Hybrid Bandit Framework for Diversified Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 5 (May 18, 2021): 4036–44. http://dx.doi.org/10.1609/aaai.v35i5.16524.

Full text
Abstract:
The interactive recommender systems involve users in the recommendation procedure by receiving timely user feedback to update the recommendation policy. Therefore, they are widely used in real application scenarios. Previous interactive recommendation methods primarily focus on learning users' personalized preferences on the relevance properties of an item set. However, the investigation of users' personalized preferences on the diversity properties of an item set is usually ignored. To overcome this problem, we propose the Linear Modular Dispersion Bandit (LMDB) framework, which is an online learning setting for optimizing a combination of modular functions and dispersion functions. Specifically, LMDB employs modular functions to model the relevance properties of each item, and dispersion functions to describe the diversity properties of an item set. Moreover, we also develop a learning algorithm, called Linear Modular Dispersion Hybrid (LMDH) to solve the LMDB problem and derive a gap-free bound on its n-step regret. Extensive experiments on real datasets are performed to demonstrate the effectiveness of the proposed LMDB framework in balancing the recommendation accuracy and diversity.
APA, Harvard, Vancouver, ISO, and other styles
40

Modaresi, Sajad, Denis Sauré, and Juan Pablo Vielma. "Learning in Combinatorial Optimization: What and How to Explore." Operations Research 68, no. 5 (September 2020): 1585–604. http://dx.doi.org/10.1287/opre.2019.1926.

Full text
Abstract:
When moving from the traditional to combinatorial multiarmed bandit setting, addressing the classical exploration versus exploitation trade-off is a challenging task. In “Learning in Combinatorial Optimization: What and How to Explore,” Modaresi, Sauré, and Vielma show that the combinatorial setting has salient features that distinguish it from the traditional bandit. In particular, combinatorial structure induces correlation between cost of different solutions, thus raising the questions of what parameters to estimate and how to collect and combine information. The authors answer such questions by developing a novel optimization problem called the lower-bound problem (LBP). They establish a fundamental limit on asymptotic performance of any admissible policy and propose near-optimal LBP-based policies. Because LBP is likely intractable in practice, they propose policies that instead solve a proxy for LBP, which they call the optimality cover problem (OCP). They provide strong evidence of practical tractability of OCP and illustrate the markedly superior performance of OCP-based policies numerically.
APA, Harvard, Vancouver, ISO, and other styles
41

Yan, Cairong, Junli Xian, Yongquan Wan, and Pengwei Wang. "Modeling implicit feedback based on bandit learning for recommendation." Neurocomputing 447 (August 2021): 244–56. http://dx.doi.org/10.1016/j.neucom.2021.03.072.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Chen, Lixing, Jie Xu, Shaolei Ren, and Pan Zhou. "Spatio–Temporal Edge Service Placement: A Bandit Learning Approach." IEEE Transactions on Wireless Communications 17, no. 12 (December 2018): 8388–401. http://dx.doi.org/10.1109/twc.2018.2876823.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Xia, Wenchao, Tony Q. S. Quek, Kun Guo, Wanli Wen, Howard H. Yang, and Hongbo Zhu. "Multi-Armed Bandit-Based Client Scheduling for Federated Learning." IEEE Transactions on Wireless Communications 19, no. 11 (November 2020): 7108–23. http://dx.doi.org/10.1109/twc.2020.3008091.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Kaufmann, Emilie, and Aurélien Garivier. "Learning the distribution with largest mean: two bandit frameworks." ESAIM: Proceedings and Surveys 60 (2017): 114–31. http://dx.doi.org/10.1051/proc/201760114.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Agrawal, Shipra, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. "MNL-Bandit: A Dynamic Learning Approach to Assortment Selection." Operations Research 67, no. 5 (September 2019): 1453–85. http://dx.doi.org/10.1287/opre.2018.1832.

Full text
Abstract:
We consider a dynamic assortment selection problem where in every round the retailer offers a subset (assortment) of N substitutable products to a consumer, who selects one of these products according to a multinomial logit (MNL) choice model. The retailer observes this choice, and the objective is to dynamically learn the model parameters while optimizing cumulative revenues over a selling horizon of length T. We refer to this exploration–exploitation formulation as the MNL-Bandit problem. Existing methods for this problem follow an explore-then-exploit approach, which estimates parameters to a desired accuracy and then, treating these estimates as if they are the correct parameter values, offers the optimal assortment based on these estimates. These approaches require certain a priori knowledge of “separability,” determined by the true parameters of the underlying MNL model, and this in turn is critical in determining the length of the exploration period. (Separability refers to the distinguishability of the true optimal assortment from the other suboptimal alternatives.) In this paper, we give an efficient algorithm that simultaneously explores and exploits, without a priori knowledge of any problem parameters. Furthermore, the algorithm is adaptive in the sense that its performance is near optimal in the “well-separated” case as well as the general parameter setting where this separation need not hold.
APA, Harvard, Vancouver, ISO, and other styles
46

Liu, Keqin, and Qing Zhao. "Distributed Learning in Multi-Armed Bandit With Multiple Players." IEEE Transactions on Signal Processing 58, no. 11 (November 2010): 5667–81. http://dx.doi.org/10.1109/tsp.2010.2062509.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Han, Kai, Yuntian He, Alex X. Liu, Shaojie Tang, and He Huang. "Differentially Private and Budget-Limited Bandit Learning over Matroids." INFORMS Journal on Computing 32, no. 3 (July 2020): 790–804. http://dx.doi.org/10.1287/ijoc.2019.0903.

Full text
Abstract:
We propose the first budget-limited multi-armed bandit (BMAB) algorithm subject to a union of matroid constraints in arm pulling, while at the same time achieving differential privacy. Our model generalizes the arm-pulling models studied in prior BMAB schemes, and it can be used to address many practical problems such as network backbone construction and dynamic pricing in crowdsourcing. We handle the exploitation versus exploration tradeoff in our BMAB problem by exploiting the combinatorial structures of matroids, and reduce the searching complexity of arm selection based on a divide-and-conquer approach. Our algorithm achieves a uniform logarithmic regret bound with respect to B and ɛ-differential privacy, where B is the budget for pulling the arms with random costs. Without differential privacy, our algorithm achieves a uniform logarithmic regret bound with respect to B, which advances the asymptotic regret bounds achieved by prior BMAB algorithms. We performed side-by-side comparisons with prior schemes in our experiments. Experimental results show that our purely-combinatorial algorithm not only achieves significantly better regret performance, but also is more than 20 times faster than prior BMAB schemes, which use time-consuming LP-solving techniques.
APA, Harvard, Vancouver, ISO, and other styles
48

Jain, Shweta, Satyanath Bhat, Ganesh Ghalme, Divya Padmanabhan, and Y. Narahari. "Mechanisms with learning for stochastic multi-armed bandit problems." Indian Journal of Pure and Applied Mathematics 47, no. 2 (June 2016): 229–72. http://dx.doi.org/10.1007/s13226-016-0186-3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Meng, Hao, Wasswa Shafik, S. Mojtaba Matinkhah, and Zubair Ahmad. "A 5G Beam Selection Machine Learning Algorithm for Unmanned Aerial Vehicle Applications." Wireless Communications and Mobile Computing 2020 (August 1, 2020): 1–16. http://dx.doi.org/10.1155/2020/1428968.

Full text
Abstract:
The unmanned aerial vehicles (UAVs) emerged into a promising research trend within the recurrent year where current and future networks are to use enhanced connectivity in these digital immigrations in different fields like medical, communication, and search and rescue operations among others. The current technologies are using fixed base stations to operate onsite and off-site in the fixed position with its associated problems like poor connectivity. This open gate for the UAV technology is to be used as a mobile alternative to increase accessibility with fifth-generation (5G) connectivity that focuses on increased availability and connectivity. There has been less usage of wireless technologies in the medical field. This paper first presents a study on deep learning to medical field application in general and provides detailed steps that are involved in the multiarmed bandit (MAB) approach in solving the UAV biomedical engineering technology device and medical exploration to exploitation dilemma. The paper further presents a detailed description of the bandit network applicability to achieve close optimal performance and efficiency of medical engineered devices. The simulated results depicted that a multiarmed bandit problem approach can be applied in optimizing the performance of any medical networked device issue compared to the Thompson sampling, Bayesian algorithm, and ε-greedy algorithm. The results obtained further illustrated the optimized utilization of biomedical engineering technology systems achieving thus close optimal performance on the average period through deep learning of realistic medical situations.
APA, Harvard, Vancouver, ISO, and other styles
50

Yan, Xue, Yali Du, Binxin Ru, Jun Wang, Haifeng Zhang, and Xu Chen. "Learning to Identify Top Elo Ratings: A Dueling Bandits Approach." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8797–805. http://dx.doi.org/10.1609/aaai.v36i8.20860.

Full text
Abstract:
The Elo rating system is widely adopted to evaluate the skills of (chess) game and sports players. Recently it has been also integrated into machine learning algorithms in evaluating the performance of computerised AI agents. However, an accurate estimation of the Elo rating (for the top players) often requires many rounds of competitions, which can be expensive to carry out. In this paper, to minimize the number of comparisons and to improve the sample efficiency of the Elo evaluation (for top players), we propose an efficient online match scheduling algorithm. Specifically, we identify and match the top players through a dueling bandits framework and tailor the bandit algorithm to the gradient-based update of Elo. We show that it reduces the per-step memory and time complexity to constant, compared to the traditional likelihood maximization approaches requiring O(t) time. Our algorithm has a regret guarantee that is sublinear in the number of competition rounds and has been extended to the multidimensional Elo ratings for handling intransitive games. We empirically demonstrate that our method achieves superior convergence speed and time efficiency on a variety of gaming tasks.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography