Log in

Relevant bibliographies by topics / Bandit learning

Academic literature on the topic 'Bandit learning'

Author: Grafiati

Published: 10 December 2022

Last updated: 29 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Contents

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers
Reports

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Bandit learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Bandit learning"

1

Ciucanu, Radu, Pascal Lafourcade, Gael Marcadet, and Marta Soare. "SAMBA: A Generic Framework for Secure Federated Multi-Armed Bandits." Journal of Artificial Intelligence Research 73 (February 23, 2022): 737–65. http://dx.doi.org/10.1613/jair.1.13163.

Full text

Abstract:

The multi-armed bandit is a reinforcement learning model where a learning agent repeatedly chooses an action (pull a bandit arm) and the environment responds with a stochastic outcome (reward) coming from an unknown distribution associated with the chosen arm. Bandits have a wide-range of application such as Web recommendation systems. We address the cumulative reward maximization problem in a secure federated learning setting, where multiple data owners keep their data stored locally and collaborate under the coordination of a central orchestration server. We rely on cryptographic schemes and propose Samba, a generic framework for Secure federAted Multi-armed BAndits. Each data owner has data associated to a bandit arm and the bandit algorithm has to sequentially select which data owner is solicited at each time step. We instantiate Samba for five bandit algorithms. We show that Samba returns the same cumulative reward as the nonsecure versions of bandit algorithms, while satisfying formally proven security properties. We also show that the overhead due to cryptographic primitives is linear in the size of the input, which is confirmed by our proof-of-concept implementation.

APA, Harvard, Vancouver, ISO, and other styles

2

Sharaf, Amr, and Hal Daumé III. "Meta-Learning Effective Exploration Strategies for Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 11 (May 18, 2021): 9541–48. http://dx.doi.org/10.1609/aaai.v35i11.17149.

Full text

Abstract:

In contextual bandits, an algorithm must choose actions given ob- served contexts, learning from a reward signal that is observed only for the action chosen. This leads to an exploration/exploitation trade-off: the algorithm must balance taking actions it already believes are good with taking new actions to potentially discover better choices. We develop a meta-learning algorithm, Mêlée, that learns an exploration policy based on simulated, synthetic con- textual bandit tasks. Mêlée uses imitation learning against these simulations to train an exploration policy that can be applied to true contextual bandit tasks at test time. We evaluate Mêlée on both a natural contextual bandit problem derived from a learning to rank dataset as well as hundreds of simulated contextual ban- dit problems derived from classification tasks. Mêlée outperforms seven strong baselines on most of these datasets by leveraging a rich feature representation for learning an exploration strategy.

APA, Harvard, Vancouver, ISO, and other styles

3

Yang, Luting, Jianyi Yang, and Shaolei Ren. "Contextual Bandits with Delayed Feedback and Semi-supervised Learning (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 18 (May 18, 2021): 15943–44. http://dx.doi.org/10.1609/aaai.v35i18.17968.

Full text

Abstract:

Contextual multi-armed bandit (MAB) is a classic online learning problem, where a learner/agent selects actions (i.e., arms) given contextual information and discovers optimal actions based on reward feedback. Applications of contextual bandit have been increasingly expanding, including advertisement, personalization, resource allocation in wireless networks, among others. Nonetheless, the reward feedback is delayed in many applications (e.g., a user may only provide service ratings after a period of time), creating challenges for contextual bandits. In this paper, we address delayed feedback in contextual bandits by using semi-supervised learning — incorporate estimates of delayed rewards to improve the estimation of future rewards. Concretely, the reward feedback for an arm selected at the beginning of a round is only observed by the agent/learner with some observation noise and provided to the agent after some a priori unknown but bounded delays. Motivated by semi-supervised learning that produces pseudo labels for unlabeled data to further improve the model performance, we generate fictitious estimates of rewards that are delayed and have yet to arrive based on already-learnt reward functions. Thus, by combining semi-supervised learning with online contextual bandit learning, we propose a novel extension and design two algorithms, which estimate the values for currently unavailable reward feedbacks to minimize the maximum estimation error and average estimation error, respectively.

APA, Harvard, Vancouver, ISO, and other styles

4

Kapoor, Sayash, Kumar Kshitij Patel, and Purushottam Kar. "Corruption-tolerant bandit learning." Machine Learning 108, no. 4 (August 29, 2018): 687–715. http://dx.doi.org/10.1007/s10994-018-5758-5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Cheung, Wang Chi, David Simchi-Levi, and Ruihao Zhu. "Hedging the Drift: Learning to Optimize Under Nonstationarity." Management Science 68, no. 3 (March 2022): 1696–713. http://dx.doi.org/10.1287/mnsc.2021.4024.

Full text

Abstract:

We introduce data-driven decision-making algorithms that achieve state-of-the-art dynamic regret bounds for a collection of nonstationary stochastic bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traffic network routing in changing environments. We show how the difficulty posed by the (unknown a priori and possibly adversarial) nonstationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Beginning with the linear bandit setting, we design and analyze a sliding window-upper confidence bound algorithm that achieves the optimal dynamic regret bound when the underlying variation budget is known. This budget quantifies the total amount of temporal variation of the latent environments. Boosted by the novel bandit-over-bandit framework that adapts to the latent changes, our algorithm can further enjoy nearly optimal dynamic regret bounds in a (surprisingly) parameter-free manner. We extend our results to other related bandit problems, namely the multiarmed bandit, generalized linear bandit, and combinatorial semibandit settings, which model a variety of operations research applications. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the “forgetting principle” in the learning processes, which is vital in changing environments. Extensive numerical experiments with synthetic datasets and a dataset of an online auto-loan company during the severe acute respiratory syndrome (SARS) epidemic period demonstrate that our proposed algorithms achieve superior performance compared with existing algorithms. This paper was accepted by George J. Shanthikumar, Management Science Special Section on Data-Driven Prescriptive Analytics.

APA, Harvard, Vancouver, ISO, and other styles

6

Du, Yihan, Siwei Wang, and Longbo Huang. "A One-Size-Fits-All Solution to Conservative Bandit Problems." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 7254–61. http://dx.doi.org/10.1609/aaai.v35i8.16891.

Full text

Abstract:

In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees (T-independent additive regrets instead of T-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with O(1/T) normalized additive regrets (T-independent in the cumulative form) and validate this result through empirical evaluation.

APA, Harvard, Vancouver, ISO, and other styles

7

Narita, Yusuke, Shota Yasui, and Kohei Yata. "Efficient Counterfactual Learning from Bandit Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4634–41. http://dx.doi.org/10.1609/aaai.v33i01.33014634.

Full text

Abstract:

What is the most statistically efficient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-theart benchmark.

APA, Harvard, Vancouver, ISO, and other styles

8

Lupu, Andrei, Audrey Durand, and Doina Precup. "Leveraging Observations in Bandits: Between Risks and Benefits." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6112–19. http://dx.doi.org/10.1609/aaai.v33i01.33016112.

Full text

Abstract:

Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. Allowing an agent to be influenced by external observations can benefit to the learning process, but it also puts the agent at risk of following sub-optimal behaviours. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, not the rewards obtained. We introduce a new bandit optimism modifier that uses conditional optimism contingent on the actions of the target in order to guide the agent’s exploration. We analyze the effect of this modification on the well-known Upper Confidence Bound algorithm by proving that it preserves a regret upper-bound of order O(lnT), even in the presence of a very poor target, and we derive the dependency of the expected regret on the general target policy. We provide empirical results showing both great benefits as well as certain limitations inherent to observational learning in the multi-armed bandit setting. Experiments are conducted using targets satisfying theoretical assumptions with high probability, thus narrowing the gap between theory and application.

APA, Harvard, Vancouver, ISO, and other styles

9

Lopez, Romain, Inderjit S. Dhillon, and Michael I. Jordan. "Learning from eXtreme Bandit Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 8732–40. http://dx.doi.org/10.1609/aaai.v35i10.17058.

Full text

Abstract:

We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces. Learning from extreme bandit feedback is ubiquitous in recommendation systems, in which billions of decisions are made over sets consisting of millions of choices in a single day, yielding massive observational data. In these large-scale real-world applications, supervised learning frameworks such as eXtreme Multi-label Classification (XMC) are widely used despite the fact that they incur significant biases due to the mismatch between bandit feedback and supervised labels. Such biases can be mitigated by importance sampling techniques, but these techniques suffer from impractical variance when dealing with a large number of actions. In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime. The sIS estimator is obtained by performing importance sampling on the conditional expectation of the reward with respect to a small subset of actions for each instance (a form of Rao-Blackwellization). We employ this estimator in a novel algorithmic procedure---named Policy Optimization for eXtreme Models (POXM)---for learning from bandit feedback on XMC tasks. In POXM, the selected actions for the sIS estimator are the top-p actions of the logging policy, where p is adjusted from the data and is significantly smaller than the size of the action space. We use a supervised-to-bandit conversion on three XMC datasets to benchmark our POXM method against three competing methods: BanditNet, a previously applied partial matching pruning strategy, and a supervised learning baseline. Whereas BanditNet sometimes improves marginally over the logging policy, our experiments show that POXM systematically and significantly improves over all baselines.

APA, Harvard, Vancouver, ISO, and other styles

10

Caro, Felipe, and Onesun Steve Yoo. "INDEXABILITY OF BANDIT PROBLEMS WITH RESPONSE DELAYS." Probability in the Engineering and Informational Sciences 24, no. 3 (April 23, 2010): 349–74. http://dx.doi.org/10.1017/s0269964810000021.

Full text

Abstract:

This article considers an important class of discrete time restless bandits, given by the discounted multiarmed bandit problems with response delays. The delays in each period are independent random variables, in which the delayed responses do not cross over. For a bandit arm in this class, we use a coupling argument to show that in each state there is a unique subsidy that equates the pulling and nonpulling actions (i.e., the bandit satisfies the indexibility criterion introduced by Whittle (1988). The result allows for infinite or finite horizon and holds for arbitrary delay lengths and infinite state spaces. We compute the resulting marginal productivity indexes (MPI) for the Beta-Bernoulli Bayesian learning model, formulate and compute a tractable upper bound, and compare the suboptimality gap of the MPI policy to those of other heuristics derived from different closed-form indexes. The MPI policy performs near optimally and provides a theoretical justification for the use of the other heuristics.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Bandit learning"

1

Liu, Fang. "Efficient Online Learning with Bandit Feedback." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587680990430268.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Klein, Nicolas. "Learning and Experimentation in Strategic Bandit Problems." Diss., lmu, 2010. http://nbn-resolving.de/urn:nbn:de:bvb:19-122728.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Talebi, Mazraeh Shahi Mohammad Sadegh. "Online Combinatorial Optimization under Bandit Feedback." Licentiate thesis, KTH, Reglerteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-181321.

Full text

Abstract:

Multi-Armed Bandits (MAB) constitute the most fundamental model for sequential decision making problems with an exploration vs. exploitation trade-off. In such problems, the decision maker selects an arm in each round and observes a realization of the corresponding unknown reward distribution. Each decision is based on past decisions and observed rewards. The objective is to maximize the expected cumulative reward over some time horizon by balancing exploitation (arms with higher observed rewards should be selectedoften) and exploration (all arms should be explored to learn their average rewards). Equivalently, the performanceof a decision rule or algorithm can be measured through its expected regret, defined as the gap betweenthe expected reward achieved by the algorithm and that achieved by an oracle algorithm always selecting the bestarm. This thesis investigates stochastic and adversarial combinatorial MAB problems, where each arm is a collection of several basic actions taken from a set of $d$ elements, in a way that the set of arms has a certain combinatorial structure. Examples of such sets include the set of fixed-size subsets, matchings, spanning trees, paths, etc. These problems are specific forms of online linear optimization, where the decision space is a subset of $d$-dimensional hypercube.Due to the combinatorial nature, the number of arms generically grows exponentially with $d$. Hence, treating arms as independent and applying classical sequential arm selection policies would yield a prohibitive regret. It may then be crucial to exploit the combinatorial structure of the problem to design efficient arm selection algorithms.As the first contribution of this thesis, in Chapter 3 we investigate combinatorial MABs in the stochastic setting and with Bernoulli rewards. We derive asymptotic (i.e., when the time horizon grows large) lower bounds on the regret of any algorithm under bandit and semi-bandit feedback. The proposed lower bounds are problem-specific and tight in the sense that there exists an algorithm that achieves these regret bounds. Our derivation leverages some theoretical results in adaptive control of Markov chains. Under semi-bandit feedback, we further discuss the scaling of the proposed lower bound with the dimension of the underlying combinatorial structure. For the case of semi-bandit feedback, we propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the fourth chapter, we consider stochastic combinatorial MAB problems where the underlying combinatorial structure is a matroid. Specializing the results of Chapter 3 to matroids, we provide explicit regret lower bounds for this class of problems. For the case of semi-bandit feedback, we propose KL-OSM, a computationally efficient greedy-based algorithm that exploits the matroid structure. Through a finite-time analysis, we prove that the regret upper bound of KL-OSM matches the proposed lower bound, thus making it the first asymptotically optimal algorithm for this class of problems. Numerical experiments validate that KL-OSM outperforms state-of-the-art algorithms in practice, as well.In the fifth chapter, we investigate the online shortest-path routing problem which is an instance of combinatorial MABs with geometric rewards. We consider and compare three different types of online routing policies, depending (i) on where routing decisions are taken (at the source or at each node), and (ii) on the received feedback (semi-bandit or bandit). For each case, we derive the asymptotic regret lower bound. These bounds help us to understand the performance improvements we can expect when (i) taking routing decisions at each hop rather than at the source only, and (ii) observing per-link delays rather than end-to-end path delays. In particular, we show that (i) is of no use while (ii) can have a spectacular impact.For source routing under semi-bandit feedback, we then propose two algorithms with a trade-off betweencomputational complexity and performance. The regret upper bounds of these algorithms improve over those ofthe existing algorithms, and they significantly outperform state-of-the-art algorithms in numerical experiments. Finally, we discuss combinatorial MABs in the adversarial setting and under bandit feedback. We concentrate on the case where arms have the same number of basic actions but are otherwise arbitrary. We propose CombEXP, an algorithm that has the same regret scaling as state-of-the-art algorithms. Furthermore, we show that CombEXP admits lower computational complexity for some combinatorial problems.

QC 20160201

APA, Harvard, Vancouver, ISO, and other styles

4

Lomax, S. E. "Cost-sensitive decision tree learning using a multi-armed bandit framework." Thesis, University of Salford, 2013. http://usir.salford.ac.uk/29308/.

Full text

Abstract:

Decision tree learning is one of the main methods of learning from data. It has been applied to a variety of different domains over the past three decades. In the real world, accuracy is not enough; there are costs involved, those of obtaining the data and those when classification errors occur. A comprehensive survey of cost-sensitive decision tree learning has identified over 50 algorithms, developing a taxonomy in order to classify the algorithms by the way in which cost has been incorporated, and a recent comparison shows that many cost-sensitive algorithms can process balanced, two class datasets well, but produce lower accuracy rates in order to achieve lower costs when the dataset is less balanced or has multiple classes. This thesis develops a new framework and algorithm concentrating on the view that cost-sensitive decision tree learning involves a trade-off between costs and accuracy. Decisions arising from these two viewpoints can often be incompatible resulting in the reduction of the accuracy rates. The new framework builds on a specific Game Theory problem known as the multi-armed bandit. This problem concerns a scenario whereby exploration and exploitation are required to solve it. For example, a player in a casino has to decide which slot machine (bandit) from a selection of slot machines is likely to pay out the most. Game Theory proposes a solution of this problem which is solved by a process of exploration and exploitation in which reward is maximized. This thesis utilizes these concepts from the multi-armed bandit game to develop a new algorithm by viewing the rewards as a reduction in costs, utilizing the exploration and exploitation techniques so that a compromise between decisions based on accuracy and decisions based on costs can be found. The algorithm employs the adapted multi-armed bandit game to select the attributes during decision tree induction, using a look-ahead methodology to explore potential attributes and exploit the attributes which maximizes the reward. The new algorithm is evaluated on fifteen datasets and compared to six well-known algorithms J48, EG2, MetaCost, AdaCostM1, ICET and ACT. The results obtained show that the new multi-armed based algorithm can produce more cost-effective trees without compromising accuracy. The thesis also includes a critical appraisal of the limitations of the developed algorithm and proposes avenues for further research.

APA, Harvard, Vancouver, ISO, and other styles

5

Jedor, Matthieu. "Bandit algorithms for recommender system optimization." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASM027.

Full text

Abstract:

Dans cette thèse de doctorat, nous étudions l'optimisation des systèmes de recommandation dans le but de fournir des suggestions de produits plus raffinées pour un utilisateur.La tâche est modélisée à l'aide du cadre des bandits multi-bras.Dans une première partie, nous abordons deux problèmes qui se posent fréquemment dans les systèmes de recommandation : le grand nombre d'éléments à traiter et la gestion des contenus sponsorisés.Dans une deuxième partie, nous étudions les performances empiriques des algorithmes de bandit et en particulier comment paramétrer les algorithmes traditionnels pour améliorer les résultats dans les environnements stationnaires et non stationnaires qui l'on rencontre en pratique.Cela nous amène à analyser à la fois théoriquement et empiriquement l'algorithme glouton qui, dans certains cas, est plus performant que l'état de l'art
In this PhD thesis, we study the optimization of recommender systems with the objective of providing more refined suggestions of items for a user to benefit.The task is modeled using the multi-armed bandit framework.In a first part, we look upon two problems that commonly occured in recommendation systems: the large number of items to handle and the management of sponsored contents.In a second part, we investigate the empirical performance of bandit algorithms and especially how to tune conventional algorithm to improve results in stationary and non-stationary environments that arise in practice.This leads us to analyze both theoretically and empirically the greedy algorithm that, in some cases, outperforms the state-of-the-art

APA, Harvard, Vancouver, ISO, and other styles

6

Louëdec, Jonathan. "Stratégies de bandit pour les systèmes de recommandation." Thesis, Toulouse 3, 2016. http://www.theses.fr/2016TOU30257/document.

Full text

Abstract:

Les systèmes de recommandation actuels ont besoin de recommander des objets pertinents aux utilisateurs (exploitation), mais pour cela ils doivent pouvoir également obtenir continuellement de nouvelles informations sur les objets et les utilisateurs encore peu connus (exploration). Il s'agit du dilemme exploration/exploitation. Un tel environnement s'inscrit dans le cadre de ce que l'on appelle " apprentissage par renforcement ". Dans la littérature statistique, les stratégies de bandit sont connues pour offrir des solutions à ce dilemme. Les contributions de cette thèse multidisciplinaire adaptent ces stratégies pour appréhender certaines problématiques des systèmes de recommandation, telles que la recommandation de plusieurs objets simultanément, la prise en compte du vieillissement de la popularité d'un objet ou encore la recommandation en temps réel
Current recommender systems need to recommend items that are relevant to users (exploitation), but they must also be able to continuously obtain new information about items and users (exploration). This is the exploration / exploitation dilemma. Such an environment is part of what is called "reinforcement learning". In the statistical literature, bandit strategies are known to provide solutions to this dilemma. The contributions of this multidisciplinary thesis the adaptation of these strategies to deal with some problems of the recommendation systems, such as the recommendation of several items simultaneously, taking into account the aging of the popularity of an items or the recommendation in real time

APA, Harvard, Vancouver, ISO, and other styles

7

Nakhe, Paresh [Verfasser], Martin [Gutachter] Hoefer, and Georg [Gutachter] Schnitger. "On bandit learning and pricing in markets / Paresh Nakhe ; Gutachter: Martin Hoefer, Georg Schnitger." Frankfurt am Main : Universitätsbibliothek Johann Christian Senckenberg, 2018. http://d-nb.info/1167856740/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Besson, Lilian. "Multi-Players Bandit Algorithms for Internet of Things Networks." Thesis, CentraleSupélec, 2019. http://www.theses.fr/2019CSUP0005.

Full text

Abstract:

Dans cette thèse de doctorat, nous étudions les réseaux sans fil et les appareils reconfigurables qui peuvent accéder à des réseaux de type radio intelligente, dans des bandes non licenciées et sans supervision centrale. Nous considérons notamment des réseaux actuels ou futurs de l’Internet des Objets (IoT), avec l’objectif d’augmenter la durée de vie de la batterie des appareils, en les équipant d’algorithmes d’apprentissage machine peu coûteux mais efficaces, qui leur permettent d’améliorer automatiquement l’efficacité de leurs communications sans fil. Nous proposons deux modèles de réseaux IoT, et nous montrons empiriquement, par des simulations numériques et une validation expérimentale réaliste, le gain que peuvent apporter nos méthodes, qui se reposent sur l’apprentissage par renforcement. Les différents problèmes d’accès au réseau sont modélisés avec des Bandits Multi-Bras (MAB), mais l’analyse de la convergence d’un grand nombre d’appareils jouant à un jeu collaboratif sans communication ni aucune coordination reste délicate, lorsque les appareils suivent tous un modèle d’activation aléatoire. Le reste de ce manuscrit étudie donc deux modèles restreints, d’abord des banditsmulti-joueurs dans des problèmes stationnaires, puis des bandits mono-joueur non stationnaires. Nous détaillons également une autre contribution, la bibliothèque Python open-source SMPyBandits, qui permet des simulations numériques de problèmes MAB, qui couvre les modèles étudiés et d’autres
In this PhD thesis, we study wireless networks and reconfigurable end-devices that can access Cognitive Radio networks, in unlicensed bands and without central control. We focus on Internet of Things networks (IoT), with the objective of extending the devices’ battery life, by equipping them with low-cost but efficient machine learning algorithms, in order to let them automatically improve the efficiency of their wireless communications. We propose different models of IoT networks, and we show empirically on both numerical simulations and real-world validation the possible gain of our methods, that use Reinforcement Learning. The different network access problems are modeled as Multi-Armed Bandits (MAB), but we found that analyzing the realistic models was intractable, because proving the convergence of many IoT devices playing a collaborative game, without communication nor coordination is hard, when they all follow random activation patterns. The rest of this manuscript thus studies two restricted models, first multi-players bandits in stationary problems, then non-stationary single-player bandits. We also detail another contribution, SMPyBandits, our open-source Python library for numerical MAB simulations, that covers all the studied models and more

APA, Harvard, Vancouver, ISO, and other styles

9

Racey, Deborah Elaine. "EFFECTS OF RESPONSE FREQUENCY CONSTRAINTS ON LEARNING IN A NON-STATIONARY MULTI-ARMED BANDIT TASK." OpenSIUC, 2009. https://opensiuc.lib.siu.edu/dissertations/86.

Full text

Abstract:

An n-armed bandit task was used to investigate the trade-off between exploratory (choosing lesser-known options) and exploitive (choosing options with the greatest probability of reinforcement) human choice in a trial-and-error learning problem. In Experiment 1 a different probability of reinforcement was assigned to each of 8 response options using random-ratios (RRs), and participants chose by clicking buttons in a circular display on a computer screen using a computer mouse. Relative frequency thresholds (ranging from .10 to 1.0) were randomly assigned to each participant and acted as task constraints limiting the proportion of total responses that could be attributed to any response option. Preference for the richer keys was shown, and those with greater constraints explored more and earned less reinforcement. Those with the highest constraints showed no preference, distributing their responses among the options with equal probability. In Experiment 2 the payoff probabilities changed partway through, for some the leanest options increased to richest, and for others the richest became leanest. When the RRs changed, the decrease participants with moderate and low constraints showed immediate increases in exploration and change in preference to the new richest keys, while increase participants showed no increase in exploration, and more gradual changes in preference. For Experiment 3 the constraint was held constant at .85, and the two richest options were decreased midway through the task by varying amounts (0 to .60). Decreases were detected early for participants in all but the smallest decrease conditions, and exploration increased.

APA, Harvard, Vancouver, ISO, and other styles

10

Hren, Jean-Francois. "Planification Optimiste pour Systèmes Déterministes." Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2012. http://tel.archives-ouvertes.fr/tel-00845898.

Full text

Abstract:

Dans le domaine de l'apprentissage par renforcement, la planifi ation dans les processus de décisions markoviens est une approche en ligne utilisée pour contrôler un système dont on possède un modèle génératif. Nous nous proposons d'adresser ce problème dans le cas déterministe avec espace d'action discret ou continu. Cette thèse s'attache au chapitre 2 à présenter succinctement les processus de décision markoviens puis l'apprentissage par renforcement. Nous présentons en particulier trois algorithmes centraux que sont l'itération de la valeur, l'itération de la politique et le Q-Learning. Au chapitre 3, nous expliquons l'approche de la planifi cation dans les processus de décision markoviens pour contrôler des systèmes en ligne. Ainsi, nous supposons posséder un modèle génératif d'un système à contrôler et nous l'utilisons pour décider, à chaque pas de temps du système à contrôler, de l'action à lui appliquer en vue de le faire transiter dans un état maximisant la somme future des récompenses dépréciées. Nous considérons un modèle génératif comme une boite noire, laquelle étant donnée un état et une action, nous retourne un état successeur ainsi qu'une récompense associée. L'approche optimiste est détaillée dans sa philosophie et dans son application à la résolution du dilemme exploration-exploitation au travers de di fférentes techniques présentes dans la littérature. Nous présentons di fférents algorithmes issus de la littérature et s'appliquant dans le cadre de la plani fication dans les processus de décision markoviens. Nous nous concentrons en particulier sur les algorithmes effectuant une recherche avant par construction d'un arbre des possibilités look-ahead tree en anglais. Les algorithmes sont présentés et mis en relation les uns avec les autres. L'algorithme de recherche du plus court chemin dans un graphe A est présenté en vue d'être relié à notre première contribution, l'algorithme de plani fication optimiste. Nous détaillons cette première contribution au chapitre 4. Dans un premier temps, nous présentons en détail le contexte de la planification sous contrainte de ressources computationnelles ainsi que la notion de regret. Dans un second temps, l'algorithme de plani cation uniforme est présenté et son regret est analysé pour obtenir une base comparative avec l'algorithme de plani cation optimiste. Enfi n, celui-ci est présenté et son regret est analysé. L'analyse est étendue à une classe de problèmes dé finie par la proportion de chemins -optimaux, permettant ainsi d'établir une borne supérieure sur le regret de l'algorithme de plani cation optimiste meilleure que celle de l'algorithme de plani cation uniforme dans le pire des cas. Des expérimentations sont menées pour valider la théorie et chi rer les performances de l'algorithme de plani cation optimiste par le biais de problèmes issus de la littérature comme le cart-pole, l'acrobot ou le mountain car et en comparaison à l'algorithme de plani cation uniforme, à l'algorithme UCT ainsi qu'à l'algorithme de recherche aléatoire. Nous verrons que, comme suggéré par la dé nition de la borne supérieure sur son regret, l'algorithme de plani cation optimiste est sensible au facteur de branchement ce qui nous mène à envisager le cas où l'espace d'action est continu. Ceci fait l'objet de nos deux autres contributions au chapitre 5. Notre deuxième contribution est l'algorithme de plani cation lipschitzienne reposant sur une hypothèse de régularité sur les récompenses menant à supposer que la fonction de transition et la fonction récompense du processus de décision markovien modélisant le système à contrôler sont lipschitziennes. De cette hypothèse, nous formulons une borne sur un sous-ensemble de sousespaces de l'espace d'action continu nous permettant de l'explorer par discr étisations successives. L'algorithme demande cependant la connaissance de la constante de Lipschitz associée au système à contrôler. Des expérimentations sont menées pour évaluer l'approche utilisée pour diff érentes constantes de Lipschitz sur des problèmes de la littérature comme le cart-pole, l'acrobot ou la lévitation magnétique d'une boule en acier. Les résultats montrent que l'estimation de la constante de Lipschitz est diffi cile et ne permet pas de prendre en compte le paysage local des récompenses. Notre troisième contribution est l'algorithme de plani cation séquentielle découlant d'une approche intuitive où une séquence d'instances d'un algorithme d'optimisation globale est utilisée pour construire des séquences d'actions issues de l'espace d'action continu. Des expérimentations sont menées pour évaluer cet approche intuitive pour diff érents algorithmes d'optimisation globale sur des problèmes de la littérature comme le cart-pole, le bateau ou le nageur. Les résultats obtenus sont encourageants et valident l'approche intuitive. Finalement, nous concluons en résumant les di érentes contributions et en ouvrant sur de nouvelles perspectives et extensions.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Books on the topic "Bandit learning"

1

Garofalo, Robert Joseph. Chorale and Shaker dance by John P. Zdechlik: A teaching-learning unit. Ft. Lauderdale, FL: Meredith Music Publications, 1999.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

2

Garofalo, Robert Joseph. Suite française by Darius Milhaud: A teaching-learning unit. Ft. Lauderdale, FL: Meredith Music Publications, 1998.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

3

Garofalo, Robert Joseph. On a hymnsong of Philip Bliss by David R. Holsinger: A teaching/learning unit. Galesville, MD: Meredith Music Publications, 2000.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

4

Mirror mind. Toronto, Ont: [T. Woollcott], 2009.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

5

James, Patterson. Retour au collège: Le pire endroit du monde! Vanves: Hachette romans, 2016.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

6

Bubeck, Sébastian, and Cesa-Bianchi Nicolò. Regret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems. Now Publishers, 2012.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

7

Zhao, Qing, and R. Srikant. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks. Morgan & Claypool Publishers, 2019.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

8

Zhao, Qing, and R. Srikant. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks. Morgan & Claypool Publishers, 2019.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

9

Zhao, Qing, and R. Srikant. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks. Morgan & Claypool Publishers, 2019.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

10

Zhao, Qing. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks. Springer International Publishing AG, 2019.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

More sources

Book chapters on the topic "Bandit learning"

1

Kakas, Antonis C., David Cohn, Sanjoy Dasgupta, Andrew G. Barto, Gail A. Carpenter, Stephen Grossberg, Geoffrey I. Webb, et al. "Associative Bandit Problem." In Encyclopedia of Machine Learning, 49. Boston, MA: Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-30164-8_39.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Mannor, Shie, Xin Jin, Jiawei Han, Xin Jin, Jiawei Han, Xin Jin, Jiawei Han, and Xinhua Zhang. "k-Armed Bandit." In Encyclopedia of Machine Learning, 561–63. Boston, MA: Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-30164-8_424.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Fürnkranz, Johannes, Philip K. Chan, Susan Craw, Claude Sammut, William Uther, Adwait Ratnaparkhi, Xin Jin, et al. "Multi-Armed Bandit." In Encyclopedia of Machine Learning, 699. Boston, MA: Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-30164-8_565.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Fürnkranz, Johannes, Philip K. Chan, Susan Craw, Claude Sammut, William Uther, Adwait Ratnaparkhi, Xin Jin, et al. "Multi-Armed Bandit Problem." In Encyclopedia of Machine Learning, 699. Boston, MA: Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-30164-8_566.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Mannor, Shie. "k-Armed Bandit." In Encyclopedia of Machine Learning and Data Mining, 687–90. Boston, MA: Springer US, 2017. http://dx.doi.org/10.1007/978-1-4899-7687-1_424.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Madani, Omid, Daniel J. Lizotte, and Russell Greiner. "The Budgeted Multi-armed Bandit Problem." In Learning Theory, 643–45. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004. http://dx.doi.org/10.1007/978-3-540-27819-1_46.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Munro, Paul, Hannu Toivonen, Geoffrey I. Webb, Wray Buntine, Peter Orbanz, Yee Whye Teh, Pascal Poupart, et al. "Bandit Problem with Side Information." In Encyclopedia of Machine Learning, 73. Boston, MA: Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-30164-8_54.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Munro, Paul, Hannu Toivonen, Geoffrey I. Webb, Wray Buntine, Peter Orbanz, Yee Whye Teh, Pascal Poupart, et al. "Bandit Problem with Side Observations." In Encyclopedia of Machine Learning, 73. Boston, MA: Springer US, 2011. http://dx.doi.org/10.1007/978-0-387-30164-8_55.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Agarwal, Mudit, and Naresh Manwani. "ALBIF: Active Learning with BandIt Feedbacks." In Advances in Knowledge Discovery and Data Mining, 353–64. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-05981-0_28.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Vermorel, Joannès, and Mehryar Mohri. "Multi-armed Bandit Algorithms and Empirical Evaluation." In Machine Learning: ECML 2005, 437–48. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005. http://dx.doi.org/10.1007/11564096_42.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Bandit learning"

1

das Dores, Silvia Cristina Nunes, Carlos Soares, and Duncan Ruiz. "Bandit-Based Automated Machine Learning." In 2018 7th Brazilian Conference on Intelligent Systems (BRACIS). IEEE, 2018. http://dx.doi.org/10.1109/bracis.2018.00029.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Xie, Miao, Wotao Yin, and Huan Xu. "AutoBandit: A Meta Bandit Online Learning System." In Thirtieth International Joint Conference on Artificial Intelligence {IJCAI-21}. California: International Joint Conferences on Artificial Intelligence Organization, 2021. http://dx.doi.org/10.24963/ijcai.2021/719.

Full text

Abstract:

Recently online multi-armed bandit (MAB) is growing rapidly, as novel problem settings and algorithms motivated by various practical applications are being studied, building on the top of the classic bandit problem. However, identifying the best bandit algorithm from lots of potential candidates for a given application is not only time-consuming but also relying on human expertise, which hinders the practicality of MAB. To alleviate this problem, this paper outlines an intelligent system called AutoBandit, equipped with many out-of-the-box MAB algorithms, for automatically and adaptively choosing the best with suitable hyper-parameters online. It is effective to help a growing application for continuously maximizing cumulative rewards of its whole life-cycle. With a flexible architecture and user-friendly web-based interfaces, it is very convenient for the user to integrate and monitor online bandits in a business system. At the time of publication, AutoBandit has been deployed for various industrial applications.

APA, Harvard, Vancouver, ISO, and other styles

3

Deng, Kun, Chris Bourke, Stephen Scott, Julie Sunderman, and Yaling Zheng. "Bandit-Based Algorithms for Budgeted Learning." In 2007 7th IEEE International Conference on Data Mining (ICDM '07). IEEE, 2007. http://dx.doi.org/10.1109/icdm.2007.91.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Zong, Jun, Ting Liu, Zhaowei Zhu, Xiliang Luo, and Hua Qian. "Social Bandit Learning: Strangers Can Help." In 2020 International Conference on Wireless Communications and Signal Processing (WCSP). IEEE, 2020. http://dx.doi.org/10.1109/wcsp49889.2020.9299725.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Yang, Luting, Jianyi Yang, and Shaolei Ren. "Multi-Feedback Bandit Learning with Probabilistic Contexts." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/427.

Full text

Abstract:

Contextual bandit is a classic multi-armed bandit setting, where side information (i.e., context) is available before arm selection. A standard assumption is that exact contexts are perfectly known prior to arm selection and only single feedback is returned. In this work, we focus on multi-feedback bandit learning with probabilistic contexts, where a bundle of contexts are revealed to the agent along with their corresponding probabilities at the beginning of each round. This models such scenarios as where contexts are drawn from the probability output of a neural network and the reward function is jointly determined by multiple feedback signals. We propose a kernelized learning algorithm based on upper confidence bound to choose the optimal arm in reproducing kernel Hilbert space for each context bundle. Moreover, we theoretically establish an upper bound on the cumulative regret with respect to an oracle that knows the optimal arm given probabilistic contexts, and show that the bound grows sublinearly with time. Our simula- tion on machine learning model recommendation further validates the sub-linearity of our cumulative regret and demonstrates that our algorithm outper- forms the approach that selects arms based on the most probable context.

APA, Harvard, Vancouver, ISO, and other styles

6

Strehl, Alexander L., Chris Mesterharm, Michael L. Littman, and Haym Hirsh. "Experience-efficient learning in associative bandit problems." In the 23rd international conference. New York, New York, USA: ACM Press, 2006. http://dx.doi.org/10.1145/1143844.1143956.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

"LEARNING TO PLAY K-ARMED BANDIT PROBLEMS." In International Conference on Agents and Artificial Intelligence. SciTePress - Science and and Technology Publications, 2012. http://dx.doi.org/10.5220/0003733500740081.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Zhao, Zibo, Kiyoshi Nakayama, and Ratnesh Sharma. "Decentralized Transactive Energy Auctions with Bandit Learning." In 2019 IEEE PES Transactive Energy Systems Conference (TESC). IEEE, 2019. http://dx.doi.org/10.1109/tesc.2019.8843371.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Du, Wenbin, Huaqing Jin, Chao Yu, and Guosheng Yin. "Deep Reinforcement Learning for Bandit Arm Localization." In 2022 IEEE International Conference on Big Data (Big Data). IEEE, 2022. http://dx.doi.org/10.1109/bigdata55660.2022.10020647.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Zhang, Xiaoying, Hong Xie, and John C. S. Lui. "Heterogeneous Information Assisted Bandit Learning: Theory and Application." In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021. http://dx.doi.org/10.1109/icde51399.2021.00213.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Bandit learning"

1

Shum, Matthew, Yingyao Hu, and Yutaka Kayaba. Nonparametric learning rules from bandit experiments: the eyes have it! Institute for Fiscal Studies, June 2010. http://dx.doi.org/10.1920/wp.cem.2010.1510.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Liu, Haoyang, Keqin Liu, and Qing Zhao. Learning in A Changing World: Non-Bayesian Restless Multi-Armed Bandit. Fort Belvoir, VA: Defense Technical Information Center, October 2010. http://dx.doi.org/10.21236/ada554798.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Olivier, Jason, and Sally Shoop. Imagery classification for autonomous ground vehicle mobility in cold weather environments. Engineer Research and Development Center (U.S.), November 2021. http://dx.doi.org/10.21079/11681/42425.

Full text

Abstract:

Autonomous ground vehicle (AGV) research for military applications is important for developing ways to remove soldiers from harm’s way. Current AGV research tends toward operations in warm climates and this leaves the vehicle at risk of failing in cold climates. To ensure AGVs can fulfill a military vehicle’s role of being able to operate on- or off-road in all conditions, consideration needs to be given to terrain of all types to inform the on-board machine learning algorithms. This research aims to correlate real-time vehicle performance data with snow and ice surfaces derived from multispectral imagery with the goal of aiding in the development of a truly all-terrain AGV. Using the image data that correlated most closely to vehicle performance the images were classified into terrain units of most interest to mobility. The best image classification results were obtained when using Short Wave InfraRed (SWIR) band values and a supervised classification scheme, resulting in over 95% accuracy.

APA, Harvard, Vancouver, ISO, and other styles

4

Becker, Sarah, Megan Maloney, and Andrew Griffin. A multi-biome study of tree cover detection using the Forest Cover Index. Engineer Research and Development Center (U.S.), September 2021. http://dx.doi.org/10.21079/11681/42003.

Full text

Abstract:

Tree cover maps derived from satellite and aerial imagery directly support civil and military operations. However, distinguishing tree cover from other vegetative land covers is an analytical challenge. While the commonly used Normalized Difference Vegetation Index (NDVI) can identify vegetative cover, it does not consistently distinguish between tree and low-stature vegetation. The Forest Cover Index (FCI) algorithm was developed to take the multiplicative product of the red and near infrared bands and apply a threshold to separate tree cover from non-tree cover in multispectral imagery (MSI). Previous testing focused on one study site using 2-m resolution commercial MSI from WorldView-2 and 30-m resolution imagery from Landsat-7. New testing in this work used 3-m imagery from PlanetScope and 10-m imagery from Sentinel-2 in imagery in sites across 12 biomes in South and Central America and North Korea. Overall accuracy ranged between 23% and 97% for Sentinel-2 imagery and between 51% and 98% for PlanetScope imagery. Future research will focus on automating the identification of the threshold that separates tree from other land covers, exploring use of the output for machine learning applications, and incorporating ancillary data such as digital surface models and existing tree cover maps.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!