Academic literature on the topic 'Reinforcement Learning, Multi-armed Bandits'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Reinforcement Learning, Multi-armed Bandits.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Reinforcement Learning, Multi-armed Bandits"

1

Wan, Zongqi, Zhijie Zhang, Tongyang Li, Jialin Zhang, and Xiaoming Sun. "Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 10087–94. http://dx.doi.org/10.1609/aaai.v37i8.26202.

Full text
Abstract:
Multi-arm bandit (MAB) and stochastic linear bandit (SLB) are important models in reinforcement learning, and it is well-known that classical algorithms for bandits with time horizon T suffer from the regret of at least the square root of T. In this paper, we study MAB and SLB with quantum reward oracles and propose quantum algorithms for both models with the order of the polylog T regrets, exponentially improving the dependence in terms of T. To the best of our knowledge, this is the first provable quantum speedup for regrets of bandit problems and in general exploitation in reinforcement learning. Compared to previous literature on quantum exploration algorithms for MAB and reinforcement learning, our quantum input model is simpler and only assumes quantum oracles for each individual arm.
APA, Harvard, Vancouver, ISO, and other styles
2

Ciucanu, Radu, Pascal Lafourcade, Gael Marcadet, and Marta Soare. "SAMBA: A Generic Framework for Secure Federated Multi-Armed Bandits." Journal of Artificial Intelligence Research 73 (February 23, 2022): 737–65. http://dx.doi.org/10.1613/jair.1.13163.

Full text
Abstract:
The multi-armed bandit is a reinforcement learning model where a learning agent repeatedly chooses an action (pull a bandit arm) and the environment responds with a stochastic outcome (reward) coming from an unknown distribution associated with the chosen arm. Bandits have a wide-range of application such as Web recommendation systems. We address the cumulative reward maximization problem in a secure federated learning setting, where multiple data owners keep their data stored locally and collaborate under the coordination of a central orchestration server. We rely on cryptographic schemes and propose Samba, a generic framework for Secure federAted Multi-armed BAndits. Each data owner has data associated to a bandit arm and the bandit algorithm has to sequentially select which data owner is solicited at each time step. We instantiate Samba for five bandit algorithms. We show that Samba returns the same cumulative reward as the nonsecure versions of bandit algorithms, while satisfying formally proven security properties. We also show that the overhead due to cryptographic primitives is linear in the size of the input, which is confirmed by our proof-of-concept implementation.
APA, Harvard, Vancouver, ISO, and other styles
3

Huanca-Anquise, Candy A., Ana Lúcia Cetertich Bazzan, and Anderson R. Tavares. "Multi-Objective, Multi-Armed Bandits: Algorithms for Repeated Games and Application to Route Choice." Revista de Informática Teórica e Aplicada 30, no. 1 (January 30, 2023): 11–23. http://dx.doi.org/10.22456/2175-2745.122929.

Full text
Abstract:
Multi-objective decision-making in multi-agent scenarios poses multiple challenges. Dealing with multiple objectives and non-stationarity caused by simultaneous learning are only two of them, which have been addressed separately. In this work, reinforcement learning algorithms that tackle both issues together are proposed and applied to a route choice problem, where drivers must select an action in a single-state formulation, while aiming to minimize both their travel time and toll. Hence, we deal with repeated games, now with a multi-objective approach. Advantages, limitations and differences of these algorithms are discussed. Our results show that the proposed algorithms for action selection using reinforcement learning deal with non-stationarity and multiple objectives, while providing alternative solutions to those of centralized methods.
APA, Harvard, Vancouver, ISO, and other styles
4

Giachino, Chiara, Luigi Bollani, Alessandro Bonadonna, and Marco Bertetti. "Reinforcement learning for content's customization: a first step of experimentation in Skyscanner." Industrial Management & Data Systems 121, no. 6 (January 15, 2021): 1417–34. http://dx.doi.org/10.1108/imds-12-2019-0722.

Full text
Abstract:
PurposeThe aim of the paper is to test and demonstrate the potential benefits in applying reinforcement learning instead of traditional methods to optimize the content of a company's mobile application to best help travellers finding their ideal flights. To this end, two approaches were considered and compared via simulation: standard randomized experiments or A/B testing and multi-armed bandits.Design/methodology/approachThe simulation of the two approaches to optimize the content of its mobile application and, consequently, increase flights conversions is illustrated as applied by Skyscanner, using R software.FindingsThe first results are about the comparison between the two approaches – A/B testing and multi-armed bandits – to identify the best one to achieve better results for the company. The second one is to gain experiences and suggestion in the application of the two approaches useful for other industries/companies.Research limitations/implicationsThe case study demonstrated, via simulation, the potential benefits to apply the reinforcement learning in a company. Finally, the multi-armed bandit was implemented in the company, but the period of the available data was limited, and due to its strategic relevance, the company cannot show all the findings.Practical implicationsThe right algorithm can change according to the situation and industry but would bring great benefits to the company's ability to surface content that is more relevant to users and help improving the experience for travellers. The study shows how to manage complexity and data to achieve good results.Originality/valueThe paper describes the approach used by an European leading company operating in the travel sector in understanding how to adapt reinforcement learning to its strategic goals. It presents a real case study and the simulation of the application of A/B testing and multi-armed bandit in Skyscanner; moreover, it highlights practical suggestion useful to other companies.
APA, Harvard, Vancouver, ISO, and other styles
5

Noothigattu, Ritesh, Tom Yan, and Ariel D. Procaccia. "Inverse Reinforcement Learning From Like-Minded Teachers." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 9197–204. http://dx.doi.org/10.1609/aaai.v35i10.17110.

Full text
Abstract:
We study the problem of learning a policy in a Markov decision process (MDP) based on observations of the actions taken by multiple teachers. We assume that the teachers are like-minded in that their reward functions -- while different from each other -- are random perturbations of an underlying reward function. Under this assumption, we demonstrate that inverse reinforcement learning algorithms that satisfy a certain property -- that of matching feature expectations -- yield policies that are approximately optimal with respect to the underlying reward function, and that no algorithm can do better in the worst case. We also show how to efficiently recover the optimal policy when the MDP has one state -- a setting that is akin to multi-armed bandits.
APA, Harvard, Vancouver, ISO, and other styles
6

Xiong, Guojun, Jian Li, and Rahul Singh. "Reinforcement Learning Augmented Asymptotically Optimal Index Policy for Finite-Horizon Restless Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 8726–34. http://dx.doi.org/10.1609/aaai.v36i8.20852.

Full text
Abstract:
We study a finite-horizon restless multi-armed bandit problem with multiple actions, dubbed as R(MA)^2B. The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state and action of the corresponding MDP. Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy entitled Occupancy-Measured-Reward Index Policy for the finite-horizon R(MA)^2B. Our index policy is well-defined without the requirement of indexability condition and is provably asymptotically optimal as the number of arms tends to infinity. We then adopt a learning perspective where the system parameters are unknown, and propose R(MA)^2B-UCB, a generative model based reinforcement learning augmented algorithm that can fully exploit the structure of Occupancy-Measured-Reward Index Policy. Compared to existing algorithms, R(MA)^2B-UCB performs close to offline optimum, and achieves a sub-linear regret and a low computational complexity all at once. Experimental results show that R(MA)^2B-UCB outperforms existing algorithms in both regret and running time.
APA, Harvard, Vancouver, ISO, and other styles
7

Huo, Xiaoguang, and Feng Fu. "Risk-aware multi-armed bandit problem with application to portfolio selection." Royal Society Open Science 4, no. 11 (November 2017): 171377. http://dx.doi.org/10.1098/rsos.171377.

Full text
Abstract:
Sequential portfolio selection has attracted increasing interest in the machine learning and quantitative finance communities in recent years. As a mathematical framework for reinforcement learning policies, the stochastic multi-armed bandit problem addresses the primary difficulty in sequential decision-making under uncertainty, namely the exploration versus exploitation dilemma, and therefore provides a natural connection to portfolio selection. In this paper, we incorporate risk awareness into the classic multi-armed bandit setting and introduce an algorithm to construct portfolio. Through filtering assets based on the topological structure of the financial market and combining the optimal multi-armed bandit policy with the minimization of a coherent risk measure, we achieve a balance between risk and return.
APA, Harvard, Vancouver, ISO, and other styles
8

Nobari, Sadegh. "DBA: Dynamic Multi-Armed Bandit Algorithm." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9869–70. http://dx.doi.org/10.1609/aaai.v33i01.33019869.

Full text
Abstract:
We introduce Dynamic Bandit Algorithm (DBA), a practical solution to improve the shortcoming of the pervasively employed reinforcement learning algorithm called Multi-Arm Bandit, aka Bandit. Bandit makes real-time decisions based on the prior observations. However, Bandit is heavily biased to the priors that it cannot quickly adapt itself to a trend that is interchanging. As a result, Bandit cannot, quickly enough, make profitable decisions when the trend is changing. Unlike Bandit, DBA focuses on quickly adapting itself to detect these trends early enough. Furthermore, DBA remains as almost as light as Bandit in terms of computations. Therefore, DBA can be easily deployed in production as a light process similar to The Bandit. We demonstrate how critical and beneficial is the main focus of DBA, i.e. the ability to quickly finding the most profitable option in real-time, over its stateof-the-art competitors. Our experiments are augmented with a visualization mechanism that explains the profitability of the decisions made by each algorithm in each step by animations. Finally we observe that DBA can substantially outperform the original Bandit by close to 3 times for a set Key Performance Indicator (KPI) in a case of having 3 arms.
APA, Harvard, Vancouver, ISO, and other styles
9

Esfandiari, Hossein, MohammadTaghi HajiAghayi, Brendan Lucier, and Michael Mitzenmacher. "Online Pandora’s Boxes and Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 1885–92. http://dx.doi.org/10.1609/aaai.v33i01.33011885.

Full text
Abstract:
We consider online variations of the Pandora’s box problem (Weitzman 1979), a standard model for understanding issues related to the cost of acquiring information for decision-making. Our problem generalizes both the classic Pandora’s box problem and the prophet inequality framework. Boxes are presented online, each with a random value and cost drawn jointly from some known distribution. Pandora chooses online whether to open each box given its cost, and then chooses irrevocably whether to keep the revealed prize or pass on it. We aim for approximation algorithms against adversaries that can choose the largest prize over any opened box, and use optimal offline policies to decide which boxes to open (without knowledge of the value inside)1. We consider variations where Pandora can collect multiple prizes subject to feasibility constraints, such as cardinality, matroid, or knapsack constraints. We also consider variations related to classic multi-armed bandit problems from reinforcement learning. Our results use a reduction-based framework where we separate the issues of the cost of acquiring information from the online decision process of which prizes to keep. Our work shows that in many scenarios, Pandora can achieve a good approximation to the best possible performance.
APA, Harvard, Vancouver, ISO, and other styles
10

Lefebvre, Germain, Christopher Summerfield, and Rafal Bogacz. "A Normative Account of Confirmation Bias During Reinforcement Learning." Neural Computation 34, no. 2 (January 14, 2022): 307–37. http://dx.doi.org/10.1162/neco_a_01455.

Full text
Abstract:
Abstract Reinforcement learning involves updating estimates of the value of states and actions on the basis of experience. Previous work has shown that in humans, reinforcement learning exhibits a confirmatory bias: when the value of a chosen option is being updated, estimates are revised more radically following positive than negative reward prediction errors, but the converse is observed when updating the unchosen option value estimate. Here, we simulate performance on a multi-arm bandit task to examine the consequences of a confirmatory bias for reward harvesting. We report a paradoxical finding: that confirmatory biases allow the agent to maximize reward relative to an unbiased updating rule. This principle holds over a wide range of experimental settings and is most influential when decisions are corrupted by noise. We show that this occurs because on average, confirmatory biases lead to overestimating the value of more valuable bandits and underestimating the value of less valuable bandits, rendering decisions overall more robust in the face of noise. Our results show how apparently suboptimal learning rules can in fact be reward maximizing if decisions are made with finite computational precision.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Reinforcement Learning, Multi-armed Bandits"

1

Magureanu, Stefan. "Structured Stochastic Bandits." Licentiate thesis, KTH, Reglerteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-182816.

Full text
Abstract:
In this thesis we address the multi-armed bandit (MAB) problem with stochastic rewards and correlated arms. Particularly, we investigate the case when the expected rewards are a Lipschitz function of the arm, and the learning to rank problem, as viewed from a MAB perspective. For the former, we derive a problem specific lower bound and propose both an asymptotically optimal algorithm (OSLB) and a (pareto)optimal, algorithm (POSLB). For the latter, we construct the regret lower bound and determine its closed form for some particular settings, as well as propose two asymptotically optimal algorithms PIE and PIE-C. For all algorithms mentioned above, we present performance analysis in the form of theoretical regret guarantees as well as numerical evaluation on artificial datasets as well as real-world datasets, in the case of PIE and PIE-C.

QC 20160223

APA, Harvard, Vancouver, ISO, and other styles
2

Talebi, Mazraeh Shahi Mohammad Sadegh. "Minimizing Regret in Combinatorial Bandits and Reinforcement Learning." Doctoral thesis, KTH, Reglerteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-219970.

Full text
Abstract:
This thesis investigates sequential decision making tasks that fall in the framework of reinforcement learning (RL). These tasks involve a decision maker repeatedly interacting with an environment modeled by an unknown finite Markov decision process (MDP), who wishes to maximize a notion of reward accumulated during her experience. Her performance can be measured through the notion of regret, which compares her accumulated expected reward against that achieved by an oracle algorithm always following an optimal behavior. In order to maximize her accumulated reward, or equivalently to minimize the regret, she needs to face a trade-off between exploration and exploitation. The first part of this thesis investigates combinatorial multi-armed bandit (MAB) problems, which are RL problems whose state-space is a singleton. It also addresses some applications that can be cast as combinatorial MAB problems. The number of arms in such problems generically grows exponentially with the number of basic actions, but the rewards of various arms are correlated. Hence, the challenge in such problems is to exploit the underlying combinatorial structure.For these problems, we derive asymptotic (i.e., when the time horizon grows large) lower bounds on the regret of any admissible algorithm and investigate how these bounds scale with the dimension of the underlying combinatorial structure. We then propose several algorithms and provide finite-time analyses of their regret. The proposed algorithms efficiently exploit the structure of the problem, provide better performance guarantees than existing algorithms, and significantly outperform these algorithms in practice. The second part of the thesis concerns RL in an unknown and discrete MDP under the average-reward criterion. We develop some variations of the transportation lemma that could serve as novel tools for the regret analysis of RL algorithms. Revisiting existing regret lower bounds allows us to derive alternative bounds, which motivate that the local variance of the bias function of the MDP, i.e., the variance with respect to next-state transition laws, could serve as a notion of problem complexity for regret minimization in RL. Leveraging these tools also allows us to report a novel regret analysis of the KL-UCRL algorithm for ergodic MDPs. The leading term in our regret bound depends on the local variance of the bias function, thus coinciding with observations obtained from our presented lower bounds. Numerical evaluations in some benchmark MDPs indicate that the leading term of the derived bound can provide an order of magnitude improvement over previously known results for this algorithm.

QC 20171215

APA, Harvard, Vancouver, ISO, and other styles
3

Hauser, Kristen. "Hyperparameter Tuning for Reinforcement Learning with Bandits and Off-Policy Sampling." Case Western Reserve University School of Graduate Studies / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=case1613034993418088.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Olkhovskaya, Julia. "Large-scale online learning under partial feedback." Doctoral thesis, Universitat Pompeu Fabra, 2022. http://hdl.handle.net/10803/673926.

Full text
Abstract:
La toma secuencial de decisiones con incertidumbre cubre una amplia clase de problemas. Las aplicaciones reales requieren que los algoritmos sean computacionalmente eficientes y escalables. Nosotros estudiamos un rango de problemas de toma secuencial de decisiones donde el agente obtiene información parcial de las recompensas, y desarrollamos algoritmos que son robustos y computacionalmente eficientes en problemas de grandes dimensiones. El primer problema que consideramos es un problema de maximización de la influencia en línea en el que el agente selecciona secuencialmente los nodos del grafo con el objetivo de esparcir la información a través de éste poniéndola en los nodos elegidos. El feedback disponible es solo alguna información sobre los vecinos más cercanos del vértice seleccionado. Nuestros resultados muestran que esas observaciones locales y parciales pueden ser suficientes para maximizar la influencia global. Proponemos algoritmos de aprendizaje secuencial que intentan maximizar la influencia y presentamos un análisis teórico tanto en los regímenes subcrítico y supercrítico de modelos gráficos. Estos son los primeros algoritmos de maximización secuencial de influencia que son eficientes en grafos con gran cantidad de nodos. En otra línea de trabajo, estudiamos el problema de bandidos multibrazos contextuales, donde la función recompensa puede cambiar de modo adverso y el agente solo puede observar las recompensas asociadas a sus acciones. Asumimos que el número de brazos es finito y que la dimensión del espacio de contextos puede ser infinita. Desarrollamos un algoritmo computacionalmente eficiente bajo la asunción que los contextos d-dimensionales son generados aleatoriamente de una distribución conocida, de forma idéntica e independiente. También proponemos un algoritmo robusto frente a errores de especificación en el caso en el que la función recompensa real es lineal con un error aditivo no lineal. Hasta nuestro conocimiento, nuestras garantías de performance constituyen los primeros resultados en este problema. También mostramos una extensión para cuando el contexto es un elemento de un espacio de hilbert con kernel reproducible. Finalmente, consideramos una extensión del problema de las máquinas tragamonedas contextuales descrito previamente. Estudiamos el caso en el que el agente interactúa con un proceso de decisión de Markov en una secuencia de episodios, donde un adversario escoge la función recompensa y solo observamos las observaciones de las recompensas de la acción elegida. Consideramos un espacio de estados arbitrariamente grande, pero asumimos que todas las funciones de acción-valor pueden ser representadas como funciones lineales en términos de un mapa de características de baja dimensión conocido, y que el agente tiene acceso a un simulador de trayectorias en el MDP. Nuestra principal contribución es el desarrollo de los primeros algoritmos que se han probado robustos y eficientes en este problema.
Sequential decision making under uncertainty covers a broad class of problems. Real-world applications require the algorithms to be computationally efficient and scalable. We study a range of sequential learning problems, where the learner observe only partial information about the rewards we develop the algorithms that are robust and computationally efficient in large-scale settings. First problem that we consider is an online influence maximization problem in which a decision maker sequentiaonally selects a node in the graph in order to spread the information throughout the graph by placing the information in the chosen node. The available feedback is only some information about a small neighbourhood of the selected vertex. Our results show that such partial local observations can be sufficient for maximizing global influence. We propose sequential learning algorithms that aim at maximizing influence, and provide their theoretical analysis in both the subcritical and supercritical regimes of broadly studied graph models. Thus this is the first algorithms in the sequential influence maximization setting, that perform efficiently in the graph with a huge number of nodes. In another line of work, we study the contextual bandit problem, where the reward function is allowed to change in an adversarial manner and the learner only gets to observe the rewards associated with its actions. We assume that the number of arms is finite and the context space can be infinite. We develop a computationally efficient algorithm under the assumption that the d-dimensional contexts are generated i.i.d. at random from a known distribution. We also propose an algorithm that is shown to be robust to misspecification in the setting where the true reward function is linear up to an additive nonlinear error. To our knowledge, our performance guarantees constitute the very first results on this problem setting. We also provide an extension when the context is an element of a reproducing kernel Hilbert space. Finally, we consider an extension of the contextual bandit problem described above. We study a setting where the learner interacts with a Markov decision process in a sequence of episodes, where an adversary chooses the reward function and the reward observations are available only for the selected action. We allow the state space to be arbitrarily large, but we assume that all action-value functions can be represented as linear functions in terms of a known low-dimensional feature map, and that the learner at least has access to the simulator of the trajectories in the MDP. Our main contributions are the first algorithms that are shown to be robust and efficient in this problem setting.
APA, Harvard, Vancouver, ISO, and other styles
5

Racey, Deborah Elaine. "EFFECTS OF RESPONSE FREQUENCY CONSTRAINTS ON LEARNING IN A NON-STATIONARY MULTI-ARMED BANDIT TASK." OpenSIUC, 2009. https://opensiuc.lib.siu.edu/dissertations/86.

Full text
Abstract:
An n-armed bandit task was used to investigate the trade-off between exploratory (choosing lesser-known options) and exploitive (choosing options with the greatest probability of reinforcement) human choice in a trial-and-error learning problem. In Experiment 1 a different probability of reinforcement was assigned to each of 8 response options using random-ratios (RRs), and participants chose by clicking buttons in a circular display on a computer screen using a computer mouse. Relative frequency thresholds (ranging from .10 to 1.0) were randomly assigned to each participant and acted as task constraints limiting the proportion of total responses that could be attributed to any response option. Preference for the richer keys was shown, and those with greater constraints explored more and earned less reinforcement. Those with the highest constraints showed no preference, distributing their responses among the options with equal probability. In Experiment 2 the payoff probabilities changed partway through, for some the leanest options increased to richest, and for others the richest became leanest. When the RRs changed, the decrease participants with moderate and low constraints showed immediate increases in exploration and change in preference to the new richest keys, while increase participants showed no increase in exploration, and more gradual changes in preference. For Experiment 3 the constraint was held constant at .85, and the two richest options were decreased midway through the task by varying amounts (0 to .60). Decreases were detected early for participants in all but the smallest decrease conditions, and exploration increased.
APA, Harvard, Vancouver, ISO, and other styles
6

Besson, Lilian. "Multi-Players Bandit Algorithms for Internet of Things Networks." Thesis, CentraleSupélec, 2019. http://www.theses.fr/2019CSUP0005.

Full text
Abstract:
Dans cette thèse de doctorat, nous étudions les réseaux sans fil et les appareils reconfigurables qui peuvent accéder à des réseaux de type radio intelligente, dans des bandes non licenciées et sans supervision centrale. Nous considérons notamment des réseaux actuels ou futurs de l’Internet des Objets (IoT), avec l’objectif d’augmenter la durée de vie de la batterie des appareils, en les équipant d’algorithmes d’apprentissage machine peu coûteux mais efficaces, qui leur permettent d’améliorer automatiquement l’efficacité de leurs communications sans fil. Nous proposons deux modèles de réseaux IoT, et nous montrons empiriquement, par des simulations numériques et une validation expérimentale réaliste, le gain que peuvent apporter nos méthodes, qui se reposent sur l’apprentissage par renforcement. Les différents problèmes d’accès au réseau sont modélisés avec des Bandits Multi-Bras (MAB), mais l’analyse de la convergence d’un grand nombre d’appareils jouant à un jeu collaboratif sans communication ni aucune coordination reste délicate, lorsque les appareils suivent tous un modèle d’activation aléatoire. Le reste de ce manuscrit étudie donc deux modèles restreints, d’abord des banditsmulti-joueurs dans des problèmes stationnaires, puis des bandits mono-joueur non stationnaires. Nous détaillons également une autre contribution, la bibliothèque Python open-source SMPyBandits, qui permet des simulations numériques de problèmes MAB, qui couvre les modèles étudiés et d’autres
In this PhD thesis, we study wireless networks and reconfigurable end-devices that can access Cognitive Radio networks, in unlicensed bands and without central control. We focus on Internet of Things networks (IoT), with the objective of extending the devices’ battery life, by equipping them with low-cost but efficient machine learning algorithms, in order to let them automatically improve the efficiency of their wireless communications. We propose different models of IoT networks, and we show empirically on both numerical simulations and real-world validation the possible gain of our methods, that use Reinforcement Learning. The different network access problems are modeled as Multi-Armed Bandits (MAB), but we found that analyzing the realistic models was intractable, because proving the convergence of many IoT devices playing a collaborative game, without communication nor coordination is hard, when they all follow random activation patterns. The rest of this manuscript thus studies two restricted models, first multi-players bandits in stationary problems, then non-stationary single-player bandits. We also detail another contribution, SMPyBandits, our open-source Python library for numerical MAB simulations, that covers all the studied models and more
APA, Harvard, Vancouver, ISO, and other styles
7

Achab, Mastane. "Ranking and risk-aware reinforcement learning." Electronic Thesis or Diss., Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAT020.

Full text
Abstract:
Les travaux de cette thèse se situent à l’interface de deux thématiques de l'apprentissage automatique : l’apprentissage de préférences d'une part, et l’apprentissage par renforcement de l'autre. La première consiste à percoler différents classements d’un même ensemble d’objets afin d’en extraire un ordre général, la seconde à identifier séquentiellement une stratégie optimale en observant des récompenses sanctionnant chaque action essayée. La structure de la thèse suit ce découpage thématique. En première partie, le paradigme de minimisation du risque empirique est utilisé à des fins d'ordonnancement. Partant du problème d’apprentissage supervisé de règles d’ordonnancement à partir de données étiquetées de façon binaire, une extension est proposée au cas où les étiquettes prennent des valeurs continues. Les critères de performance usuels dans le cas binaire, à savoir la courbe caractéristique de l’opérateur de réception (COR) et l’aire sous la courbe COR (ASC), sont étendus au cas continu : les métriques COR intégrée (CORI) et ASC intégrée (ASCI) sont introduites à cet effet. Le second problème d'ordonnancement étudié est celui de l'agrégation de classements à travers l'identification du consensus de Kemeny. En particulier, une relaxation au problème plus général de la réduction de la dimensionnalité dans l'espace des distributions sur le groupe symétrique est formulée à l'aide d'outils mathématiques empruntés à la théorie du transport optimal. La seconde partie de cette thèse s'intéresse à l'apprentissage par renforcement. Des problèmes de bandit manchot sont analysés dans des contextes où la performance moyenne n'est pas pertinente et où la gestion du risque prévaut. Enfin, le problème plus général de l'apprentissage par renforcement distributionnel, dans lequel le décideur cherche à connaître l'entière distribution de sa performance et non pas uniquement sa valeur moyenne, est considéré. De nouveaux opérateurs de programmation dynamique ainsi que leurs pendants atomiques mènent à de nouveaux algorithmes stochastiques distributionnels
This thesis divides into two parts: the first part is on ranking and the second on risk-aware reinforcement learning. While binary classification is the flagship application of empirical risk minimization (ERM), the main paradigm of machine learning, more challenging problems such as bipartite ranking can also be expressed through that setup. In bipartite ranking, the goal is to order, by means of scoring methods, all the elements of some feature space based on a training dataset composed of feature vectors with their binary labels. This thesis extends this setting to the continuous ranking problem, a variant where the labels are taking continuous values instead of being simply binary. The analysis of ranking data, initiated in the 18th century in the context of elections, has led to another ranking problem using ERM, namely ranking aggregation and more precisely the Kemeny's consensus approach. From a training dataset made of ranking data, such as permutations or pairwise comparisons, the goal is to find the single "median permutation" that best corresponds to a consensus order. We present a less drastic dimensionality reduction approach where a distribution on rankings is approximated by a simpler distribution, which is not necessarily reduced to a Dirac mass as in ranking aggregation.For that purpose, we rely on mathematical tools from the theory of optimal transport such as Wasserstein metrics. The second part of this thesis focuses on risk-aware versions of the stochastic multi-armed bandit problem and of reinforcement learning (RL), where an agent is interacting with a dynamic environment by taking actions and receiving rewards, the objective being to maximize the total payoff. In particular, a novel atomic distributional RL approach is provided: the distribution of the total payoff is approximated by particles that correspond to trimmed means
APA, Harvard, Vancouver, ISO, and other styles
8

Barkino, Iliam. "Summary Statistic Selection with Reinforcement Learning." Thesis, Uppsala universitet, Avdelningen för beräkningsvetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-390838.

Full text
Abstract:
Multi-armed bandit (MAB) algorithms could be used to select a subset of the k most informative summary statistics, from a pool of m possible summary statistics, by reformulating the subset selection problem as a MAB problem. This is suggested by experiments that tested five MAB algorithms (Direct, Halving, SAR, OCBA-m, and Racing) on the reformulated problem and comparing the results to two established subset selection algorithms (Minimizing Entropy and Approximate Sufficiency). The MAB algorithms yielded errors at par with the established methods, but in only a fraction of the time. Establishing MAB algorithms as a new standard for summary statistics subset selection could therefore save numerous scientists substantial amounts of time when selecting summary statistics for approximate bayesian computation.
APA, Harvard, Vancouver, ISO, and other styles
9

Jedor, Matthieu. "Bandit algorithms for recommender system optimization." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASM027.

Full text
Abstract:
Dans cette thèse de doctorat, nous étudions l'optimisation des systèmes de recommandation dans le but de fournir des suggestions de produits plus raffinées pour un utilisateur.La tâche est modélisée à l'aide du cadre des bandits multi-bras.Dans une première partie, nous abordons deux problèmes qui se posent fréquemment dans les systèmes de recommandation : le grand nombre d'éléments à traiter et la gestion des contenus sponsorisés.Dans une deuxième partie, nous étudions les performances empiriques des algorithmes de bandit et en particulier comment paramétrer les algorithmes traditionnels pour améliorer les résultats dans les environnements stationnaires et non stationnaires qui l'on rencontre en pratique.Cela nous amène à analyser à la fois théoriquement et empiriquement l'algorithme glouton qui, dans certains cas, est plus performant que l'état de l'art
In this PhD thesis, we study the optimization of recommender systems with the objective of providing more refined suggestions of items for a user to benefit.The task is modeled using the multi-armed bandit framework.In a first part, we look upon two problems that commonly occured in recommendation systems: the large number of items to handle and the management of sponsored contents.In a second part, we investigate the empirical performance of bandit algorithms and especially how to tune conventional algorithm to improve results in stationary and non-stationary environments that arise in practice.This leads us to analyze both theoretically and empirically the greedy algorithm that, in some cases, outperforms the state-of-the-art
APA, Harvard, Vancouver, ISO, and other styles
10

Couetoux, Adrien. "Monte Carlo Tree Search for Continuous and Stochastic Sequential Decision Making Problems." Thesis, Paris 11, 2013. http://www.theses.fr/2013PA112192.

Full text
Abstract:
Dans cette thèse, nous avons étudié les problèmes de décisions séquentielles, avec comme application la gestion de stocks d'énergie. Traditionnellement, ces problèmes sont résolus par programmation dynamique stochastique. Mais la grande dimension, et la non convexité du problème, amènent à faire des simplifications sur le modèle pour pouvoir faire fonctionner ces méthodes.Nous avons donc étudié une méthode alternative, qui ne requiert pas de simplifications du modèle: Monte Carlo Tree Search (MCTS). Nous avons commencé par étendre le MCTS classique (qui s’applique aux domaines finis et déterministes) aux domaines continus et stochastiques. Pour cela, nous avons utilisé la méthode de Double Progressive Widening (DPW), qui permet de gérer le ratio entre largeur et profondeur de l’arbre, à l’aide de deux méta paramètres. Nous avons aussi proposé une heuristique nommée Blind Value (BV) pour améliorer la recherche de nouvelles actions, en utilisant l’information donnée par les simulations passées. D’autre part, nous avons étendu l’heuristique RAVE aux domaines continus. Enfin, nous avons proposé deux nouvelles méthodes pour faire remonter l’information dans l’arbre, qui ont beaucoup amélioré la vitesse de convergence sur deux cas tests.Une part importante de notre travail a été de proposer une façon de mêler MCTS avec des heuristiques rapides pré-existantes. C’est une idée particulièrement intéressante dans le cas de la gestion d’énergie, car ces problèmes sont pour le moment résolus de manière approchée. Nous avons montré comment utiliser Direct Policy Search (DPS) pour rechercher une politique par défaut efficace, qui est ensuite utilisée à l’intérieur de MCTS. Les résultats expérimentaux sont très encourageants.Nous avons aussi appliqué MCTS à des processus markoviens partiellement observables (POMDP), avec comme exemple le jeu de démineur. Dans ce cas, les algorithmes actuels ne sont pas optimaux, et notre approche l’est, en transformant le POMDP en MDP, par un changement de vecteur d’état.Enfin, nous avons utilisé MCTS dans un cadre de méta-bandit, pour résoudre des problèmes d’investissement. Le choix d’investissement est fait par des algorithmes de bandits à bras multiples, tandis que l’évaluation de chaque bras est faite par MCTS.Une des conclusions importantes de ces travaux est que MCTS en continu a besoin de très peu d’hypothèses (uniquement un modèle génératif du problème), converge vers l’optimum, et peut facilement améliorer des méthodes suboptimales existantes
In this thesis, we study sequential decision making problems, with a focus on the unit commitment problem. Traditionally solved by dynamic programming methods, this problem is still a challenge, due to its high dimension and to the sacrifices made on the accuracy of the model to apply state of the art methods. We investigate on the applicability of Monte Carlo Tree Search methods for this problem, and other problems that are single player, stochastic and continuous sequential decision making problems. We started by extending the traditional finite state MCTS to continuous domains, with a method called Double Progressive Widening (DPW). This method relies on two hyper parameters, and determines the ratio between width and depth in the nodes of the tree. We developed a heuristic called Blind Value (BV) to improve the exploration of new actions, using the information from past simulations. We also extended the RAVE heuristic to continuous domain. Finally, we proposed two new ways of backing up information through the tree, that improved the convergence speed considerably on two test cases.An important part of our work was to propose a way to mix MCTS with existing powerful heuristics, with the application to energy management in mind. We did so by proposing a framework that allows to learn a good default policy by Direct Policy Search (DPS), and to include it in MCTS. The experimental results are very positive.To extend the reach of MCTS, we showed how it could be used to solve Partially Observable Markovian Decision Processes, with an application to game of Mine Sweeper, for which no consistent method had been proposed before.Finally, we used MCTS in a meta-bandit framework to solve energy investment problems: the investment decision was handled by classical bandit algorithms, while the evaluation of each investment was done by MCTS.The most important take away is that continuous MCTS has almost no assumption (besides the need for a generative model), is consistent, and can easily improve existing suboptimal solvers by using a method similar to what we proposed with DPS
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Reinforcement Learning, Multi-armed Bandits"

1

Zhao, Qing, and R. Srikant. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks. Morgan & Claypool Publishers, 2019.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

Zhao, Qing, and R. Srikant. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks. Morgan & Claypool Publishers, 2019.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
3

Zhao, Qing. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks. Springer International Publishing AG, 2019.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
4

Zhao, Qing, and R. Srikant. Multi-Armed Bandits: Theory and Applications to Online Learning in Networks. Morgan & Claypool Publishers, 2019.

Find full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Reinforcement Learning, Multi-armed Bandits"

1

Rao, Ashwin, and Tikhon Jelvis. "Multi-Armed Bandits: Exploration versus Exploitation." In Foundations of Reinforcement Learning with Applications in Finance, 411–38. Boca Raton: Chapman and Hall/CRC, 2022. http://dx.doi.org/10.1201/9781003229193-15.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Kubat, Miroslav. "Reinforcement Learning: N-Armed Bandits and Episodes." In An Introduction to Machine Learning, 353–76. Cham: Springer International Publishing, 2012. http://dx.doi.org/10.1007/978-3-030-81935-4_17.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Roijers, Diederik M., Luisa M. Zintgraf, Pieter Libin, Mathieu Reymond, Eugenio Bargiacchi, and Ann Nowé. "Interactive Multi-objective Reinforcement Learning in Multi-armed Bandits with Gaussian Process Utility Models." In Machine Learning and Knowledge Discovery in Databases, 463–78. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-67664-3_28.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Combrink, Herkulaas MvE, Vukosi Marivate, and Benjamin Rosman. "Reinforcement Learning in Education: A Multi-armed Bandit Approach." In Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 3–16. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-35883-8_1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Antos, András, Varun Grover, and Csaba Szepesvári. "Active Learning in Multi-armed Bandits." In Lecture Notes in Computer Science, 287–302. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. http://dx.doi.org/10.1007/978-3-540-87987-9_25.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Qureshi, Ubaid, Mehreen Mushtaq, Juveeryah Qureshi, Mir Aiman, Mansha Ali, and Shahnawaz Ali. "Dynamic Pricing for Electric Vehicle Charging at a Commercial Charging Station in Presence of Uncertainty: A Multi-armed Bandit Reinforcement Learning Approach." In Proceedings of International Conference on Data Science and Applications, 625–35. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-19-6634-7_44.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Baransi, Akram, Odalric-Ambrym Maillard, and Shie Mannor. "Sub-sampling for Multi-armed Bandits." In Machine Learning and Knowledge Discovery in Databases, 115–31. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014. http://dx.doi.org/10.1007/978-3-662-44848-9_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Carpentier, Alexandra, Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos, and Peter Auer. "Upper-Confidence-Bound Algorithms for Active Learning in Multi-armed Bandits." In Lecture Notes in Computer Science, 189–203. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011. http://dx.doi.org/10.1007/978-3-642-24412-4_17.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Mendonça, Vânia, Luísa Coheur, and Alberto Sardinha. "One Arm to Rule Them All: Online Learning with Multi-armed Bandits for Low-Resource Conversational Agents." In Progress in Artificial Intelligence, 625–34. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-86230-5_49.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Zhang, Lili, Ruben Mukherjee, Piyush Wadhai, Willie Muehlhausen, and Tomas Ward. "Computational Phenotyping of Decision-Making over Voice Interfaces." In Communications in Computer and Information Science, 475–87. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-26438-2_37.

Full text
Abstract:
AbstractResearch on human reinforcement learning and decision-making behaviour has traditionally used visual-based symbols and graphics in the experimental paradigms. Such research leads to improved understanding of human decision-making and has application in fundamental research in cognitive neuroscience. In clinical domains, the approach holds out the possibility for the development of computationally-derived biomarkers suitable for use in psychiatry. Scaling this experimental approach through pervasive computing can help create larger datasets which will be necessary for normative studies. This will require the expansion of these experimental approaches beyond conventional visual representations. People receive information and interact with their environments through various senses. In particular, our sense of hearing in conjunction with speech represents a ubiquitous modality for learning and for updating our knowledge of the world. Consequently, it represents an important path for the investigation of human decision-making which is now experimentally accessible via rapid advances in voice-enabled intelligent personal assistants (IPAs). Examples include Amazon’s Alexa technology and Google’s Voice Assistant. However, to date no studies have demonstrated the feasibility of delivering such experimental paradigms over such voice technologies. Consequently in this study, we compared the performance of the same group of participants on the traditional visual-based and for the first time, a conversational voice-based, two-armed bandit task. Reinforcement learning models were fitted to the data to represent the characteristics of the underlying cognitive mechanisms in the task. Both model-independent behavioural measures and model-derived parameters were compared. The results suggest that participants demonstrated higher shifting rates in the voice-based version of the task. The computational modelling analysis revealed that participants adopted similar learning rates under the two versions of the interfaces, but more decision noise was introduced in the voice-based task as reflected by the decreased value of the inverse temperature parameter. We suggest that the elevated shifting rate is derived from the increased noise in the voice interface instead of a change in the learning strategy of the participants. Higher intensity of the control adjustments (click touch versus speak) might be one of the sources of noise, thus it is important to think further regarding the design of the voice interface if we wish to apply voice-enabled IPAs to measure human decision-making in their daily environments in the future.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Reinforcement Learning, Multi-armed Bandits"

1

Liu, Yi-Pei, Kuo Li, Xi Cao, Qing-Shan Jia, and Xu Wang. "Quantum Reinforcement Learning for Multi-Armed Bandits." In 2022 41st Chinese Control Conference (CCC). IEEE, 2022. http://dx.doi.org/10.23919/ccc55666.2022.9902595.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Zhang, Junzhe, and Elias Bareinboim. "Transfer Learning in Multi-Armed Bandits: A Causal Approach." In Twenty-Sixth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 2017. http://dx.doi.org/10.24963/ijcai.2017/186.

Full text
Abstract:
Reinforcement learning (RL) agents have been deployed in complex environments where interactions are costly, and learning is usually slow. One prominent task in these settings is to reuse interactions performed by other agents to accelerate the learning process. Causal inference provides a family of methods to infer the effects of actions from a combination of data and qualitative assumptions about the underlying environment. Despite its success of transferring invariant knowledge across domains in the empirical sciences, causal inference has not been fully realized in the context of transfer learning in interactive domains. In this paper, we use causal inference as a basis to support a principled and more robust transfer of knowledge in RL settings. In particular, we tackle the problem of transferring knowledge across bandit agents in settings where causal effects cannot be identified by do-calculus [Pearl, 2000] and standard learning techniques. Our new identification strategy combines two steps -- first, deriving bounds over the arm’s distribution based on structural knowledge; second, incorporating these bounds in a dynamic allocation procedure so as to guide the search towards more promising actions. We formally prove that our strategy dominates previously known algorithms and achieves orders of magnitude faster convergence rates than these algorithms. Finally, we perform simulations and empirically demonstrate that our strategy is consistently more efficient than the current (non-causal) state-of-the-art methods
APA, Harvard, Vancouver, ISO, and other styles
3

Yahyaa, Saba Q., Madalina M. Drugan, and Bernard Manderick. "Annealing-pareto multi-objective multi-armed bandit algorithm." In 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL). IEEE, 2014. http://dx.doi.org/10.1109/adprl.2014.7010619.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Jiang, Daniel, Haipeng Luo, Chu Wang, and Yingfei Wang. "Multi-Armed Bandits and Reinforcement Learning: Advancing Decision Making in E-Commerce and Beyond." In KDD '21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2021. http://dx.doi.org/10.1145/3447548.3469457.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

"EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed Bandits and MDPs." In 2nd International Conference on Agents and Artificial Intelligence. SciTePress - Science and and Technology Publications, 2010. http://dx.doi.org/10.5220/0002703002030210.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Tariq, Zain Ul Abideen, Emna Baccour, Aiman Erbad, Mohsen Guizani, and Mounir Hamdi. "Network Intrusion Detection for Smart Infrastructure using Multi-armed Bandit based Reinforcement Learning in Adversarial Environment." In 2022 International Conference on Cyber Warfare and Security (ICCWS). IEEE, 2022. http://dx.doi.org/10.1109/iccws56285.2022.9998440.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

ElSayed, Karim A., Ilias Bilionis, and Jitesh H. Panchal. "Evaluating Heuristics in Engineering Design: A Reinforcement Learning Approach." In ASME 2021 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference. American Society of Mechanical Engineers, 2021. http://dx.doi.org/10.1115/detc2021-70425.

Full text
Abstract:
Abstract Heuristics are essential for addressing the complexities of engineering design processes. The goodness of heuristics is context-dependent. Appropriately tailored heuristics can enable designers to find good solutions efficiently, and inappropriate heuristics can result in cognitive biases and inferior design outcomes. While there have been several efforts at understanding which heuristics are used by designers, there is a lack of normative understanding about when different heuristics are suitable. Towards addressing this gap, this paper presents a reinforcement learning-based approach to evaluate the goodness of heuristics for three sub-problems commonly faced by designers: (1) learning the map between the design space and the performance space, (2) acquiring sequential information, and (3) stopping the information acquisition process. Using a multi-armed bandit formulation and simulation studies, we learn the suitable heuristics for these individual sub-problems under different resource constraints and problem complexities. Additionally, we learn the optimal heuristics for the combined problem (i.e., the one composing all three sub-problems), and we compare them to ones learned at the sub-problem level. The results of our simulation study indicate that the proposed reinforcement learning-based approach can be effective for determining the quality of heuristics for different problems, and how the effectiveness of the heuristics changes as a function of the designer’s preference (e.g., performance versus cost), the complexity of the problem, and the resources available.
APA, Harvard, Vancouver, ISO, and other styles
8

Kalathil, Dileep, Naumaan Nayyar, and Rahul Jain. "Decentralized learning for multi-player multi-armed bandits." In 2012 IEEE 51st Annual Conference on Decision and Control (CDC). IEEE, 2012. http://dx.doi.org/10.1109/cdc.2012.6426587.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Sankararaman, Abishek, Ayalvadi Ganesh, and Sanjay Shakkottai. "Social Learning in Multi Agent Multi Armed Bandits." In SIGMETRICS '20: ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer Systems. New York, NY, USA: ACM, 2020. http://dx.doi.org/10.1145/3393691.3394217.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Radlinski, Filip, Robert Kleinberg, and Thorsten Joachims. "Learning diverse rankings with multi-armed bandits." In the 25th international conference. New York, New York, USA: ACM Press, 2008. http://dx.doi.org/10.1145/1390156.1390255.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography