Dissertations / Theses: 'Reinforcement Learning, Multi-armed Bandits'

1

Magureanu, Stefan. "Structured Stochastic Bandits." Licentiate thesis, KTH, Reglerteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-182816.

Full text

Abstract:

In this thesis we address the multi-armed bandit (MAB) problem with stochastic rewards and correlated arms. Particularly, we investigate the case when the expected rewards are a Lipschitz function of the arm, and the learning to rank problem, as viewed from a MAB perspective. For the former, we derive a problem specific lower bound and propose both an asymptotically optimal algorithm (OSLB) and a (pareto)optimal, algorithm (POSLB). For the latter, we construct the regret lower bound and determine its closed form for some particular settings, as well as propose two asymptotically optimal algorithms PIE and PIE-C. For all algorithms mentioned above, we present performance analysis in the form of theoretical regret guarantees as well as numerical evaluation on artificial datasets as well as real-world datasets, in the case of PIE and PIE-C.

QC 20160223

APA, Harvard, Vancouver, ISO, and other styles

2

Talebi, Mazraeh Shahi Mohammad Sadegh. "Minimizing Regret in Combinatorial Bandits and Reinforcement Learning." Doctoral thesis, KTH, Reglerteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-219970.

Full text

Abstract:

This thesis investigates sequential decision making tasks that fall in the framework of reinforcement learning (RL). These tasks involve a decision maker repeatedly interacting with an environment modeled by an unknown finite Markov decision process (MDP), who wishes to maximize a notion of reward accumulated during her experience. Her performance can be measured through the notion of regret, which compares her accumulated expected reward against that achieved by an oracle algorithm always following an optimal behavior. In order to maximize her accumulated reward, or equivalently to minimize the regret, she needs to face a trade-off between exploration and exploitation. The first part of this thesis investigates combinatorial multi-armed bandit (MAB) problems, which are RL problems whose state-space is a singleton. It also addresses some applications that can be cast as combinatorial MAB problems. The number of arms in such problems generically grows exponentially with the number of basic actions, but the rewards of various arms are correlated. Hence, the challenge in such problems is to exploit the underlying combinatorial structure.For these problems, we derive asymptotic (i.e., when the time horizon grows large) lower bounds on the regret of any admissible algorithm and investigate how these bounds scale with the dimension of the underlying combinatorial structure. We then propose several algorithms and provide finite-time analyses of their regret. The proposed algorithms efficiently exploit the structure of the problem, provide better performance guarantees than existing algorithms, and significantly outperform these algorithms in practice. The second part of the thesis concerns RL in an unknown and discrete MDP under the average-reward criterion. We develop some variations of the transportation lemma that could serve as novel tools for the regret analysis of RL algorithms. Revisiting existing regret lower bounds allows us to derive alternative bounds, which motivate that the local variance of the bias function of the MDP, i.e., the variance with respect to next-state transition laws, could serve as a notion of problem complexity for regret minimization in RL. Leveraging these tools also allows us to report a novel regret analysis of the KL-UCRL algorithm for ergodic MDPs. The leading term in our regret bound depends on the local variance of the bias function, thus coinciding with observations obtained from our presented lower bounds. Numerical evaluations in some benchmark MDPs indicate that the leading term of the derived bound can provide an order of magnitude improvement over previously known results for this algorithm.

QC 20171215

APA, Harvard, Vancouver, ISO, and other styles

3

Hauser, Kristen. "Hyperparameter Tuning for Reinforcement Learning with Bandits and Off-Policy Sampling." Case Western Reserve University School of Graduate Studies / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=case1613034993418088.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Olkhovskaya, Julia. "Large-scale online learning under partial feedback." Doctoral thesis, Universitat Pompeu Fabra, 2022. http://hdl.handle.net/10803/673926.

Full text

Abstract:

La toma secuencial de decisiones con incertidumbre cubre una amplia clase de problemas. Las aplicaciones reales requieren que los algoritmos sean computacionalmente eficientes y escalables. Nosotros estudiamos un rango de problemas de toma secuencial de decisiones donde el agente obtiene información parcial de las recompensas, y desarrollamos algoritmos que son robustos y computacionalmente eficientes en problemas de grandes dimensiones. El primer problema que consideramos es un problema de maximización de la influencia en línea en el que el agente selecciona secuencialmente los nodos del grafo con el objetivo de esparcir la información a través de éste poniéndola en los nodos elegidos. El feedback disponible es solo alguna información sobre los vecinos más cercanos del vértice seleccionado. Nuestros resultados muestran que esas observaciones locales y parciales pueden ser suficientes para maximizar la influencia global. Proponemos algoritmos de aprendizaje secuencial que intentan maximizar la influencia y presentamos un análisis teórico tanto en los regímenes subcrítico y supercrítico de modelos gráficos. Estos son los primeros algoritmos de maximización secuencial de influencia que son eficientes en grafos con gran cantidad de nodos. En otra línea de trabajo, estudiamos el problema de bandidos multibrazos contextuales, donde la función recompensa puede cambiar de modo adverso y el agente solo puede observar las recompensas asociadas a sus acciones. Asumimos que el número de brazos es finito y que la dimensión del espacio de contextos puede ser infinita. Desarrollamos un algoritmo computacionalmente eficiente bajo la asunción que los contextos d-dimensionales son generados aleatoriamente de una distribución conocida, de forma idéntica e independiente. También proponemos un algoritmo robusto frente a errores de especificación en el caso en el que la función recompensa real es lineal con un error aditivo no lineal. Hasta nuestro conocimiento, nuestras garantías de performance constituyen los primeros resultados en este problema. También mostramos una extensión para cuando el contexto es un elemento de un espacio de hilbert con kernel reproducible. Finalmente, consideramos una extensión del problema de las máquinas tragamonedas contextuales descrito previamente. Estudiamos el caso en el que el agente interactúa con un proceso de decisión de Markov en una secuencia de episodios, donde un adversario escoge la función recompensa y solo observamos las observaciones de las recompensas de la acción elegida. Consideramos un espacio de estados arbitrariamente grande, pero asumimos que todas las funciones de acción-valor pueden ser representadas como funciones lineales en términos de un mapa de características de baja dimensión conocido, y que el agente tiene acceso a un simulador de trayectorias en el MDP. Nuestra principal contribución es el desarrollo de los primeros algoritmos que se han probado robustos y eficientes en este problema.
Sequential decision making under uncertainty covers a broad class of problems. Real-world applications require the algorithms to be computationally efficient and scalable. We study a range of sequential learning problems, where the learner observe only partial information about the rewards we develop the algorithms that are robust and computationally efficient in large-scale settings. First problem that we consider is an online influence maximization problem in which a decision maker sequentiaonally selects a node in the graph in order to spread the information throughout the graph by placing the information in the chosen node. The available feedback is only some information about a small neighbourhood of the selected vertex. Our results show that such partial local observations can be sufficient for maximizing global influence. We propose sequential learning algorithms that aim at maximizing influence, and provide their theoretical analysis in both the subcritical and supercritical regimes of broadly studied graph models. Thus this is the first algorithms in the sequential influence maximization setting, that perform efficiently in the graph with a huge number of nodes. In another line of work, we study the contextual bandit problem, where the reward function is allowed to change in an adversarial manner and the learner only gets to observe the rewards associated with its actions. We assume that the number of arms is finite and the context space can be infinite. We develop a computationally efficient algorithm under the assumption that the d-dimensional contexts are generated i.i.d. at random from a known distribution. We also propose an algorithm that is shown to be robust to misspecification in the setting where the true reward function is linear up to an additive nonlinear error. To our knowledge, our performance guarantees constitute the very first results on this problem setting. We also provide an extension when the context is an element of a reproducing kernel Hilbert space. Finally, we consider an extension of the contextual bandit problem described above. We study a setting where the learner interacts with a Markov decision process in a sequence of episodes, where an adversary chooses the reward function and the reward observations are available only for the selected action. We allow the state space to be arbitrarily large, but we assume that all action-value functions can be represented as linear functions in terms of a known low-dimensional feature map, and that the learner at least has access to the simulator of the trajectories in the MDP. Our main contributions are the first algorithms that are shown to be robust and efficient in this problem setting.

APA, Harvard, Vancouver, ISO, and other styles

5

Racey, Deborah Elaine. "EFFECTS OF RESPONSE FREQUENCY CONSTRAINTS ON LEARNING IN A NON-STATIONARY MULTI-ARMED BANDIT TASK." OpenSIUC, 2009. https://opensiuc.lib.siu.edu/dissertations/86.

Full text

Abstract:

An n-armed bandit task was used to investigate the trade-off between exploratory (choosing lesser-known options) and exploitive (choosing options with the greatest probability of reinforcement) human choice in a trial-and-error learning problem. In Experiment 1 a different probability of reinforcement was assigned to each of 8 response options using random-ratios (RRs), and participants chose by clicking buttons in a circular display on a computer screen using a computer mouse. Relative frequency thresholds (ranging from .10 to 1.0) were randomly assigned to each participant and acted as task constraints limiting the proportion of total responses that could be attributed to any response option. Preference for the richer keys was shown, and those with greater constraints explored more and earned less reinforcement. Those with the highest constraints showed no preference, distributing their responses among the options with equal probability. In Experiment 2 the payoff probabilities changed partway through, for some the leanest options increased to richest, and for others the richest became leanest. When the RRs changed, the decrease participants with moderate and low constraints showed immediate increases in exploration and change in preference to the new richest keys, while increase participants showed no increase in exploration, and more gradual changes in preference. For Experiment 3 the constraint was held constant at .85, and the two richest options were decreased midway through the task by varying amounts (0 to .60). Decreases were detected early for participants in all but the smallest decrease conditions, and exploration increased.

APA, Harvard, Vancouver, ISO, and other styles

6

Besson, Lilian. "Multi-Players Bandit Algorithms for Internet of Things Networks." Thesis, CentraleSupélec, 2019. http://www.theses.fr/2019CSUP0005.

Full text

Abstract:

Dans cette thèse de doctorat, nous étudions les réseaux sans fil et les appareils reconfigurables qui peuvent accéder à des réseaux de type radio intelligente, dans des bandes non licenciées et sans supervision centrale. Nous considérons notamment des réseaux actuels ou futurs de l’Internet des Objets (IoT), avec l’objectif d’augmenter la durée de vie de la batterie des appareils, en les équipant d’algorithmes d’apprentissage machine peu coûteux mais efficaces, qui leur permettent d’améliorer automatiquement l’efficacité de leurs communications sans fil. Nous proposons deux modèles de réseaux IoT, et nous montrons empiriquement, par des simulations numériques et une validation expérimentale réaliste, le gain que peuvent apporter nos méthodes, qui se reposent sur l’apprentissage par renforcement. Les différents problèmes d’accès au réseau sont modélisés avec des Bandits Multi-Bras (MAB), mais l’analyse de la convergence d’un grand nombre d’appareils jouant à un jeu collaboratif sans communication ni aucune coordination reste délicate, lorsque les appareils suivent tous un modèle d’activation aléatoire. Le reste de ce manuscrit étudie donc deux modèles restreints, d’abord des banditsmulti-joueurs dans des problèmes stationnaires, puis des bandits mono-joueur non stationnaires. Nous détaillons également une autre contribution, la bibliothèque Python open-source SMPyBandits, qui permet des simulations numériques de problèmes MAB, qui couvre les modèles étudiés et d’autres
In this PhD thesis, we study wireless networks and reconfigurable end-devices that can access Cognitive Radio networks, in unlicensed bands and without central control. We focus on Internet of Things networks (IoT), with the objective of extending the devices’ battery life, by equipping them with low-cost but efficient machine learning algorithms, in order to let them automatically improve the efficiency of their wireless communications. We propose different models of IoT networks, and we show empirically on both numerical simulations and real-world validation the possible gain of our methods, that use Reinforcement Learning. The different network access problems are modeled as Multi-Armed Bandits (MAB), but we found that analyzing the realistic models was intractable, because proving the convergence of many IoT devices playing a collaborative game, without communication nor coordination is hard, when they all follow random activation patterns. The rest of this manuscript thus studies two restricted models, first multi-players bandits in stationary problems, then non-stationary single-player bandits. We also detail another contribution, SMPyBandits, our open-source Python library for numerical MAB simulations, that covers all the studied models and more

APA, Harvard, Vancouver, ISO, and other styles

7

Achab, Mastane. "Ranking and risk-aware reinforcement learning." Electronic Thesis or Diss., Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAT020.

Full text

Abstract:

Les travaux de cette thèse se situent à l’interface de deux thématiques de l'apprentissage automatique : l’apprentissage de préférences d'une part, et l’apprentissage par renforcement de l'autre. La première consiste à percoler différents classements d’un même ensemble d’objets afin d’en extraire un ordre général, la seconde à identifier séquentiellement une stratégie optimale en observant des récompenses sanctionnant chaque action essayée. La structure de la thèse suit ce découpage thématique. En première partie, le paradigme de minimisation du risque empirique est utilisé à des fins d'ordonnancement. Partant du problème d’apprentissage supervisé de règles d’ordonnancement à partir de données étiquetées de façon binaire, une extension est proposée au cas où les étiquettes prennent des valeurs continues. Les critères de performance usuels dans le cas binaire, à savoir la courbe caractéristique de l’opérateur de réception (COR) et l’aire sous la courbe COR (ASC), sont étendus au cas continu : les métriques COR intégrée (CORI) et ASC intégrée (ASCI) sont introduites à cet effet. Le second problème d'ordonnancement étudié est celui de l'agrégation de classements à travers l'identification du consensus de Kemeny. En particulier, une relaxation au problème plus général de la réduction de la dimensionnalité dans l'espace des distributions sur le groupe symétrique est formulée à l'aide d'outils mathématiques empruntés à la théorie du transport optimal. La seconde partie de cette thèse s'intéresse à l'apprentissage par renforcement. Des problèmes de bandit manchot sont analysés dans des contextes où la performance moyenne n'est pas pertinente et où la gestion du risque prévaut. Enfin, le problème plus général de l'apprentissage par renforcement distributionnel, dans lequel le décideur cherche à connaître l'entière distribution de sa performance et non pas uniquement sa valeur moyenne, est considéré. De nouveaux opérateurs de programmation dynamique ainsi que leurs pendants atomiques mènent à de nouveaux algorithmes stochastiques distributionnels
This thesis divides into two parts: the first part is on ranking and the second on risk-aware reinforcement learning. While binary classification is the flagship application of empirical risk minimization (ERM), the main paradigm of machine learning, more challenging problems such as bipartite ranking can also be expressed through that setup. In bipartite ranking, the goal is to order, by means of scoring methods, all the elements of some feature space based on a training dataset composed of feature vectors with their binary labels. This thesis extends this setting to the continuous ranking problem, a variant where the labels are taking continuous values instead of being simply binary. The analysis of ranking data, initiated in the 18th century in the context of elections, has led to another ranking problem using ERM, namely ranking aggregation and more precisely the Kemeny's consensus approach. From a training dataset made of ranking data, such as permutations or pairwise comparisons, the goal is to find the single "median permutation" that best corresponds to a consensus order. We present a less drastic dimensionality reduction approach where a distribution on rankings is approximated by a simpler distribution, which is not necessarily reduced to a Dirac mass as in ranking aggregation.For that purpose, we rely on mathematical tools from the theory of optimal transport such as Wasserstein metrics. The second part of this thesis focuses on risk-aware versions of the stochastic multi-armed bandit problem and of reinforcement learning (RL), where an agent is interacting with a dynamic environment by taking actions and receiving rewards, the objective being to maximize the total payoff. In particular, a novel atomic distributional RL approach is provided: the distribution of the total payoff is approximated by particles that correspond to trimmed means

APA, Harvard, Vancouver, ISO, and other styles

8

Barkino, Iliam. "Summary Statistic Selection with Reinforcement Learning." Thesis, Uppsala universitet, Avdelningen för beräkningsvetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-390838.

Full text

Abstract:

Multi-armed bandit (MAB) algorithms could be used to select a subset of the k most informative summary statistics, from a pool of m possible summary statistics, by reformulating the subset selection problem as a MAB problem. This is suggested by experiments that tested five MAB algorithms (Direct, Halving, SAR, OCBA-m, and Racing) on the reformulated problem and comparing the results to two established subset selection algorithms (Minimizing Entropy and Approximate Sufficiency). The MAB algorithms yielded errors at par with the established methods, but in only a fraction of the time. Establishing MAB algorithms as a new standard for summary statistics subset selection could therefore save numerous scientists substantial amounts of time when selecting summary statistics for approximate bayesian computation.

APA, Harvard, Vancouver, ISO, and other styles

9

Jedor, Matthieu. "Bandit algorithms for recommender system optimization." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASM027.

Full text

Abstract:

Dans cette thèse de doctorat, nous étudions l'optimisation des systèmes de recommandation dans le but de fournir des suggestions de produits plus raffinées pour un utilisateur.La tâche est modélisée à l'aide du cadre des bandits multi-bras.Dans une première partie, nous abordons deux problèmes qui se posent fréquemment dans les systèmes de recommandation : le grand nombre d'éléments à traiter et la gestion des contenus sponsorisés.Dans une deuxième partie, nous étudions les performances empiriques des algorithmes de bandit et en particulier comment paramétrer les algorithmes traditionnels pour améliorer les résultats dans les environnements stationnaires et non stationnaires qui l'on rencontre en pratique.Cela nous amène à analyser à la fois théoriquement et empiriquement l'algorithme glouton qui, dans certains cas, est plus performant que l'état de l'art
In this PhD thesis, we study the optimization of recommender systems with the objective of providing more refined suggestions of items for a user to benefit.The task is modeled using the multi-armed bandit framework.In a first part, we look upon two problems that commonly occured in recommendation systems: the large number of items to handle and the management of sponsored contents.In a second part, we investigate the empirical performance of bandit algorithms and especially how to tune conventional algorithm to improve results in stationary and non-stationary environments that arise in practice.This leads us to analyze both theoretically and empirically the greedy algorithm that, in some cases, outperforms the state-of-the-art

APA, Harvard, Vancouver, ISO, and other styles

10

Couetoux, Adrien. "Monte Carlo Tree Search for Continuous and Stochastic Sequential Decision Making Problems." Thesis, Paris 11, 2013. http://www.theses.fr/2013PA112192.

Full text

Abstract:

Dans cette thèse, nous avons étudié les problèmes de décisions séquentielles, avec comme application la gestion de stocks d'énergie. Traditionnellement, ces problèmes sont résolus par programmation dynamique stochastique. Mais la grande dimension, et la non convexité du problème, amènent à faire des simplifications sur le modèle pour pouvoir faire fonctionner ces méthodes.Nous avons donc étudié une méthode alternative, qui ne requiert pas de simplifications du modèle: Monte Carlo Tree Search (MCTS). Nous avons commencé par étendre le MCTS classique (qui s’applique aux domaines finis et déterministes) aux domaines continus et stochastiques. Pour cela, nous avons utilisé la méthode de Double Progressive Widening (DPW), qui permet de gérer le ratio entre largeur et profondeur de l’arbre, à l’aide de deux méta paramètres. Nous avons aussi proposé une heuristique nommée Blind Value (BV) pour améliorer la recherche de nouvelles actions, en utilisant l’information donnée par les simulations passées. D’autre part, nous avons étendu l’heuristique RAVE aux domaines continus. Enfin, nous avons proposé deux nouvelles méthodes pour faire remonter l’information dans l’arbre, qui ont beaucoup amélioré la vitesse de convergence sur deux cas tests.Une part importante de notre travail a été de proposer une façon de mêler MCTS avec des heuristiques rapides pré-existantes. C’est une idée particulièrement intéressante dans le cas de la gestion d’énergie, car ces problèmes sont pour le moment résolus de manière approchée. Nous avons montré comment utiliser Direct Policy Search (DPS) pour rechercher une politique par défaut efficace, qui est ensuite utilisée à l’intérieur de MCTS. Les résultats expérimentaux sont très encourageants.Nous avons aussi appliqué MCTS à des processus markoviens partiellement observables (POMDP), avec comme exemple le jeu de démineur. Dans ce cas, les algorithmes actuels ne sont pas optimaux, et notre approche l’est, en transformant le POMDP en MDP, par un changement de vecteur d’état.Enfin, nous avons utilisé MCTS dans un cadre de méta-bandit, pour résoudre des problèmes d’investissement. Le choix d’investissement est fait par des algorithmes de bandits à bras multiples, tandis que l’évaluation de chaque bras est faite par MCTS.Une des conclusions importantes de ces travaux est que MCTS en continu a besoin de très peu d’hypothèses (uniquement un modèle génératif du problème), converge vers l’optimum, et peut facilement améliorer des méthodes suboptimales existantes
In this thesis, we study sequential decision making problems, with a focus on the unit commitment problem. Traditionally solved by dynamic programming methods, this problem is still a challenge, due to its high dimension and to the sacrifices made on the accuracy of the model to apply state of the art methods. We investigate on the applicability of Monte Carlo Tree Search methods for this problem, and other problems that are single player, stochastic and continuous sequential decision making problems. We started by extending the traditional finite state MCTS to continuous domains, with a method called Double Progressive Widening (DPW). This method relies on two hyper parameters, and determines the ratio between width and depth in the nodes of the tree. We developed a heuristic called Blind Value (BV) to improve the exploration of new actions, using the information from past simulations. We also extended the RAVE heuristic to continuous domain. Finally, we proposed two new ways of backing up information through the tree, that improved the convergence speed considerably on two test cases.An important part of our work was to propose a way to mix MCTS with existing powerful heuristics, with the application to energy management in mind. We did so by proposing a framework that allows to learn a good default policy by Direct Policy Search (DPS), and to include it in MCTS. The experimental results are very positive.To extend the reach of MCTS, we showed how it could be used to solve Partially Observable Markovian Decision Processes, with an application to game of Mine Sweeper, for which no consistent method had been proposed before.Finally, we used MCTS in a meta-bandit framework to solve energy investment problems: the investment decision was handled by classical bandit algorithms, while the evaluation of each investment was done by MCTS.The most important take away is that continuous MCTS has almost no assumption (besides the need for a generative model), is consistent, and can easily improve existing suboptimal solvers by using a method similar to what we proposed with DPS

APA, Harvard, Vancouver, ISO, and other styles

11

Bouneffouf, Djallel. "DRARS, A Dynamic Risk-Aware Recommender System." Phd thesis, Institut National des Télécommunications, 2013. http://tel.archives-ouvertes.fr/tel-01026136.

Full text

Abstract:

L'immense quantité d'information générée et gérée au quotidien par les systèmes d'information et leurs utilisateurs conduit inéluctablement ?a la problématique de surcharge d'information. Dans ce contexte, les systèmes de recommandation traditionnels fournissent des informations pertinentes aux utilisateurs. Néanmoins, avec la propagation récente des dispositifs mobiles (Smartphones et tablettes), nous constatons une migration progressive des utilisateurs vers la manipulation d'environnements pérvasifs. Le problème avec les approches traditionnelles de recommandation est qu'elles n'utilisent pas toute l'information disponible pour produire des recommandations. Davantage d'informations contextuelles pourraient être utilisées dans le processus de recommandation pour aboutir à des recommandations plus précises. Les systèmes de recommandations sensibles au contexte (CARS) combinent les caractéristiques des systèmes sensibles au contexte et des systèmes de recommandation an de fournir des informations personnalisées aux utilisateurs dans des environnements ubiquitaires. Dans cette perspective ou tout ce qui concerne l'utilisateur est dynamique, les contenus qu'il manipule et son environnement, deux questions principales doivent être adressées : i) Comment prendre en compte la dynamicité des contenus de l'utilisateur ? et ii ) Comment éviter d'être intrusif en particulier dans des situations critiques ?. En réponse ?a ces questions, nous avons développé un système de recommandation dynamique et sensible au risque appelé DRARS (Dynamic Risk-Aware Recommender System), qui modélise la recommandation sensible au contexte comme un problème de bandit. Ce système combine une technique de filtrage basée sur le contenu et un algorithme de bandit contextuel. Nous avons montré que DRARS améliore la stratégie de l'algorithme UCB (Upper Con dence Bound), le meilleur algorithme actuellement disponible, en calculant la valeur d'exploration la plus optimale pour maintenir un compromis entre exploration et exploitation basé sur le niveau de risque de la situation courante de l'utilisateur. Nous avons mené des expériences dans un contexte industriel avec des données réelles et des utilisateurs réels et nous avons montré que la prise en compte du niveau de risque de la situation de l'utilisateur augmentait significativement la performance du système de recommandation.

APA, Harvard, Vancouver, ISO, and other styles

12

Ameen, S. A. "Optimizing deep learning networks using multi-armed bandits." Thesis, University of Salford, 2017. http://usir.salford.ac.uk/45018/.

Full text

Abstract:

Deep learning has gained significant attention recently following their successful use for applications such as computer vision, speech recognition, and natural language processing. These deep learning models are based on very large neural networks, which can require a significant amount of memory and hence limit the range of applications. Hence, this study explores methods for pruning deep learning models as a way of reducing their size, and computational time, but without sacrificing their accuracy. A literature review was carried out, revealing existing approaches for pruning, their strengths, and weaknesses. A key issue emerging from this review is that there is a trade-off between removing a weight or neuron and the potential reduction in accuracy. Thus, this study develops new algorithms for pruning that utilize a framework, known as a multi-armed bandit, which has been successfully applied in applications where there is a need to learn which option to select given the outcome of trials. There are several different multi-arm bandit methods, and these have been used to develop new algorithms including those based on the following types of multi-arm bandits: (i) Epsilon-Greedy (ii) Upper Confidence Bounds (UCB) (iii) Thompson Sampling and (iv) Exponential Weight Algorithm for Exploration and Exploitation (EXP3). The algorithms were implemented in Python and a comprehensive empirical evaluation of their performance was carried out in comparison to both the original neural network models and existing algorithms for pruning. The existing methods that are compared include: Random Pruning, Greedy Pruning, Optimal Brain Damage (OBD) and Optimal Brain Surgeon (OBS). The thesis also includes an empirical comparison with a number of other learning methods such as KNN, decision trees, SVM, Naïve Bayes, LDA, QDA, logistic regression, Gaussian process classifier, kernel ridge regression, LASSO regression, linear regression, Bayesian Ridge regression, boosting, bagging and random forests. The results on the data sets show that some of the new methods (i) generalize better than the original model and most of the other methods such as KNN and decision trees (ii) outperform OBS and OBD in terms of reduction in size, generalization, and computational time (iii) outperform the greedy algorithm in terms of accuracy.

APA, Harvard, Vancouver, ISO, and other styles

13

Paraschakis, Dimitris. "Algorithmic and Ethical Aspects of Recommender Systems in e-Commerce." Licentiate thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-7792.

Full text

Abstract:

Recommender systems have become an integral part of virtually every e-commerce application on the web. These systems enable users to quickly discover relevant products, at the same time increasing business value. Over the past decades, recommender systems have been modeled using numerous machine learning techniques. However, the adoptability of these models by commercial applications remains unclear. We assess the receptiveness of the industrial sector to algorithmic contributions of the research community by surveying more than 30 e-commerce platforms, and experimenting with various recommenders on proprietary e-commerce datasets. Another overlooked but important factor that complicates the design and use of recommender systems is their ethical implications. We identify and summarize these issues in our ethical recommendation framework, which also enables users to control sensitive moral aspects of recommendations via the proposed “ethical toolbox”. The feasibility of this tool is supported by the results of our user study. Because of moral implications associated with user profiling, we investigate algorithms capable of generating user-agnostic recommendations. We propose an ensemble learning scheme based on Thompson Sampling bandit policy, which models arms as base recommendation functions. We show how to adapt this algorithm to realistic situations when neither arm availability nor reward stationarity is guaranteed.

APA, Harvard, Vancouver, ISO, and other styles

14

Audibert, Jean-Yves. "PAC-Bayesian aggregation and multi-armed bandits." Habilitation à diriger des recherches, Université Paris-Est, 2010. http://tel.archives-ouvertes.fr/tel-00843972.

Full text

Abstract:

This habilitation thesis presents several contributions to (1) the PAC-Bayesian analysis of statistical learning, (2) the three aggregation problems: given d functions, how to predict as well as (i) the best of these d functions (model selection type aggregation), (ii) the best convex combination of these d functions, (iii) the best linear combination of these d functions, (3) the multi-armed bandit problems.

APA, Harvard, Vancouver, ISO, and other styles

15

Gutowski, Nicolas. "Recommandation contextuelle de services : application à la recommandation d'évènements culturels dans la ville intelligente." Thesis, Angers, 2019. http://www.theses.fr/2019ANGE0030.

Full text

Abstract:

Les algorithmes de bandits-manchots pour les systèmes de recommandation sensibles au contexte font aujourd’hui l’objet de nombreuses études. Afin de répondre aux enjeux de cette thématique, les contributions de cette thèse sont organisées autour de 3 axes : 1) les systèmes de recommandation ; 2) les algorithmes de bandits-manchots (contextuels et non contextuels) ; 3) le contexte. La première partie de nos contributions a porté sur les algorithmes de bandits-manchots pour la recommandation. Elle aborde la diversification des recommandations visant à améliorer la précision individuelle. La seconde partie a porté sur la capture de contexte, le raisonnement contextuel pour les systèmes de recommandation d’événements culturels dans la ville intelligente, et l’enrichissement dynamique de contexte pour les algorithmes de bandits-manchots contextuels
Nowadays, Multi-Armed Bandit algorithms for context-aware recommendation systems are extensively studied. In order to meet challenges underlying this field of research, our works and contributions have been organised according to three research directions : 1) recommendation systems ; 2) Multi-Armed Bandit (MAB) and Contextual Multi-Armed Bandit algorithms (CMAB) ; 3) context.The first part of our contributions focuses on MAB and CMAB algorithms for recommendation. It particularly addresses diversification of recommendations for improving individual accuracy. The second part is focused on contextacquisition, on context reasoning for cultural events recommendation systems for Smart Cities, and on dynamic context enrichment for CMAB algorithms

APA, Harvard, Vancouver, ISO, and other styles

16

Galichet, Nicolas. "Contributions to Multi-Armed Bandits : Risk-Awareness and Sub-Sampling for Linear Contextual Bandits." Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112242/document.

Full text

Abstract:

Cette thèse s'inscrit dans le domaine de la prise de décision séquentielle en environnement inconnu, et plus particulièrement dans le cadre des bandits manchots (multi-armed bandits, MAB), défini par Robbins et Lai dans les années 50. Depuis les années 2000, ce cadre a fait l'objet de nombreuses recherches théoriques et algorithmiques centrées sur le compromis entre l'exploration et l'exploitation : L'exploitation consiste à répéter le plus souvent possible les choix qui se sont avérés les meilleurs jusqu'à présent. L'exploration consiste à essayer des choix qui ont rarement été essayés, pour vérifier qu'on a bien identifié les meilleurs choix. Les applications des approches MAB vont du choix des traitements médicaux à la recommandation dans le contexte du commerce électronique, en passant par la recherche de politiques optimales de l'énergie. Les contributions présentées dans ce manuscrit s'intéressent au compromis exploration vs exploitation sous deux angles spécifiques. Le premier concerne la prise en compte du risque. Toute exploration dans un contexte inconnu peut en effet aboutir à des conséquences indésirables ; par exemple l'exploration des comportements d'un robot peut aboutir à des dommages pour le robot ou pour son environnement. Dans ce contexte, l'objectif est d'obtenir un compromis entre exploration, exploitation, et prise de risque (EER). Plusieurs algorithmes originaux sont proposés dans le cadre du compromis EER. Sous des hypothèses fortes, l'algorithme MIN offre des garanties de regret logarithmique, à l'état de l'art ; il offre également une grande robustesse, contrastant avec la forte sensibilité aux valeurs des hyper-paramètres de e.g. (Auer et al. 2002). L'algorithme MARAB s'intéresse à un critère inspiré de la littérature économique(Conditional Value at Risk), et montre d'excellentes performances empiriques comparées à (Sani et al. 2012), mais sans garanties théoriques. Enfin, l'algorithme MARABOUT modifie l'estimation du critère CVaR pour obtenir des garanties théoriques, tout en obtenant un bon comportement empirique. Le second axe de recherche concerne le bandit contextuel, où l'on dispose d'informations additionnelles relatives au contexte de la décision ; par exemple, les variables d'état du patient dans un contexte médical ou de l'utilisateur dans un contexte de recommandation. L'étude se focalise sur le choix entre bras qu'on a tirés précédemment un nombre de fois différent. Le choix repose en général sur la notion d'optimisme, comparant les bornes supérieures des intervalles de confiance associés aux bras considérés. Une autre approche appelée BESA, reposant sur le sous-échantillonnage des valeurs tirées pour les bras les plus visités, et permettant ainsi de se ramener au cas où tous les bras ont été tirés un même nombre de fois, a été proposée par (Baransi et al. 2014)
This thesis focuses on sequential decision making in unknown environment, and more particularly on the Multi-Armed Bandit (MAB) setting, defined by Lai and Robbins in the 50s. During the last decade, many theoretical and algorithmic studies have been aimed at cthe exploration vs exploitation tradeoff at the core of MABs, where Exploitation is biased toward the best options visited so far while Exploration is biased toward options rarely visited, to enforce the discovery of the the true best choices. MAB applications range from medicine (the elicitation of the best prescriptions) to e-commerce (recommendations, advertisements) and optimal policies (e.g., in the energy domain). The contributions presented in this dissertation tackle the exploration vs exploitation dilemma under two angles. The first contribution is centered on risk avoidance. Exploration in unknown environments often has adverse effects: for instance exploratory trajectories of a robot can entail physical damages for the robot or its environment. We thus define the exploration vs exploitation vs safety (EES) tradeoff, and propose three new algorithms addressing the EES dilemma. Firstly and under strong assumptions, the MIN algorithm provides a robust behavior with guarantees of logarithmic regret, matching the state of the art with a high robustness w.r.t. hyper-parameter setting (as opposed to, e.g. UCB (Auer 2002)). Secondly, the MARAB algorithm aims at optimizing the cumulative 'Conditional Value at Risk' (CVar) rewards, originated from the economics domain, with excellent empirical performances compared to (Sani et al. 2012), though without any theoretical guarantees. Finally, the MARABOUT algorithm modifies the CVar estimation and yields both theoretical guarantees and a good empirical behavior. The second contribution concerns the contextual bandit setting, where additional informations are provided to support the decision making, such as the user details in the ontent recommendation domain, or the patient history in the medical domain. The study focuses on how to make a choice between two arms with different numbers of samples. Traditionally, a confidence region is derived for each arm based on the associated samples, and the 'Optimism in front of the unknown' principle implements the choice of the arm with maximal upper confidence bound. An alternative, pioneered by (Baransi et al. 2014), and called BESA, proceeds instead by subsampling without replacement the larger sample set. In this framework, we designed a contextual bandit algorithm based on sub-sampling without replacement, relaxing the (unrealistic) assumption that all arm reward distributions rely on the same parameter. The CL-BESA algorithm yields both theoretical guarantees of logarithmic regret and good empirical behavior

APA, Harvard, Vancouver, ISO, and other styles

17

CELLA, LEONARDO. "EFFICIENCY AND REALISM IN STOCHASTIC BANDITS." Doctoral thesis, Università degli Studi di Milano, 2021. http://hdl.handle.net/2434/807862.

Full text

Abstract:

This manuscript is dedicated to the analysis of the application of stochastic bandits to the recommender systems domain. Here a learning agent sequentially recommends one item from a catalog of available alternatives. Consequently, the environment returns a reward that is a noisy observation of the rating associated to the suggested item. The peculiarity of the bandit setting is that no information is given about not recommended products, and the collected rewards are the only information available to the learning agent. By relying on them the learner adapts his strategy towards reaching its learning objective, that is, maximizing the cumulative reward collected over all the interactions. In this dissertation we cover the investigation of two main research directions: the development of efficient learning algorithms and the introduction of a more realistic learning setting. In addressing the former objective we propose two approaches to speedup the learning process. The first solution aims to reduce the computational costs associated to the learning procedure, while the second's goal is to boost the learning phase by relying on data corresponding to terminated recommendation sessions. Regarding the latter research line, we propose a novel setting representing use-cases that do not fit in the standard bandit model.

APA, Harvard, Vancouver, ISO, and other styles

18

Liu, Fang. "Efficient Online Learning with Bandit Feedback." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587680990430268.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Allesiardo, Robin. "Bandits Manchots sur Flux de Données Non Stationnaires." Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLS334/document.

Full text

Abstract:

Le problème des bandits manchots est un cadre théorique permettant d'étudier le compromis entre exploration et exploitation lorsque l'information observée est partielle. Dans celui-ci, un joueur dispose d'un ensemble de K bras (ou actions), chacun associé à une distribution de récompenses D(µk) de moyenne µk Є [0, 1] et de support [0, 1]. A chaque tour t Є [1, T], il choisit un bras kt et observe la récompense y kt tirée depuis D (µkt). La difficulté du problème vient du fait que le joueur observe uniquement la récompense associée au bras joué; il ne connaît pas celle qui aurait pu être obtenue en jouant un autre bras. À chaque choix, il est ainsi confronté au dilemme entre l'exploration et l'exploitation; explorer lui permet d'affiner sa connaissance des distributions associées aux bras explorés tandis qu'exploiter lui permet d'accumuler davantage de récompenses en jouant le meilleur bras empirique (sous réserve que le meilleur bras empirique soit effectivement le meilleur bras). Dans la première partie de la thèse nous aborderons le problème des bandits manchots lorsque les distributions générant les récompenses sont non-stationnaires. Nous étudierons dans un premier temps le cas où même si les distributions varient au cours du temps, le meilleur bras ne change pas. Nous étudierons ensuite le cas où le meilleur bras peut aussi changer au cours du temps. La seconde partie est consacrée aux algorithmes de bandits contextuels où les récompenses dépendent de l'état de l'environnement. Nous étudierons l'utilisation des réseaux de neurones et des forêts d'arbres dans le cas des bandits contextuels puis les différentes approches à base de méta-bandits permettant de sélectionner en ligne l'expert le plus performant durant son apprentissage
The multi-armed bandit is a framework allowing the study of the trade-off between exploration and exploitation under partial feedback. At each turn t Є [1,T] of the game, a player has to choose an arm kt in a set of K and receives a reward ykt drawn from a reward distribution D(µkt) of mean µkt and support [0,1]. This is a challeging problem as the player only knows the reward associated with the played arm and does not know what would be the reward if she had played another arm. Before each play, she is confronted to the dilemma between exploration and exploitation; exploring allows to increase the confidence of the reward estimators and exploiting allows to increase the cumulative reward by playing the empirical best arm (under the assumption that the empirical best arm is indeed the actual best arm).In the first part of the thesis, we will tackle the multi-armed bandit problem when reward distributions are non-stationary. Firstly, we will study the case where, even if reward distributions change during the game, the best arm stays the same. Secondly, we will study the case where the best arm changes during the game. The second part of the thesis tacles the contextual bandit problem where means of reward distributions are now dependent of the environment's current state. We will study the use of neural networks and random forests in the case of contextual bandits. We will then propose meta-bandit based approach for selecting online the most performant expert during its learning

APA, Harvard, Vancouver, ISO, and other styles

20

McInerney, Robert E. "Decision making under uncertainty." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:a34e87ad-8330-42df-8ba6-d55f10529331.

Full text

Abstract:

Operating and interacting in an environment requires the ability to manage uncertainty and to choose definite courses of action. In this thesis we look to Bayesian probability theory as the means to achieve the former, and find that through rigorous application of the rules it prescribes we can, in theory, solve problems of decision making under uncertainty. Unfortunately such methodology is intractable in realworld problems, and thus approximation of one form or another is inevitable. Many techniques make use of heuristic procedures for managing uncertainty. We note that such methods suffer unreliable performance and rely on the specification of ad-hoc variables. Performance is often judged according to long-term asymptotic performance measures which we also believe ignores the most complex and relevant parts of the problem domain. We therefore look to develop principled approximate methods that preserve the meaning of Bayesian theory but operate with the scalability of heuristics. We start doing this by looking at function approximation in continuous state and action spaces using Gaussian Processes. We develop a novel family of covariance functions which allow tractable inference methods to accommodate some of the uncertainty lost by not following full Bayesian inference. We also investigate the exploration versus exploitation tradeoff in the context of the Multi-Armed Bandit, and demonstrate that principled approximations behave close to optimal behaviour and perform significantly better than heuristics on a range of experimental test beds.

APA, Harvard, Vancouver, ISO, and other styles

21

Talebi, Mazraeh Shahi Mohammad Sadegh. "Online Combinatorial Optimization under Bandit Feedback." Licentiate thesis, KTH, Reglerteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-181321.

Full text

Abstract:

Multi-Armed Bandits (MAB) constitute the most fundamental model for sequential decision making problems with an exploration vs. exploitation trade-off. In such problems, the decision maker selects an arm in each round and observes a realization of the corresponding unknown reward distribution. Each decision is based on past decisions and observed rewards. The objective is to maximize the expected cumulative reward over some time horizon by balancing exploitation (arms with higher observed rewards should be selectedoften) and exploration (all arms should be explored to learn their average rewards). Equivalently, the performanceof a decision rule or algorithm can be measured through its expected regret, defined as the gap betweenthe expected reward achieved by the algorithm and that achieved by an oracle algorithm always selecting the bestarm. This thesis investigates stochastic and adversarial combinatorial MAB problems, where each arm is a collection of several basic actions taken from a set of $d$ elements, in a way that the set of arms has a certain combinatorial structure. Examples of such sets include the set of fixed-size subsets, matchings, spanning trees, paths, etc. These problems are specific forms of online linear optimization, where the decision space is a subset of $d$-dimensional hypercube.Due to the combinatorial nature, the number of arms generically grows exponentially with $d$. Hence, treating arms as independent and applying classical sequential arm selection policies would yield a prohibitive regret. It may then be crucial to exploit the combinatorial structure of the problem to design efficient arm selection algorithms.As the first contribution of this thesis, in Chapter 3 we investigate combinatorial MABs in the stochastic setting and with Bernoulli rewards. We derive asymptotic (i.e., when the time horizon grows large) lower bounds on the regret of any algorithm under bandit and semi-bandit feedback. The proposed lower bounds are problem-specific and tight in the sense that there exists an algorithm that achieves these regret bounds. Our derivation leverages some theoretical results in adaptive control of Markov chains. Under semi-bandit feedback, we further discuss the scaling of the proposed lower bound with the dimension of the underlying combinatorial structure. For the case of semi-bandit feedback, we propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the fourth chapter, we consider stochastic combinatorial MAB problems where the underlying combinatorial structure is a matroid. Specializing the results of Chapter 3 to matroids, we provide explicit regret lower bounds for this class of problems. For the case of semi-bandit feedback, we propose KL-OSM, a computationally efficient greedy-based algorithm that exploits the matroid structure. Through a finite-time analysis, we prove that the regret upper bound of KL-OSM matches the proposed lower bound, thus making it the first asymptotically optimal algorithm for this class of problems. Numerical experiments validate that KL-OSM outperforms state-of-the-art algorithms in practice, as well.In the fifth chapter, we investigate the online shortest-path routing problem which is an instance of combinatorial MABs with geometric rewards. We consider and compare three different types of online routing policies, depending (i) on where routing decisions are taken (at the source or at each node), and (ii) on the received feedback (semi-bandit or bandit). For each case, we derive the asymptotic regret lower bound. These bounds help us to understand the performance improvements we can expect when (i) taking routing decisions at each hop rather than at the source only, and (ii) observing per-link delays rather than end-to-end path delays. In particular, we show that (i) is of no use while (ii) can have a spectacular impact.For source routing under semi-bandit feedback, we then propose two algorithms with a trade-off betweencomputational complexity and performance. The regret upper bounds of these algorithms improve over those ofthe existing algorithms, and they significantly outperform state-of-the-art algorithms in numerical experiments. Finally, we discuss combinatorial MABs in the adversarial setting and under bandit feedback. We concentrate on the case where arms have the same number of basic actions but are otherwise arbitrary. We propose CombEXP, an algorithm that has the same regret scaling as state-of-the-art algorithms. Furthermore, we show that CombEXP admits lower computational complexity for some combinatorial problems.

QC 20160201

APA, Harvard, Vancouver, ISO, and other styles

22

Chafaa, Irched. "Machine learning for beam alignment in mmWave networks." Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG044.

Full text

Abstract:

Pour faire face à la croissance exponentielle du trafic des données mobiles, une solution possible est d'exploiter les larges bandes spectrales disponibles dans la partie millimétrique du spectre électromagnétique. Cependant, le signal transmis est fortement atténué, impliquant une portée de propagation limitée et un faible nombre des trajets de propagation (canal parcimonieux). Par conséquent, des faisceaux directifs doivent être utilisés pour focaliser l'énergie du signal transmis vers son utilisateur et compenser les pertes de propagation. Ces faisceaux ont besoin d'être dirigés convenablement pour garantir la fiabilité du lien de communication. Ceci représente le problème d'alignement des faisceaux pour les systèmes de communication à onde millimétrique. En effet, les faisceaux de l'émetteur et du récepteur doivent être constamment ajustés et alignés pour combattre les conditions de propagation difficiles de la bande millimétrique. De plus, les techniques d'alignement des faisceaux doivent prendre en compte la mobilité des utilisateurs et la dynamique imprévisible du réseau. Ceci mène à un fort coût de signalisation et d'entraînement qui impacte les performances des réseaux. Dans la première partie de cette thèse, nous reformulons le problème d'alignement des faisceaux en utilisant les bandits manchots (ou multi-armed bandits), pertinents dans le cas d'une dynamique du réseau imprévisibles et arbitraire (non-stationnaire ou même antagoniste). Nous proposons des méthodes en ligne et adaptatives pour aligner indépendamment les faisceaux des deux nœuds du lien de communication en utilisant seulement un seul bit de feedback. En se basant sur l'algorithme des poids exponentiels (EXP3) et le caractère parcimonieux du canal à onde millimétrique, nous proposons une version modifiée de l'algorithme original (MEXP3) avec des garanties théoriques en fonction du regret asymptotique. En outre, pour un horizon du temps fini, notre borne supérieure du regret est plus serrée que celle de l'algorithme EXP3, indiquant une meilleure performance en pratique. Nous introduisons également une deuxième modification qui utilise les corrélations temporelles entre des choix successifs des faisceaux dans une nouvelle technique d'alignement des faisceaux (NBT-MEXP3). Dans la deuxième partie de cette thèse, des outils de l'apprentissage profond sont examinés pour choisir des faisceaux dans un lien point d'accès -- utilisateur. Nous exploitons l'apprentissage profond non supervisé pour utiliser l'information des canaux au-dessous de 6 GHz afin de prédire des faisceaux dans la bande millimétrique; cette fonction canal-faisceau complexe est apprise en utilisant des données non-annotés du dataset DeepMIMO. Nous discutons aussi le choix d'une taille optimale pour le réseau de neurones en fonction du nombre des antennes de transmission et de réception au point d'accès. De plus, nous étudions l'impact de la disponibilité des données d'entraînement et introduisons une approche basée sur l'apprentissage fédéré pour prédire des faisceaux dans un réseau à plusieurs liens en partageant uniquement les paramètres des réseaux de neurones entrainés localement (et non pas les données locales). Nous envisageons les méthodes synchrones et asynchrones de l'approche par apprentissage fédéré. Nos résultats numériques montrent le potentiel de notre approche particulièrement au cas où les données d'entrainement sont peu abondantes ou imparfaites (bruitées). Enfin, nous comparons nos méthodes basées sur l'apprentissage profond avec celles de la première partie. Les simulations montrent que le choix d'une méthode convenable pour aligner les faisceaux dépend de la nature de l'application et présente un compromis entre le débit obtenu et la complexité du calcul
To cope with the ever increasing mobile data traffic, an envisioned solution for future wireless networks is to exploit the large available spectrum in the millimeter wave (mmWave) band. However, communicating at these high frequencies is very challenging as the transmitted signal suffers from strong attenuation, which leads to a limited propagation range and few multipath components (sparse mmWave channels). Hence, highly-directional beams have to be employed to focus the signal energy towards the intended user and compensate all those losses. Such beams need to be steered appropriately to guarantee a reliable communication link. This represents the so called beam alignment problem where the beams of the transmitter and the receiver need to be constantly aligned. Moreover, beam alignment policies need to support devices mobility and the unpredicted dynamics of the network, which result in significant signaling and training overhead affecting the overall performance. In the first part of the thesis, we formulate the beam alignment problem via the adversarial multi-armed bandit framework, which copes with arbitrary network dynamics including non-stationary or adversarial components. We propose online and adaptive beam alignment policies relying only on one-bit feedback to steer the beams of both nodes of the communication link in a distributed manner. Building on the well-known exponential weights algorithm (EXP3) and by exploiting the sparse nature of mmWave channels, we propose a modified policy (MEXP3), which comes with optimal theoretical guarantees in terms of asymptotic regret. Moreover, for finite horizons, our regret upper-bound is tighter than that of the original EXP3 suggesting better performance in practice. We then introduce an additional modification that accounts for the temporal correlation between successive beams and propose another beam alignment policy (NBT-MEXP3). In the second part of the thesis, deep learning tools are investigated to select mmWave beams in an access point -- user link. We leverage unsupervised deep learning to exploit the channel knowledge at sub-6 GHz and predict beamforming vectors in the mmWave band; this complex channel-beam mapping is learned via data issued from the DeepMIMO dataset and lacking the ground truth. We also show how to choose an optimal size of our neural network depending on the number of transmit and receive antennas at the access point. Furthermore, we investigate the impact of training data availability and introduce a federated learning (FL) approach to predict the beams of multiple links by sharing only the parameters of the locally trained neural networks (and not the local data). We investigate both synchronous and asynchronous FL methods. Our numerical simulations show the high potential of our approach, especially when the local available data is scarce or imperfect (noisy). At last, we compare our proposed deep learning methods with reinforcement learning methods derived in the first part. Simulations show that choosing an appropriate beam steering method depends on the target application and is a tradeoff between rate performance and computational complexity

APA, Harvard, Vancouver, ISO, and other styles

23

Wilhelmi, Roca Francesc. "Towards spatial reuse in future wireless local area networks: a sequential learning approach." Doctoral thesis, Universitat Pompeu Fabra, 2020. http://hdl.handle.net/10803/669970.

Full text

Abstract:

The Spatial Reuse (SR) operation is gaining momentum in the latest IEEE 802.11 family of standards due to the overwhelming requirements posed by next-generation wireless networks. In particular, the rising traffic requirements and the number of concurrent devices compromise the efficiency of increasingly crowded Wireless Local Area Networks (WLANs) and throw into question their decentralized nature. The SR operation, initially introduced by the IEEE~802.11ax-2021 amendment and further studied in IEEE 802.11be-2024, aims to increase the number of concurrent transmissions in an Overlapping Basic Service Set (OBSS) using sensitivity adjustment and transmit power control, thus improving spectral efficiency. Our analysis of the SR operation shows outstanding potential in improving the number of concurrent transmissions in crowded deployments, which contributed to enabling low-latency next-generation applications. However, the potential gains of SR are currently limited by the rigidity of the mechanism introduced for the 11ax, and the lack of coordination among BSSs implementing it. The SR operation is evolving towards coordinated schemes where different BSSs cooperate. Nevertheless, coordination entails communication and synchronization overhead, which impact on the performance of WLANs remains unknown. Moreover, the coordinated approach is incompatible with devices using previous IEEE 802.11 versions, potentially leading to degrading the performance of legacy networks. For those reasons, in this thesis, we start assessing the viability of decentralized SR, and thoroughly examine the main impediments and shortcomings that may result from it. We aim to shed light on the future shape of WLANs concerning SR optimization and whether their decentralized nature should be kept, or it is preferable to evolve towards coordinated and centralized deployments. To address the SR problem in a decentralized manner, we focus on Artificial Intelligence (AI) and propose using a class of sequential learning-based methods, referred to as Multi-Armed Bandits (MABs). The MAB framework suits the SR problem because it addresses the uncertainty caused by the concurrent operation of multiple devices (i.e., multi-player setting) and the lack of information in decentralized deployments. MABs can potentially overcome the complexity of the spatial interactions that result from devices modifying their sensitivity and transmit power. In this regard, our results indicate significant performance gains (up to 100\% throughput improvement) in highly dense WLAN deployments. Nevertheless, the multi-agent setting raises several concerns that may compromise network devices' performance (definition of joint goals, time-horizon convergence, scalability aspects, or non-stationarity). Besides, our analysis of multi-agent SR encompasses an in-depth study of infrastructure aspects for next-generation AI-enabled networking.
L'operació de reutilització espacial (SR) està guanyant impuls per a la darrera família d'estàndards IEEE 802.11 a causa dels aclaparadors requisits que presenten les xarxes sense fils de nova generació. En particular, la creixent necessitat de tràfic i el nombre de dispositius concurrents comprometen l'eficiència de les xarxes d'àrea local sense fils (WLANs) cada cop més concorregudes i posen en dubte la seva naturalesa descentralitzada. L'operació SR, inicialment introduïda per l'estàndard IEEE 802.11ax-2021 i estudiada posteriorment a IEEE 802.11be-2024, pretén augmentar el nombre de transmissions concurrents en un conjunt bàsic de serveis superposats (OBSS) mitjançant l'ajustament de la sensibilitat i el control de potència de transmissió, millorant així l'eficiència espectral. El nostre estudi sobre el funcionament de SR mostra un potencial destacat per millorar el nombre de transmissions simultànies en desplegaments multitudinaris, contribuint així al desenvolupament d'aplicacions de nova generació de baixa latència. Tot i això, els beneficis potencials de SR són actualment limitats per la rigidesa del mecanisme introduït per a l'11ax, i la manca de coordinació entre els BSS que ho implementen. L'operació SR evoluciona cap a esquemes coordinats on cooperen diferents BSS. En canvi, la coordinació comporta una sobrecàrrega de comunicació i sincronització, el qual té un impacte en el rendiment de les WLAN. D'altra banda, l'esquema coordinat és incompatible amb els dispositius que utilitzen versions anteriors IEEE 802.11, la qual cosa podria deteriorar el rendiment de les xarxes ja existents. Per aquests motius, en aquesta tesi s'avalua la viabilitat de mecanismes descentralitzats per a SR i s'analitzen minuciosament els principals impediments i mancances que se'n poden derivar. El nostre objectiu és donar llum a la futura forma de les WLAN pel que fa a l?optimització de SR i si s'ha de mantenir el seu caràcter descentralitzat, o bé és preferible evolucionar cap a desplegaments coordinats i centralitzats. Per abordar SR de forma descentralitzada, ens centrem en la Intel·ligència Artificial (AI) i ens proposem utilitzar una classe de mètodes seqüencials basats en l'aprenentatge, anomenats Multi-Armed Bandits (MAB). L'esquema MAB s'adapta al problema descentralitzat de SR perquè aborda la incertesa causada pel funcionament simultani de diversos dispositius (és a dir, un entorn multi-jugador) i la falta d'informació que se'n deriva. Els MAB poden fer front a la complexitat darrera les interaccions espacials entre dispositius que resulten de modificar la seva sensibilitat i potència de transmissió. En aquest sentit, els nostres resultats indiquen guanys importants de rendiment (fins al 100 \%) en desplegaments altament densos. Tot i això, l'aplicació d'aprenentatge automàtic amb múltiples agents planteja diversos problemes que poden comprometre el rendiment dels dispositius d'una xarxa (definició d'objectius conjunts, horitzó de convergència, aspectes d'escalabilitat o manca d'estacionarietat). A més, el nostre estudi d'aprenentatge multi-agent per a SR multi-agent inclou aspectes d'infraestructura per a xarxes de nova generació que integrin AI de manera intrínseca.

APA, Harvard, Vancouver, ISO, and other styles

24

Alotaibi, Faisal F. "CONTENT TRADING AND PRIVACY-AWARE PRICING FOR EFFICIENT SPECTRUM UTILIZATION." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1574698784641394.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Cayci, Semih. "Online Learning for Optimal Control of Communication and Computing Systems." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1595516470389826.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Selent, Douglas A. "Creating Systems and Applying Large-Scale Methods to Improve Student Remediation in Online Tutoring Systems in Real-time and at Scale." Digital WPI, 2017. https://digitalcommons.wpi.edu/etd-dissertations/308.

Full text

Abstract:

"A common problem shared amongst online tutoring systems is the time-consuming nature of content creation. It has been estimated that an hour of online instruction can take up to 100-300 hours to create. Several systems have created tools to expedite content creation, such as the Cognitive Tutors Authoring Tool (CTAT) and the ASSISTments builder. Although these tools make content creation more efficient, they all still depend on the efforts of a content creator and/or past historical. These tools do not take full advantage of the power of the crowd. These issues and challenges faced by online tutoring systems provide an ideal environment to implement a solution using crowdsourcing. I created the PeerASSIST system to provide a solution to the challenges faced with tutoring content creation. PeerASSIST crowdsources the work students have done on problems inside the ASSISTments online tutoring system and redistributes that work as a form of tutoring to their peers, who are in need of assistance. Multi-objective multi-armed bandit algorithms are used to distribute student work, which balance exploring which work is good and exploiting the best currently known work. These policies are customized to run in a real-world environment with multiple asynchronous reward functions and an infinite number of actions. Inspired by major companies such as Google, Facebook, and Bing, PeerASSIST is also designed as a platform for simultaneous online experimentation in real-time and at scale. Currently over 600 teachers (grades K-12) are requiring students to show their work. Over 300,000 instances of student work have been collected from over 18,000 students across 28,000 problems. From the student work collected, 2,000 instances have been redistributed to over 550 students who needed help over the past few months. I conducted a randomized controlled experiment to evaluate the effectiveness of PeerASSIST on student performance. Other contributions include representing learning maps as Bayesian networks to model student performance, creating a machine-learning algorithm to derive student incorrect processes from their incorrect answer and the inputs of the problem, and applying Bayesian hypothesis testing to A/B experiments. We showed that learning maps can be simplified without practical loss of accuracy and that time series data is necessary to simplify learning maps if the static data is highly correlated. I also created several interventions to evaluate the effectiveness of the buggy messages generated from the machine-learned incorrect processes. The null results of these experiments demonstrate the difficulty of creating a successful tutoring and suggest that other methods of tutoring content creation (i.e. PeerASSIST) should be explored."

APA, Harvard, Vancouver, ISO, and other styles

27

Degenne, Rémy. "Impact of structure on the design and analysis of bandit algorithms." Thesis, Université de Paris (2019-....), 2019. http://www.theses.fr/2019UNIP7179.

Full text

Abstract:

Cette thèse porte sur des problèmes d'apprentissage statistique séquentiel, dits bandits stochastiques à plusieurs bras. Dans un premier temps un algorithme de bandit est présenté. L'analyse de cet algorithme, comme la majorité des preuves usuelles de bornes de regret pour algorithmes de bandits, utilise des intervalles de confiance pour les moyennes des bras. Dans un cadre paramétrique,on prouve des inégalités de concentration quantifiant la déviation entre le paramètre d'une distribution et son estimation empirique, afin d'obtenir de tels intervalles. Ces inégalités sont exprimées en fonction de la divergence de Kullback-Leibler. Trois extensions du problème de bandits sont ensuite étudiées. Premièrement on considère le problème dit de semi-bandit combinatoire, dans lequel un algorithme choisit un ensemble de bras et la récompense de chaque bras est observée. Le regret minimal atteignable dépend alors de la corrélation entre les bras. On considère ensuite un cadre où on change le mécanisme d'obtention des observations provenant des différents bras. Une source de difficulté du problème de bandits est la rareté de l'information: seul le bras choisi est observé. On montre comment on peut tirer parti de la disponibilité d'observations supplémentaires gratuites, ne participant pas au regret. Enfin, une nouvelle famille d'algorithmes est présentée afin d'obtenir à la fois des guaranties de minimisation de regret et d'identification du meilleur bras. Chacun des algorithmes réalise un compromis entre regret et temps d'identification. On se penche dans un deuxième temps sur le problème dit d'exploration pure, dans lequel un algorithme n'est pas évalué par son regret mais par sa probabilité d'erreur quant à la réponse à une question posée sur le problème. On détermine la complexité de tels problèmes et on met au point des algorithmes approchant cette complexité
In this Thesis, we study sequential learning problems called stochastic multi-armed bandits. First a new bandit algorithm is presented. The analysis of that algorithm uses confidence intervals on the mean of the arms reward distributions, as most bandit proofs do. In a parametric setting, we derive concentration inequalities which quantify the deviation between the mean parameter of a distribution and its empirical estimation in order to obtain confidence intervals. These inequalities are presented as bounds on the Kullback-Leibler divergence. Three extensions of the stochastic multi-armed bandit problem are then studied. First we study the so-called combinatorial semi-bandit problem, in which an algorithm chooses a set of arms and the reward of each of these arms is observed. The minimal attainable regret then depends on the correlation between the arm distributions. We consider then a setting in which the observation mechanism changes. One source of difficulty of the bandit problem is the scarcity of information: only the arm pulled is observed. We show how to use efficiently eventual supplementary free information (which do not influence the regret). Finally a new family of algorithms is introduced to obtain both regret minimization and est arm identification regret guarantees. Each algorithm of the family realizes a trade-off between regret and time needed to identify the best arm. In a second part we study the so-called pure exploration problem, in which an algorithm is not evaluated on its regret but on the probability that it returns a wrong answer to a question on the arm distributions. We determine the complexity of such problems and design with performance close to that complexity

APA, Harvard, Vancouver, ISO, and other styles

28

Jouini, Wassim. "Contribution to learning and decision making under uncertainty for Cognitive Radio." Thesis, Supélec, 2012. http://www.theses.fr/2012SUPL0010/document.

Full text

Abstract:

L’allocation des ressources spectrales à des services de communications sans fil, sans cesse plus nombreux et plus gourmands, a récemment mené la communauté radio à vouloir remettre en question la stratégie de répartition des bandes de fréquences imposée depuis plus d’un siècle. En effet une étude rendue publique en 2002 par la commission fédérale des communications aux Etats-Unis (Federal Communications Commission - FCC) mit en évidence une pénurie des ressources spectrales dans une large bande de fréquences comprise entre quelques mégahertz à plusieurs gigahertz. Cependant, cette même étude expliqua cette pénurie par une allocation statique des ressources aux différents services demandeurs plutôt que par une saturation des bandes de fréquences. Cette explication fut par la suite corroborée par de nombreuses mesures d’occupation spectrale, réalisées dans plusieurs pays, qui montrèrent une forte sous-utilisation des bandes de fréquences en fonction du temps et de l’espace, représentant par conséquent autant d’opportunité spectrale inexploitée. Ces constations donnèrent naissance à un domaine en plein effervescence connu sous le nom d’Accès Opportuniste au Spectre (Opportunistic Spectrum Access). Nos travaux suggèrent l’étude de mécanismes d’apprentissage pour la radio intelligente (Cognitive Radio) dans le cadre de l’Accès Opportuniste au Spectre (AOS) afin de permettre à des équipements radio d’exploiter ces opportunités de manière autonome. Pour cela, nous montrons que les problématiques d’AOS peuvent être fidèlement représentées par des modèles d’apprentissage par renforcement. Ainsi, l’équipement radio est modélisé par un agent intelligent capable d’interagir avec son environnement afin d’en collecter des informations. Ces dernières servent à reconnaître, au fur et à mesure des expériences, les meilleurs choix (bandes de fréquences, configurations, etc.) qui s’offrent au système de communication. Nous nous intéressons au modèle particulier des bandits manchots (Multi-Armed Bandit appliqué à l’AOS). Nous discutons, lors d’une phase préliminaire, différentes solutions empruntées au domaine de l’apprentissage machine (Machine Learning). Ensuite, nous élargissons ces résultats à des cadres adaptés à la radio intelligente. Notamment, nous évaluons les performances de ces algorithmes dans le cas de réseaux d’équipements qui collaborent en prenant en compte, dans le modèle suggéré, les erreurs d’observations. On montre de plus que ces algorithmes n’ont pas besoin de connaître la fréquence des erreurs d’observation afin de converger. La vitesse de convergence dépend néanmoins de ces fréquences. Dans un second temps nous concevons un nouvel algorithme d’apprentissage destiné à répondre à des problèmes d’exploitation des ressources spectrales dans des conditions dites de fading. Tous ces travaux présupposent néanmoins la capacité de l’équipement intelligent à détecter efficacement l’activité d’autres utilisateurs sur la bande (utilisateurs prioritaires dits utilisateurs primaires). La principale difficulté réside dans le fait que l’équipement intelligent ne suppose aucune connaissance a priori sur son environnement (niveau du bruit notamment) ou sur les utilisateurs primaires. Afin de lever le doute sur l’efficacité de l’approche suggérée, nous analysons l’impact de ces incertitudes sur le détecteur d’énergie. Ce dernier prend donc le rôle d’observateur et envoie ses observations aux algorithmes d’apprentissage. Nous montrons ainsi qu’il est possible de quantifier les performances de ce détecteur dans des conditions d’incertitude sur le niveau du bruit ce qui le rend utilisable dans le contexte de la radio intelligente. Par conséquent, les algorithmes d’apprentissage utilisés pourront exploiter les résultats du détecteur malgré l’incertitude inhérente liée à l’environnement considéré et aux hypothèses (sévères) d’incertitude liées au problème analysé
During the last century, most of the meaningful frequency bands were licensed to emerging wireless applications. Because of the static model of frequency allocation, the growing number of spectrum demanding services led to a spectrum scarcity. However, recently, series of measurements on the spectrum utilization showed that the different frequency bands were underutilized (sometimes even unoccupied) and thus that the scarcity of the spectrum resource is virtual and only due to the static allocation of the different bands to specific wireless services. Moreover, the underutilization of the spectrum resource varies on different scales in time and space offering many opportunities to an unlicensed user or network to access the spectrum. Cognitive Radio (CR) and Opportunistic Spectrum Access (OSA) were introduced as possible solutions to alleviate the spectrum scarcity issue.In this dissertation, we aim at enabling CR equipments to exploit autonomously communication opportunities found in their vicinity. For that purpose, we suggest decision making mechanisms designed and/or adapted to answer CR related problems in general, and more specifically, OSA related scenarios. Thus, we argue that OSA scenarios can be modeled as Multi-Armed Bandit (MAB) problems. As a matter of fact, within OSA contexts, CR equipments are assumed to have no prior knowledge on their environment. Acquiring the necessary information relies on a sequential interaction between the CR equipment and its environment. Finally, the CR equipment is modeled as a cognitive agent whose purpose is to learn while providing an improving service to its user. Thus, firstly we analyze the performance of UCB1 algorithm when dealing with OSA problems with imperfect sensing. More specifically, we show that UCB1 can efficiently cope with sensing errors. We prove its convergence to the optimal channel and quantify its loss of performance compared to the case with perfect sensing. Secondly, we combine UCB1 algorithm with collaborative and coordination mechanism to model a secondary network (i.e. several SUs). We show that within this complex scenario, a coordinated learning mechanism can lead to efficient secondary networks. These scenarios assume that a SU can efficiently detect incumbent users’ activity while having no prior knowledge on their characteristics. Usually, energy detection is suggested as a possible approach to handle such task. Unfortunately, energy detection in known to perform poorly when dealing with uncertainty. Consequently, we ventured in this Ph.D. to revisit the problem of energy detection limits under uncertainty. We present new results on its performances as well as its limits when the noise level is uncertain and the uncertainty is modeled by a log-normal distribution (as suggested by Alexander Sonnenschein and Philip M. Fishman in 1992). Within OSA contexts, we address a final problem where a sensor aims at quantifying the quality of a channel in fading environments. In such contexts, UCB1 algorithms seem to fail. Consequently, we designed a new algorithm called Multiplicative UCB (UCB) and prove its convergence. Moreover, we prove that MUCB algorithms are order optimal (i.e., the order of their learning rate is optimal). This last work provides a contribution that goes beyond CR and OSA. As a matter of fact, MUCB algorithms are introduced and solved within a general MAB framework

APA, Harvard, Vancouver, ISO, and other styles

29

Collet, Timothé. "Méthodes optimistes d’apprentissage actif pour la classification." Thesis, Université de Lorraine, 2016. http://www.theses.fr/2016LORR0084/document.

Full text

Abstract:

La classification se base sur un jeu de données étiquetées par un expert. Plus le jeu de données est grand, meilleure est la performance de classification. Pourtant, la requête à un expert peut parfois être coûteuse. Le but de l'apprentissage actif est alors de minimiser le nombre de requêtes à l'expert. La collection des données non-étiquetées reste aisée cependant et illimitée, il est donc nécessaire de faire un choix sur les données à annoter, l'idée est alors de profiter de ce choix pour maximiser les performances en ne lui fournissant que les données les plus informatives à étiqueter. Pourtant, le niveau d'informativité de chaque donnée ne peut pas être calculé exactement et ne peut être estimé qu'à une incertitude près. Améliorer la précision de l'estimation nécessite d'annoter de nouvelles données. Il y a donc un dilemme entre utiliser le budget d'annotations disponible pour améliorer la performance du classifieur selon l'estimation actuelle du critère ou pour améliorer la précision sur le critère. Ce dilemme est bien connu dans le cadre de l'optimisation en budget fini sous le nom de dilemme entre exploration et exploitation. Les solutions usuelles pour résoudre ce dilemme dans ce contexte font usage du principe d'Optimisme Face à l'Incertitude. Dans cette thèse, nous montrons donc qu'il est possible d'adapter ce principe au problème d'apprentissage actif pour la classification. Pour cela, plusieurs algorithmes ont été être développés pour des classifieurs de complexité croissante, chacun utilisant le principe de l'Optimisme Face à l'Incertitude, et leurs résultats ont été évalués empiriquement
A Classification problem makes use of a training set consisting of data labeled by an oracle. The larger the training set, the best the performance. However, requesting the oracle may be costly. The goal of Active Learning is thus to minimize the number of requests to the oracle while achieving the best performance. To do so, the data that are presented to the oracle must be carefully selected among a large number of unlabeled instances acquired at no cost. However, the true profitability of labeling a particular instance may not be known perfectly. It can therefore be estimated along with a measure of uncertainty. To Increase the precision on the estimate, we need to label more data. Thus, there is a dilemma between labeling data in order to increase the performance of the classifier or to better know how to select data. This dilemma is well studied in the context of finite budget optimization under the name of exploration versus exploitation dilemma. The most famous solutions make use of the principle of Optimism in the Face of Uncertainty. In this thesis, we show that it is possible to adapt this principle to the active learning problem for classification. Several algorithms have been developed for classifiers of increasing complexity, each one of them using the principle of Optimism in the Face of Uncertainty, and their performances have been empirically evaluated

APA, Harvard, Vancouver, ISO, and other styles

30

Guillou, Frédéric. "On recommendation systems in a sequential context." Thesis, Lille 3, 2016. http://www.theses.fr/2016LIL30041/document.

Full text

Abstract:

Cette thèse porte sur l'étude des Systèmes de Recommandation dans un cadre séquentiel, où les retours des utilisateurs sur des articles arrivent dans le système l'un après l'autre. Après chaque retour utilisateur, le système doit le prendre en compte afin d'améliorer les recommandations futures. De nombreuses techniques de recommandation ou méthodologies d'évaluation ont été proposées par le passé pour les problèmes de recommandation. Malgré cela, l'évaluation séquentielle, qui est pourtant plus réaliste et se rapproche davantage du cadre d'évaluation d'un vrai système de recommandation, a été laissée de côté. Le contexte séquentiel nécessite de prendre en considération différents aspects non visibles dans un contexte fixe. Le premier de ces aspects est le dilemme dit d'exploration vs. exploitation: le modèle effectuant les recommandations doit trouver le bon compromis entre recueillir de l'information sur les goûts des utilisateurs à travers des étapes d'exploration, et exploiter la connaissance qu'il a à l'heure actuelle pour maximiser le feedback reçu. L'importance de ce premier point est mise en avant à travers une première évaluation, et nous proposons une approche à la fois simple et efficace, basée sur la Factorisation de Matrice et un algorithme de Bandit Manchot, pour produire des recommandations appropriées. Le second aspect pouvant apparaître dans le cadre séquentiel surgit dans le cas où une liste ordonnée d'articles est recommandée au lieu d'un seul article. Dans cette situation, le feedback donné par l'utilisateur est multiple: la partie explicite concerne la note donnée par l'utilisateur concernant l'article choisi, tandis que la partie implicite concerne les articles cliqués (ou non cliqués) parmi les articles de la liste. En intégrant les deux parties du feedback dans un modèle d'apprentissage, nous proposons une approche basée sur la Factorisation de Matrice, qui peut recommander de meilleures listes ordonnées d'articles, et nous évaluons cette approche dans un contexte séquentiel particulier pour montrer son efficacité
This thesis is dedicated to the study of Recommendation Systems under a sequential setting, where the feedback given by users on items arrive one after another in the system. After each feedback, the system has to integrate it and try to improve future recommendations. Many techniques or evaluation methods have already been proposed to study the recommendation problem. Despite that, such sequential setting, which is more realistic and represent a closer framework to a real Recommendation System evaluation, has surprisingly been left aside. Under a sequential context, recommendation techniques need to take into consideration several aspects which are not visible for a fixed setting. The first one is the exploration-exploitation dilemma: the model making recommendations needs to find a good balance between gathering information about users' tastes or items through exploratory recommendation steps, and exploiting its current knowledge of the users and items to try to maximize the feedback received. We highlight the importance of this point through the first evaluation study and propose a simple yet efficient approach to make effective recommendation, based on Matrix Factorization and Multi-Armed Bandit algorithms. The second aspect emphasized by the sequential context appears when a list of items is recommended to the user instead of a single item. In such a case, the feedback given by the user includes two parts: the explicit feedback as the rating, but also the implicit feedback given by clicking (or not clicking) on other items of the list. By integrating both feedback into a Matrix Factorization model, we propose an approach which can suggest better ranked list of items, and we evaluate it in a particular setting

APA, Harvard, Vancouver, ISO, and other styles

31

Ray, Chowdhury Sayak. "Reinforcement Learning in Large and Structured Environments." Thesis, 2021. https://etd.iisc.ac.in/handle/2005/5744.

Full text

Abstract:

In a reinforcement learning (RL) problem, a learner takes actions to control the state of an initially unknown environment so as to maximize the sum of the rewards he obtains. This has several applications in many practical sequential decision-making systems like recommendation systems, sequential investment and portfolio allocation, dynamic resource allocation in communication systems, targeted advertising and pricing, to name a few. In these online learning settings, there is no separate time available to purely exploring the unknown environment; rather, one must carefully trade-off exploration and exploitation. The learner seeks to minimize the regret or shortfall in total reward earned from all its decisions, compared to an optimal decision. In this thesis, we study online RL algorithms for large and structured environments, where generalization across unseen states and actions is crucial to achieving significantly low regret. In the first part of this thesis, we study non-parametric approaches to model reward structure in the multi-armed bandit problem -- a stateless i.i.d. non-Markovian environment -- with a very large (potentially infinite) number of arms. This problem can be formulated as learning to optimize a fixed but unknown function over a continuous domain, with the function evaluations being noisy and expensive. This is a generic problem appearing in several applications like hyper-parameter tuning in machine learning models, sensor selection, experimental design, to name a few. We model the function's uncertainty using the reproducing kernel Hilbert space (RKHS) compatible with a kernel on the function's domain. Adapting techniques from Gaussian process regression, we develop upper confidence bound and Thompson sampling-based algorithms for continuous bandit optimization and derive corresponding regret guarantees assuming the reward distribution to be sub-Gaussian. Along the way, we derive a self-normalized concentration inequality for RKHS-valued martingales of arbitrary, possibly infinite, dimension, which might be of independent interest. We further extend these results to kernelized bandit environments with heavy-tailed reward distributions. By employing an efficient reward-truncation strategy, we design an algorithm which is robust to heavy-tailed corruptions in rewards. We show that the regret performance of our algorithm is (almost) optimal in a widely used setting by deriving a fundamental lower bound for this problem. We conclude this part of this thesis by extending the kernelized bandit problem to the multi-objective optimization setting and designing low-regret strategies for the same. In the second part of the thesis, we study regret minimization strategies for episodic, average-reward Markov decision processes (MDPs) with infinite states and actions. We consider both parametric and non-parametric MDPs. In the non-parametric setting, taking inspiration from kernelized bandits, we model uncertainty in mean reward and transitions using RKHSs compatible with kernels defined jointly on state-action spaces. In the parametric setting, we introduce an MDP transition model defined by bilinear exponential families with an unknown finite-dimensional parameter. For both settings, we design variants of upper confidence RL and posterior sampling RL strategies. We prove finite-time regret guarantees for the proposed algorithms that depend on the structures of the underlying models. For linear exponential family MDPs, the regret bounds are proportional to the dimension of the parameter vector and for kernelizable MDPs, the bounds are expressed using effective dimensions of respective RKHSs. Moreover, all the results represent a generalization of existing approaches for tabular finite-state, finite-action MDPs, and parametric linear quadratic regulators.
Google India PhD Fellowship

APA, Harvard, Vancouver, ISO, and other styles

32

Lattimore, Finnian Rachel. "Learning how to act: making good decisions with machine learning." Phd thesis, 2017. http://hdl.handle.net/1885/144602.

Full text

Abstract:

This thesis is about machine learning and statistical approaches to decision making. How can we learn from data to anticipate the consequence of, and optimally select, interventions or actions? Problems such as deciding which medication to prescribe to patients, who should be released on bail, and how much to charge for insurance are ubiquitous, and have far reaching impacts on our lives. There are two fundamental approaches to learning how to act: reinforcement learning, in which an agent directly intervenes in a system and learns from the outcome, and observational causal inference, whereby we seek to infer the outcome of an intervention from observing the system. The goal of this thesis to connect and unify these key approaches. I introduce causal bandit problems: a synthesis that combines causal graphical models, which were developed for observational causal inference, with multi-armed bandit problems, which are a subset of reinforcement learning problems that are simple enough to admit formal analysis. I show that knowledge of the causal structure allows us to transfer information learned about the outcome of one action to predict the outcome of an alternate action, yielding a novel form of structure between bandit arms that cannot be exploited by existing algorithms. I propose an algorithm for causal bandit problems and prove bounds on the simple regret demonstrating it is close to mini-max optimal and better than algorithms that do not use the additional causal information.

APA, Harvard, Vancouver, ISO, and other styles

33

Palanisamy, Sundar Agnideven. "Learning-based Attack and Defense on Recommender Systems." Thesis, 2021. http://dx.doi.org/10.7912/C2/65.

Full text

Abstract:

Indiana University-Purdue University Indianapolis (IUPUI)
The internet is the home for massive volumes of valuable data constantly being created, making it difficult for users to find information relevant to them. In recent times, online users have been relying on the recommendations made by websites to narrow down the options. Online reviews have also become an increasingly important factor in the final choice of a customer. Unfortunately, attackers have found ways to manipulate both reviews and recommendations to mislead users. A Recommendation System is a special type of information filtering system adapted by online vendors to provide suggestions to their customers based on their requirements. Collaborative filtering is one of the most widely used recommendation systems; unfortunately, it is prone to shilling/profile injection attacks. Such attacks alter the recommendation process to promote or demote a particular product. On the other hand, many spammers write deceptive reviews to change the credibility of a product/service. This work aims to address these issues by treating the review manipulation and shilling attack scenarios independently. For the shilling attacks, we build an efficient Reinforcement Learning-based shilling attack method. This method reduces the uncertainty associated with the item selection process and finds the most optimal items to enhance attack reach while treating the recommender system as a black box. Such practical online attacks open new avenues for research in building more robust recommender systems. When it comes to review manipulations, we introduce a method to use a deep structure embedding approach that preserves highly nonlinear structural information and the dynamic aspects of user reviews to identify and cluster the spam users. It is worth mentioning that, in the experiment with real datasets, our method captures about 92\% of all spam reviewers using an unsupervised learning approach.

APA, Harvard, Vancouver, ISO, and other styles

34

(11190282), Agnideven Palanisamy Sundar. "Learning-based Attack and Defense on Recommender Systems." Thesis, 2021.

Find full text

Abstract:

The internet is the home for massive volumes of valuable data constantly being created, making it difficult for users to find information relevant to them. In recent times, online users have been relying on the recommendations made by websites to narrow down the options. Online reviews have also become an increasingly important factor in the final choice of a customer. Unfortunately, attackers have found ways to manipulate both reviews and recommendations to mislead users. A Recommendation System is a special type of information filtering system adapted by online vendors to provide suggestions to their customers based on their requirements. Collaborative filtering is one of the most widely used recommendation systems; unfortunately, it is prone to shilling/profile injection attacks. Such attacks alter the recommendation process to promote or demote a particular product. On the other hand, many spammers write deceptive reviews to change the credibility of a product/service. This work aims to address these issues by treating the review manipulation and shilling attack scenarios independently. For the shilling attacks, we build an efficient Reinforcement Learning-based shilling attack method. This method reduces the uncertainty associated with the item selection process and finds the most optimal items to enhance attack reach while treating the recommender system as a black box. Such practical online attacks open new avenues for research in building more robust recommender systems. When it comes to review manipulations, we introduce a method to use a deep structure embedding approach that preserves highly nonlinear structural information and the dynamic aspects of user reviews to identify and cluster the spam users. It is worth mentioning that, in the experiment with real datasets, our method captures about 92\% of all spam reviewers using an unsupervised learning approach.

APA, Harvard, Vancouver, ISO, and other styles

35

Siddartha, Y. R. "Learning Tournament Solutions from Preference-based Multi-Armed Bandits." Thesis, 2017. https://etd.iisc.ac.in/handle/2005/4698.

Full text

Abstract:

We consider the dueling bandits problem, a sequential decision task where the goal is to learn to pick `good' arms out of an available pool by actively querying for and observing relative preferences between selected pairs of arms. The noisy observed preferences are assumed to be generated by a fixed but unknown stochastic preference model. Motivated by applications in information retrieval, e-commerce, crowdsourcing, etc., a number of bandit algorithms have been proposed in recent years for this task. These have mostly addressed restricted settings wherein the underlying preference model satisfies various structural assumptions. such as being based on a random utility function or a feature-space embedding, or satisfying transitivity or sparsity properties, or at least possessing a Condorcet winner { a single `best ‘arm that is preferred to all others. In seeking to move beyond such restricted settings, there has been a recent shift towards alternative notions of `good' arms (including Borda, Copeland and von Neumann winners). In this work, we extend the dueling bandits problem by adopting, as the desired target set of good (or 'winning') arms, a number of tournament solutions that have been proposed in social choice and voting theory literature as embodying principled and natural criteria for identifying good arms based on preference relations. We then propose a family of upper confidence bound (UCB) based dueling bandit algorithms that learn to play winning arms from several popular tournament solutions, the top cycle, uncovered set, Banks set and Copeland set. We derive these algorithms by first proposing a generic UCB-based framework algorithm that can be instantiated for different tournament solutions by means of appropriately designed `selection procedures. We show sufficiency conditions for the resulting dueling bandit algorithms to satisfy distribution-dependent, horizon-free bounds on natural regret measures defined w.r.t. the target tournament solutions. In contrast to previous work, these bounds do not require restrictive structural assumptions on the preference model and hold for a range of different tournament solutions. We develop selection procedures that satisfy the sufficiency conditions for a number of popular tournament solutions, yielding dueling bandit algorithms UCB-TC, UCB-UC, UCB-BA and UCB-CO for the top cycle, uncovered set, Banks set and the Copeland set respectively. The O_K2 ln T g2 _ bounds we derive are optimal in their dependence on the time horizon T. We show that for all of these tournament solutions, the distribution-dependent `margin' g is lower bounded by the separation or the relative advantage of top cycle arms over non-top cycle arms. While O(K ln T) bounds are known for Condorcet models, our O(K2 ln T) bounds extend to more general models as well as other tournament solutions. We empirically validate these claims and evaluate the proposed algorithms, comparing them to dueling bandit algorithms RUCB, SAVAGE and BTMB over synthetic and real-world preference models. We show that the UCB-TS algorithms perform competitively over models that possess a Condorcet winner, but out-perform the other algorithms over more general models that do not possess a Condorcet winner.

APA, Harvard, Vancouver, ISO, and other styles

36

"Sample-efficient learning algorithms for multi-armed bandits and tensor completion: 基於有限樣本的機器學習算法." 2015. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1291960.

Full text

Abstract:

Chen, Shouyuan.
Thesis Ph.D. Chinese University of Hong Kong 2015.
Includes bibliographical references (leaves 211-228).
Abstracts also in Chinese.
Title from PDF title page (viewed on 07, December, 2016).
Chen, Shouyuan.

APA, Harvard, Vancouver, ISO, and other styles

37

Narayanaprasad, Karthik Periyapattana. "Sequential Controlled Sensing to Detect an Anomalous Process." Thesis, 2021. https://etd.iisc.ac.in/handle/2005/5514.

Full text

Abstract:

In this thesis, we study the problem of identifying an anomalous arm in a multi-armed bandit as quickly as possible, subject to an upper bound on the error probability. Also known as odd arm identification, this problem falls within the class of optimal stopping problems in decision theory and can be embedded within the framework of active sequential hypothesis testing. Prior works on odd arm identification dealt with independent and identically distributed observations from each arm. We provide the first known extension to the case of Markov observations from each arm. Our analysis and results are in the asymptotic regime of vanishing error probability. We associate with each arm an ergodic discrete time Markov process that evolves on a common, finite state space. The notion of anomaly is that the transition probability matrix (TPM) of the Markov process of one of the arms (the {\em odd arm}) is some $P_{1}$, and that of each non-odd arm is a different $P_{2}$, $P_{2}\neq P_{1}$. A learning agent whose goal it is to find the odd arm samples the arms sequentially, one at any given time, and observes the state of the Markov process of the sampled arm. The Markov processes of the unobserved arms may either remain frozen at their last observed states until their next sampling instant ({\em rested arms}) or continue to undergo state evolution ({\em restless arms}). The TPMs $P_{1}$ and $P_{2}$ may be known to the learning agent beforehand or unknown. We analyse the following cases: (a) rested arms with TPMs unknown, (b) restless arms with TPMs known, and (c) restless arms with TPMs unknown. For each of these cases, we first present an asymptotic lower bound on the {\color{black} growth rate of the }expected time required to find the odd arm, and subsequently devise a sequential arm selection policy which we show achieves the lower bound and is therefore asymptotically optimal. A key ingredient in our analysis of the setting of rested arms is the observation that for the Markov process of each arm, the long term fraction of entries into a state is equal to the long term fraction of exits from the state (global balance). When the arms are restless, it is necessary for the learning agent to keep track of the time since each arm was last sampled (arm's {\em delay}) and the state of each arm when it was last sampled (arm's {\em last observed state}). We show that the arm delays and the last observed states form a controlled Markov process which is ergodic under any stationary arm selection policy that picks each arm with a strictly positive probability. Our approach of considering the delays and the last observed states of all the arms jointly offers a global perspective of the arms and serves as a `lift' from the local perspective of dealing with the delay and the last observed state of each arm separately, one that is suggested by the prior works. Lastly, when the TPMs are unknown and have to be estimated along the way, it is important to ensure that the estimates converge almost surely to their true values asymptotically, i.e., the system is {\em identifiable}. We show identifiability follows from the ergodic theorem in the rested case, and provide sufficient conditions for it in the restless case.
Department of Science and Technology, Ministry of Human Resource Development, Center for Networked Intelligence, Robert Bosch Center for Cyber-Physical Systems

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Reinforcement Learning, Multi-armed Bandits'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles