Log in

Relevant bibliographies by topics / Bandit Contextuel / Journal articles

To see the other types of publications on this topic, follow the link: Bandit Contextuel.

Journal articles on the topic 'Bandit Contextuel'

Author: Grafiati

Published: 11 January 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Bandit Contextuel.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Gisselbrecht, Thibault, Sylvain Lamprier, and Patrick Gallinari. "Collecte ciblée à partir de ﬂux de données en ligne dans les médias sociaux. Une approche de bandit contextuel." Document numérique 19, no. 2-3 (December 30, 2016): 11–30. http://dx.doi.org/10.3166/dn.19.2-3.11-30.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Dimakopoulou, Maria, Zhengyuan Zhou, Susan Athey, and Guido Imbens. "Balanced Linear Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 3445–53. http://dx.doi.org/10.1609/aaai.v33i01.33013445.

Full text

Abstract:

Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We develop algorithms for contextual bandits with linear payoffs that integrate balancing methods from the causal inference literature in their estimation to make it less prone to problems of estimation bias. We provide the first regret bound analyses for linear contextual bandits with balancing and show that our algorithms match the state of the art theoretical guarantees. We demonstrate the strong practical advantage of balanced contextual bandits on a large number of supervised learning datasets and on a synthetic example that simulates model misspecification and prejudice in the initial training data.

APA, Harvard, Vancouver, ISO, and other styles

3

Tong, Ruoyi. "A survey of the application and technical improvement of the multi-armed bandit." Applied and Computational Engineering 77, no. 1 (July 16, 2024): 25–31. http://dx.doi.org/10.54254/2755-2721/77/20240631.

Full text

Abstract:

In recent years, the multi-armed bandit (MAB) model has been widely used and has shown excellent performance. This article provides an overview of the applications and technical improvements of the multi-armed bandit machine problem. First, an overview of the multi-armed bandit problem is presented, including the explanation of a general modeling approach and several existing common algorithms, such as -greedy, ETC, UCB, and Thompson sampling. Then, the real-life applications of the multi-armed bandit model are explored, covering the fields of recommender systems, healthcare, and finance. Then, some improved algorithms and models are summarized by addressing the problems encountered in different application domains, including the multi-armed bandit considering multiple objectives, the mortal multi-armed bandits, the multi-armed bandit considering contextual side information, combinatorial multi-armed bandits. Finally, the characteristics, trends of changes among different algorithms, and applicable scenarios are summarized and discussed.

APA, Harvard, Vancouver, ISO, and other styles

4

Yang, Luting, Jianyi Yang, and Shaolei Ren. "Contextual Bandits with Delayed Feedback and Semi-supervised Learning (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 18 (May 18, 2021): 15943–44. http://dx.doi.org/10.1609/aaai.v35i18.17968.

Full text

Abstract:

Contextual multi-armed bandit (MAB) is a classic online learning problem, where a learner/agent selects actions (i.e., arms) given contextual information and discovers optimal actions based on reward feedback. Applications of contextual bandit have been increasingly expanding, including advertisement, personalization, resource allocation in wireless networks, among others. Nonetheless, the reward feedback is delayed in many applications (e.g., a user may only provide service ratings after a period of time), creating challenges for contextual bandits. In this paper, we address delayed feedback in contextual bandits by using semi-supervised learning — incorporate estimates of delayed rewards to improve the estimation of future rewards. Concretely, the reward feedback for an arm selected at the beginning of a round is only observed by the agent/learner with some observation noise and provided to the agent after some a priori unknown but bounded delays. Motivated by semi-supervised learning that produces pseudo labels for unlabeled data to further improve the model performance, we generate fictitious estimates of rewards that are delayed and have yet to arrive based on already-learnt reward functions. Thus, by combining semi-supervised learning with online contextual bandit learning, we propose a novel extension and design two algorithms, which estimate the values for currently unavailable reward feedbacks to minimize the maximum estimation error and average estimation error, respectively.

APA, Harvard, Vancouver, ISO, and other styles

5

Sharaf, Amr, and Hal Daumé III. "Meta-Learning Effective Exploration Strategies for Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 11 (May 18, 2021): 9541–48. http://dx.doi.org/10.1609/aaai.v35i11.17149.

Full text

Abstract:

In contextual bandits, an algorithm must choose actions given ob- served contexts, learning from a reward signal that is observed only for the action chosen. This leads to an exploration/exploitation trade-off: the algorithm must balance taking actions it already believes are good with taking new actions to potentially discover better choices. We develop a meta-learning algorithm, Mêlée, that learns an exploration policy based on simulated, synthetic con- textual bandit tasks. Mêlée uses imitation learning against these simulations to train an exploration policy that can be applied to true contextual bandit tasks at test time. We evaluate Mêlée on both a natural contextual bandit problem derived from a learning to rank dataset as well as hundreds of simulated contextual ban- dit problems derived from classification tasks. Mêlée outperforms seven strong baselines on most of these datasets by leveraging a rich feature representation for learning an exploration strategy.

APA, Harvard, Vancouver, ISO, and other styles

6

Du, Yihan, Siwei Wang, and Longbo Huang. "A One-Size-Fits-All Solution to Conservative Bandit Problems." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 7254–61. http://dx.doi.org/10.1609/aaai.v35i8.16891.

Full text

Abstract:

In this paper, we study a family of conservative bandit problems (CBPs) with sample-path reward constraints, i.e., the learner's reward performance must be at least as well as a given baseline at any time. We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems, i.e. conservative multi-armed bandits (CMAB), conservative linear bandits (CLB) and conservative contextual combinatorial bandits (CCCB). Different from previous works which consider high probability constraints on the expected reward, we focus on a sample-path constraint on the actually received reward, and achieve better theoretical guarantees (T-independent additive regrets instead of T-dependent) and empirical performance. Furthermore, we extend the results and consider a novel conservative mean-variance bandit problem (MV-CBP), which measures the learning performance with both the expected reward and variability. For this extended problem, we provide a novel algorithm with O(1/T) normalized additive regrets (T-independent in the cumulative form) and validate this result through empirical evaluation.

APA, Harvard, Vancouver, ISO, and other styles

7

Varatharajah, Yogatheesan, and Brent Berry. "A Contextual-Bandit-Based Approach for Informed Decision-Making in Clinical Trials." Life 12, no. 8 (August 21, 2022): 1277. http://dx.doi.org/10.3390/life12081277.

Full text

Abstract:

Clinical trials are conducted to evaluate the efficacy of new treatments. Clinical trials involving multiple treatments utilize the randomization of treatment assignments to enable the evaluation of treatment efficacies in an unbiased manner. Such evaluation is performed in post hoc studies that usually use supervised-learning methods that rely on large amounts of data collected in a randomized fashion. That approach often proves to be suboptimal in that some participants may suffer and even die as a result of having not received the most appropriate treatments during the trial. Reinforcement-learning methods improve the situation by making it possible to learn the treatment efficacies dynamically during the course of the trial, and to adapt treatment assignments accordingly. Recent efforts using multi-arm bandits, a type of reinforcement-learning method, have focused on maximizing clinical outcomes for a population that was assumed to be homogeneous. However, those approaches have failed to account for the variability among participants that is becoming increasingly evident as a result of recent clinical-trial-based studies. We present a contextual-bandit-based online treatment optimization algorithm that, in choosing treatments for new participants in the study, takes into account not only the maximization of the clinical outcomes as well as the patient characteristics. We evaluated our algorithm using a real clinical trial dataset from the International Stroke Trial. We simulated the online setting by sequentially going through the data of each participant admitted to the trial. Two bandits (one for each context) were created, with four choices of treatments. For a new participant in the trial, depending on the context, one of the bandits was selected. Then, we took three different approaches to choose a treatment: (a) a random choice (i.e., the strategy currently used in clinical trial settings), (b) a Thompson sampling-based approach, and (c) a UCB-based approach. Success probabilities of each context were calculated separately by considering the participants with the same context. Those estimated outcomes were used to update the prior distributions within the bandit corresponding to the context of each participant. We repeated that process through the end of the trial and recorded the outcomes and the chosen treatments for each approach. We also evaluated a context-free multi-arm-bandit-based approach, using the same dataset, to showcase the benefits of our approach. In the context-free case, we calculated the success probabilities for the Bernoulli sampler using the whole clinical trial dataset in a context-independent manner. The results of our retrospective analysis indicate that the proposed approach performs significantly better than either a random assignment of treatments (the current gold standard) or a multi-arm-bandit-based approach, providing substantial gains in the percentage of participants who are assigned the most suitable treatments. The contextual-bandit and multi-arm bandit approaches provide 72.63% and 64.34% gains, respectively, compared to a random assignment.

APA, Harvard, Vancouver, ISO, and other styles

8

Li, Jialian, Chao Du, and Jun Zhu. "A Bayesian Approach for Subset Selection in Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 9 (May 18, 2021): 8384–91. http://dx.doi.org/10.1609/aaai.v35i9.17019.

Full text

Abstract:

Subset selection in Contextual Bandits (CB) is an important task in various applications such as advertisement recommendation. In CB, arms are attached with contexts and thus correlated in the context space. Proper exploration for subset selection in CB should carefully consider the contexts. Previous works mainly concentrate on the best one arm identification in linear bandit problems, where the expected rewards are linearly dependent on the contexts. However, these methods highly rely on linearity, and cannot be easily extended to more general cases. We propose a novel Bayesian approach for subset selection in general CB where the reward functions can be nonlinear. Our method provides a principled way to employ contextual information and efficiently explore the arms. For cases with relatively smooth posteriors, we give theoretical results that are comparable to previous works. For general cases, we provide a calculable approximate variant. Empirical results show the effectiveness of our method on both linear bandits and general CB.

APA, Harvard, Vancouver, ISO, and other styles

9

Qu, Jiaming. "Survey of dynamic pricing based on Multi-Armed Bandit algorithms." Applied and Computational Engineering 37, no. 1 (January 22, 2024): 160–65. http://dx.doi.org/10.54254/2755-2721/37/20230497.

Full text

Abstract:

Dynamic pricing seeks to determine the most optimal selling price for a product or service, taking into account factors like limited supply and uncertain demand. This study aims to provide a comprehensive exploration of dynamic pricing using the multi-armed bandit problem framework in various contexts. The investigation highlights the prevalence of Thompson sampling in dynamic pricing scenarios with a Bayesian backdrop, where the seller possesses prior knowledge of demand functions. On the other hand, in non-Bayesian situations, the Upper Confidence Bound (UCB) algorithm family gains traction due to their favorable regret bounds. As markets often exhibit temporal fluctuations, the domain of non-stationary multi-armed bandits within dynamic pricing emerges as crucial. Future research directions include enhancing traditional multi-armed bandit algorithms to suit online learning settings, especially those involving dynamic reward distributions. Additionally, merging prior insights into demand functions with contextual multi-armed bandit approaches holds promise for advancing dynamic pricing strategies. In conclusion, this study sheds light on dynamic pricing through the lens of multi-armed bandit problems, offering insights and pathways for further exploration.

APA, Harvard, Vancouver, ISO, and other styles

10

Atsidakou, Alexia, Constantine Caramanis, Evangelia Gergatsouli, Orestis Papadigenopoulos, and Christos Tzamos. "Contextual Pandora’s Box." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 10 (March 24, 2024): 10944–52. http://dx.doi.org/10.1609/aaai.v38i10.28969.

Full text

Abstract:

Pandora’s Box is a fundamental stochastic optimization problem, where the decision-maker must find a good alternative, while minimizing the search cost of exploring the value of each alternative. In the original formulation, it is assumed that accurate distributions are given for the values of all the alternatives, while recent work studies the online variant of Pandora’s Box where the distributions are originally unknown. In this work, we study Pandora’s Box in the online setting, while incorporating context. At each round, we are presented with a number of alternatives each having a context, an exploration cost and an unknown value drawn from an unknown distribution that may change at every round. Our main result is a no-regret algorithm that performs comparably well against the optimal algorithm which knows all prior distributions exactly. Our algorithm works even in the bandit setting where the algorithm never learns the values of the alternatives that were not explored. The key technique that enables our result is a novel modification of the realizability condition in contextual bandits that connects a context to a sufficient statistic of each alternative’s distribution (its reservation value) rather than its mean.

APA, Harvard, Vancouver, ISO, and other styles

11

Zhang, Qianqian. "Real-world Applications of Bandit Algorithms: Insights and Innovations." Transactions on Computer Science and Intelligent Systems Research 5 (August 12, 2024): 753–58. http://dx.doi.org/10.62051/ge4sk783.

Full text

Abstract:

In the rapidly evolving landscape of decision-making systems, the significance of Multi-Armed Bandit (MAB) algorithms has surged, showcasing a remarkable ability to address the exploration-exploitation dilemma across diverse domains. Originating from the probabilistic and statistical decision-making framework, MAB algorithms have established a critical role by offering a systematic approach to making choices in uncertain environments with limited information. These algorithms ingeniously balance the trade-off between exploiting known resources for immediate gains and exploring new possibilities for future benefits. The spectrum of MAB algorithms ranges from Stochastic Stationary Bandits, dealing with static reward distributions, to more complex forms like Restless and Contextual Bandits, each tailored to the dynamism and specificity of real-world challenges. Further, Structured Bandits explore the underlying patterns in reward distributions, providing strategic insights into decision-making processes. The practical applications of these algorithms span several fields, including healthcare, content recommendation, and education, demonstrating their versatility and efficacy in addressing specific contextual challenges. This paper aims to provide a comprehensive overview of the development, nuances, and practical applications of MAB algorithms, highlighting their pivotal role in advancing decision-making processes amidst uncertainty.

APA, Harvard, Vancouver, ISO, and other styles

12

Wang, Zhiyong, Xutong Liu, Shuai Li, and John C. S. Lui. "Efficient Explorative Key-Term Selection Strategies for Conversational Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 10288–95. http://dx.doi.org/10.1609/aaai.v37i8.26225.

Full text

Abstract:

Conversational contextual bandits elicit user preferences by occasionally querying for explicit feedback on key-terms to accelerate learning. However, there are aspects of existing approaches which limit their performance. First, information gained from key-term-level conversations and arm-level recommendations is not appropriately incorporated to speed up learning. Second, it is important to ask explorative key-terms to quickly elicit the user's potential interests in various domains to accelerate the convergence of user preference estimation, which has never been considered in existing works. To tackle these issues, we first propose ``ConLinUCB", a general framework for conversational bandits with better information incorporation, combining arm-level and key-term-level feedback to estimate user preference in one step at each time. Based on this framework, we further design two bandit algorithms with explorative key-term selection strategies, ConLinUCB-BS and ConLinUCB-MCR. We prove tighter regret upper bounds of our proposed algorithms. Particularly, ConLinUCB-BS achieves a better regret bound than the previous result. Extensive experiments on synthetic and real-world data show significant advantages of our algorithms in learning accuracy (up to 54% improvement) and computational efficiency (up to 72% improvement), compared to the classic ConUCB algorithm, showing the potential benefit to recommender systems.

APA, Harvard, Vancouver, ISO, and other styles

13

Bansal, Nipun, Manju Bala, and Kapil Sharma. "FuzzyBandit An Autonomous Personalized Model Based on Contextual Multi Arm Bandits Using Explainable AI." Defence Science Journal 74, no. 4 (April 26, 2024): 496–504. http://dx.doi.org/10.14429/dsj.74.19330.

Full text

Abstract:

In the era of artificial cognizance, context-aware decision-making problems have attracted significant attention. Contextual bandit addresses these problems by solving the exploration versus exploitation dilemma faced to provide customized solutions as per the user’s liking. However, a high level of accountability is required, and there is a need to understand the underlying mechanism of the black box nature of the contextual bandit algorithms proposed in the literature. To overcome these shortcomings, an explainable AI (XAI) based FuzzyBandit model is proposed, which maximizes the cumulative reward by optimizing the decision at each trial based on the rewards received in previous observations and, at the same time, generates explanations for the decision made. The proposed model uses an adaptive neuro-fuzzy inference system (ANFIS) to address the vague nature of arm selection in contextual bandits and uses a feedback mechanism to adjust its parameters based on the relevance and diversity of the features to maximize reward generation. The FuzzyBandit model has also been empirically compared with the existing seven most popular art of literature models on four benchmark datasets over nine criteria, namely recall, specificity, precision, prevalence, F1 score, Matthews Correlation Coefficient (MCC), Fowlkes–Mallows index (FM), Critical Success Index (CSI) and accuracy.

APA, Harvard, Vancouver, ISO, and other styles

14

Tang, Qiao, Hong Xie, Yunni Xia, Jia Lee, and Qingsheng Zhu. "Robust Contextual Bandits via Bootstrapping." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 13 (May 18, 2021): 12182–89. http://dx.doi.org/10.1609/aaai.v35i13.17446.

Full text

Abstract:

Upper confidence bound (UCB) based contextual bandit algorithms require one to know the tail property of the reward distribution. Unfortunately, such tail property is usually unknown or difficult to specify in real-world applications. Using a tail property heavier than the ground truth leads to a slow learning speed of the contextual bandit algorithm, while using a lighter one may cause the algorithm to diverge. To address this fundamental problem, we develop an estimator (evaluated from historical rewards) for the contextual bandit UCB based on the multiplier bootstrapping technique. We first establish sufficient conditions under which our estimator converges asymptotically to the ground truth of contextual bandit UCB. We further derive a second order correction for our estimator so as to obtain its confidence level with a finite number of rounds. To demonstrate the versatility of the estimator, we apply it to design a BootLinUCB algorithm for the contextual bandit. We prove that the BootLinUCB has a sub-linear regret upper bound and also conduct extensive experiments to validate its superior performance.

APA, Harvard, Vancouver, ISO, and other styles

15

Wu, Jiazhen. "In-depth Exploration and Implementation of Multi-Armed Bandit Models Across Diverse Fields." Highlights in Science, Engineering and Technology 94 (April 26, 2024): 201–5. http://dx.doi.org/10.54097/d3ez0n61.

Full text

Abstract:

This paper presents an in-depth analysis of the Multi-Armed Bandit (MAB) problem, tracing its evolution from its origins in the gambling domain of the 1940s to its current prominence in machine learning and artificial intelligence. The analysis begins with a historical overview, noting key developments like Herbert Robbins' probabilistic framework and the expansion of the problem into strategic decision-making in the 1970s. The emergence of algorithms like the Upper Confidence Bound (UCB) and Thompson Sampling in the late 20th century is highlighted, demonstrating the MAB problem's transition to practical applications. The integration of MAB algorithms with machine learning, particularly in the era of reinforcement learning, is explored, emphasizing their application in various domains such as online advertising, financial market trading, and clinical trials. The paper discusses the critical role of decision theory and probabilistic models in MAB problems, focusing on the balance between exploration and exploitation strategies. Recent advancements in Contextual Bandits, non-stationary reward distributions, and Multi-agent Bandits are examined, showcasing the ongoing evolution and adaptability of MAB problems.

APA, Harvard, Vancouver, ISO, and other styles

16

Wang, Kun. "Conservative Contextual Combinatorial Cascading Bandit." IEEE Access 9 (2021): 151434–43. http://dx.doi.org/10.1109/access.2021.3124416.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Elwood, Adam, Marco Leonardi, Ashraf Mohamed, and Alessandro Rozza. "Maximum Entropy Exploration in Contextual Bandits with Neural Networks and Energy Based Models." Entropy 25, no. 2 (January 18, 2023): 188. http://dx.doi.org/10.3390/e25020188.

Full text

Abstract:

Contextual bandits can solve a huge range of real-world problems. However, current popular algorithms to solve them either rely on linear models or unreliable uncertainty estimation in non-linear models, which are required to deal with the exploration–exploitation trade-off. Inspired by theories of human cognition, we introduce novel techniques that use maximum entropy exploration, relying on neural networks to find optimal policies in settings with both continuous and discrete action spaces. We present two classes of models, one with neural networks as reward estimators, and the other with energy based models, which model the probability of obtaining an optimal reward given an action. We evaluate the performance of these models in static and dynamic contextual bandit simulation environments. We show that both techniques outperform standard baseline algorithms, such as NN HMC, NN Discrete, Upper Confidence Bound, and Thompson Sampling, where energy based models have the best overall performance. This provides practitioners with new techniques that perform well in static and dynamic settings, and are particularly well suited to non-linear scenarios with continuous action spaces.

APA, Harvard, Vancouver, ISO, and other styles

18

Baheri, Ali. "Multilevel Constrained Bandits: A Hierarchical Upper Confidence Bound Approach with Safety Guarantees." Mathematics 13, no. 1 (January 3, 2025): 149. https://doi.org/10.3390/math13010149.

Full text

Abstract:

The multi-armed bandit (MAB) problem is a foundational model for sequential decision-making under uncertainty. While MAB has proven valuable in applications such as clinical trials and online advertising, traditional formulations have limitations; specifically, they struggle to handle three key real-world scenarios: (1) when decisions must follow a hierarchical structure (as in autonomous systems where high-level strategy guides low-level actions); (2) when there are constraints at multiple levels of decision-making (such as both system-wide and component-level resource limits); and (3) when available actions depend on previous choices or context. To address these challenges, we introduce the hierarchical constrained bandits (HCB) framework, which extends contextual bandits to incorporate both hierarchical decisions and multilevel constraints. We propose the HC-UCB (hierarchical constrained upper confidence bound) algorithm to solve the HCB problem. The algorithm uses confidence bounds within a hierarchical setting to balance exploration and exploitation while respecting constraints at all levels. Our theoretical analysis establishes that HC-UCB achieves sublinear regret, guarantees constraint satisfaction at all hierarchical levels, and is near-optimal in terms of achievable performance. Simple experimental results demonstrate the algorithm’s effectiveness in balancing reward maximization with constraint satisfaction.

APA, Harvard, Vancouver, ISO, and other styles

19

Strong, Emily, Bernard Kleynhans, and Serdar Kadıoğlu. "MABWISER: Parallelizable Contextual Multi-armed Bandits." International Journal on Artificial Intelligence Tools 30, no. 04 (June 2021): 2150021. http://dx.doi.org/10.1142/s0218213021500214.

Full text

Abstract:

Contextual multi-armed bandit algorithms are an effective approach for online sequential decision-making problems. However, there are limited tools available to support their adoption in the community. To fill this gap, we present an open-source Python library with context-free, parametric and non-parametric contextual multi-armed bandit algorithms. The MABWiser library is designed to be user-friendly and supports custom bandit algorithms for specific applications. Our design provides built-in parallelization to speed up training and testing for scalability with special attention given to ensuring the reproducibility of results. The API makes hybrid strategies possible that combine non-parametric policies with parametric ones, an area that is not explored in the literature. As a practical application, we demonstrate using the library in both batch and online simulations for context-free, parametric and non-parametric contextual policies with the well-known MovieLens data set. Finally, we quantify the performance benefits of built-in parallelization.

APA, Harvard, Vancouver, ISO, and other styles

20

Lee, Kyungbok, Myunghee Cho Paik, Min-hwan Oh, and Gi-Soo Kim. "Mixed-Effects Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (March 24, 2024): 13409–17. http://dx.doi.org/10.1609/aaai.v38i12.29243.

Full text

Abstract:

We study a novel variant of a contextual bandit problem with multi-dimensional reward feedback formulated as a mixed-effects model, where the correlations between multiple feedback are induced by sharing stochastic coefficients called random effects. We propose a novel algorithm, Mixed-Effects Contextual UCB (ME-CUCB), achieving tildeO(d sqrt(mT)) regret bound after T rounds where d is the dimension of contexts and m is the dimension of outcomes, with either known or unknown covariance structure. This is a tighter regret bound than that of the naive canonical linear bandit algorithm ignoring the correlations among rewards. We prove a lower bound of Omega(d sqrt(mT)) matching the upper bound up to logarithmic factors. To our knowledge, this is the first work providing a regret analysis for mixed-effects models and algorithms involving weighted least-squares estimators. Our theoretical analysis faces a significant technical challenge in that the error terms do not constitute martingales since the weights depend on the rewards. We overcome this challenge by using covering numbers, of theoretical interest in its own right. We provide numerical experiments demonstrating the advantage of our proposed algorithm, supporting the theoretical claims.

APA, Harvard, Vancouver, ISO, and other styles

21

Oh, Min-hwan, and Garud Iyengar. "Multinomial Logit Contextual Bandits: Provable Optimality and Practicality." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 10 (May 18, 2021): 9205–13. http://dx.doi.org/10.1609/aaai.v35i10.17111.

Full text

Abstract:

We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown. In each period, the learning agent observes a d-dimensional contextual information about the user and the N available items, and offers an assortment of size K to the user, and observes the bandit feedback of the item chosen from the assortment. We propose upper confidence bound based algorithms for this MNL contextual bandit. The first algorithm is a simple and practical method that achieves an O(d√T) regret over T rounds. Next, we propose a second algorithm which achieves a O(√dT) regret. This matches the lower bound for the MNL bandit problem, up to logarithmic terms, and improves on the best-known result by a √d factor. To establish this sharper regret bound, we present a non-asymptotic confidence bound for the maximum likelihood estimator of the MNL model that may be of independent interest as its own theoretical contribution. We then revisit the simpler, significantly more practical, first algorithm and show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.

APA, Harvard, Vancouver, ISO, and other styles

22

Zhao, Yisen. "Enhancing conversational recommendation systems through the integration of KNN with ConLinUCB contextual bandits." Applied and Computational Engineering 68, no. 1 (June 6, 2024): 8–16. http://dx.doi.org/10.54254/2755-2721/68/20241388.

Full text

Abstract:

In recommender system research, contextual multi-armed bandits have shown promise in delivering tailored recommendations by utilizing contextual data. However, their effectiveness is often curtailed by the cold start problem, arising from the lack of initial user data. This necessitates extensive exploration to ascertain user preferences, consequently impeding the speed of learning. The advent of conversational recommendation systems offers a solution. Through these systems, the conversational contextual bandit algorithm swiftly learns user preferences for specific key-terms via interactive dialogues, thereby enhancing the learning pace. Despite these advancements, there are limitations in current methodologies. A primary issue is the suboptimal integration of data from key-term-centric dialogues and arm-level recommendations, which could otherwise expedite the learning process. Another crucial aspect is the strategic suggestion of exploratory key phrases. These phrases are essential in quickly uncovering users potential interests in various domains, thus accelerating the convergence of accurate user preference models. Addressing these challenges, the ConLinUCB framework emerges as a groundbreaking solution. It ingeniously combines feedback from both arm-level and key-term-level interactions, significantly optimizing the learning trajectory. Building upon this, the framework integrates a K-nearest neighbour (KNN) approach to refine key-term selection and arm recommendations. This integration hinges on the similarity of user preferences, further hastening the convergence of the parameter vectors.

APA, Harvard, Vancouver, ISO, and other styles

23

Chen, Qiufan. "A survey on contextual multi-armed bandits." Applied and Computational Engineering 53, no. 1 (March 28, 2024): 287–95. http://dx.doi.org/10.54254/2755-2721/53/20241593.

Full text

Abstract:

As a powerful reinforcement learning framework, Contextual Multi-Armed Bandits have extensive applications in various domains. The models of Contextual Multi-Armed Bandits enable decision-makers to make intelligent choices in situations with uncertainty, and they find utility in fields such as online advertising, medical treatment optimization, resource allocation, and more. This paper reviews the evolution of algorithms for Contextual Multi-Armed Bandits, including traditional Bayesian approaches and the latest deep learning techniques. Successful case studies are summarized in different application domains, such as online ad click-through rate optimization and medical decision support. Furthermore, the author discusses future research directions, including more sophisticated context modeling, interpretability, fairness issues, and ethical considerations in the context of automated decision-making.

APA, Harvard, Vancouver, ISO, and other styles

24

Mohaghegh Neyshabouri, Mohammadreza, Kaan Gokcesu, Hakan Gokcesu, Huseyin Ozkan, and Suleyman Serdar Kozat. "Asymptotically Optimal Contextual Bandit Algorithm Using Hierarchical Structures." IEEE Transactions on Neural Networks and Learning Systems 30, no. 3 (March 2019): 923–37. http://dx.doi.org/10.1109/tnnls.2018.2854796.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Gu, Haoran, Yunni Xia, Hong Xie, Xiaoyu Shi, and Mingsheng Shang. "Robust and efficient algorithms for conversational contextual bandit." Information Sciences 657 (February 2024): 119993. http://dx.doi.org/10.1016/j.ins.2023.119993.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Narita, Yusuke, Shota Yasui, and Kohei Yata. "Efficient Counterfactual Learning from Bandit Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4634–41. http://dx.doi.org/10.1609/aaai.v33i01.33014634.

Full text

Abstract:

What is the most statistically efficient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-theart benchmark.

APA, Harvard, Vancouver, ISO, and other styles

27

Li, Zhaoyu, and Qian Ai. "Managing Considerable Distributed Resources for Demand Response: A Resource Selection Strategy Based on Contextual Bandit." Electronics 12, no. 13 (June 23, 2023): 2783. http://dx.doi.org/10.3390/electronics12132783.

Full text

Abstract:

The widespread adoption of distributed energy resources (DERs) leads to resource redundancy in grid operation and increases computation complexity, which underscores the need for effective resource management strategies. In this paper, we present a novel resource management approach that decouples the resource selection and power dispatch tasks. The resource selection task determines the subset of resources designated to participate in the demand response service, while the power dispatch task determines the power output of the selected candidates. A solution strategy based on contextual bandit with DQN structure is then proposed. Concretely, an agent determines the resource selection action, while the power dispatch task is solved in the environment. The negative value of the operational cost is used as feedback to the agent, which links the two tasks in a closed-loop manner. Moreover, to cope with the uncertainty in the power dispatch problem, distributionally robust optimization (DRO) is applied for the reserve settlement to satisfy the reliability requirement against this uncertainty. Numerical studies demonstrate that the DQN-based contextual bandit approach can achieve a profit enhancement ranging from 0.35% to 46.46% compared to the contextual bandit with policy gradient approach under different resource selection quantities.

APA, Harvard, Vancouver, ISO, and other styles

28

Huang, Wen, and Xintao Wu. "Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 18 (March 24, 2024): 20438–46. http://dx.doi.org/10.1609/aaai.v38i18.30027.

Full text

Abstract:

This paper studies bandit problems where an agent has access to offline data that might be utilized to potentially improve the estimation of each arm’s reward distribution. A major obstacle in this setting is the existence of compound biases from the observational data. Ignoring these biases and blindly fitting a model with the biased data could even negatively affect the online learning phase. In this work, we formulate this problem from a causal perspective. First, we categorize the biases into confounding bias and selection bias based on the causal structure they imply. Next, we extract the causal bound for each arm that is robust towards compound biases from biased observational data. The derived bounds contain the ground truth mean reward and can effectively guide the bandit agent to learn a nearly-optimal decision policy. We also conduct regret analysis in both contextual and non-contextual bandit settings and show that prior causal bounds could help consistently reduce the asymptotic regret.

APA, Harvard, Vancouver, ISO, and other styles

29

Spieker, Helge, and Arnaud Gotlieb. "Adaptive metamorphic testing with contextual bandits." Journal of Systems and Software 165 (July 2020): 110574. http://dx.doi.org/10.1016/j.jss.2020.110574.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Jagerman, Rolf, Ilya Markov, and Maarten De Rijke. "Safe Exploration for Optimizing Contextual Bandits." ACM Transactions on Information Systems 38, no. 3 (June 26, 2020): 1–23. http://dx.doi.org/10.1145/3385670.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Kakadiya, Ashutosh, Sriraam Natarajan, and Balaraman Ravindran. "Relational Boosted Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 13 (May 18, 2021): 12123–30. http://dx.doi.org/10.1609/aaai.v35i13.17439.

Full text

Abstract:

Contextual bandits algorithms have become essential in real-world user interaction problems in recent years. However, these algorithms represent context as attribute value representation, which makes them infeasible for real world domains like social networks, which are inherently relational. We propose Relational Boosted Bandits (RB2), a contextual bandits algorithm for relational domains based on (relational) boosted trees. RB2 enables us to learn interpretable and explainable models due to the more descriptive nature of the relational representation. We empirically demonstrate the effectiveness and interpretability of RB2 on tasks such as link prediction, relational classification, and recommendation.

APA, Harvard, Vancouver, ISO, and other styles

32

Seifi, Farshad, and Seyed Taghi Akhavan Niaki. "Optimizing contextual bandit hyperparameters: A dynamic transfer learning-based framework." International Journal of Industrial Engineering Computations 15, no. 4 (2024): 951–64. http://dx.doi.org/10.5267/j.ijiec.2024.6.003.

Full text

Abstract:

The stochastic contextual bandit problem, recognized for its effectiveness in navigating the classic exploration-exploitation dilemma through ongoing player-environment interactions, has found broad applications across various industries. This utility largely stems from the algorithms’ ability to accurately forecast reward functions and maintain an optimal balance between exploration and exploitation, contingent upon the precise selection and calibration of hyperparameters. However, the inherently dynamic and real-time nature of bandit environments significantly complicates hyperparameter tuning, rendering traditional offline methods inadequate. While specialized methods have been developed to overcome these challenges, they often face three primary issues: difficulty in adaptively learning hyperparameters in ever-changing environments, inability to simultaneously optimize multiple hyperparameters for complex models, and inefficiencies in data utilization and knowledge transfer from analogous tasks. To tackle these hurdles, this paper introduces an innovative transfer learning-based approach designed to harness past task knowledge for accelerated optimization and dynamically optimize multiple hyperparameters, making it well-suited for fluctuating environments. The method employs a dual Gaussian meta-model strategy—one for transfer learning and the other for assessing hyperparameters’ performance within the current task —enabling it to leverage insights from previous tasks while quickly adapting to new environmental changes. Furthermore, the framework’s meta-model-centric architecture enables simultaneous optimization of multiple hyperparameters. Experimental evaluations demonstrate that this approach markedly outperforms competing methods in scenarios with perturbations and exhibits superior performance in 70% of stationary cases while matching performance in the remaining 30%. This superiority in performance, coupled with its computational efficiency on par with existing alternatives, positions it as a superior and practical solution for optimizing hyperparameters in contextual bandit settings.

APA, Harvard, Vancouver, ISO, and other styles

33

Zhao, Yafei, and Long Yang. "Constrained contextual bandit algorithm for limited-budget recommendation system." Engineering Applications of Artificial Intelligence 128 (February 2024): 107558. http://dx.doi.org/10.1016/j.engappai.2023.107558.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Yang, Jianyi, and Shaolei Ren. "Robust Bandit Learning with Imperfect Context." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 10594–602. http://dx.doi.org/10.1609/aaai.v35i12.17267.

Full text

Abstract:

A standard assumption in contextual multi-arm bandit is that the true context is perfectly known before arm selection. Nonetheless, in many practical applications (e.g., cloud resource management), prior to arm selection, the context information can only be acquired by prediction subject to errors or adversarial modification. In this paper, we study a novel contextual bandit setting in which only imperfect context is available for arm selection while the true context is revealed at the end of each round. We propose two robust arm selection algorithms: MaxMinUCB (Maximize Minimum UCB) which maximizes the worst-case reward, and MinWD (Minimize Worst-case Degradation) which minimizes the worst-case regret. Importantly, we analyze the robustness of MaxMinUCB and MinWD by deriving both regret and reward bounds compared to an oracle that knows the true context. Our results show that as time goes on, MaxMinUCB and MinWD both perform as asymptotically well as their optimal counterparts that know the reward function. Finally, we apply MaxMinUCB and MinWD to online edge datacenter selection, and run synthetic simulations to validate our theoretical analysis.

APA, Harvard, Vancouver, ISO, and other styles

35

Liu, Zizhuo. "Investigation of progress and application related to Multi-Armed Bandit algorithms." Applied and Computational Engineering 37, no. 1 (January 22, 2024): 155–59. http://dx.doi.org/10.54254/2755-2721/37/20230496.

Full text

Abstract:

This paper discusses four Multi-armed Bandit algorithms: Explore-then-Commit (ETC), Epsilon-Greedy, Upper Confidence Bound (UCB), and Thompson Sampling algorithm. ETC algorithm aims to spend the majority of rounds on the best arm, but it can lead to a suboptimal outcome if the environment changes rapidly. The Epsilon-Greedy algorithm is designed to explore and exploit simultaneously, while it often tries sub-optimal arm even after the algorithm finds the best arm. Thus, the Epsilon-Greedy algorithm performs well when the environment continuously changes. UCB algorithm is one of the most used Multi-armed Bandit algorithms because it can rapidly narrow the potential optimal decisions in a wide range of scenarios; however, the algorithm can be influenced by some specific pattern of reward distribution or noise presenting in the environment. Thompson Sampling algorithm is also one of the most common algorithms in the Multi-armed Bandit algorithm due to its simplicity, effectiveness, and adaptability to various reward distributions. The Thompson Sampling algorithm performs well in multiple scenarios because it explores and exploits simultaneously, but its variance is greater than the three algorithms mentioned above. Today, Multi-armed bandit algorithms are widely used in advertisement, health care, and website and app optimization. Finally, the Multi-armed Bandit algorithms are rapidly replacing the traditional algorithms; in the future, the advanced Multi-armed Bandit algorithm, contextual Multi-armed Bandit algorithm, will gradually replace the old one.

APA, Harvard, Vancouver, ISO, and other styles

36

Semenov, Alexander, Maciej Rysz, Gaurav Pandey, and Guanglin Xu. "Diversity in news recommendations using contextual bandits." Expert Systems with Applications 195 (June 2022): 116478. http://dx.doi.org/10.1016/j.eswa.2021.116478.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Sui, Guoxin, and Yong Yu. "Bayesian Contextual Bandits for Hyper Parameter Optimization." IEEE Access 8 (2020): 42971–79. http://dx.doi.org/10.1109/access.2020.2977129.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Tekin, Cem, and Mihaela van der Schaar. "Distributed Online Learning via Cooperative Contextual Bandits." IEEE Transactions on Signal Processing 63, no. 14 (July 2015): 3700–3714. http://dx.doi.org/10.1109/tsp.2015.2430837.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Qin, Yuzhen, Yingcong Li, Fabio Pasqualetti, Maryam Fazel, and Samet Oymak. "Stochastic Contextual Bandits with Long Horizon Rewards." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 9525–33. http://dx.doi.org/10.1609/aaai.v37i8.26140.

Full text

Abstract:

The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most s prior actions and contexts (not necessarily consecutive), up to a time horizon of h. In order to avoid polynomial dependence on h, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor (T= h) regimes and derive respective regret upper bounds O(d square-root(sT) +min(q, T) and O( square-root(sdT) ), with sparsity s, feature dimension d, total time horizon T, and q that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon h. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.

APA, Harvard, Vancouver, ISO, and other styles

40

Xu, Xiao, Fang Dong, Yanghua Li, Shaojian He, and Xin Li. "Contextual-Bandit Based Personalized Recommendation with Time-Varying User Interests." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 6518–25. http://dx.doi.org/10.1609/aaai.v34i04.6125.

Full text

Abstract:

A contextual bandit problem is studied in a highly non-stationary environment, which is ubiquitous in various recommender systems due to the time-varying interests of users. Two models with disjoint and hybrid payoffs are considered to characterize the phenomenon that users' preferences towards different items vary differently over time. In the disjoint payoff model, the reward of playing an arm is determined by an arm-specific preference vector, which is piecewise-stationary with asynchronous and distinct changes across different arms. An efficient learning algorithm that is adaptive to abrupt reward changes is proposed and theoretical regret analysis is provided to show that a sublinear scaling of regret in the time length T is achieved. The algorithm is further extended to a more general setting with hybrid payoffs where the reward of playing an arm is determined by both an arm-specific preference vector and a joint coefficient vector shared by all arms. Empirical experiments are conducted on real-world datasets to verify the advantages of the proposed learning algorithms against baseline ones in both settings.

APA, Harvard, Vancouver, ISO, and other styles

41

Tekin, Cem, and Eralp Turgay. "Multi-objective Contextual Multi-armed Bandit With a Dominant Objective." IEEE Transactions on Signal Processing 66, no. 14 (July 15, 2018): 3799–813. http://dx.doi.org/10.1109/tsp.2018.2841822.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Yoon, Gyugeun, and Joseph Y. J. Chow. "Contextual Bandit-Based Sequential Transit Route Design under Demand Uncertainty." Transportation Research Record: Journal of the Transportation Research Board 2674, no. 5 (May 2020): 613–25. http://dx.doi.org/10.1177/0361198120917388.

Full text

Abstract:

While public transit network design has a wide literature, the study of line planning and route generation under uncertainty is not so well covered. Such uncertainty is present in planning for emerging transit technologies or operating models in which demand data is largely unavailable to make predictions on. In such circumstances, this paper proposes a sequential route generation process in which an operator periodically expands the route set and receives ridership feedback. Using this sensor loop, a reinforcement learning-based route generation methodology is proposed to support line planning for emerging technologies. The method makes use of contextual bandit problems to explore different routes to invest in while optimizing the operating cost or demand served. Two experiments are conducted. They (1) prove that the algorithm is better than random choice; and (2) show good performance with a gap of 3.7% relative to a heuristic solution to an oracle policy.

APA, Harvard, Vancouver, ISO, and other styles

43

Li, Litao. "Exploring Multi-Armed Bandit algorithms: Performance analysis in dynamic environments." Applied and Computational Engineering 34, no. 1 (January 22, 2024): 252–59. http://dx.doi.org/10.54254/2755-2721/34/20230338.

Full text

Abstract:

The Multi-armed Bandit algorithm, a proficient solver of the exploration-and-exploitation trade-off predicament, furnishes businesses with a robust tool for resource allocation that predominantly aligns with customer preferences. However, varying Multi-armed Bandit algorithm types exhibit dissimilar performance characteristics based on contextual variations. Hence, a series of experiments is imperative, involving alterations to input values across distinct algorithms. Within this study, three specific algorithms were applied, Explore-then-commit (ETC), Upper Confident Bound (UCB) and its asymptotically optimal variant, and Thompson Sampling (TS), to the extensively utilized MovieLens dataset. This application aimed to gauge their effectiveness comprehensively. The algorithms were translated into executable code, and their performance was visually depicted through multiple figures. Through cumulative regret tracking within defined conditions, algorithmic performance was scrutinized, laying the groundwork for subsequent parameter-based comparisons. A dedicated experimentation framework was devised to evaluate the robustness of each algorithm, involving deliberate parameter adjustments and tailored experiments to elucidate distinct performance nuances. The ensuing graphical depictions distinctly illustrated Thompson Sampling's persistent minimal regrets across most scenarios. UCB algorithms displayed steadfast stability. ETC manifested excellent performance with a low number of runs but escalate significantly along the number of runs growing. It also warranting constraints on exploratory phases to mitigate regrets. This investigation underscores the efficacy of Multi-armed Bandit algorithms while elucidating their nuanced behaviors within diverse contextual contingencies.

APA, Harvard, Vancouver, ISO, and other styles

44

Zhu, Tan, Guannan Liang, Chunjiang Zhu, Haining Li, and Jinbo Bi. "An Efficient Algorithm for Deep Stochastic Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 12 (May 18, 2021): 11193–201. http://dx.doi.org/10.1609/aaai.v35i12.17335.

Full text

Abstract:

In stochastic contextual bandit (SCB) problems, an agent selects an action based on certain observed context to maximize the cumulative reward over iterations. Recently there have been a few studies using a deep neural network (DNN) to predict the expected reward for an action, and the DNN is trained by a stochastic gradient based method. However, convergence analysis has been greatly ignored to examine whether and where these methods converge. In this work, we formulate the SCB that uses a DNN reward function as a non-convex stochastic optimization problem, and design a stage-wised stochastic gradient descent algorithm to optimize the problem and determine the action policy. We prove that with high probability, the action sequence chosen by our algorithm converges to a greedy action policy respecting a local optimal reward function. Extensive experiments have been performed to demonstrate the effectiveness and efficiency of the proposed algorithm on multiple real-world datasets.

APA, Harvard, Vancouver, ISO, and other styles

45

Martín H., José Antonio, and Ana M. Vargas. "Linear Bayes policy for learning in contextual-bandits." Expert Systems with Applications 40, no. 18 (December 2013): 7400–7406. http://dx.doi.org/10.1016/j.eswa.2013.07.041.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Raghavan, Manish, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. "Greedy Algorithm Almost Dominates in Smoothed Contextual Bandits." SIAM Journal on Computing 52, no. 2 (April 12, 2023): 487–524. http://dx.doi.org/10.1137/19m1247115.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Ayala-Romero, Jose A., Andres Garcia-Saavedra, and Xavier Costa-Perez. "Risk-Aware Continuous Control with Neural Contextual Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 19 (March 24, 2024): 20930–38. http://dx.doi.org/10.1609/aaai.v38i19.30083.

Full text

Abstract:

Recent advances in learning techniques have garnered attention for their applicability to a diverse range of real-world sequential decision-making problems. Yet, many practical applications have critical constraints for operation in real environments. Most learning solutions often neglect the risk of failing to meet these constraints, hindering their implementation in real-world contexts. In this paper, we propose a risk-aware decision-making framework for contextual bandit problems, accommodating constraints and continuous action spaces. Our approach employs an actor multi-critic architecture, with each critic characterizing the distribution of performance and constraint metrics. Our framework is designed to cater to various risk levels, effectively balancing constraint satisfaction against performance. To demonstrate the effectiveness of our approach, we first compare it against state-of-the-art baseline methods in a synthetic environment, highlighting the impact of intrinsic environmental noise across different risk configurations. Finally, we evaluate our framework in a real-world use case involving a 5G mobile network where only our approach satisfies consistently the system constraint (a signal processing reliability target) with a small performance toll (8.5% increase in power consumption).

APA, Harvard, Vancouver, ISO, and other styles

48

Pilani, Akshay, Kritagya Mathur, Himanshu Agrawal, Deeksha Chandola, Vinay Anand Tikkiwal, and Arun Kumar. "Contextual Bandit Approach-based Recommendation System for Personalized Web-based Services." Applied Artificial Intelligence 35, no. 7 (April 6, 2021): 489–504. http://dx.doi.org/10.1080/08839514.2021.1883855.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Li, Xinbin, Jiajia Liu, Lei Yan, Song Han, and Xinping Guan. "Relay Selection in Underwater Acoustic Cooperative Networks: A Contextual Bandit Approach." IEEE Communications Letters 21, no. 2 (February 2017): 382–85. http://dx.doi.org/10.1109/lcomm.2016.2625300.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Gisselbrecht, Thibault, Sylvain Lamprier, and Patrick Gallinari. "Dynamic Data Capture from Social Media Streams: A Contextual Bandit Approach." Proceedings of the International AAAI Conference on Web and Social Media 10, no. 1 (August 4, 2021): 131–40. http://dx.doi.org/10.1609/icwsm.v10i1.14734.

Full text

Abstract:

Social Media usually provide streaming data access that enable dynamic capture of the social activity of their users. Leveraging such APIs for collecting social data that satisfy a given pre-defined need may constitute a complex task, that implies careful stream selections. With user-centered streams, it indeed comes down to the problem of choosing which users to follow in order to maximize the utility of the collected data w.r.t. the need. On large social media, this represents a very challenging task due to the huge number of potential targets and restricted access to the data. Because of the intrinsic non-stationarity of user's behavior, a relevant target today might be irrelevant tomorrow, which represents a major difficulty to apprehend. In this paper, we propose a new approach that anticipates which profiles are likely to publish relevant contents - given a predefined need - in the future, and dynamically selects a subset of accounts to follow at each iteration. Our method has the advantage to take into account both API restrictions and the dynamics of users' behaviors. We formalize the task as a contextual bandit problem with multiple actions selection. We finally conduct experiments on Twitter, which demonstrate the empirical effectiveness of our approach in real-world settings.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!