Log in

Relevant bibliographies by topics / Offline Contextual Bandit

Academic literature on the topic 'Offline Contextual Bandit'

Author: Grafiati

Published: 1 June 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Offline Contextual Bandit.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Contents

Journal articles
Dissertations / Theses
Conference papers

Journal articles on the topic "Offline Contextual Bandit":

1

Huang, Wen, and Xintao Wu. "Robustly Improving Bandit Algorithms with Confounded and Selection Biased Offline Data: A Causal Approach." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 18 (March 24, 2024): 20438–46. http://dx.doi.org/10.1609/aaai.v38i18.30027.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper studies bandit problems where an agent has access to offline data that might be utilized to potentially improve the estimation of each arm’s reward distribution. A major obstacle in this setting is the existence of compound biases from the observational data. Ignoring these biases and blindly fitting a model with the biased data could even negatively affect the online learning phase. In this work, we formulate this problem from a causal perspective. First, we categorize the biases into confounding bias and selection bias based on the causal structure they imply. Next, we extract the causal bound for each arm that is robust towards compound biases from biased observational data. The derived bounds contain the ground truth mean reward and can effectively guide the bandit agent to learn a nearly-optimal decision policy. We also conduct regret analysis in both contextual and non-contextual bandit settings and show that prior causal bounds could help consistently reduce the asymptotic regret.

2

Narita, Yusuke, Shota Yasui, and Kohei Yata. "Efficient Counterfactual Learning from Bandit Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4634–41. http://dx.doi.org/10.1609/aaai.v33i01.33014634.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

What is the most statistically efficient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-theart benchmark.

3

Degroote, Hans, Patrick De Causmaecker, Bernd Bischl, and Lars Kotthoff. "A Regression-Based Methodology for Online Algorithm Selection." Proceedings of the International Symposium on Combinatorial Search 9, no. 1 (September 1, 2021): 37–45. http://dx.doi.org/10.1609/socs.v9i1.18458.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Algorithm selection approaches have achieved impressive performance improvements in many areas of AI. Most of the literature considers the offline algorithm selection problem, where the initial selection model is never updated after training. However, new data from running algorithms on instances becomes available when algorithms are selected and run. We investigate how this online data can be used to improve the selection model over time. This is especially relevant when insufficient training instances were used, but potentially improves the performance of algorithm selection in all cases. We formally define the online algorithm selection problem and model it as a contextual multi-armed bandit problem, propose a methodology for solving it, and empirically demonstrate performance improvements. We also show that our online algorithm selection method can be used when no training data whatsoever is available, a setting where offline algorithm selection cannot be used. Our experiments indicate that a simple greedy approach achieves the best performance.

4

Li, Zhao, Junshuai Song, Zehong Hu, Zhen Wang, and Jun Gao. "Constrained Dual-Level Bandit for Personalized Impression Regulation in Online Ranking Systems." ACM Transactions on Knowledge Discovery from Data 16, no. 2 (July 21, 2021): 1–23. http://dx.doi.org/10.1145/3461340.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Impression regulation plays an important role in various online ranking systems, e.g. , e-commerce ranking systems always need to achieve local commercial demands on some pre-labeled target items like fresh item cultivation and fraudulent item counteracting while maximizing its global revenue. However, local impression regulation may cause “butterfly effects” on the global scale, e.g. , in e-commerce, the price preference fluctuation in initial conditions (overpriced or underpriced items) may create a significantly different outcome, thus affecting shopping experience and bringing economic losses to platforms. To prevent “butterfly effects”, some researchers define their regulation objectives with global constraints, by using contextual bandit at the page-level that requires all items on one page sharing the same regulation action, which fails to conduct impression regulation on individual items. To address this problem, in this article, we propose a personalized impression regulation method that can directly makes regulation decisions for each user-item pair. Specifically, we model the regulation problem as a C onstrained D ual-level B andit (CDB) problem, where the local regulation action and reward signals are at the item-level while the global effect constraint on the platform impression can be calculated at the page-level only. To handle the asynchronous signals, we first expand the page-level constraint to the item-level and then derive the policy updating as a second-order cone optimization problem. Our CDB approaches the optimal policy by iteratively solving the optimization problem. Experiments are performed on both offline and online datasets, and the results, theoretically and empirically, demonstrate CDB outperforms state-of-the-art algorithms.

5

Vera, Alberto, Siddhartha Banerjee, and Itai Gurvich. "Online Allocation and Pricing: Constant Regret via Bellman Inequalities." Operations Research 69, no. 3 (May 2021): 821–40. http://dx.doi.org/10.1287/opre.2020.2061.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We develop a framework for designing simple and efficient policies for a family of online allocation and pricing problems that includes online packing, budget-constrained probing, dynamic pricing, and online contextual bandits with knapsacks. In each case, we evaluate the performance of our policies in terms of their regret (i.e., additive gap) relative to an offline controller that is endowed with more information than the online controller. Our framework is based on Bellman inequalities, which decompose the loss of an algorithm into two distinct sources of error: (1) arising from computational tractability issues, and (2) arising from estimation/prediction of random trajectories. Balancing these errors guides the choice of benchmarks, and leads to policies that are both tractable and have strong performance guarantees. In particular, in all our examples, we demonstrate constant-regret policies that only require resolving a linear program in each period, followed by a simple greedy action-selection rule; thus, our policies are practical as well as provably near optimal.

6

Ayle, Morgane, Jimmy Tekli, Julia El-Zini, Boulos El-Asmar, and Mariette Awad. "BAR — A Reinforcement Learning Agent for Bounding-Box Automated Refinement." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 03 (April 3, 2020): 2561–68. http://dx.doi.org/10.1609/aaai.v34i03.5639.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Research has shown that deep neural networks are able to help and assist human workers throughout the industrial sector via different computer vision applications. However, such data-driven learning approaches require a very large number of labeled training images in order to generalize well and achieve high accuracies that meet industry standards. Gathering and labeling large amounts of images is both expensive and time consuming, specifically for industrial use-cases. In this work, we introduce BAR (Bounding-box Automated Refinement), a reinforcement learning agent that learns to correct inaccurate bounding-boxes that are weakly generated by certain detection methods, or wrongly annotated by a human, using either an offline training method with Deep Reinforcement Learning (BAR-DRL), or an online one using Contextual Bandits (BAR-CB). Our agent limits the human intervention to correcting or verifying a subset of bounding-boxes instead of re-drawing new ones. Results on a car industry-related dataset and on the PASCAL VOC dataset show a consistent increase of up to 0.28 in the Intersection-over-Union of bounding-boxes with their desired ground-truths, while saving 30%-82% of human intervention time in either correcting or re-drawing inaccurate proposals.

7

Simchi-Levi, David, and Yunzong Xu. "Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits Under Realizability." Mathematics of Operations Research, December 9, 2021. http://dx.doi.org/10.1287/moor.2021.1193.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We consider the general (stochastic) contextual bandit problem under the realizability assumption, that is, the expected reward, as a function of contexts and actions, belongs to a general function class [Formula: see text]. We design a fast and simple algorithm that achieves the statistically optimal regret with only [Formula: see text] calls to an offline regression oracle across all T rounds. The number of oracle calls can be further reduced to [Formula: see text] if T is known in advance. Our results provide the first universal and optimal reduction from contextual bandits to offline regression, solving an important open problem in the contextual bandit literature. A direct consequence of our results is that any advances in offline regression immediately translate to contextual bandits, statistically and computationally. This leads to faster algorithms and improved regret guarantees for broader classes of contextual bandit problems.

8

Soemers, Dennis, Tim Brys, Kurt Driessens, Mark Winands, and Ann Nowé. "Adapting to Concept Drift in Credit Card Transaction Data Streams Using Contextual Bandits and Decision Trees." Proceedings of the AAAI Conference on Artificial Intelligence 32, no. 1 (April 27, 2018). http://dx.doi.org/10.1609/aaai.v32i1.11411.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Credit card transactions predicted to be fraudulent by automated detection systems are typically handed over to human experts for verification. To limit costs, it is standard practice to select only the most suspicious transactions for investigation. We claim that a trade-off between exploration and exploitation is imperative to enable adaptation to changes in behavior (concept drift). Exploration consists of the selection and investigation of transactions with the purpose of improving predictive models, and exploitation consists of investigating transactions detected to be suspicious. Modeling the detection of fraudulent transactions as rewarding, we use an incremental Regression Tree learner to create clusters of transactions with similar expected rewards. This enables the use of a Contextual Multi-Armed Bandit (CMAB) algorithm to provide the exploration/exploitation trade-off. We introduce a novel variant of a CMAB algorithm that makes use of the structure of this tree, and use Semi-Supervised Learning to grow the tree using unlabeled data. The approach is evaluated on a real dataset and data generated by a simulator that adds concept drift by adapting the behavior of fraudsters to avoid detection. It outperforms frequently used offline models in terms of cumulative rewards, in particular in the presence of concept drift.

9

Cao, Junyu, and Wei Sun. "Tiered Assortment: Optimization and Online Learning." Management Science, October 4, 2023. http://dx.doi.org/10.1287/mnsc.2023.4940.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Due to the sheer number of available choices, online retailers frequently use tiered assortment to present products. In this case, groups of products are arranged across multiple pages or stages, and a customer clicks on “next” or “load more” to access them sequentially. Despite the prevalence of such assortments in practice, this topic has not received much attention in the existing literature. In this work, we focus on a sequential choice model that captures customers’ behavior when product recommendations are presented in tiers. We analyze multiple variants of tiered assortment optimization (TAO) problems by imposing “no-duplication” and capacity constraints. For the offline setting involving known customers’ preferences, we characterize the properties of the optimal tiered assortment and propose algorithms that improve the computational efficiency over the existing benchmarks. To the best of our knowledge, we are the first to study the online setting of the TAO problem. A unique characteristic of our online setting, absent from the one-tier MNL bandit, is partial feedback. Products in the lower priority tiers are not shown when a customer has purchased a product or has chosen to exit at an earlier tier. Such partial feedback, along with product interdependence across tiers, increases the learning complexity. For both the noncontextual and contextual problems, we propose online algorithms and quantify their respective regrets. Moreover, we are able to construct tighter uncertainty sets for model parameters in the contextual case and thus improve the performance. We demonstrate the efficacy of our proposed algorithms through numerical experiments. This paper was accepted by J. George Shanthikumar, data science. Supplemental Material: The data files and e-companion are available at https://doi.org/10.1287/mnsc.2023.4940 .

10

Zeng, Yingyan, Xiaoyu Chen, and Ran Jin. "Ensemble Active Learning by Contextual Bandits for AI Incubation in Manufacturing." ACM Transactions on Intelligent Systems and Technology, October 25, 2023. http://dx.doi.org/10.1145/3627821.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

An Industrial Cyber-physical System (ICPS) provide a digital foundation for data-driven decision-making by artificial intelligence (AI) models. However, the poor data quality (e.g., inconsistent distribution, imbalanced classes) of high-speed, large-volume data streams poses significant challenges to the online deployment of offline-trained AI models. As an alternative, updating AI models online based on streaming data enables continuous improvement and resilient modeling performance. However, for a supervised learning model (i.e., a base learner), it is labor-intensive to annotate all streaming samples to update the model. Hence, a data acquisition method is needed to select the data for annotation to ensure data quality while saving annotation efforts. In the literature, active learning methods have been proposed to acquire informative samples. Different acquisition criteria were developed for exploration of under-represented regions in the input variable space or exploitation of the well-represented regions for optimal estimation of base learners. However, it remains a challenge to balance the exploration-exploitation trade-off under different online annotation scenarios. On the other hand, an acquisition criterion learned by AI adapts itself to a scenario dynamically, but the ambiguous consideration of the trade-off limits its performance in frequently changing manufacturing contexts. To overcome these limitations, we propose an ensemble active learning method by contextual bandits ( CbeAL ). CbeAL incorporates a set of active learning agents (i.e., acquisition criteria) explicitly designed for exploration or exploitation by a weighted combination of their acquisition decisions. The weight of each agent will be dynamically adjusted based on the usefulness of its decisions to improve the performance of the base learner. With adaptive and explicit consideration of both objectives, CbeAL efficiently guides the data acquisition process by selecting informative samples to reduce the human annotation efforts. Furthermore, we characterize the exploration and exploitation capability of the proposed agents theoretically. The evaluation results in a numerical simulation study and a real case study demonstrate the effectiveness and efficiency of CbeAL in manufacturing process modeling of the ICPS.

Dissertations / Theses on the topic "Offline Contextual Bandit":

1

Sakhi, Otmane. "Offline Contextual Bandit : Theory and Large Scale Applications." Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAG011.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Cette thèse s'intéresse au problème de l'apprentissage à partir d'interactions en utilisant le cadre du bandit contextuel hors ligne. En particulier, nous nous intéressons à deux sujets connexes : (1) l'apprentissage de politiques hors ligne avec des certificats de performance, et (2) l'apprentissage rapide et efficace de politiques, pour le problème de recommandation à grande échelle. Pour (1), nous tirons d'abord parti des résultats du cadre d'optimisation distributionnellement robuste pour construire des bornes asymptotiques, sensibles à la variance, qui permettent l'évaluation des performances des politiques. Ces bornes nous aident à obtenir de nouveaux objectifs d'apprentissage plus pratiques grâce à leur nature composite et à leur calibrage simple. Nous analysons ensuite le problème d'un point de vue PAC-Bayésien et fournissons des bornes, plus étroites, sur les performances des politiques. Nos résultats motivent de nouvelles stratégies, qui offrent des certificats de performance sur nos politiques avant de les déployer en ligne. Les stratégies nouvellement dérivées s'appuient sur des objectifs d'apprentissage composites qui ne nécessitent pas de réglage supplémentaire. Pour (2), nous proposons d'abord un modèle bayésien hiérarchique, qui combine différents signaux, pour estimer efficacement la qualité de la recommandation. Nous fournissons les outils computationnels appropriés pour adapter l'inférence aux problèmes à grande échelle et démontrons empiriquement les avantages de l'approche dans plusieurs scénarios. Nous abordons ensuite la question de l'accélération des approches communes d'optimisation des politiques, en nous concentrant particulièrement sur les problèmes de recommandation avec des catalogues de millions de produits. Nous dérivons des méthodes d'optimisation, basées sur de nouvelles approximations du gradient calculées en temps logarithmique par rapport à la taille du catalogue. Notre approche améliore le temps linéaire des méthodes courantes de calcul de gradient, et permet un apprentissage rapide sans nuire à la qualité des politiques obtenues
This thesis presents contributions to the problem of learning from logged interactions using the offline contextual bandit framework. We are interested in two related topics: (1) offline policy learning with performance certificates, and (2) fast and efficient policy learning applied to large scale, real world recommendation. For (1), we first leverage results from the distributionally robust optimisation framework to construct asymptotic, variance-sensitive bounds to evaluate policies' performances. These bounds lead to new, more practical learning objectives thanks to their composite nature and straightforward calibration. We then analyse the problem from the PAC-Bayesian perspective, and provide tighter, non-asymptotic bounds on the performance of policies. Our results motivate new strategies, that offer performance certificates before deploying the policies online. The newly derived strategies rely on composite learning objectives that do not require additional tuning. For (2), we first propose a hierarchical Bayesian model, that combines different signals, to efficiently estimate the quality of recommendation. We provide proper computational tools to scale the inference to real world problems, and demonstrate empirically the benefits of the approach in multiple scenarios. We then address the question of accelerating common policy optimisation approaches, particularly focusing on recommendation problems with catalogues of millions of items. We derive optimisation routines, based on new gradient approximations, computed in logarithmic time with respect to the catalogue size. Our approach improves on common, linear time gradient computations, yielding fast optimisation with no loss on the quality of the learned policies

Conference papers on the topic "Offline Contextual Bandit":

1

Li, Lihong, Wei Chu, John Langford, and Xuanhui Wang. "Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms." In the fourth ACM international conference. New York, New York, USA: ACM Press, 2011. http://dx.doi.org/10.1145/1935826.1935878.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Bouneffouf, Djallel, Srinivasan Parthasarathy, Horst Samulowitz, and Martin Wistuba. "Optimal Exploitation of Clustering and History Information in Multi-armed Bandit." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/279.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We consider the stochastic multi-armed bandit problem and the contextual bandit problem with historical observations and pre-clustered arms. The historical observations can contain any number of instances for each arm, and the pre-clustering information is a fixed clustering of arms provided as part of the input. We develop a variety of algorithms which incorporate this offline information effectively during the online exploration phase and derive their regret bounds. In particular, we develop the META algorithm which effectively hedges between two other algorithms: one which uses both historical observations and clustering, and another which uses only the historical observations. The former outperforms the latter when the clustering quality is good, and vice-versa. Extensive experiments on synthetic and real world datasets on Warafin drug dosage and web server selectionfor latency minimization validate our theoretical insights and demonstrate that META is a robust strategy for optimally exploiting the pre-clustering information.

3

Degroote, Hans. "Online Algorithm Selection." In Twenty-Sixth International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization, 2017. http://dx.doi.org/10.24963/ijcai.2017/746.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Algorithm selection approaches have achieved impressive performance improvements in many areas of AI. Most of the literature considers the offline algorithm selection problem, where the initial selection model is never updated after training. However, new data from running algorithms on instances becomes available while an algorithm selection method is in use. In this extended abstract, the online algorithm selection problem is considered. In online algorithm selection, additional data can be processed, and the selection model can change over time. This abstract details the online algorithm setting, shows that it is a contextual multi-armed bandit, proposes a solution methodology, and empirically validates it.

4

Januszewski, Piotr, Dominik Grzegorzek, and Paweł Czarnul. "Dataset Characteristics and Their Impact on Offline Policy Learning of Contextual Multi-Armed Bandits." In 16th International Conference on Agents and Artificial Intelligence. SCITEPRESS - Science and Technology Publications, 2024. http://dx.doi.org/10.5220/0012311000003636.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Ameko, Mawulolo K., Miranda L. Beltzer, Lihua Cai, Mehdi Boukhechba, Bethany A. Teachman, and Laura E. Barnes. "Offline Contextual Multi-armed Bandits for Mobile Health Interventions: A Case Study on Emotion Regulation." In RecSys '20: Fourteenth ACM Conference on Recommender Systems. New York, NY, USA: ACM, 2020. http://dx.doi.org/10.1145/3383313.3412244.

Full text

APA, Harvard, Vancouver, ISO, and other styles