Journal articles on the topic 'Stochastic Multi-armed Bandit'

To see the other types of publications on this topic, follow the link: Stochastic Multi-armed Bandit.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Stochastic Multi-armed Bandit.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Xiong, Guojun, and Jian Li. "Decentralized Stochastic Multi-Player Multi-Armed Walking Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 9 (June 26, 2023): 10528–36. http://dx.doi.org/10.1609/aaai.v37i9.26251.

Full text
Abstract:
Multi-player multi-armed bandit is an increasingly relevant decision-making problem, motivated by applications to cognitive radio systems. Most research for this problem focuses exclusively on the settings that players have full access to all arms and receive no reward when pulling the same arm. Hence all players solve the same bandit problem with the goal of maximizing their cumulative reward. However, these settings neglect several important factors in many real-world applications, where players have limited access to a dynamic local subset of arms (i.e., an arm could sometimes be ``walking'' and not accessible to the player). To this end, this paper proposes a multi-player multi-armed walking bandits model, aiming to address aforementioned modeling issues. The goal now is to maximize the reward, however, players can only pull arms from the local subset and only collect a full reward if no other players pull the same arm. We adopt Upper Confidence Bound (UCB) to deal with the exploration-exploitation tradeoff and employ distributed optimization techniques to properly handle collisions. By carefully integrating these two techniques, we propose a decentralized algorithm with near-optimal guarantee on the regret, and can be easily implemented to obtain competitive empirical performance.
APA, Harvard, Vancouver, ISO, and other styles
2

Ciucanu, Radu, Pascal Lafourcade, Gael Marcadet, and Marta Soare. "SAMBA: A Generic Framework for Secure Federated Multi-Armed Bandits." Journal of Artificial Intelligence Research 73 (February 23, 2022): 737–65. http://dx.doi.org/10.1613/jair.1.13163.

Full text
Abstract:
The multi-armed bandit is a reinforcement learning model where a learning agent repeatedly chooses an action (pull a bandit arm) and the environment responds with a stochastic outcome (reward) coming from an unknown distribution associated with the chosen arm. Bandits have a wide-range of application such as Web recommendation systems. We address the cumulative reward maximization problem in a secure federated learning setting, where multiple data owners keep their data stored locally and collaborate under the coordination of a central orchestration server. We rely on cryptographic schemes and propose Samba, a generic framework for Secure federAted Multi-armed BAndits. Each data owner has data associated to a bandit arm and the bandit algorithm has to sequentially select which data owner is solicited at each time step. We instantiate Samba for five bandit algorithms. We show that Samba returns the same cumulative reward as the nonsecure versions of bandit algorithms, while satisfying formally proven security properties. We also show that the overhead due to cryptographic primitives is linear in the size of the input, which is confirmed by our proof-of-concept implementation.
APA, Harvard, Vancouver, ISO, and other styles
3

Wan, Zongqi, Zhijie Zhang, Tongyang Li, Jialin Zhang, and Xiaoming Sun. "Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 8 (June 26, 2023): 10087–94. http://dx.doi.org/10.1609/aaai.v37i8.26202.

Full text
Abstract:
Multi-arm bandit (MAB) and stochastic linear bandit (SLB) are important models in reinforcement learning, and it is well-known that classical algorithms for bandits with time horizon T suffer from the regret of at least the square root of T. In this paper, we study MAB and SLB with quantum reward oracles and propose quantum algorithms for both models with the order of the polylog T regrets, exponentially improving the dependence in terms of T. To the best of our knowledge, this is the first provable quantum speedup for regrets of bandit problems and in general exploitation in reinforcement learning. Compared to previous literature on quantum exploration algorithms for MAB and reinforcement learning, our quantum input model is simpler and only assumes quantum oracles for each individual arm.
APA, Harvard, Vancouver, ISO, and other styles
4

Lesage-Landry, Antoine, and Joshua A. Taylor. "The Multi-Armed Bandit With Stochastic Plays." IEEE Transactions on Automatic Control 63, no. 7 (July 2018): 2280–86. http://dx.doi.org/10.1109/tac.2017.2765501.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Esfandiari, Hossein, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. "Regret Bounds for Batched Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 8 (May 18, 2021): 7340–48. http://dx.doi.org/10.1609/aaai.v35i8.16901.

Full text
Abstract:
We present simple algorithms for batched stochastic multi-armed bandit and batched stochastic linear bandit problems. We prove bounds for their expected regrets that improve and extend the best known regret bounds of Gao, Han, Ren, and Zhou (NeurIPS 2019), for any number of batches. In particular, our algorithms in both settings achieve the optimal expected regrets by using only a logarithmic number of batches. We also study the batched adversarial multi-armed bandit problem for the first time and provide the optimal regret, up to logarithmic factors, of any algorithm with predetermined batch sizes.
APA, Harvard, Vancouver, ISO, and other styles
6

Dzhoha, A. S. "Sequential resource allocation in a stochastic environment: an overview and numerical experiments." Bulletin of Taras Shevchenko National University of Kyiv. Series: Physics and Mathematics, no. 3 (2021): 13–25. http://dx.doi.org/10.17721/1812-5409.2021/3.1.

Full text
Abstract:
In this paper, we consider policies for the sequential resource allocation under the multi-armed bandit problem in a stochastic environment. In this model, an agent sequentially selects an action from a given set and an environment reveals a reward in return. In the stochastic setting, each action is associated with a probability distribution with parameters that are not known in advance. The agent makes a decision based on the history of the chosen actions and obtained rewards. The objective is to maximize the total cumulative reward, which is equivalent to the loss minimization. We provide a brief overview of the sequential analysis and an appearance of the multi-armed bandit problem as a formulation in the scope of the sequential resource allocation theory. Multi-armed bandit classification is given with an analysis of the existing policies for the stochastic setting. Two different approaches are shown to tackle the multi-armed bandit problem. In the frequentist view, the confidence interval is used to express the exploration-exploitation trade-off. In the Bayesian approach, the parameter that needs to be estimated is treated as a random variable. Shown, how this model can be modelled with help of the Markov decision process. In the end, we provide numerical experiments in order to study the effectiveness of these policies.
APA, Harvard, Vancouver, ISO, and other styles
7

Juditsky, A., A. V. Nazin, A. B. Tsybakov, and N. Vayatis. "Gap-free Bounds for Stochastic Multi-Armed Bandit." IFAC Proceedings Volumes 41, no. 2 (2008): 11560–63. http://dx.doi.org/10.3182/20080706-5-kr-1001.01959.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Allesiardo, Robin, Raphaël Féraud, and Odalric-Ambrym Maillard. "The non-stationary stochastic multi-armed bandit problem." International Journal of Data Science and Analytics 3, no. 4 (March 30, 2017): 267–83. http://dx.doi.org/10.1007/s41060-017-0050-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Huo, Xiaoguang, and Feng Fu. "Risk-aware multi-armed bandit problem with application to portfolio selection." Royal Society Open Science 4, no. 11 (November 2017): 171377. http://dx.doi.org/10.1098/rsos.171377.

Full text
Abstract:
Sequential portfolio selection has attracted increasing interest in the machine learning and quantitative finance communities in recent years. As a mathematical framework for reinforcement learning policies, the stochastic multi-armed bandit problem addresses the primary difficulty in sequential decision-making under uncertainty, namely the exploration versus exploitation dilemma, and therefore provides a natural connection to portfolio selection. In this paper, we incorporate risk awareness into the classic multi-armed bandit setting and introduce an algorithm to construct portfolio. Through filtering assets based on the topological structure of the financial market and combining the optimal multi-armed bandit policy with the minimization of a coherent risk measure, we achieve a balance between risk and return.
APA, Harvard, Vancouver, ISO, and other styles
10

Xu, Lily, Elizabeth Bondi, Fei Fang, Andrew Perrault, Kai Wang, and Milind Tambe. "Dual-Mandate Patrols: Multi-Armed Bandits for Green Security." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 17 (May 18, 2021): 14974–82. http://dx.doi.org/10.1609/aaai.v35i17.17757.

Full text
Abstract:
Conservation efforts in green security domains to protect wildlife and forests are constrained by the limited availability of defenders (i.e., patrollers), who must patrol vast areas to protect from attackers (e.g., poachers or illegal loggers). Defenders must choose how much time to spend in each region of the protected area, balancing exploration of infrequently visited regions and exploitation of known hotspots. We formulate the problem as a stochastic multi-armed bandit, where each action represents a patrol strategy, enabling us to guarantee the rate of convergence of the patrolling policy. However, a naive bandit approach would compromise short-term performance for long-term optimality, resulting in animals poached and forests destroyed. To speed up performance, we leverage smoothness in the reward function and decomposability of actions. We show a synergy between Lipschitz-continuity and decomposition as each aids the convergence of the other. In doing so, we bridge the gap between combinatorial and Lipschitz bandits, presenting a no-regret approach that tightens existing guarantees while optimizing for short-term performance. We demonstrate that our algorithm, LIZARD, improves performance on real-world poaching data from Cambodia.
APA, Harvard, Vancouver, ISO, and other styles
11

Patil, Vishakha, Ganesh Ghalme, Vineet Nair, and Y. Narahari. "Achieving Fairness in the Stochastic Multi-Armed Bandit Problem." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 5379–86. http://dx.doi.org/10.1609/aaai.v34i04.5986.

Full text
Abstract:
We study an interesting variant of the stochastic multi-armed bandit problem, which we call the Fair-MAB problem, where, in addition to the objective of maximizing the sum of expected rewards, the algorithm also needs to ensure that at any time, each arm is pulled at least a pre-specified fraction of times. We investigate the interplay between learning and fairness in terms of a pre-specified vector denoting the fractions of guaranteed pulls. We define a fairness-aware regret, which we call r-Regret, that takes into account the above fairness constraints and extends the conventional notion of regret in a natural way. Our primary contribution is to obtain a complete characterization of a class of Fair-MAB algorithms via two parameters: the unfairness tolerance and the learning algorithm used as a black-box. For this class of algorithms, we provide a fairness guarantee that holds uniformly over time, irrespective of the choice of the learning algorithm. Further, when the learning algorithm is UCB1, we show that our algorithm achieves constant r-Regret for a large enough time horizon. Finally, we analyze the cost of fairness in terms of the conventional notion of regret. We conclude by experimentally validating our theoretical results.
APA, Harvard, Vancouver, ISO, and other styles
12

Jain, Shweta, Satyanath Bhat, Ganesh Ghalme, Divya Padmanabhan, and Y. Narahari. "Mechanisms with learning for stochastic multi-armed bandit problems." Indian Journal of Pure and Applied Mathematics 47, no. 2 (June 2016): 229–72. http://dx.doi.org/10.1007/s13226-016-0186-3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Cowan, Wesley, and Michael N. Katehakis. "MULTI-ARMED BANDITS UNDER GENERAL DEPRECIATION AND COMMITMENT." Probability in the Engineering and Informational Sciences 29, no. 1 (October 10, 2014): 51–76. http://dx.doi.org/10.1017/s0269964814000217.

Full text
Abstract:
Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a reward that is a function of the activated process, and in doing so advancing the chosen process. Classically, rewards are discounted by a constant factor β∈(0, 1) per round.In this paper, we present a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under a framework in which, first, the discount factors may be non-uniform and vary over time, and second, the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices, and it utilizes a view of the problem not as “decisions over state space” but rather “decisions over time”.
APA, Harvard, Vancouver, ISO, and other styles
14

Dunn, R. T., and K. D. Glazebrook. "The performance of index-based policies for bandit problems with stochastic machine availability." Advances in Applied Probability 33, no. 2 (June 2001): 365–90. http://dx.doi.org/10.1017/s0001867800010843.

Full text
Abstract:
We consider generalisations of two classical stochastic scheduling models, namely the discounted branching bandit and the discounted multi-armed bandit, to the case where the collection of machines available for processing is itself a stochastic process. Under rather mild conditions on the machine availability process we obtain performance guarantees for a range of controls based on Gittins indices. Various forms of asymptotic optimality are established for index-based limit policies as the discount rate approaches 0.
APA, Harvard, Vancouver, ISO, and other styles
15

Bubeck, Sébastien. "Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems." Foundations and Trends® in Machine Learning 5, no. 1 (2012): 1–122. http://dx.doi.org/10.1561/2200000024.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Zayas-Cabán, Gabriel, Stefanus Jasin, and Guihua Wang. "An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits." Advances in Applied Probability 51, no. 03 (September 2019): 745–72. http://dx.doi.org/10.1017/apr.2019.29.

Full text
Abstract:
AbstractWe propose an asymptotically optimal heuristic, which we term randomized assignment control (RAC) for a restless multi-armed bandit problem with discrete-time and finite states. It is constructed using a linear programming relaxation of the original stochastic control formulation. In contrast to most of the existing literature, we consider a finite-horizon problem with multiple actions and time-dependent (i.e. nonstationary) upper bound on the number of bandits that can be activated at each time period; indeed, our analysis can also be applied in the setting with nonstationary transition matrix and nonstationary cost function. The asymptotic setting is obtained by letting the number of bandits and other related parameters grow to infinity. Our main contribution is that the asymptotic optimality of RAC in this general setting does not require indexability properties or the usual stability conditions of the underlying Markov chain (e.g. unichain) or fluid approximation (e.g. global stable attractor). Moreover, our multi-action setting is not restricted to the usual dominant action concept. Finally, we show that RAC is also asymptotically optimal for a dynamic population, where bandits can randomly arrive and depart the system.
APA, Harvard, Vancouver, ISO, and other styles
17

Roy Chaudhuri, Arghya, and Shivaram Kalyanakrishnan. "Regret Minimisation in Multi-Armed Bandits Using Bounded Arm Memory." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 06 (April 3, 2020): 10085–92. http://dx.doi.org/10.1609/aaai.v34i06.6566.

Full text
Abstract:
Regret minimisation in stochastic multi-armed bandits is a well-studied problem, for which several optimal algorithms have been proposed. Such algorithms depend on (sufficient statistics of) the empirical reward distributions of the arms to decide which arm to pull next. In this paper, we consider the design of algorithms that are constrained to store statistics from only a bounded number of arms. For bandits with a finite set of arms, we derive a sub-linear upper bound on the regret that decreases with the “arm memory” size M. For instances with a large, possibly infinite, set of arms, we show a sub-linear bound on the quantile regret.Our problem formulation generalises that of Liau et al. (2018), who fix M = O(1), and so do not obtain bounds that depend on M. More importantly, our algorithms keep exploration and exploitation tightly coupled, without a dedicated exploration phase as employed by Liau et al. (2018). Although this choice makes our analysis harder, it leads to much-improved practical performance. For bandits with a large number of arms and no known structure on the rewards, our algorithms serve as a viable option. Unlike many other approaches to restrict the memory of bandit algorithms, our algorithms do not need any additional technical assumptions.
APA, Harvard, Vancouver, ISO, and other styles
18

O'Flaherty, Brendan. "Some results on two-armed bandits when both projects vary." Journal of Applied Probability 26, no. 3 (September 1989): 655–58. http://dx.doi.org/10.2307/3214424.

Full text
Abstract:
In the multi-armed bandit problem, the decision-maker must choose each period a single project to work on. From the chosen project she receives an immediate reward that depends on the current state of the project. Next period the chosen project makes a stochastic transition to a new state, but projects that are not chosen remain in the same state. What happens in a two-armed bandit context if projects not chosen do not remain in the same state? We derive two sufficient conditions for the optimal policy to be myopic: either the transition function for chosen projects has in a certain sense uniformly stronger stochastic dominance than the transition function for unchosen projects, or both transition processes are normal martingales, the variance of which is independent of the history of process choices.
APA, Harvard, Vancouver, ISO, and other styles
19

O'Flaherty, Brendan. "Some results on two-armed bandits when both projects vary." Journal of Applied Probability 26, no. 03 (September 1989): 655–58. http://dx.doi.org/10.1017/s0021900200038262.

Full text
Abstract:
In the multi-armed bandit problem, the decision-maker must choose each period a single project to work on. From the chosen project she receives an immediate reward that depends on the current state of the project. Next period the chosen project makes a stochastic transition to a new state, but projects that are not chosen remain in the same state. What happens in a two-armed bandit context if projects not chosen do not remain in the same state? We derive two sufficient conditions for the optimal policy to be myopic: either the transition function for chosen projects has in a certain sense uniformly stronger stochastic dominance than the transition function for unchosen projects, or both transition processes are normal martingales, the variance of which is independent of the history of process choices.
APA, Harvard, Vancouver, ISO, and other styles
20

Wang, Siwei, Haoyun Wang, and Longbo Huang. "Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 11 (May 18, 2021): 10210–17. http://dx.doi.org/10.1609/aaai.v35i11.17224.

Full text
Abstract:
We study the multi-armed bandit (MAB) problem with composite and anonymous feedback. In this model, the reward of pulling an arm spreads over a period of time (we call this period as reward interval) and the player receives partial rewards of the action, convoluted with rewards from pulling other arms, successively. Existing results on this model require prior knowledge about the reward interval size as an input to their algorithms. In this paper, we propose adaptive algorithms for both the stochastic and the adversarial cases, without requiring any prior information about the reward interval. For the stochastic case, we prove that our algorithm guarantees a regret that matches the lower bounds (in order). For the adversarial case, we propose the first algorithm to jointly handle non-oblivious adversary and unknown reward interval size. We also conduct simulations based on real-world dataset. The results show that our algorithms outperform existing benchmarks.
APA, Harvard, Vancouver, ISO, and other styles
21

Zuo, Jinhang, Xiaoxi Zhang, and Carlee Joe-Wong. "Observe Before Play: Multi-Armed Bandit with Pre-Observations." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 7023–30. http://dx.doi.org/10.1609/aaai.v34i04.6187.

Full text
Abstract:
We consider the stochastic multi-armed bandit (MAB) problem in a setting where a player can pay to pre-observe arm rewards before playing an arm in each round. Apart from the usual trade-off between exploring new arms to find the best one and exploiting the arm believed to offer the highest reward, we encounter an additional dilemma: pre-observing more arms gives a higher chance to play the best one, but incurs a larger cost. For the single-player setting, we design an Observe-Before-Play Upper Confidence Bound (OBP-UCB) algorithm for K arms with Bernoulli rewards, and prove a T-round regret upper bound O(K2log T). In the multi-player setting, collisions will occur when players select the same arm to play in the same round. We design a centralized algorithm, C-MP-OBP, and prove its T-round regret relative to an offline greedy strategy is upper bounded in O(K4/M2log T) for K arms and M players. We also propose distributed versions of the C-MP-OBP policy, called D-MP-OBP and D-MP-Adapt-OBP, achieving logarithmic regret with respect to collision-free target policies. Experiments on synthetic data and wireless channel traces show that C-MP-OBP and D-MP-OBP outperform random heuristics and offline optimal policies that do not allow pre-observations.
APA, Harvard, Vancouver, ISO, and other styles
22

Namba, Hiroyuki. "Non-stationary Stochastic Multi-armed Bandit Problems with External Information on Stationarity." Transactions of the Japanese Society for Artificial Intelligence 36, no. 3 (May 1, 2021): D—K84_1–11. http://dx.doi.org/10.1527/tjsai.36-3_d-k84.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Auer, Peter, and Ronald Ortner. "UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem." Periodica Mathematica Hungarica 61, no. 1-2 (September 2010): 55–65. http://dx.doi.org/10.1007/s10998-010-3055-6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Niño-Mora, José. "Multi-Gear Bandits, Partial Conservation Laws, and Indexability." Mathematics 10, no. 14 (July 18, 2022): 2497. http://dx.doi.org/10.3390/math10142497.

Full text
Abstract:
This paper considers what we propose to call multi-gear bandits, which are Markov decision processes modeling a generic dynamic and stochastic project fueled by a single resource and which admit multiple actions representing gears of operation naturally ordered by their increasing resource consumption. The optimal operation of a multi-gear bandit aims to strike a balance between project performance costs or rewards and resource usage costs, which depend on the resource price. A computationally convenient and intuitive optimal solution is available when such a model is indexable, meaning that its optimal policies are characterized by a dynamic allocation index (DAI), a function of state–action pairs representing critical resource prices. Motivated by the lack of general indexability conditions and efficient index-computing schemes, and focusing on the infinite-horizon finite-state and -action discounted case, we present a verification theorem ensuring that, if a model satisfies two proposed PCL-indexability conditions with respect to a postulated family of structured policies, then it is indexable and such policies are optimal, with its DAI being given by a marginal productivity index computed by a downshift adaptive-greedy algorithm in AN steps, with A+1 actions and N states. The DAI is further used as the basis of a new index policy for the multi-armed multi-gear bandit problem.
APA, Harvard, Vancouver, ISO, and other styles
25

Ou, Mingdong, Nan Li, Cheng Yang, Shenghuo Zhu, and Rong Jin. "Semi-Parametric Sampling for Stochastic Bandits with Many Arms." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 7933–40. http://dx.doi.org/10.1609/aaai.v33i01.33017933.

Full text
Abstract:
We consider the stochastic bandit problem with a large candidate arm set. In this setting, classic multi-armed bandit algorithms, which assume independence among arms and adopt non-parametric reward model, are inefficient, due to the large number of arms. By exploiting arm correlations based on a parametric reward model with arm features, contextual bandit algorithms are more efficient, but they can also suffer from large regret in practical applications, due to the reward estimation bias from mis-specified model assumption or incomplete features. In this paper, we propose a novel Bayesian framework, called Semi-Parametric Sampling (SPS), for this problem, which employs semi-parametric function as the reward model. Specifically, the parametric part of SPS, which models expected reward as a parametric function of arm feature, can efficiently eliminate poor arms from candidate set. The non-parametric part of SPS, which adopts nonparametric reward model, revises the parametric estimation to avoid estimation bias, especially on the remained candidate arms. We give an implementation of SPS, Linear SPS (LSPS), which utilizes linear function as the parametric part. In semi-parametric environment, theoretical analysis shows that LSPS achieves better regret bound (i.e. O̴(√N1−α dα √T) with α ∈ [0, 1])) than existing approaches. Also, experiments demonstrate the superiority of the proposed approach.
APA, Harvard, Vancouver, ISO, and other styles
26

Feldman, Zohar, and Carmel Domshlak. "On MABs and Separation of Concerns in Monte-Carlo Planning for MDPs." Proceedings of the International Conference on Automated Planning and Scheduling 24 (May 10, 2014): 120–27. http://dx.doi.org/10.1609/icaps.v24i1.13631.

Full text
Abstract:
Linking online planning for MDPs with their special case of stochastic multi-armed bandit problems, we analyze three state-of-the-art Monte-Carlo tree search al-gorithms: UCT, BRUE, and MaxUCT. Using the outcome, we (i) introduce two new MCTS algorithms,MaxBRUE, which combines uniform sampling with Bellman backups, and MpaUCT, which combines UCB1with a novel backup procedure, (ii) analyze them formally and empirically, and (iii) show how MCTS algorithms can be further stratified by an exploration control mechanism that improves their empirical performance without harming the formal guarantees.
APA, Harvard, Vancouver, ISO, and other styles
27

Ottens, Brammert, Christos Dimitrakakis, and Boi Faltings. "DUCT: An Upper Confidence Bound Approach to Distributed Constraint Optimization Problems." Proceedings of the AAAI Conference on Artificial Intelligence 26, no. 1 (September 20, 2021): 528–34. http://dx.doi.org/10.1609/aaai.v26i1.8129.

Full text
Abstract:
The Upper Confidence Bounds (UCB) algorithm is a well-known near-optimal strategy for the stochastic multi-armed bandit problem. Its extensions to trees, such as the Upper Confidence Tree (UCT) algorithm, have resulted in good solutions to the problem of Go. This paper introduces DUCT, a distributed algorithm inspired by UCT, for solving Distributed Constraint Optimization Problems (DCOP). Bounds on the solution quality are provided, and experiments show that, compared to existing DCOP approaches, DUCT is able to solve very large problems much more efficiently, or to find significantly higher quality solutions.
APA, Harvard, Vancouver, ISO, and other styles
28

Guan, Ziwei, Kaiyi Ji, Donald J. Bucci Jr., Timothy Y. Hu, Joseph Palombo, Michael Liston, and Yingbin Liang. "Robust Stochastic Bandit Algorithms under Probabilistic Unbounded Adversarial Attack." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 4036–43. http://dx.doi.org/10.1609/aaai.v34i04.5821.

Full text
Abstract:
The multi-armed bandit formalism has been extensively studied under various attack models, in which an adversary can modify the reward revealed to the player. Previous studies focused on scenarios where the attack value either is bounded at each round or has a vanishing probability of occurrence. These models do not capture powerful adversaries that can catastrophically perturb the revealed reward. This paper investigates the attack model where an adversary attacks with a certain probability at each round, and its attack value can be arbitrary and unbounded if it attacks. Furthermore, the attack value does not necessarily follow a statistical distribution. We propose a novel sample median-based and exploration-aided UCB algorithm (called med-E-UCB) and a median-based ϵ-greedy algorithm (called med-ϵ-greedy). Both of these algorithms are provably robust to the aforementioned attack model. More specifically we show that both algorithms achieve O(log T) pseudo-regret (i.e., the optimal regret without attacks). We also provide a high probability guarantee of O(log T) regret with respect to random rewards and random occurrence of attacks. These bounds are achieved under arbitrary and unbounded reward perturbation as long as the attack probability does not exceed a certain constant threshold. We provide multiple synthetic simulations of the proposed algorithms to verify these claims and showcase the inability of existing techniques to achieve sublinear regret. We also provide experimental results of the algorithm operating in a cognitive radio setting using multiple software-defined radios.
APA, Harvard, Vancouver, ISO, and other styles
29

Cowan, Wesley, and Michael N. Katehakis. "EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET." Probability in the Engineering and Informational Sciences 34, no. 3 (January 26, 2019): 406–28. http://dx.doi.org/10.1017/s0269964818000529.

Full text
Abstract:
The purpose of this paper is to provide further understanding into the structure of the sequential allocation (“stochastic multi-armed bandit”) problem by establishing probability one finite horizon bounds and convergence rates for the sample regret associated with two simple classes of allocation policies. For any slowly increasing function g, subject to mild regularity constraints, we construct two policies (the g-Forcing, and the g-Inflated Sample Mean) that achieve a measure of regret of order O(g(n)) almost surely as n → ∞, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function g effectively controls the “exploration” of the classical “exploration/exploitation” tradeoff.
APA, Harvard, Vancouver, ISO, and other styles
30

Niño-Mora, José. "Markovian Restless Bandits and Index Policies: A Review." Mathematics 11, no. 7 (March 28, 2023): 1639. http://dx.doi.org/10.3390/math11071639.

Full text
Abstract:
The restless multi-armed bandit problem is a paradigmatic modeling framework for optimal dynamic priority allocation in stochastic models of wide-ranging applications that has been widely investigated and applied since its inception in a seminal paper by Whittle in the late 1980s. The problem has generated a vast and fast-growing literature from which a significant sample is thematically organized and reviewed in this paper. While the main focus is on priority-index policies due to their intuitive appeal, tractability, asymptotic optimality properties, and often strong empirical performance, other lines of work are also reviewed. Theoretical and algorithmic developments are discussed, along with diverse applications. The main goals are to highlight the remarkable breadth of work that has been carried out on the topic and to stimulate further research in the field.
APA, Harvard, Vancouver, ISO, and other styles
31

Papagiannis, Tasos, Georgios Alexandridis, and Andreas Stafylopatis. "Pruning Stochastic Game Trees Using Neural Networks for Reduced Action Space Approximation." Mathematics 10, no. 9 (May 1, 2022): 1509. http://dx.doi.org/10.3390/math10091509.

Full text
Abstract:
Monte Carlo Tree Search has proved to be very efficient in the broad domain of Game AI, though it suffers from high dimensionality in cases of large branching factors. Several pruning techniques have been proposed to tackle this problem, most of which require explicit domain knowledge. In this study, an approach using neural networks to determine the number of actions to be pruned, depending on the iterations run and the total number of possible actions, is proposed. Multi-armed bandit simulations with the UCB1 formula are employed to generate suitable datasets for the networks’ training and a specifically designed process is followed to select the best combination of the number of iterations and actions for pruning. Two pruning Monte Carlo Tree Search variants are investigated, based on different actions’ expected rewards’ distributions, and they are evaluated in the collectible card game Hearthstone. The proposed technique improves the performance of the Monte Carlo Tree Search algorithm in different setups of computational limitations regarding the available number of tree search iterations and is significantly boosted when combined with supervised learning trained-state value predicting models.
APA, Harvard, Vancouver, ISO, and other styles
32

György, A., and L. Kocsis. "Efficient Multi-Start Strategies for Local Search Algorithms." Journal of Artificial Intelligence Research 41 (July 29, 2011): 407–44. http://dx.doi.org/10.1613/jair.3313.

Full text
Abstract:
Local search algorithms applied to optimization problems often suffer from getting trapped in a local optimum. The common solution for this deficiency is to restart the algorithm when no progress is observed. Alternatively, one can start multiple instances of a local search algorithm, and allocate computational resources (in particular, processing time) to the instances depending on their behavior. Hence, a multi-start strategy has to decide (dynamically) when to allocate additional resources to a particular instance and when to start new instances. In this paper we propose multi-start strategies motivated by works on multi-armed bandit problems and Lipschitz optimization with an unknown constant. The strategies continuously estimate the potential performance of each algorithm instance by supposing a convergence rate of the local search algorithm up to an unknown constant, and in every phase allocate resources to those instances that could converge to the optimum for a particular range of the constant. Asymptotic bounds are given on the performance of the strategies. In particular, we prove that at most a quadratic increase in the number of times the target function is evaluated is needed to achieve the performance of a local search algorithm started from the attraction region of the optimum. Experiments are provided using SPSA (Simultaneous Perturbation Stochastic Approximation) and k-means as local search algorithms, and the results indicate that the proposed strategies work well in practice, and, in all cases studied, need only logarithmically more evaluations of the target function as opposed to the theoretically suggested quadratic increase.
APA, Harvard, Vancouver, ISO, and other styles
33

Trovo, Francesco, Stefano Paladino, Marcello Restelli, and Nicola Gatti. "Sliding-Window Thompson Sampling for Non-Stationary Settings." Journal of Artificial Intelligence Research 68 (May 26, 2020): 311–64. http://dx.doi.org/10.1613/jair.1.11407.

Full text
Abstract:
Multi-Armed Bandit (MAB) techniques have been successfully applied to many classes of sequential decision problems in the past decades. However, non-stationary settings -- very common in real-world applications -- received little attention so far, and theoretical guarantees on the regret are known only for some frequentist algorithms. In this paper, we propose an algorithm, namely Sliding-Window Thompson Sampling (SW-TS), for nonstationary stochastic MAB settings. Our algorithm is based on Thompson Sampling and exploits a sliding-window approach to tackle, in a unified fashion, two different forms of non-stationarity studied separately so far: abruptly changing and smoothly changing. In the former, the reward distributions are constant during sequences of rounds, and their change may be arbitrary and happen at unknown rounds, while, in the latter, the reward distributions smoothly evolve over rounds according to unknown dynamics. Under mild assumptions, we provide regret upper bounds on the dynamic pseudo-regret of SW-TS for the abruptly changing environment, for the smoothly changing one, and for the setting in which both the non-stationarity forms are present. Furthermore, we empirically show that SW-TS dramatically outperforms state-of-the-art algorithms even when the forms of non-stationarity are taken separately, as previously studied in the literature.
APA, Harvard, Vancouver, ISO, and other styles
34

Killian, Jackson A., Arpita Biswas, Lily Xu, Shresth Verma, Vineet Nair, Aparna Taneja, Aparna Hegde, et al. "Robust Planning over Restless Groups: Engagement Interventions for a Large-Scale Maternal Telehealth Program." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 12 (June 26, 2023): 14295–303. http://dx.doi.org/10.1609/aaai.v37i12.26672.

Full text
Abstract:
In 2020, maternal mortality in India was estimated to be as high as 130 deaths per 100K live births, nearly twice the UN's target. To improve health outcomes, the non-profit ARMMAN sends automated voice messages to expecting and new mothers across India. However, 38% of mothers stop listening to these calls, missing critical preventative care information. To improve engagement, ARMMAN employs health workers to intervene by making service calls, but workers can only call a fraction of the 100K enrolled mothers. Partnering with ARMMAN, we model the problem of allocating limited interventions across mothers as a restless multi-armed bandit (RMAB), where the realities of large scale and model uncertainty present key new technical challenges. We address these with GROUPS, a double oracle–based algorithm for robust planning in RMABs with scalable grouped arms. Robustness over grouped arms requires several methodological advances. First, to adversarially select stochastic group dynamics, we develop a new method to optimize Whittle indices over transition probability intervals. Second, to learn group-level RMAB policy best responses to these adversarial environments, we introduce a weighted index heuristic. Third, we prove a key theoretical result that planning over grouped arms achieves the same minimax regret--optimal strategy as planning over individual arms, under a technical condition. Finally, using real-world data from ARMMAN, we show that GROUPS produces robust policies that reduce minimax regret by up to 50%, halving the number of preventable missed voice messages to connect more mothers with life-saving maternal health information.
APA, Harvard, Vancouver, ISO, and other styles
35

WATANABE, Ryo, Junpei KOMIYAMA, Atsuyoshi NAKAMURA, and Mineichi KUDO. "KL-UCB-Based Policy for Budgeted Multi-Armed Bandits with Stochastic Action Costs." IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E100.A, no. 11 (2017): 2470–86. http://dx.doi.org/10.1587/transfun.e100.a.2470.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Youssef, Marie-Josepha, Venugopal V. Veeravalli, Joumana Farah, Charbel Abdel Nour, and Catherine Douillard. "Resource Allocation in NOMA-Based Self-Organizing Networks Using Stochastic Multi-Armed Bandits." IEEE Transactions on Communications 69, no. 9 (September 2021): 6003–17. http://dx.doi.org/10.1109/tcomm.2021.3092767.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Sledge, Isaac, and José Príncipe. "An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits." Entropy 20, no. 3 (February 28, 2018): 155. http://dx.doi.org/10.3390/e20030155.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Pokhrel, Shiva Raj, and Michel Mandjes. "Internet of Drones: Improving Multipath TCP over WiFi with Federated Multi-Armed Bandits for Limitless Connectivity." Drones 7, no. 1 (December 31, 2022): 30. http://dx.doi.org/10.3390/drones7010030.

Full text
Abstract:
We consider multipath TCP (MPTCP) flows over the data networking dynamics of IEEE 802.11ay for drone surveillance of areas using high-definition video streaming. Mobility-induced handoffs are critical in IEEE 802.11ay (because of the smaller coverage of mmWaves), which adversely affects the performance of such data streaming flows. As a result of the enhanced 802.11ay network events and features (triggered by beamforming, channel bonding, MIMO, mobility-induced handoffs, channel sharing, retransmissions, etc.), the time taken for packets to travel end-to-end in 802.11ay are inherently time-varying. Several fundamental assumptions inherent in stochastic TCP models, including Poisson arrivals of packets, Gaussian process, and parameter certainty, are challenged by the improved data traffic dynamics over IEEE 802.11ay networks. The MPTCP model’s state estimation differs largely from the actual network values. We develop a new data-driven stochastic framework to address current deficiencies of MPTCP models and design a foundational architecture for intelligent multipath scheduling (at the transport layer) considering lower layer (hybrid) beamforming. At the heart of our cross-layer architecture is an intelligent learning agent for actuating and interfacing, which learns from experience optimal packet cloning, scheduling, aggregation, and beamforming using successful features of multi-armed bandits and federated learning. We demonstrate that the proposed framework can estimate and optimize jointly (explore–exploit) and is more practicable for designing the next generation of low-delay and robust MPTCP models.
APA, Harvard, Vancouver, ISO, and other styles
39

Painter, Michael, Bruno Lacerda, and Nick Hawes. "Convex Hull Monte-Carlo Tree-Search." Proceedings of the International Conference on Automated Planning and Scheduling 30 (June 1, 2020): 217–25. http://dx.doi.org/10.1609/icaps.v30i1.6664.

Full text
Abstract:
This work investigates Monte-Carlo planning for agents in stochastic environments, with multiple objectives. We propose the Convex Hull Monte-Carlo Tree-Search (CHMCTS) framework, which builds upon Trial Based Heuristic Tree Search and Convex Hull Value Iteration (CHVI), as a solution to multi-objective planning in large environments. Moreover, we consider how to pose the problem of approximating multi-objective planning solutions as a contextual multi-armed bandits problem, giving a principled motivation for how to select actions from the view of contextual regret. This leads us to the use of Contextual Zooming for action selection, yielding Zooming CHMCTS. We evaluate our algorithm using the Generalised Deep Sea Treasure environment, demonstrating that Zooming CHMCTS can achieve a sublinear contextual regret and scales better than CHVI on a given computational budget.
APA, Harvard, Vancouver, ISO, and other styles
40

Gasnikov, A. V., E. A. Krymova, A. A. Lagunovskaya, I. N. Usmanova, and F. A. Fedorenko. "Stochastic online optimization. Single-point and multi-point non-linear multi-armed bandits. Convex and strongly-convex case." Automation and Remote Control 78, no. 2 (February 2017): 224–34. http://dx.doi.org/10.1134/s0005117917020035.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Ciucanu, Radu, Pascal Lafourcade, Marius Lombard-Platet, and Marta Soare. "Secure protocols for cumulative reward maximization in stochastic multi-armed bandits." Journal of Computer Security, February 2, 2022, 1–27. http://dx.doi.org/10.3233/jcs-210051.

Full text
Abstract:
We consider the problem of cumulative reward maximization in multi-armed bandits. We address the security concerns that occur when data and computations are outsourced to an honest-but-curious cloud i.e., that executes tasks dutifully, but tries to gain as much information as possible. We consider situations where data used in bandit algorithms is sensitive and has to be protected e.g., commercial or personal data. We rely on cryptographic schemes and propose UCB - MS, a secure multi-party protocol based on the UCB algorithm. We prove that UCB - MS computes the same cumulative reward as UCB while satisfying desirable security properties. In particular, cloud nodes cannot learn the cumulative reward or the sum of rewards for more than one arm. Moreover, by analyzing messages exchanged among cloud nodes, an external observer cannot learn the cumulative reward or the sum of rewards produced by some arm. We show that the overhead due to cryptographic primitives is linear in the size of the input. Our implementation confirms the linear-time behavior and the practical feasibility of our protocol, on both synthetic and real-world data.
APA, Harvard, Vancouver, ISO, and other styles
42

Amakasu, Takashi, Nicolas Chauvet, Guillaume Bachelier, Serge Huant, Ryoichi Horisaki, and Makoto Naruse. "Conflict-free collective stochastic decision making by orbital angular momentum of photons through quantum interference." Scientific Reports 11, no. 1 (October 26, 2021). http://dx.doi.org/10.1038/s41598-021-00493-2.

Full text
Abstract:
AbstractIn recent cross-disciplinary studies involving both optics and computing, single-photon-based decision-making has been demonstrated by utilizing the wave-particle duality of light to solve multi-armed bandit problems. Furthermore, entangled-photon-based decision-making has managed to solve a competitive multi-armed bandit problem in such a way that conflicts of decisions among players are avoided while ensuring equality. However, as these studies are based on the polarization of light, the number of available choices is limited to two, corresponding to two orthogonal polarization states. Here we propose a scalable principle to solve competitive decision-making situations by using the orbital angular momentum of photons based on its high dimensionality, which theoretically allows an unlimited number of arms. Moreover, by extending the Hong-Ou-Mandel effect to more than two states, we theoretically establish an experimental configuration able to generate multi-photon states with orbital angular momentum and conditions that provide conflict-free selections at every turn. We numerically examine total rewards regarding three-armed bandit problems, for which the proposed strategy accomplishes almost the theoretical maximum, which is greater than a conventional mixed strategy intending to realize Nash equilibrium. This is thanks to the quantum interference effect that achieves no-conflict selections, even in the exploring phase to find the best arms.
APA, Harvard, Vancouver, ISO, and other styles
43

Immorlica, Nicole, Karthik Abinav Sankararaman, Robert Schapire, and Aleksandrs Slivkins. "Adversarial Bandits with Knapsacks." Journal of the ACM, August 18, 2022. http://dx.doi.org/10.1145/3557045.

Full text
Abstract:
We consider Bandits with Knapsacks (henceforth, BwK ), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem : find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on the stochastic version, we pioneer the other extreme in which the outcomes can be chosen adversarially. This is a considerably harder problem, compared to both the stochastic version and the “classic” adversarial bandits, in that regret minimization is no longer feasible. Instead, the objective is to minimize the competitive ratio : the ratio of the benchmark reward to algorithm’s reward. We design an algorithm with competitive ratio O (log T ) relative to the best fixed distribution over actions, where T is the time horizon; we also prove a matching lower bound. The key conceptual contribution is a new perspective on the stochastic version of the problem. We suggest a new algorithm for the stochastic version, which builds on the framework of regret minimization in repeated games and admits a substantially simpler analysis compared to prior work. We then analyze this algorithm for the adversarial version, and use it as a subroutine to solve the latter. Our algorithm is the first “black-box reduction” from bandits to BwK: it takes an arbitrary bandit algorithm and uses it as a subroutine. We use this reduction to derive several extensions.
APA, Harvard, Vancouver, ISO, and other styles
44

Zhou, Datong, and Claire Tomlin. "Budget-Constrained Multi-Armed Bandits With Multiple Plays." Proceedings of the AAAI Conference on Artificial Intelligence 32, no. 1 (April 29, 2018). http://dx.doi.org/10.1609/aaai.v32i1.11629.

Full text
Abstract:
We study the multi-armed bandit problem with multiple plays and a budget constraint for both the stochastic and the adversarial setting. At each round, exactly K out of N possible arms have to be played (with 1 ≤ K <= N). In addition to observing the individual rewards for each arm played, the player also learns a vector of costs which has to be covered with an a-priori defined budget B. The game ends when the sum of current costs associated with the played arms exceeds the remaining budget. Firstly, we analyze this setting for the stochastic case, for which we assume each arm to have an underlying cost and reward distribution with support [cmin, 1] and [0, 1], respectively. We derive an Upper Confidence Bound (UCB) algorithm which achieves O(NK4 log B) regret. Secondly, for the adversarial case in which the entire sequence of rewards and costs is fixed in advance, we derive an upper bound on the regret of order O(√NB log(N/K)) utilizing an extension of the well-known Exp3 algorithm. We also provide upper bounds that hold with high probability and a lower bound of order Ω((1 – K/N) √NB/K).
APA, Harvard, Vancouver, ISO, and other styles
45

Fernandez-Tapia, Joaquin, and Charles Monzani. "Stochastic Multi-Armed Bandit Algorithm for Optimal Budget Allocation in Programmatic Advertising." SSRN Electronic Journal, 2015. http://dx.doi.org/10.2139/ssrn.2600473.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Gopalan, Aditya, Prashanth L. A., Michael Fu, and Steve Marcus. "Weighted Bandits or: How Bandits Learn Distorted Values That Are Not Expected." Proceedings of the AAAI Conference on Artificial Intelligence 31, no. 1 (February 13, 2017). http://dx.doi.org/10.1609/aaai.v31i1.10922.

Full text
Abstract:
Motivated by models of human decision making proposed to explain commonly observed deviations from conventional expected value preferences, we formulate two stochastic multi-armed bandit problems with distorted probabilities on the cost distributions: the classic K-armed bandit and the linearly parameterized bandit. In both settings, we propose algorithms that are inspired by Upper Confidence Bound (UCB) algorithms, incorporate cost distortions, and exhibit sublinear regret assuming Holder continuous weight distortion functions. For the K-armed setting, we show that the algorithm, called W-UCB, achieves problem-dependent regret O(L2 M2 log n / Δ(2/α – 1), where n is the number of plays, Δ is the gap in distorted expected value between the best and next best arm, L and alpha are the Holder constants for the distortion function, and M is an upper bound on costs, and a problem-independent regret bound of O((KL2M2)(α/2) n(2 – α)/2)). We also present a matching lower bound on the regret, showing that the regret of W-UCB is essentially unimprovable over the class of Holder-continuous weight distortions. For the linearly parameterized setting, we develop a new algorithm, a variant of the Optimism in the Face of Uncertainty Linear bandit (OFUL) algorithm called WOFUL (Weight-distorted OFUL), and show that it has regret O(d√n polylog(n)) with high probability, for sub-Gaussian cost distributions.
APA, Harvard, Vancouver, ISO, and other styles
47

Mandel, Travis, Yun-En Liu, Emma Brunskill, and Zoran Popović. "The Queue Method: Handling Delay, Heuristics, Prior Data, and Evaluation in Bandits." Proceedings of the AAAI Conference on Artificial Intelligence 29, no. 1 (February 21, 2015). http://dx.doi.org/10.1609/aaai.v29i1.9604.

Full text
Abstract:
Current algorithms for the standard multi-armed bandit problem have good empirical performance and optimal regret bounds. However, real-world problems often differ from the standard formulation in several ways. First, feedback may be delayed instead of arriving immediately. Second, the real world often contains structure which suggests heuristics, which we wish to incorporate while retaining the best-known theoretical guarantees. Third, we may wish to make use of an arbitrary prior dataset without negatively impacting performance. Fourth, we may wish to efficiently evaluate algorithms using a previously collected dataset. Surprisingly, these seemingly-disparate problems can be addressed using algorithms inspired by a recently-developed queueing technique. We present the Stochastic Delayed Bandits (SDB) algorithm as a solution to these four problems, which takes black-box bandit algorithms (including heuristic approaches) as input while achieving good theoretical guarantees. We present empirical results from both synthetic simulations and real-world data drawn from an educational game. Our results show that SDB outperforms state-of-the-art approaches to handling delay, heuristics, prior data, and evaluation.
APA, Harvard, Vancouver, ISO, and other styles
48

Liu, Fang, Swapna Buccapatnam, and Ness Shroff. "Information Directed Sampling for Stochastic Bandits With Graph Feedback." Proceedings of the AAAI Conference on Artificial Intelligence 32, no. 1 (April 29, 2018). http://dx.doi.org/10.1609/aaai.v32i1.11751.

Full text
Abstract:
We consider stochastic multi-armed bandit problems with graph feedback, where the decision maker is allowed to observe the neighboring actions of the chosen action. We allow the graph structure to vary with time and consider both deterministic and Erdos-Renyi random graph models. For such a graph feedback model, we first present a novel analysis of Thompson sampling that leads to tighter performance bound than existing work. Next, we propose new Information Directed Sampling based policies that are graph-aware in their decision making. Under the deterministic graph case, we establish a Bayesian regret bound for the proposed policies that scales with the clique cover number of the graph instead of the number of actions. Under the random graph case, we provide a Bayesian regret bound for the proposed policies that scales with the ratio of the number of actions over the expected number of observations per iteration. To the best of our knowledge, this is the first analytical result for stochastic bandits with random graph feedback. Finally, using numerical evaluations, we demonstrate that our proposed IDS policies outperform existing approaches, including adaptions of upper confidence bound, epsilon-greedy and Exp3 algorithms.
APA, Harvard, Vancouver, ISO, and other styles
49

Hashima, Sherief, Mostafa M. Fouda, Sadman Sakib, Zubair Md Fadlullah, Kohei Hatano, Ehab Mahmoud Mohamed, and Xuemin Shen. "Energy-Aware Hybrid RF-VLC Multi-Band Selection in D2D Communication: A Stochastic Multi-Armed Bandit Approach." IEEE Internet of Things Journal, 2022, 1. http://dx.doi.org/10.1109/jiot.2022.3162135.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Li, Bo, and Chi Ho Yeung. "Understanding the stochastic dynamics of sequential decision-making processes: A path-integral analysis of multi-armed bandits." Chaos: An Interdisciplinary Journal of Nonlinear Science 33, no. 6 (June 1, 2023). http://dx.doi.org/10.1063/5.0120076.

Full text
Abstract:
The multi-armed bandit (MAB) model is one of the most classical models to study decision-making in an uncertain environment. In this model, a player chooses one of K possible arms of a bandit machine to play at each time step, where the corresponding arm returns a random reward to the player, potentially from a specific unknown distribution. The target of the player is to collect as many rewards as possible during the process. Despite its simplicity, the MAB model offers an excellent playground for studying the trade-off between exploration vs exploitation and designing effective algorithms for sequential decision-making under uncertainty. Although many asymptotically optimal algorithms have been established, the finite-time behaviors of the stochastic dynamics of the MAB model appear much more challenging to analyze due to the intertwine between the decision-making and the rewards being collected. In this paper, we employ techniques in statistical physics to analyze the MAB model, which facilitates the characterization of the distribution of cumulative regrets at a finite short time, the central quantity of interest in an MAB algorithm, as well as the intricate dynamical behaviors of the model. Our analytical results, in good agreement with simulations, point to the emergence of an interesting multimodal regret distribution, with large regrets resulting from excess exploitation of sub-optimal arms due to an initial unlucky output from the optimal one.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography