Tesis sobre el tema "Apprentissage par reinforcement"
Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros
Consulte los 50 mejores tesis para su investigación sobre el tema "Apprentissage par reinforcement".
Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.
También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.
Explore tesis sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.
Carrara, Nicolas. "Reinforcement learning for dialogue systems optimization with user adaptation". Thesis, Lille 1, 2019. http://www.theses.fr/2019LIL1I071/document.
Texto completoThe most powerful artificial intelligence systems are now based on learned statistical models. In order to build efficient models, these systems must collect a huge amount of data on their environment. Personal assistants, smart-homes, voice-servers and other dialogue applications are no exceptions to this statement. A specificity of those systems is that they are designed to interact with humans, and as a consequence, their training data has to be collected from interactions with these humans. As the number of interactions with a single person is often too scarce to train a proper model, the usual approach to maximise the amount of data consists in mixing data collected with different users into a single corpus. However, one limitation of this approach is that, by construction, the trained models are only efficient with an "average" human and do not include any sort of adaptation; this lack of adaptation makes the service unusable for some specific group of persons and leads to a restricted customers base and inclusiveness problems. This thesis proposes solutions to construct Dialogue Systems that are robust to this problem by combining Transfer Learning and Reinforcement Learning. It explores two main ideas: The first idea of this thesis consists in incorporating adaptation in the very first dialogues with a new user. To that extend, we use the knowledge gathered with previous users. But how to scale such systems with a growing database of user interactions? The first proposed approach involves clustering of Dialogue Systems (tailored for their respective user) based on their behaviours. We demonstrated through handcrafted and real user-models experiments how this method improves the dialogue quality for new and unknown users. The second approach extends the Deep Q-learning algorithm with a continuous transfer process.The second idea states that before using a dedicated Dialogue System, the first interactions with a user should be handled carefully by a safe Dialogue System common to all users. The underlying approach is divided in two steps. The first step consists in learning a safe strategy through Reinforcement Learning. To that extent, we introduced a budgeted Reinforcement Learning framework for continuous state space and the underlying extensions of classic Reinforcement Learning algorithms. In particular, the safe version of the Fitted-Q algorithm has been validated, in term of safety and efficiency, on a dialogue system tasks and an autonomous driving problem. The second step consists in using those safe strategies when facing new users; this method is an extension of the classic ε-greedy algorithm
Akrour, Riad. "Robust Preference Learning-based Reinforcement Learning". Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112236/document.
Texto completoThe thesis contributions resolves around sequential decision taking and more precisely Reinforcement Learning (RL). Taking its root in Machine Learning in the same way as supervised and unsupervised learning, RL quickly grow in popularity within the last two decades due to a handful of achievements on both the theoretical and applicative front. RL supposes that the learning agent and its environment follow a stochastic Markovian decision process over a state and action space. The process is said of decision as the agent is asked to choose at each time step an action to take. It is said stochastic as the effect of selecting a given action in a given state does not systematically yield the same state but rather defines a distribution over the state space. It is said to be Markovian as this distribution only depends on the current state-action pair. Consequently to the choice of an action, the agent receives a reward. The RL goal is then to solve the underlying optimization problem of finding the behaviour that maximizes the sum of rewards all along the interaction of the agent with its environment. From an applicative point of view, a large spectrum of problems can be cast onto an RL one, from Backgammon (TD-Gammon, was one of Machine Learning first success giving rise to a world class player of advanced level) to decision problems in the industrial and medical world. However, the optimization problem solved by RL depends on the prevous definition of a reward function that requires a certain level of domain expertise and also knowledge of the internal quirks of RL algorithms. As such, the first contribution of the thesis was to propose a learning framework that lightens the requirements made to the user. The latter does not need anymore to know the exact solution of the problem but to only be able to choose between two behaviours exhibited by the agent, the one that matches more closely the solution. Learning is interactive between the agent and the user and resolves around the three main following points: i) The agent demonstrates a behaviour ii) The user compares it w.r.t. to the current best one iii) The agent uses this feedback to update its preference model of the user and uses it to find the next behaviour to demonstrate. To reduce the number of required interactions before finding the optimal behaviour, the second contribution of the thesis was to define a theoretically sound criterion making the trade-off between the sometimes contradicting desires of complying with the user's preferences and demonstrating sufficiently different behaviours. The last contribution was to ensure the robustness of the algorithm w.r.t. the feedback errors that the user might make. Which happens more often than not in practice, especially at the initial phase of the interaction, when all the behaviours are far from the expected solution
Fournier, Pierre. "Intrinsically Motivated and Interactive Reinforcement Learning : a Developmental Approach". Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS634.
Texto completoReinforcement learning (RL) is today more popular than ever, but certain basic skills are still out of reach of this paradigm: object manipulation, sensorimotor control, natural interaction with other agents. A possible approach to address these challenges consist in taking inspiration from human development, or even trying to reproduce it. In this thesis, we study the intersection of two crucial topics in developmental sciences and how to apply them to RL in order to tackle the aforementioned challenges: interactive learning and intrinsic motivation. Interactive learning and intrinsic motivation have already been studied, separately, in combination with RL, but in order to improve quantitatively existing agents performances, rather than to learn in a developmental fashion. We thus focus our efforts on the developmental aspect of these subjects. Our work touches the self-organisation of learning in developmental trajectories through an intrinsically motivated for learning progress, and the interaction of this organisation with goal-directed learning and imitation learning. We show that these mechanisms, when implemented in open-ended environments with no task predefined, can interact to produce learning behaviors that are sound from a developmental standpoint, and richer than those produced by each mechanism separately
Blier, Léonard. "Some Principled Methods for Deep Reinforcement Learning". Electronic Thesis or Diss., université Paris-Saclay, 2022. http://www.theses.fr/2022UPASG040.
Texto completoThis thesis develops and studies some principled methods for Deep Learning (DL) and deep Reinforcement Learning (RL).In Part II, we study the efficiency of DL models from the context of the Minimum Description Length principle, which formalize Occam's razor, and holds that a good model of data is a model that is good at losslessly compressing the data, including the cost of describing the model itself. Deep neural networks might seem to go against this principle given the large number of parameters to be encoded. Surprisingly, we demonstrate experimentally the ability of deep neural networks to compress the training data even when accounting for parameter encoding, hence showing that DL approaches are well principled from this information theory viewpoint.In Part III, we tackle two limitations of standard approaches in DL and RL, and develop principled methods, improving robustness empirically.The first one concerns optimisation of deep learning models with SGD, and the cost of finding the optimal learning rate, which prevents using a new method out of the box without hyperparameter tuning. When design a principled optimisation method for DL, 'All Learning Rates At Once' : each unit or feature in the network gets its own learning rate sampled from a random distribution spanning several orders of magnitude. Perhaps surprisingly, Alrao performs close to SGD with an optimally tuned learning rate, for various architectures and problems.The second one tackles near continuous-time RL environments (such as robotics, control environment, …) : we show that time discretization (number of action per second) in as a critical factor, and that empirically, Q-learning-based approaches collapse with small time steps. Formally, we prove that Q-learning does not exist in continuous time. We detail a principled way to build an off-policy RL algorithm that yields similar performances over a wide range of time discretizations, and confirm this robustness empirically.The main part of this thesis, (Part IV), studies the Successor States Operator in RL, and how it can improve sample efficiency of policy evaluation. In an environment with a very sparse reward, learning the value function is a hard problem. At the beginning of training, no learning will occur until a reward is observed. This highlight the fact that not all the observed information is used. Leveraging this information might lead to better sample efficiency. The Successor State Operator is an object that expresses the value functions of all possible reward functions for a given, fixed policy. Learning the successor state operator can be done without reward signals, and can extract information from every observed transition, illustrating an unsupervised reinforcement learning approach.We offer a formal treatment of these objects in both finite and continuous spaces with function approximators. We present several learning algorithms and associated results. Similarly to the value function, the successor states operator satisfies a Bellman equation. Additionally, it also satisfies two other fixed point equations: a backward Bellman equation and a Bellman-Newton equation, expressing path compositionality in the Markov process. These new relation allow us to generalize from observed trajectories in several ways, potentially leading to more sample efficiency. Every of these equations lead to corresponding algorithms for any function approximators such as neural networks.Finally, (Part V) the study of the successor states operator and its algorithms allow us to derive unbiased methods in the setting of multi-goal RL, dealing with the issue of extremely sparse rewards. We additionally show that the popular Hindsight Experience Replay algorithm, known to be biased, is actually unbiased in the large class of deterministic environments
Chatzilygeroudis, Konstantinos. "Micro-Data Reinforcement Learning for Adaptive Robots". Thesis, Université de Lorraine, 2018. http://www.theses.fr/2018LORR0276/document.
Texto completoRobots have to face the real world, in which trying something might take seconds, hours, or even days. Unfortunately, the current state-of-the-art reinforcement learning algorithms (e.g., deep reinforcement learning) require big interaction times to find effective policies. In this thesis, we explored approaches that tackle the challenge of learning by trial-and-error in a few minutes on physical robots. We call this challenge “micro-data reinforcement learning”. In our first contribution, we introduced a novel learning algorithm called “Reset-free Trial-and-Error” that allows complex robots to quickly recover from unknown circumstances (e.g., damages or different terrain) while completing their tasks and taking the environment into account; in particular, a physical damaged hexapod robot recovered most of its locomotion abilities in an environment with obstacles, and without any human intervention. In our second contribution, we introduced a novel model-based reinforcement learning algorithm, called Black-DROPS that: (1) does not impose any constraint on the reward function or the policy (they are treated as black-boxes), (2) is as data-efficient as the state-of-the-art algorithm for data-efficient RL in robotics, and (3) is as fast (or faster) than analytical approaches when several cores are available. We additionally proposed Multi-DEX, a model-based policy search approach, that takes inspiration from novelty-based ideas and effectively solved several sparse reward scenarios. In our third contribution, we introduced a new model learning procedure in Black-DROPS (we call it GP-MI) that leverages parameterized black-box priors to scale up to high-dimensional systems; for instance, it found high-performing walking policies for a physical damaged hexapod robot (48D state and 18D action space) in less than 1 minute of interaction time. Finally, in the last part of the thesis, we explored a few ideas on how to incorporate safety constraints, robustness and leverage multiple priors in Bayesian optimization in order to tackle the micro-data reinforcement learning challenge. Throughout this thesis, our goal was to design algorithms that work on physical robots, and not only in simulation. Consequently, all the proposed approaches have been evaluated on at least one physical robot. Overall, this thesis aimed at providing methods and algorithms that will allow physical robots to be more autonomous and be able to learn in a handful of trials
Achab, Mastane. "Ranking and risk-aware reinforcement learning". Electronic Thesis or Diss., Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAT020.
Texto completoThis thesis divides into two parts: the first part is on ranking and the second on risk-aware reinforcement learning. While binary classification is the flagship application of empirical risk minimization (ERM), the main paradigm of machine learning, more challenging problems such as bipartite ranking can also be expressed through that setup. In bipartite ranking, the goal is to order, by means of scoring methods, all the elements of some feature space based on a training dataset composed of feature vectors with their binary labels. This thesis extends this setting to the continuous ranking problem, a variant where the labels are taking continuous values instead of being simply binary. The analysis of ranking data, initiated in the 18th century in the context of elections, has led to another ranking problem using ERM, namely ranking aggregation and more precisely the Kemeny's consensus approach. From a training dataset made of ranking data, such as permutations or pairwise comparisons, the goal is to find the single "median permutation" that best corresponds to a consensus order. We present a less drastic dimensionality reduction approach where a distribution on rankings is approximated by a simpler distribution, which is not necessarily reduced to a Dirac mass as in ranking aggregation.For that purpose, we rely on mathematical tools from the theory of optimal transport such as Wasserstein metrics. The second part of this thesis focuses on risk-aware versions of the stochastic multi-armed bandit problem and of reinforcement learning (RL), where an agent is interacting with a dynamic environment by taking actions and receiving rewards, the objective being to maximize the total payoff. In particular, a novel atomic distributional RL approach is provided: the distribution of the total payoff is approximated by particles that correspond to trimmed means
Zimmer, Matthieu. "Apprentissage par renforcement développemental". Thesis, Université de Lorraine, 2018. http://www.theses.fr/2018LORR0008/document.
Texto completoReinforcement learning allows an agent to learn a behavior that has never been previously defined by humans. The agent discovers the environment and the different consequences of its actions through its interaction: it learns from its own experience, without having pre-established knowledge of the goals or effects of its actions. This thesis tackles how deep learning can help reinforcement learning to handle continuous spaces and environments with many degrees of freedom in order to solve problems closer to reality. Indeed, neural networks have a good scalability and representativeness. They make possible to approximate functions on continuous spaces and allow a developmental approach, because they require little a priori knowledge on the domain. We seek to reduce the amount of necessary interaction of the agent to achieve acceptable behavior. To do so, we proposed the Neural Fitted Actor-Critic framework that defines several data efficient actor-critic algorithms. We examine how the agent can fully exploit the transitions generated by previous behaviors by integrating off-policy data into the proposed framework. Finally, we study how the agent can learn faster by taking advantage of the development of his body, in particular, by proceeding with a gradual increase in the dimensionality of its sensorimotor space
Tréca, Maxime. "Designing traffic signal control systems using reinforcement learning". Electronic Thesis or Diss., université Paris-Saclay, 2022. http://www.theses.fr/2022UPASG043.
Texto completoThis thesis studies the problem of traffic optimization through traffic light signals on road networks. Traffic optimization is achieved in our case through the use of reinforcement learning, a branch of machine learning in which an agent solves a given task in an environment by maximizing its reward signals.First, we present the fields of traffic signal control (TSC) and reinforcement learning (RL) separately, before presenting how the latter is applied on the former (RL-TSC). Then, we define a mathematical model of traffic based on graph theory, before introducing the reinforcement learning model, traffic simulator and deep reinforcement learning library created for our research work.Finally, these definitions allow us to build an efficient traffic signal control method based on reinforcement learning.We first study multiple classical reinforcement learning techniques on an isolated traffic intersection. Multiple classes of RL algorithms are compared (e.g. Q-learning, LRP, actor-critic) to deterministic TSC methods used as a baseline. We then introduce function approximation methods using deep neural networks, allowing for significant performance improvement on isolated intersections. These experiments allow us to single out dueling deep Q-learning as the best isolated RL-TSC method for out model.On this basis, we introduce the concept of agent coordination in multi-agent reinforcement learning systems (MARL). We compare multiple modes of coordinaiton to the isolated baseline that we previously defined. These experiments allow us to define the DEC-DQN coordination method, which allows for multiple agents of a POMDP to communicate in order to better optimize traffic. DEC-DQN uses a deep neural network shared by all agents of the network, allowing them to learn a common communication protocol from scratch. In order to correctly reward communication actions, which are entirely distinct from traffic optimization actions taken by agents, DEC-DQN defines a special reward function allowing each agent to directly estimate the impact of its communications on neighboring agents of the network. Communicaiton action rewards are directly estimated on the traffic optimization neural networks of neighboring intersections.Finally, this novel cooridnation method is compared to other methods of the literature on a large-scale simulation. The DEC-DQN algorithm results in faster agent learning, as well as increased performance and stability thanks to agent coordination
Tarbouriech, Jean. "Goal-oriented exploration for reinforcement learning". Electronic Thesis or Diss., Université de Lille (2022-....), 2022. http://www.theses.fr/2022ULILB014.
Texto completoLearning to reach goals is a competence of high practical relevance to acquire for intelligent agents. For instance, this encompasses many navigation tasks ("go to target X"), robotic manipulation ("attain position Y of the robotic arm"), or game-playing scenarios ("win the game by fulfilling objective Z"). As a living being interacting with the world, I am constantly driven by goals to reach, varying in scope and difficulty.Reinforcement Learning (RL) holds the promise to frame and learn goal-oriented behavior. Goals can be modeled as specific configurations of the environment that must be attained via sequential interaction and exploration of the unknown environment. Although various deep RL algorithms have been proposed for goal-oriented RL, existing methods often lack principled understanding, sample efficiency and general-purpose effectiveness. In fact, very limited theoretical analysis of goal-oriented RL was available, even in the basic scenario of finitely many states and actions.We first focus on a supervised scenario of goal-oriented RL, where a goal state to be reached in minimum total expected cost is provided as part of the problem definition. After formalizing the online learning problem in this setting often known as Stochastic Shortest Path (SSP), we introduce two no-regret algorithms (one is the first available in the literature, the other attains nearly optimal guarantees).Beyond training our RL agent to solve only one task, we then aspire that it learns to autonomously solve a wide variety of tasks, in the absence of any reward supervision. In this challenging unsupervised RL scenario, we advocate to "Set Your Own Goals" (SYOG), which suggests the agent to learn the ability to intrinsically select and reach its own goal states. We derive finite-time guarantees of this popular heuristic in various settings, each with its specific learning objective and technical challenges. As an illustration, we propose a rigorous analysis of the algorithmic principle of targeting "uncertain" goals which we also anchor in deep RL.The main focus and contribution of this thesis are to instigate a principled analysis of goal-oriented exploration in RL, both in the supervised and unsupervised scenarios. We hope that it helps suggest promising research directions to improve the interpretability and sample efficiency of goal-oriented RL algorithms in practical applications
Théro, Héloïse. "Contrôle, agentivité et apprentissage par renforcement". Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEE028/document.
Texto completoSense of agency or subjective control can be defined by the feeling that we control our actions, and through them effects in the outside world. This cluster of experiences depend on the ability to learn action-outcome contingencies and a more classical algorithm to model this originates in the field of human reinforcementlearning. In this PhD thesis, we used the cognitive modeling approach to investigate further the interaction between perceived control and reinforcement learning. First, we saw that participants undergoing a reinforcement-learning task experienced higher agency; this influence of reinforcement learning on agency comes as no surprise, because reinforcement learning relies on linking a voluntary action and its outcome. But our results also suggest that agency influences reinforcement learning in two ways. We found that people learn actionoutcome contingencies based on a default assumption: their actions make a difference to the world. Finally, we also found that the mere fact of choosing freely shapes the learning processes following that decision. Our general conclusion is that agency and reinforcement learning, two fundamental fields of human psychology, are deeply intertwined. Contrary to machines, humans do care about being in control, or about making the right choice, and this results in integrating information in a one-sided way
Darwiche, Domingues Omar. "Exploration en apprentissage par renforcement : au-delà des espaces d'états finis". Thesis, Université de Lille (2022-....), 2022. http://www.theses.fr/2022ULILB002.
Texto completoReinforcement learning (RL) is a powerful machine learning framework to design algorithms that learn to make decisions and to interact with the world. Algorithms for RL can be classified as offline or online. In the offline case, the algorithm is given a fixed dataset, based on which it needs to compute a good decision-making strategy. In the online case, an agent needs to efficiently collect data by itself, by interacting with the environment: that is the problem of exploration in reinforcement learning. This thesis presents theoretical and practical contributions to online RL. We investigate the worst-case performance of online RL algorithms in finite environments, that is, those that can be modeled with a finite amount of states, and where the set of actions that can be taken by an agent is also finite. Such performance degrades as the number of states increases, whereas in real-world applications the state set can be arbitrarily large or continuous. To tackle this issue, we propose kernel-based algorithms for exploration that can be implemented for general state spaces, and for which we provide theoretical results under weak assumptions on the environment. Those algorithms rely on a kernel function that measures the similarity between different states, which can be defined on arbitrary state-spaces, including discrete sets and Euclidean spaces, for instance. Additionally, we show that our kernel-based algorithms are able to handle non-stationary environments by using time-dependent kernel functions, and we propose and analyze approximate versions of our methods to reduce their computational complexity. Finally, we introduce a scalable approximation of our kernel-based methods, that can be implemented with deep reinforcement learning and integrate different representation learning methods to define a kernel function
Moturu, Krishna Priya Darsini. "Application of reinforcement learning algorithms to software verification". Master's thesis, Québec : Université Laval, 2006. http://www.theses.ulaval.ca/2006/23583/23583.pdf.
Texto completoZhioua, Sami. "Stochastic Systems Divergence through Reinforcement Learning". Thesis, Université Laval, 2008. http://www.theses.ulaval.ca/2008/25167/25167.pdf.
Texto completoModelling real-life systems and phenomena using mathematical based formalisms is ubiquitous in science and engineering. The reason is that mathematics offer a suitable framework to carry out formal and rigorous analysis of these systems. For instance, in software engineering, formal methods are among the most efficient tools to identify flaws in software. The behavior of many real-life systems is inherently stochastic which requires stochastic models such as labelled Markov processes (LMPs), Markov decision processes (MDPs), predictive state representations (PSRs), etc. This thesis is about quantifying the difference between stochastic systems. The main contributions are: 1. a new approach to quantify the divergence between pairs of stochastic systems based on reinforcement learning, 2. a new family of equivalence notions which lies between trace equivalence and bisimulation, and 3. a refined testing framework to define equivalence notions. The important point of the thesis is that reinforcement learning (RL), a branch of artificial intelligence particularly efficient in presence of uncertainty, can be used to quantify efficiently the divergence between stochastic systems. The key idea is to define an MDP out of the systems to be compared and then to interpret the optimal value of the MDP as the divergence between them. The most appealing feature of the proposed approach is that it does not rely on the knowledge of the internal structure of the systems. Only a possibility of interacting with them is required. Because of this, the approach can be extended to different types of stochastic systems. The second contribution is a new family of equivalence notions, moment, that constitute a good compromise between trace equivalence (too weak) and bisimulation (too strong). This family has a natural definition using coincidence of moments of random variables but more importantly, it has a simple testing characterization. moment turns out to be part of a bigger framework called test-observation-equivalence (TOE), which we propose as a third contribution of this thesis. It is a refined testing framework to define equivalence notions with more flexibility.
Vincent, Marc. "Reinforcement Learning for Multi-Function Radar Resource Management". Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS305.
Texto completoIn the wake of recent advances in the field of machine learning, much progress has been accomplished in one of its sub-fields, reinforcement learning, whose aim is to solve sequential decision problems under uncertainty. Radar resource management seems to represent an ideal application case for this type of technique. Indeed, a radar emits signals, called dwells, whose echoes are used to measure the state of surrounding objects; these dwells vary according to numerous parameters (duration, beam width...) and must be executed sequentially. The surveillance strategy of a multi-function radar thus consists in continuously selecting the dwells to perform, with the aim of searching the surrounding space while tracking already detected targets. The methods currently used to address this problem are largely heuristic, and are likely to run into difficulties in a range of complex situations involving hyper-velocity or hyper-maneuvering targets. First, we propose applications of reinforcement learning techniques adapted to the current architecture of multi-function radars. These contributions focus on two aspects~: dwell scheduling on the antenna using model-based methods, and active tracking dwell optimization using model-free methods. Secondly, we highlight the limitations of current resource management architectures, which leads us to consider an alternative architecture for which we propose new reinforcement learning algorithms designed to address the problems it raises.These contributions focus both on the multi-objective aspect, which is useful in multi-function radars to reflect the trade-offs to be made between different functions, and on the combinatorial aspect, which is due to the large number of tasks that the radar must carry out in parallel
Cuvelier, Thibaut. "Polynomial-Time Algorithms for Combinatorial Semibandits : Computationally Tractable Reinforcement Learning in Complex Environments". Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG020.
Texto completoSequential decision making is a core component of many real-world applications, from computer-network operations to online ads. The major tool for this use is reinforcement learning: an agent takes a sequence of decisions in order to achieve its goal, with typically noisy measurements of the evolution of the environment. For instance, a self-driving car can be controlled by such an agent; the environment is the city in which the car manœuvers. Bandit problems are a class of reinforcement learning for which very strong theoretical properties can be shown. The focus of bandit algorithms is on the exploration-exploitation dilemma: in order to have good performance, the agent must have a deep knowledge of its environment (exploration); however, it should also play actions that bring it closer to its goal (exploitation).In this dissertation, we focus on combinatorial bandits, which are bandits whose decisions are highly structured (a "combinatorial" structure). These include cases where the learning agent determines a path to follow (on a road, in a computer network, etc.) or ads to display on a Website. Such situations share their computational complexity: while it is often easy to determine the optimum decision when the parameters are known (the time to cross a road, the monetary gain of displaying an ad at a given place), the bandit variant (when the parameters must be determined through interactions with the environment) is more complex.We propose two new algorithms to tackle these problems by mathematical-optimisation techniques. Based on weak hypotheses, they have a polynomial time complexity, and yet perform well compared to state-of-the-art algorithms for the same problems. They also enjoy excellent statistical properties, meaning that they find a balance between exploration and exploitation that is close to the theoretical optimum. Previous work on combinatorial bandits had to make a choice between computational burden and statistical performance; our algorithms show that there is no need for such a quandary
Chenu, Alexandre. "Leveraging sequentiality in Robot Learning : Application of the Divide & Conquer paradigm to Neuro-Evolution and Deep Reinforcement Learning". Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS342.
Texto completo“To succeed, planning alone is insufficient. One must improvise as well.” This quote from Isaac Asimov, founding father of robotics and author of the Three Laws of Robotics, emphasizes the importance of being able to adapt and think on one’s feet to achieve success. Although robots can nowadays resolve highly complex tasks, they still need to gain those crucial adaptability skills to be deployed on a larger scale. Robot Learning uses learning algorithms to tackle this lack of adaptability and to enable robots to solve complex tasks autonomously. Two types of learning algorithms are particularly suitable for robots to learn controllers autonomously: Deep Reinforcement Learning and Neuro-Evolution. However, both classes of algorithms often cannot solve Hard Exploration Problems, that is problems with a long horizon and a sparse reward signal, unless they are guided in their learning process. One can consider different approaches to tackle those problems. An option is to search for a diversity of behaviors rather than a specific one. The idea is that among this diversity, some behaviors will be able to solve the task. We call these algorithms Diversity Search algorithms. A second option consists in guiding the learning process using demonstrations provided by an expert. This is called Learning from Demonstration. However, searching for diverse behaviors or learning from demonstration can be inefficient in some contexts. Indeed, finding diverse behaviors can be tedious if the environment is complex. On the other hand, learning from demonstration can be very difficult if only one demonstration is available. This thesis attempts to improve the effectiveness of Diversity Search and Learning from Demonstration when applied to Hard Exploration Problems. To do so, we assume that complex robotics behaviors can be decomposed into reaching simpler sub-goals. Based on this sequential bias, we try to improve the sample efficiency of Diversity Search and Learning from Demonstration algorithms by adopting Divide & Conquer strategies, which are well-known for their efficiency when the problem is composable. Throughout the thesis, we propose two main strategies. First, after identifying some limitations of Diversity Search algorithms based on Neuro-Evolution, we propose Novelty Search Skill Chaining. This algorithm combines Diversity Search with Skill- Chaining to efficiently navigate maze environments that are difficult to explore for state-of-the-art Diversity Search. In a second set of contributions, we propose the Divide & Conquer Imitation Learning algorithms. The key intuition behind those methods is to decompose the complex task of learning from a single demonstration into several simpler goal-reaching sub-tasks. DCIL-II, the most advanced variant, can learn walking behaviors for under-actuated humanoid robots with unprecedented efficiency. Beyond underlining the effectiveness of the Divide & Conquer paradigm in Robot Learning, this work also highlights the difficulties that can arise when composing behaviors, even in elementary environments. One will inevitably have to address these difficulties before applying these algorithms directly to real robots. It may be necessary for the success of the next generations of robots, as outlined by Asimov
Paumard, Marie-Morgane. "Résolution automatique de puzzles par apprentissage profond". Thesis, CY Cergy Paris Université, 2020. http://www.theses.fr/2020CYUN1067.
Texto completoThe objective of this thesis is to develop semantic methods of reassembly in the complicated framework of heritage collections, where some blocks are eroded or missing.The reassembly of archaeological remains is an important task for heritage sciences: it allows to improve the understanding and conservation of ancient vestiges and artifacts. However, some sets of fragments cannot be reassembled with techniques using contour information or visual continuities. It is then necessary to extract semantic information from the fragments and to interpret them. These tasks can be performed automatically thanks to deep learning techniques coupled with a solver, i.e., a constrained decision making algorithm.This thesis proposes two semantic reassembly methods for 2D fragments with erosion and a new dataset and evaluation metrics.The first method, Deepzzle, proposes a neural network followed by a solver. The neural network is composed of two Siamese convolutional networks trained to predict the relative position of two fragments: it is a 9-class classification. The solver uses Dijkstra's algorithm to maximize the joint probability. Deepzzle can address the case of missing and supernumerary fragments, is capable of processing about 15 fragments per puzzle, and has a performance that is 25% better than the state of the art.The second method, Alphazzle, is based on AlphaZero and single-player Monte Carlo Tree Search (MCTS). It is an iterative method that uses deep reinforcement learning: at each step, a fragment is placed on the current reassembly. Two neural networks guide MCTS: an action predictor, which uses the fragment and the current reassembly to propose a strategy, and an evaluator, which is trained to predict the quality of the future result from the current reassembly. Alphazzle takes into account the relationships between all fragments and adapts to puzzles larger than those solved by Deepzzle. Moreover, Alphazzle is compatible with constraints imposed by a heritage framework: at the end of reassembly, MCTS does not access the reward, unlike AlphaZero. Indeed, the reward, which indicates if a puzzle is well solved or not, can only be estimated by the algorithm, because only a conservator can be sure of the quality of a reassembly
Asri, Layla El. "Learning the Parameters of Reinforcement Learning from Data for Adaptive Spoken Dialogue Systems". Thesis, Université de Lorraine, 2016. http://www.theses.fr/2016LORR0350/document.
Texto completoThis document proposes to learn the behaviour of the dialogue manager of a spoken dialogue system from a set of rated dialogues. This learning is performed through reinforcement learning. Our method does not require the definition of a representation of the state space nor a reward function. These two high-level parameters are learnt from the corpus of rated dialogues. It is shown that the spoken dialogue designer can optimise dialogue management by simply defining the dialogue logic and a criterion to maximise (e.g user satisfaction). The methodology suggested in this thesis first considers the dialogue parameters that are necessary to compute a representation of the state space relevant for the criterion to be maximized. For instance, if the chosen criterion is user satisfaction then it is important to account for parameters such as dialogue duration and the average speech recognition confidence score. The state space is represented as a sparse distributed memory. The Genetic Sparse Distributed Memory for Reinforcement Learning (GSDMRL) accommodates many dialogue parameters and selects the parameters which are the most important for learning through genetic evolution. The resulting state space and the policy learnt on it are easily interpretable by the system designer. Secondly, the rated dialogues are used to learn a reward function which teaches the system to optimise the criterion. Two algorithms, reward shaping and distance minimisation are proposed to learn the reward function. These two algorithms consider the criterion to be the return for the entire dialogue. These functions are discussed and compared on simulated dialogues and it is shown that the resulting functions enable faster learning than using the criterion directly as the final reward. A spoken dialogue system for appointment scheduling was designed during this thesis, based on previous systems, and a corpus of rated dialogues with this system were collected. This corpus illustrates the scaling capability of the state space representation and is a good example of an industrial spoken dialogue system upon which the methodology could be applied
Léon, Aurélia. "Apprentissage séquentiel budgétisé pour la classification extrême et la découverte de hiérarchie en apprentissage par renforcement". Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS226.
Texto completoThis thesis deals with the notion of budget to study problems of complexity (it can be computational complexity, a complex task for an agent, or complexity due to a small amount of data). Indeed, the main goal of current techniques in machine learning is usually to obtain the best accuracy, without worrying about the cost of the task. The concept of budget makes it possible to take into account this parameter while maintaining good performances. We first focus on classification problems with a large number of classes: the complexity in those algorithms can be reduced thanks to the use of decision trees (here learned through budgeted reinforcement learning techniques) or the association of each class with a (binary) code. We then deal with reinforcement learning problems and the discovery of a hierarchy that breaks down a (complex) task into simpler tasks to facilitate learning and generalization. Here, this discovery is done by reducing the cognitive effort of the agent (considered in this work as equivalent to the use of an additional observation). Finally, we address problems of understanding and generating instructions in natural language, where data are available in small quantities: we test for this purpose the simultaneous use of an agent that understands and of an agent that generates the instructions
Matheron, Guillaume. "Integrating motion planning into reinforcement learning to solve hard exploration problems". Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS348.
Texto completoMotion planning is able to solve robotics problems much quicker than any reinforcement learning algorithm by efficiently searching for a viable trajectory. Indeed, while the main object of interest in the field of Reinforcement Learning is the behavior of an agent, Motion Planning is concerned with the geometry and properties of the state-space, and uses a different set of primitives to achieve more efficient exploration. Some of these primitives require a model of the system and are not studied in this work, others such as reset-anywhere are only available in simulated environments. In contrast, Motion Planning approaches do not benefit from the same generalization properties as the policies produced by reinforcement learning. In this thesis, we study the ways in which techniques inspired from motion planning can speed up the solving of hard exploration problems for reinforcement learning without sacrificing the advantages of model-free learning and generalization. We identify a deadlock that can occur when applying reinforcement learning to seemingly-trivial sparse-reward problems, and contribute an exploration algorithm inspired by motion planning but specifically designed for reinforcement learning environments, as well as a framework to use the collected data to train a reinforcement learning algorithm in previously-intractable scenarios
Kamienny, Pierre-Alexandre. "Efficient adaptation of reinforcement learning agents : from model-free exploration to symbolic world models". Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS412.
Texto completoReinforcement Learning (RL) encompasses a range of techniques employed to train autonomous agents to interact with environments with the purpose of maximizing their returns across various training tasks. To ensure successful deployment of RL agents in real-world scenarios, achieving generalization and adaptation to unfamiliar situations is crucial. Although neural networks have shown promise in facilitating in-domain generalization by enabling agents to interpolate desired behaviors, their limitations in generalizing beyond the training distribution often lead to suboptimal performance on out-of-distribution data. These challenges are further amplified in RL settings characterized by non-stationary environments and constant distribution shifts during deployment. This thesis presents novel strategies within the framework of Meta-Reinforcement Learning, aiming to equip RL agents with the ability to adapt at test-time to out-of-domain tasks. The first part of the thesis focuses on model-free techniques to learn effective exploration strategies. We consider two scenarios: one where the agent is provided with a set of training tasks, enabling it to explicitly model and learn generalizable task representations; and another where the agent learns without rewards to maximize its state coverage. In the second part, we investigate into the application of symbolic regression, a powerful tool for developing predictive models that offer interpretability and exhibit enhanced robustness against distribution shifts. These models are subsequently integrated within model-based RL agents to improve their performance. Furthermore, this research contributes to the field of symbolic regression by introducing a collection of techniques that leverage Transformer models, enhancing their accuracy and effectiveness. In summary, by addressing the challenges of adaptation and generalization in RL, this thesis focuses on the understanding and application of Meta-Reinforcement Learning strategies. It provides insights and techniques for enabling RL agents to adapt seamlessly to out-of-domain tasks, ultimately facilitating their successful deployment in real-world scenarios
Fruit, Ronan. "Exploration-exploitation dilemma in reinforcement learning under various form of prior knowledge". Thesis, Lille 1, 2019. http://www.theses.fr/2019LIL1I086.
Texto completoIn combination with Deep Neural Networks (DNNs), several Reinforcement Learning (RL) algorithms such as "Q-learning" of "Policy Gradient" are now able to achieve super-human performaces on most Atari Games as well as the game of Go. Despite these outstanding and promising achievements, such Deep Reinforcement Learning (DRL) algorithms require millions of samples to perform well, thus limiting their deployment to all applications where data acquisition is costly. The lack of sample efficiency of DRL can partly be attributed to the use of DNNs, which are known to be data-intensive in the training phase. But more importantly, it can be attributed to the type of Reinforcement Learning algorithm used, which only perform a very inefficient undirected exploration of the environment. For instance, Q-learning and Policy Gradient rely on randomization for exploration. In most cases, this strategy turns out to be very ineffective to properly balance the exploration needed to discover unknown and potentially highly rewarding regions of the environment, with the exploitation of rewarding regions already identified as such. Other RL approaches with theoretical guarantees on the exploration-exploitation trade-off have been investigated. It is sometimes possible to formally prove that the performances almost match the theoretical optimum. This line of research is inspired by the Multi-Armed Bandit literature, with many algorithms relying on the same underlying principle often referred as "optimism in the face of uncertainty". Even if a significant effort has been made towards understanding the exploration-exploitation dilemma generally, many questions still remain open. In this thesis, we generalize existing work on exploration-exploitation to different contexts with different amounts of prior knowledge on the learning problem. We introduce several algorithmic improvements to current state-of-the-art approaches and derive a new theoretical analysis which allows us to answer several open questions of the literature. We then relax the (very common although not very realistic) assumption that a path between any two distinct regions of the environment should always exist. Relaxing this assumption highlights the impact of prior knowledge on the intrinsic limitations of the exploration-exploitation dilemma. Finally, we show how some prior knowledge such as the range of the value function or a set of macro-actions can be efficiently exploited to speed-up learning. In this thesis, we always strive to take the algorithmic complexity of the proposed algorithms into account. Although all these algorithms are somehow computationally "efficient", they all require a planning phase and therefore suffer from the well-known "curse of dimensionality" which limits their applicability to real-world problems. Nevertheless, the main focus of this work is to derive general principles that may be combined with more heuristic approaches to help overcome current DRL flaws
Khouzaimi, Hatim. "Turn-taking enhancement in spoken dialogue systems with reinforcement learning". Thesis, Avignon, 2016. http://www.theses.fr/2016AVIG0213/document.
Texto completoIncremental dialogue systems are able to process the user’s speech as it is spoken (without waiting for the end of a sentence before starting to process it). This makes them able to take the floor whenever they decide to (the user can also speak whenever she wants, even if the system is still holding the floor). As a consequence, they are able to perform a richer set of turn-taking behaviours compared to traditional systems. Several contributions are described in this thesis with the aim of showing that dialogue systems’ turn-taking capabilities can be automatically improved from data. First, human-human dialogue is analysed and a new taxonomy of turn-taking phenomena in human conversation is established. Based on this work, the different phenomena are analysed and some of them are selected for replication in a human-machine context (the ones that are more likely to improve a dialogue system’s efficiency). Then, a new architecture for incremental dialogue systems is introduced with the aim of transforming a traditional dialogue system into an incremental one at a low cost (also separating the turn-taking manager from the dialogue manager). To be able to perform the first tests, a simulated environment has been designed and implemented. It is able to replicate user and ASR behaviour that are specific to incremental processing, unlike existing simulators. Combined together, these contributions led to the establishement of a rule-based incremental dialogue strategy that is shown to improve the dialogue efficiency in a task-oriented situation and in simulation. A new reinforcement learning strategy has also been proposed. It is able to autonomously learn optimal turn-taking behavious throughout the interactions. The simulated environment has been used for training and for a first evaluation, where the new data-driven strategy is shown to outperform both the non-incremental and rule-based incremental strategies. In order to validate these results in real dialogue conditions, a prototype through which the users can interact in order to control their smart home has been developed. At the beginning of each interaction, the turn-taking strategy is randomly chosen among the non-incremental, the rule-based incremental and the reinforcement learning strategy (learned in simulation). A corpus of 206 dialogues has been collected. The results show that the reinforcement learning strategy significantly improves the dialogue efficiency without hurting the user experience (slightly improving it, in fact)
Paolo, Giuseppe. "Learning in Sparse Rewards setting through Quality Diversity algorithms". Electronic Thesis or Diss., Sorbonne université, 2021. http://www.theses.fr/2021SORUS400.
Texto completoEmbodied agents, both natural and artificial, can learn to interact with the environment they are in through a process of trial and error. This process can be formalized through the Reinforcement Learning framework, in which the agent performs an action in the environment and observes its outcome through an observation and a reward signal. It is the reward signal that tells the agent how good the performed action is with respect to the task. This means that the more often a reward is given, the easier it is to improve on the current solution. When this is not the case, and the reward is given sparingly, the agent finds itself in a situation of sparse rewards. This requires a big focus on exploration, that is on testing different things, in order to discover which action, or set of actions leads to the reward. RL agents usually struggle with this. Exploration is the focus of Quality-Diversity methods, a family of evolutionary algorithms that searches for a set of policies whose behaviors are as different as possible, while also improving on their performances. In this thesis, we approach the problem of sparse rewards with these algorithms, and in particular with Novelty Search. This is a method that, contrary to many other Quality-Diversity approaches, does not improve on the performances of the discovered rewards, but only on their diversity. Thanks to this it can quickly explore the whole space of possible policies behaviors. The first part of the thesis focuses on autonomously learning a representation of the search space in which the algorithm evaluates the discovered policies. In this regard, we propose the Task Agnostic eXploration of Outcome spaces through Novelty and Surprise (TAXONS) algorithm. This method learns a low-dimensional representation of the search space in situations in which it is not easy to hand-design said representation. TAXONS has proven effective in three different environments but still requires information on when to capture the observation used to learn the search space. This limitation is addressed by performing a study on multiple ways to encode into the search space information about the whole trajectory of observations generated during a policy evaluation. Among the studied methods, we analyze in particular the mathematical transform called signature and its relevance to build trajectory-level representations. The manuscript continues with the study of a complementary problem to the one addressed by TAXONS: how to focus on the most interesting parts of the search space. Novelty Search is limited by the fact that all information about any reward discovered during the exploration process is ignored. In our second contribution, we introduce the Sparse Reward Exploration via Novelty Search and Emitters (SERENE) algorithm. This method separates the exploration of the search space from the exploitation of the reward through a two-alternating-steps approach. The exploration is performed through Novelty Search, but whenever a reward is discovered, it is exploited by instances of reward-based methods - called emitters - that perform local optimization of the reward. Experiments on different environments show how SERENE can quickly obtain high rewarding solutions without hindering the exploration performances of the method. In our third and final contribution, we combine the two ideas presented with TAXONS and SERENE into a single approach: SERENE augmented TAXONS (STAX). This algorithm can autonomously learn a low-dimensional representation of the search space while quickly optimizing any discovered reward through emitters. Experiments conducted on various environments show how the method can i) learn a representation allowing the discovery of all rewards and ii) quickly [...]
Chaffre, Thomas. "Reinforcement learning and sim-to-real transfer for adaptive control of AUV". Electronic Thesis or Diss., Brest, École nationale supérieure de techniques avancées Bretagne, 2022. http://www.theses.fr/2022ENTA0010.
Texto completoAutopilots for unmanned systems are usually designed based on the feedback provided by velocity and orientation sensors. In the case of autopilot systems for autonomous underwater vehicles (AUVs), the main objective in the design is to compensate for waves and current-induced disturbing forces acting on their body. Existing AUV autopilots are however only able to compensate for low-frequency components of sea-induced disturbances. It seems natural to assume that the AUV performance could be improved by taking the nature of the disturbances into account in the design of the autopilot. Adaptive control provides what seems to be an ideal framework for this end. The objective of this technique is to adjust automatically the control parameters when facing unknown or time-varying processes such that the desired performance threshold is met. Developed in the late 1950s, adaptive control frameworks have been considerably expanded and used in various fields, their application has been facilitated by the rapid progress in microelectronics and the increasing interaction between laboratories and companies, from aerospace to maritime industries. As a result, adaptive controllers started to be widely adopted in the industry in the early 1980s. It was established at that time that robust designs with fixed parameters are too limited to handle complex regimes. The study of adaptive controllers for AUV maneuvering is associated with various challenges, and the focus of this thesis was the external disturbances including: - Unknown dynamics: the uncertainty associated with describing precisely the states of waves or currents is high. This, together with its dynamic nature, prevents linear feedback control methods from achieving optimal performance of the plant. This becomes more critical in the presence of changes in weather conditions that impose a multiplicative factor in the component of the induced forces. The disturbance period will also vary with the speed of the vehicle and its orientation relative to the waves; - Nonlinearity: the controller response at some operating points must be overly conservative to satisfy the specification at other operating points. This is difficult to achieve for fixed parameters obtained through local linearization, that do not encompass the entire regime envelope. In this thesis, we considered the case where the AUVs have limited observability of the process and therefore the aforementioned uncertainties are not measured by the system. A class of adaptive control methods, known as learning-based adaptive controllers, have been developed to tackle some of these limitations. This family of solutions uses model-free optimization methods capable of compensating for the unknown part of a process while also maintaining optimal control of its known part using traditional model-based control structures. Among the various model-free methods, deep reinforcement learning is currently leading the field. They exploit strong statistical tools that provide control systems the ability to automatically learn and improve from experience without being explicitly told how to. The objective of this thesis was to formalize a novel learning-based adaptive control using deep reinforcement learning and adaptive pole-placement control. In addition, we proposed a novel experience replay mechanism that takes into account the characteristic of the biological replay mechanism. The methods were validated in simulation and in real life, demonstrating the benefits of combining both theories against using them separately
Xia, Chen. "Apprentissage Intelligent des Robots Mobiles dans la Navigation Autonome". Thesis, Ecole centrale de Lille, 2015. http://www.theses.fr/2015ECLI0026/document.
Texto completoModern robots are designed for assisting or replacing human beings to perform complicated planning and control operations, and the capability of autonomous navigation in a dynamic environment is an essential requirement for mobile robots. In order to alleviate the tedious task of manually programming a robot, this dissertation contributes to the design of intelligent robot control to endow mobile robots with a learning ability in autonomous navigation tasks. First, we consider the robot learning from expert demonstrations. A neural network framework is proposed as the inference mechanism to learn a policy offline from the dataset extracted from experts. Then we are interested in the robot self-learning ability without expert demonstrations. We apply reinforcement learning techniques to acquire and optimize a control strategy during the interaction process between the learning robot and the unknown environment. A neural network is also incorporated to allow a fast generalization, and it helps the learning to converge in a number of episodes that is greatly smaller than the traditional methods. Finally, we study the robot learning of the potential rewards underneath the states from optimal or suboptimal expert demonstrations. We propose an algorithm based on inverse reinforcement learning. A nonlinear policy representation is designed and the max-margin method is applied to refine the rewards and generate an optimal control policy. The three proposed methods have been successfully implemented on the autonomous navigation tasks for mobile robots in unknown and dynamic environments
Najar, Anis. "Shaping robot behaviour with unlabeled human instructions". Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066152.
Texto completoMost of current interactive learning systems rely on predefined protocols that constrain the interaction with the user. Relaxing the constraints of interaction protocols can therefore improve the usability of these systems.This thesis tackles the question of interpreting human instructions, in order to relax the constraints about predetermining their meanings. We propose a framework that enables a human teacher to shape a robot behaviour, by interactively providing it with unlabeled instructions. Our approach consists in grounding the meaning of instruction signals in the task learning process, and using them simultaneously for guiding the latter. This approach has a two-fold advantage. First, it provides more freedom to the teacher in choosing his preferred signals. Second, it reduces the required engineering efforts, by removing the necessity to encode the meaning of each instruction signal. We implement our framework as a modular architecture, named TICS, that offers the possibility to combine different information sources: a predefined reward function, evaluative feedback and unlabeled instructions. This allows for more flexibility in the teaching process, by enabling the teacher to switch between different learning modes. Particularly, we propose several methods for interpreting instructions, and a new method for combining evaluative feedback with a predefined reward function. We evaluate our framework through a series of experiments, performed both in simulation and with real robots. The experimental results demonstrate the effectiveness of our framework in accelerating the task learning process, and in reducing the number of required interactions with the teacher
Renaudo, Erwan. "Des comportements flexibles aux comportements habituels : meta-apprentissage neuro-inspiré pour la robotique autonome". Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066508/document.
Texto completoIn this work, we study how the notion of behavioral habit, inspired from the study of biology, can benefit to robots. Robot control architectures allow the robot to be able to plan to reach long term goals while staying reactive to events happening in the environment (Kortenkamp et Simmons, 2008). However, these architectures are rarely provided with learning capabilities that would allow them to acquire knowledge from experience. On the other hand, learning has been shown as an essential abiilty for behavioral adaptation in mammals. It permits flexible adaptation to new contexts but also efficient behavior in known contexts (Dickinson, 1985). The learning mechanisms are modeled as model-based (planning) and model-free (habitual) reinforcement learning algorithms (Sutton et Barto, 1998) which are combined into a global model of behavior (Daw et al., 2005). We proposed a robotic control architecture that take inspiration from this model of behavior and embed the two kinds of algorithms, and studied its performance in a robotic simulated task. None of the several methods for combining the algorithm we studied gave satisfying results, however, it allowed to identify some properties required for the planning process in a robotic task. We extended our study to two other tasks (one being on a real robot) and confirmed that combining the algorithms improves learning of the robot's behavior
De, La Bourdonnaye François. "Learning sensori-motor mappings using little knowledge : application to manipulation robotics". Thesis, Université Clermont Auvergne (2017-2020), 2018. http://www.theses.fr/2018CLFAC037/document.
Texto completoThe thesis is focused on learning a complex manipulation robotics task using little knowledge. More precisely, the concerned task consists in reaching an object with a serial arm and the objective is to learn it without camera calibration parameters, forward kinematics, handcrafted features, or expert demonstrations. Deep reinforcement learning algorithms suit well to this objective. Indeed, reinforcement learning allows to learn sensori-motor mappings while dispensing with dynamics. Besides, deep learning allows to dispense with handcrafted features for the state spacerepresentation. However, it is difficult to specify the objectives of the learned task without requiring human supervision. Some solutions imply expert demonstrations or shaping rewards to guiderobots towards its objective. The latter is generally computed using forward kinematics and handcrafted visual modules. Another class of solutions consists in decomposing the complex task. Learning from easy missions can be used, but this requires the knowledge of a goal state. Decomposing the whole complex into simpler sub tasks can also be utilized (hierarchical learning) but does notnecessarily imply a lack of human supervision. Alternate approaches which use several agents in parallel to increase the probability of success can be used but are costly. In our approach,we decompose the whole reaching task into three simpler sub tasks while taking inspiration from the human behavior. Indeed, humans first look at an object before reaching it. The first learned task is an object fixation task which is aimed at localizing the object in the 3D space. This is learned using deep reinforcement learning and a weakly supervised reward function. The second task consists in learning jointly end-effector binocular fixations and a hand-eye coordination function. This is also learned using a similar set-up and is aimed at localizing the end-effector in the 3D space. The third task uses the two prior learned skills to learn to reach an object and uses the same requirements as the two prior tasks: it hardly requires supervision. In addition, without using additional priors, an object reachability predictor is learned in parallel. The main contribution of this thesis is the learning of a complex robotic task with weak supervision
Bouzid, Salah Eddine. "Optimisation multicritères des performances de réseau d’objets communicants par méta-heuristiques hybrides et apprentissage par renforcement". Thesis, Le Mans, 2020. http://cyberdoc-int.univ-lemans.fr/Theses/2020/2020LEMA1026.pdf.
Texto completoThe deployment of Communicating Things Networks (CTNs), with continuously increasing densities, needs to be optimal in terms of quality of service, energy consumption and lifetime. Determining the optimal placement of the nodes of these networks, relative to the different quality criteria, is an NP-Hard problem. Faced to this NP-Hardness, especially for indoor environments, existing approaches focus on the optimization of one single objective while neglecting the other criteria, or adopt an expensive manual solution. Finding new approaches to solve this problem is required. Accordingly, in this thesis, we propose a new approach which automatically generates the deployment that guarantees optimality in terms of performance and robustness related to possible topological failures and instabilities. The proposed approach is based, on the first hand, on the modeling of the deployment problem as a multi-objective optimization problem under constraints, and its resolution using a hybrid algorithm combining genetic multi-objective optimization with weighted sum optimization and on the other hand, the integration of reinforcement learning to guarantee the optimization of energy consumption and the extending the network lifetime. To apply this approach, two tools are developed. A first called MOONGA (Multi-Objective Optimization of wireless Network approach based on Genetic Algorithm) which automatically generates the placement of nodes while optimizing the metrics that define the QoS of the CTN: connectivity, m-connectivity, coverage, k-coverage, coverage redundancy and cost. MOONGA tool considers constraints related to the architecture of the deployment space, the network topology, the specifies of the application and the preferences of the network designer. The second optimization tool is named R2LTO (Reinforcement Learning for Life-Time Optimization), which is a new routing protocol for CTNs, based on distributed reinforcement learning that allows to determine the optimal rooting path in order to guarantee energy-efficiency and to extend the network lifetime while maintaining the required QoS
Chandramohan, Senthilkumar. "Revisiting user simulation in dialogue systems : do we still need them ? : will imitation play the role of simulation ?" Phd thesis, Université d'Avignon, 2012. http://tel.archives-ouvertes.fr/tel-00875229.
Texto completoRenaudo, Erwan. "Des comportements flexibles aux comportements habituels : meta-apprentissage neuro-inspiré pour la robotique autonome". Electronic Thesis or Diss., Paris 6, 2016. http://www.theses.fr/2016PA066508.
Texto completoIn this work, we study how the notion of behavioral habit, inspired from the study of biology, can benefit to robots. Robot control architectures allow the robot to be able to plan to reach long term goals while staying reactive to events happening in the environment (Kortenkamp et Simmons, 2008). However, these architectures are rarely provided with learning capabilities that would allow them to acquire knowledge from experience. On the other hand, learning has been shown as an essential abiilty for behavioral adaptation in mammals. It permits flexible adaptation to new contexts but also efficient behavior in known contexts (Dickinson, 1985). The learning mechanisms are modeled as model-based (planning) and model-free (habitual) reinforcement learning algorithms (Sutton et Barto, 1998) which are combined into a global model of behavior (Daw et al., 2005). We proposed a robotic control architecture that take inspiration from this model of behavior and embed the two kinds of algorithms, and studied its performance in a robotic simulated task. None of the several methods for combining the algorithm we studied gave satisfying results, however, it allowed to identify some properties required for the planning process in a robotic task. We extended our study to two other tasks (one being on a real robot) and confirmed that combining the algorithms improves learning of the robot's behavior
Ferreira, Emmanuel. "Apprentissage automatique en ligne pour un dialogue homme-machine situé". Thesis, Avignon, 2015. http://www.theses.fr/2015AVIG0206/document.
Texto completoA dialogue system should give the machine the ability to interactnaturally and efficiently with humans. In this thesis, we focus on theissue of the development of stochastic dialogue systems. Thus, we especiallyconsider the Partially Observable Markov Decision Process (POMDP)framework which yields state-of-the-art performance on goal-oriented dialoguemanagement tasks. This model enables the system to cope with thecommunication ambiguities due to noisy channel and also to optimize itsdialogue management strategy directly from data with Reinforcement Learning (RL)methods.Considering statistical approaches often requires the availability of alarge amount of training data to reach good performance. However, corpora of interest are seldom readily available and collectingsuch data is both time consuming and expensive. For instance, it mayrequire a working prototype to initiate preliminary experiments with thesupport of expert users or to consider other alternatives such as usersimulation techniques.Very few studies to date have considered learning a dialogue strategyfrom scratch by interacting with real users, yet this solution is ofgreat interest. Indeed, considering the learning process as part of thelife cycle of a system offers a principle framework to dynamically adaptthe system to new conditions in an online and seamless fashion.In this thesis, we endeavour to provide solutions to make possible thisdialogue system cold start (nearly from scratch) but also to improve its ability to adapt to new conditions in operation (domain extension, new user profile, etc.).First, we investigate the conditions under which initial expertknowledge (such as expert rules) can be used to accelerate the policyoptimization of a learning agent. Similarly, we study how polarized userappraisals gathered throughout the course of the interaction can beintegrated into a reinforcement learning-based dialogue manager. Morespecifically, we discuss how this information can be cast intosocially-inspired rewards to speed up the policy optimisation for bothefficient task completion and user adaptation in an online learning setting.The results obtained on a reference task demonstrate that a(quasi-)optimal policy can be learnt in just a few hundred dialogues,but also that the considered additional information is able tosignificantly accelerate the learning as well as improving the noise tolerance.Second, we focus on reducing the development cost of the spoken language understanding module. For this, we exploit recent word embedding models(projection of words in a continuous vector space representing syntacticand semantic properties) to generalize from a limited initial knowledgeabout the dialogue task to enable the machine to instantly understandthe user utterances. We also propose to dynamically enrich thisknowledge with both active learning techniques and state-of-the-artstatistical methods. Our experimental results show that state-of-the-artperformance can be obtained with a very limited amount of in-domain andin-context data. We also show that we are able to refine the proposedmodel by exploiting user returns about the system outputs as well as tooptimize our adaptive learning with an adversarial bandit algorithm tosuccessfully balance the trade-off between user effort and moduleperformance.Finally, we study how the physical embodiment of a dialogue system in a humanoid robot can help the interaction in a dedicated Human-Robotapplication where dialogue system learning and testing are carried outwith real users. Indeed, in this thesis we propose an extension of thepreviously considered decision-making techniques to be able to take intoaccount the robot's awareness of the users' belief (perspective taking)in a RL-based situated dialogue management optimisation procedure
Lecerf, Ugo. "Robust learning for autonomous agents in stochastic environments". Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS253.
Texto completoIn this work we explore data-driven deep reinforcement learning (RL) approaches for an autonomous agent to be robust to a navigation task, and act correctly in the face of risk and uncertainty. We investigate the effects that sudden changes to environment conditions have on autonomous agents and explore methods which allow an agent to have a high degree of generalization to unforeseen, sudden modifications to its environment it was not explicitly trained to handle. Inspired by the human dopamine circuit, the performance of an RL agent is measured and optimized in terms of rewards and penalties it receives for desirable or undesirable behaviour. Our initial approach is to learn to estimate the distribution of expected rewards from the agent, and use information about modes in this distribution to gain nuanced information about how an agent can act in a high-risk situation. Later, we show that we are able to achieve the same robustness objective with respect to uncertainties in the environment by attempting to learn the most effective contingency policies in a `divide and conquer' approach, where the computational complexity of the learning task is shared between multiple policy models. We then combine this approach with a hierarchical planning module which is used to effectively schedule the different policy models in such a way that the collection of contingency plans is able to be highly robust to unanticipated environment changes. This combination of learning and planning enables us to make the most of the adaptability of deep learning models, as well as the stricter and more explicit constraints that can be implemented and measured by means of a hierarchical planner
Jneid, Khoder. "Apprentissage par Renforcement Profond pour l'Optimisation du Contrôle et de la Gestion des Bâtiment". Electronic Thesis or Diss., Université Grenoble Alpes, 2023. http://www.theses.fr/2023GRALM062.
Texto completoHeating, ventilation, and air-conditioning (HVAC) systems account for high energy consumption in buildings. Conventional approaches used to control HVAC systems rely on rule-based control (RBC) that consists of predefined rules set by an expert. Model-predictive control (MPC), widely explored in literature, is not adopted in the industry since it is a model-based approach that requires to build models of the building at the first stage to be used in the optimization phase and thus is time-consuming and expensive. During the PhD, we investigate reinforcement learning (RL) to optimize the energy consumption of HVAC systems while maintaining good thermal comfort and good air quality. Specifically, we focus on model-free RL algorithms that learn through interaction with the environment (building including the HVAC) and thus not requiring to have accurate models of the environment. In addition, online approaches are considered. The main challenge of an online model-free RL is the number of days that are necessary for the algorithm to acquire enough data and actions feedback to start acting properly. Hence, the research subject of the PhD is boosting model-free RL algorithms to converge faster to make them applicable in real-world applications, HVAC control. Two approaches have been explored during the PhD to achieve our objective: the first approach combines RBC with value-based RL, and the second approach combines fuzzy rules with policy-based RL. Both approaches aim to boost the convergence of RL by guiding the RL policy but they are completely different. The first approach exploits RBC rules during training while in the second approach, the fuzzy rules are injected directly into the policy. Tests areperformed on a simulated office during winter. This simulated office is a replica of a real office at Grenoble INP
Najar, Anis. "Shaping robot behaviour with unlabeled human instructions". Electronic Thesis or Diss., Paris 6, 2017. http://www.theses.fr/2017PA066152.
Texto completoMost of current interactive learning systems rely on predefined protocols that constrain the interaction with the user. Relaxing the constraints of interaction protocols can therefore improve the usability of these systems.This thesis tackles the question of interpreting human instructions, in order to relax the constraints about predetermining their meanings. We propose a framework that enables a human teacher to shape a robot behaviour, by interactively providing it with unlabeled instructions. Our approach consists in grounding the meaning of instruction signals in the task learning process, and using them simultaneously for guiding the latter. This approach has a two-fold advantage. First, it provides more freedom to the teacher in choosing his preferred signals. Second, it reduces the required engineering efforts, by removing the necessity to encode the meaning of each instruction signal. We implement our framework as a modular architecture, named TICS, that offers the possibility to combine different information sources: a predefined reward function, evaluative feedback and unlabeled instructions. This allows for more flexibility in the teaching process, by enabling the teacher to switch between different learning modes. Particularly, we propose several methods for interpreting instructions, and a new method for combining evaluative feedback with a predefined reward function. We evaluate our framework through a series of experiments, performed both in simulation and with real robots. The experimental results demonstrate the effectiveness of our framework in accelerating the task learning process, and in reducing the number of required interactions with the teacher
Keramati, Mohammadmahdi. "A homeostatic reinforcement learning theory, and its implications in cocaine addiction". Thesis, Paris 5, 2013. http://www.theses.fr/2013PA05T027/document.
Texto completoThis thesis is composed of two parts. In the first part, we propose a theory for interaction between reinforcement learning and homeostatic regulation processes. In fact, efficient regulation of internal homeostasis and defending it against perturbations requires complex behavioral strategies to obtain physiologically-depleted resources. In this respect, it is essential that brains homeostatic regulation and associative learning processes work in concert. We propose a normative computational theory for homeostatically-regulated reinforcement learning (HRL), where physiological stability and reward acquisition prove to be identical objectives achievable simultaneously. Theoretically, the framework resolves the long-standing question of how overt behavior is modulated by internal state, and how animals learn to predictively act to preclude prospective homeostatic challenges (anticipatory responding). It further provides a normative explanation for temporal discounting of reward, and accounts for risk-aversive behavior, competition between motivational systems, taste-induced overeating, and lack of motivation for intravenous injection of food. Neurobiologically, the theory suggests a computational explanation for the role of orexin-based interaction between the hypothalamic circuitry and the midbrain dopaminergic nuclei, as an interface between internal states and motivated behaviors. In the second part of the thesis, we use the HRL model presented in the first part, as the cornerstone for developing an Allostatic Reinforcement Learning (ARL) theory of cocaine addiction. We argue that cocaine addiction arises from the HRL system being hijacked by the pharmacological effects of cocaine on the brain. We demonstrate that the model can successfully capture a wide range of behavioral and neurobiological aspects of cocaine addiction, namely escalation of cocaine self-administration under long- but not short-access conditions, U-shaped dose-response function for cocaine, priming-induced reinstatement of cocaine seeking, and interaction between dopamine D2 receptor availability and cocaine seeking
Chareyre, Maxime. "Apprentissage non-supervisé pour la découverte de propriétés d'objets par découplage entre interaction et interprétation". Electronic Thesis or Diss., Université Clermont Auvergne (2021-...), 2023. http://www.theses.fr/2023UCFA0122.
Texto completoRobots are increasingly used to achieve tasks in controlled environments. However, their use in open environments is still fraught with difficulties. Robotic agents are likely to encounter objects whose behaviour and function they are unaware of. In some cases, it must interact with these elements to carry out its mission by collecting or moving them, but without knowledge of their dynamic properties it is not possible to implement an effective strategy for resolving the mission.In this thesis, we present a method for teaching an autonomous robot a physical interaction strategy with unknown objects, without any a priori knowledge, the aim being to extract information about as many of the object's physical properties as possible from the interactions observed by its sensors. Existing methods for characterising objects through physical interactions do not fully satisfy these criteria. Indeed, the interactions established only provide an implicit representation of the object's dynamics, requiring supervision to identify their properties. Furthermore, the proposed solution is based on unrealistic scenarios without an agent. Our approach differs from the state of the art by proposing a generic method for learning interaction that is independent of the object and its properties, and can therefore be decoupled from the prediction phase. In particular, this leads to a completely unsupervised global pipeline.In the first phase, we propose to learn an interaction strategy with the object via an unsupervised reinforcement learning method, using an intrinsic motivation signal based on the idea of maximising variations in a state vector of the object. The aim is to obtain a set of interactions containing information that is highly correlated with the object's physical properties. This method has been tested on a simulated robot interacting by pushing and has enabled properties such as the object's mass, shape and friction to be accurately identified.In a second phase, we make the assumption that the true physical properties define a latent space that explains the object's behaviours and that this space can be identified from observations collected through the agent's interactions. We set up a self-supervised prediction task in which we adapt a state-of-the-art architecture to create this latent space. Our simulations confirm that combining the behavioural model with this architecture leads to the emergence of a representation of the object's properties whose principal components are shown to be strongly correlated with the object's physical properties.Once the properties of the objects have been extracted, the agent can use them to improve its efficiency in tasks involving these objects. We conclude this study by highlighting the performance gains achieved by the agent through training via reinforcement learning on a simplified object repositioning task where the properties are perfectly known.All the work carried out in simulation confirms the effectiveness of an innovative method aimed at autonomously discovering the physical properties of an object through the physical interactions of a robot. The prospects for extending this work involve transferring it to a real robot in a cluttered environment
Alami, Réda. "Bandits à Mémoire pour la prise de décision en environnement dynamique. Application à l'optimisation des réseaux de télécommunications". Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG063.
Texto completoIn this PhD thesis, we study the non-stationary multi-armed bandit problem where the non-stationarity behavior of the environment is characterized by several abrupt changes called "change-points". We propose Memory Bandits: a combination between an algorithm for the stochastic multi-armed bandit and the Bayesian Online Change-Point Detector (BOCPD). The analysis of the latter has always been an open problem in the statistical and sequential learning theory community. For this reason, we derive a variant of the Bayesian Online Change-point detector which is easier to mathematically analyze in term of false alarm rateand detection delay (which are the most common criteria for online change-point detection). Then, we introduce the decentralized exploration problem in the multi-armed bandit paradigm where a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. We propose a first generic solution called decentralized elimination: which uses any best arm identification algorithm as a subroutine with the guar-antee that the algorithm ensures privacy, with a low communication cost. Finally, we perform an evaluation of the multi-armed bandit strategies in two different context of telecommunication networks. First, in LoRaWAN (Long Range Wide Area Network) context, we propose to use multi-armed bandit algorithms instead of the default algorithm ADR (Adaptive Data Rate) in order to minimize the energy consumption and the packet losses of end-devices. Then, in a IEEE 802.15.4-TSCH context, we perform an evaluation of 9 multi-armed bandit algorithms in order to select the ones that choose high-performance channels, using data collected through the FIT IoT-LAB platform. The performance evaluation suggests that our proposal can significantly improve the packet delivery ratio compared to the default TSCH operation, thereby increasing the reliability and the energy efficiency of the transmissions
Ho, Vinh Thanh. "Techniques avancées d'apprentissage automatique basées sur la programmation DC et DCA". Thesis, Université de Lorraine, 2017. http://www.theses.fr/2017LORR0289/document.
Texto completoIn this dissertation, we develop some advanced machine learning techniques in the framework of online learning and reinforcement learning (RL). The backbones of our approaches are DC (Difference of Convex functions) programming and DCA (DC Algorithm), and their online version that are best known as powerful nonsmooth, nonconvex optimization tools. This dissertation is composed of two parts: the first part studies some online machine learning techniques and the second part concerns RL in both batch and online modes. The first part includes two chapters corresponding to online classification (Chapter 2) and prediction with expert advice (Chapter 3). These two chapters mention a unified DC approximation approach to different online learning algorithms where the observed objective functions are 0-1 loss functions. We thoroughly study how to develop efficient online DCA algorithms in terms of theoretical and computational aspects. The second part consists of four chapters (Chapters 4, 5, 6, 7). After a brief introduction of RL and its related works in Chapter 4, Chapter 5 aims to provide effective RL techniques in batch mode based on DC programming and DCA. In particular, we first consider four different DC optimization formulations for which corresponding attractive DCA-based algorithms are developed, then carefully address the key issues of DCA, and finally, show the computational efficiency of these algorithms through various experiments. Continuing this study, in Chapter 6 we develop DCA-based RL techniques in online mode and propose their alternating versions. As an application, we tackle the stochastic shortest path (SSP) problem in Chapter 7. Especially, a particular class of SSP problems can be reformulated in two directions as a cardinality minimization formulation and an RL formulation. Firstly, the cardinality formulation involves the zero-norm in objective and the binary variables. We propose a DCA-based algorithm by exploiting a DC approximation approach for the zero-norm and an exact penalty technique for the binary variables. Secondly, we make use of the aforementioned DCA-based batch RL algorithm. All proposed algorithms are tested on some artificial road networks
Père, Valentin. "Contributions au contrôle et au dimensionnement des micro-réseaux par apprentissage par renforcement : application aux systèmes avec production renouvelable et stockage hybride batterie-hydrogène". Electronic Thesis or Diss., Ecole nationale des Mines d'Albi-Carmaux, 2023. http://www.theses.fr/2023EMAC0018.
Texto completoCombining photovoltaic panels with an electrochemical battery reduces the daily phase difference between electricity production and demand in a microgrid. For long-term electricity storage, the combined use of an electrolyzer, hydrogen storage and a fuel cell offers the possibility of conserving electricity produced in summer to meet increased winter demand. Optimal real-time control of microgrid storage units is hampered by random data and the non-linear dynamic behavior of the units over long time horizons. This work presents a methodology for sizing and controlling a microgrid comprising photovoltaic electricity production, a lithium-ion battery and hydrogen storage, based on economic, environmental and technical objectives. The sizing of the units in a microgrid establishes their constraints of use, while the criteria to be optimized for its sizing (such as the cost of energy, the rate of self-consumption, the probability of breakdowns) depend on the management of these units. This interdependence justifies the development of a sizing methodology coupled with long-term energy management algorithm. The management of a microgrid is influenced by random variables such as demand and the energy produced at any given time. Reinforcement learning is a sequential decision-making methodology based on a dynamic model of the system that can adapt its strategy to random data. As a first step, a reinforcement learning control methodology is adopted by integrating non-linearities such as the aging of the storage system. Reinforcement learning enabled the energy management system to maintain an effective unit control policy with respect to the targeted criteria. This effectiveness is maintained despite different data and a longer time horizon than those on which the model was built. The control strategies developed suggest that the advantages of long-term electricity storage depend on the characteristics of the microgrid, and in particular on the amplitude of demand and the capacity of the battery. The study shows that a compromise must be found between the economic profitability of the microgrid and the guarantee of its autonomy. A bi-level optimization method is developed to achieve optimal unit sizing and energy management. The control of the microgrid by reinforcement learning forms the inner loop, while unit sizing is carried out using a simulated-annealing algorithm in the main loop. Particular attention is paid to minimizing computing time, by developing a method for transferring control policy from one iteration of the main loop to another. Offline reinforcement learning has been used to learn unit control strategies without random interaction with the microgrid simulation. The strategies are learned by observing the control decisions made by a model trained on other sizing in previous iterations. The calculation time is reduces by over 50% and the quality of the control policy learned is not affected. The results are analyzed in regard to the objectives considered, the control strategy and the data incorporated into the microgrid simulation
Pinault, Florian. "Apprentissage par renforcement pour la généralisation des approches automatiques dans la conception des systèmes de dialogue oral". Phd thesis, Université d'Avignon, 2011. http://tel.archives-ouvertes.fr/tel-00933937.
Texto completoDaher, Tony. "Gestion cognitive des réseaux radio auto-organisant de cinquième génération". Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLT023/document.
Texto completoThe pressure on operators to improve the network management efficiency is constantly growing for many reasons: the user traffic that is increasing very fast, higher end users expectations, emerging services with very specific requirements. Self-Organizing Networks (SON) concept was introduced by the 3rd Generation Partnership Project as a promising solution to simplify the operation and management of complex networks. Many SON modules are already being deployed in today’s networks. Such networks are known as SON enabled networks, and they have proved to be useful in reducing the complexity of network management. However, SON enabled networks are still far from realizing a network that is autonomous and self-managed as a whole. In fact, the behavior of the SON functions depends on the parameters of their algorithm, as well as on the network environment where it is deployed. Besides, SON objectives and actions might be conflicting with each other, leading to incompatible parameter tuning in the network. Each SON function hence still needs to be itself manually configured, depending on the network environment and the objectives of the operator. In this thesis, we propose an approach for an integrated SON management system through a Cognitive Policy Based SON Management (C-PBSM) approach, based on Reinforcement Learning (RL). The C-PBSM translates autonomously high level operator objectives, formulated as target Key Performance Indicators (KPIs), into configurations of the SON functions. Furthermore, through its cognitive capabilities, the C-PBSM is able to build its knowledge by interacting with the real network. It is also capable of adapting with the environment changes. We investigate different RL approaches, we analyze the convergence time and the scalability and propose adapted solutions. We tackle the problem of non-stationarity in the network, notably the traffic variations, as well as the different contexts present in a network. We propose as well an approach for transfer learning and collaborative learning. Practical aspects of deploying RL agents in real networks are also investigated under Software Defined Network (SDN) architecture
Esteves, José Jurandir Alves. "Optimization of network slice placement in distributed large-scale infrastructures : from heuristics to controlled deep reinforcement learning". Electronic Thesis or Diss., Sorbonne université, 2021. http://www.theses.fr/2021SORUS325.
Texto completoThis PhD thesis investigates how to optimize Network Slice Placement in distributed large-scale infrastructures focusing on online heuristic and Deep Reinforcement Learning (DRL) based approaches. First, we rely on Integer Linear Programming (ILP) to propose a data model for enabling on-Edge and on-Network Slice Placement. In contrary to most studies related to placement in the NFV context, the proposed ILP model considers complex Network Slice topologies and pays special attention to the geographic location of Network Slice Users and its impact on the End-to-End (E2E) latency. Extensive numerical experiments show the relevance of taking into account the user location constraints. Then, we rely on an approach called the “Power of Two Choices"(P2C) to propose an online heuristic algorithm for the problem which is adapted to support placement on large-scale distributed infrastructures while integrating Edge-specific constraints. The evaluation results show the good performance of the heuristic that solves the problem in few seconds under a large-scale scenario. The heuristic also improves the acceptance ratio of Network Slice Placement Requests when compared against a deterministic online ILP-based solution. Finally, we investigate the use of ML methods, more specifically DRL, for increasing scalability and automation of Network Slice Placement considering a multi-objective optimization approach to the problem. We first propose a DRL algorithm for Network Slice Placement which relies on the Advantage Actor Critic algorithm for fast learning, and Graph Convolutional Networks for feature extraction automation. Then, we propose an approach we call Heuristically Assisted Deep Reinforcement Learning (HA-DRL), which uses heuristics to control the learning and execution of the DRL agent. We evaluate this solution trough simulations under stationary, cycle-stationary and non-stationary network load conditions. The evaluation results show that heuristic control is an efficient way of speeding up the learning process of DRL, achieving a substantial gain in resource utilization, reducing performance degradation, and is more reliable under unpredictable changes in network load than non-controlled DRL algorithms
Martinez, Coralie. "Classification précoce de séquences temporelles par de l'apprentissage par renforcement profond". Thesis, Université Grenoble Alpes (ComUE), 2019. http://www.theses.fr/2019GREAT123.
Texto completoEarly classification (EC) of time series is a recent research topic in the field of sequential data analysis. It consists in assigning a label to some data that is sequentially collected with new data points arriving over time, and the prediction of a label has to be made using as few data points as possible in the sequence. The EC problem is of paramount importance for supporting decision makers in many real-world applications, ranging from process control to fraud detection. It is particularly interesting for applications concerned with the costs induced by the acquisition of data points, or for applications which seek for rapid label prediction in order to take early actions. This is for example the case in the field of health, where it is necessary to provide a medical diagnosis as soon as possible from the sequence of medical observations collected over time. Another example is predictive maintenance with the objective to anticipate the breakdown of a machine from its sensor signals. In this doctoral work, we developed a new approach for this problem, based on the formulation of a sequential decision making problem, that is the EC model has to decide between classifying an incomplete sequence or delaying the prediction to collect additional data points. Specifically, we described this problem as a Partially Observable Markov Decision Process noted EC-POMDP. The approach consists in training an EC agent with Deep Reinforcement Learning (DRL) in an environment characterized by the EC-POMDP. The main motivation for this approach was to offer an end-to-end model for EC which is able to simultaneously learn optimal patterns in the sequences for classification and optimal strategic decisions for the time of prediction. Also, the method allows to set the importance of time against accuracy of the classification in the definition of rewards, according to the application and its willingness to make this compromise. In order to solve the EC-POMDP and model the policy of the EC agent, we applied an existing DRL algorithm, the Double Deep-Q-Network algorithm, whose general principle is to update the policy of the agent during training episodes, using a replay memory of past experiences. We showed that the application of the original algorithm to the EC problem lead to imbalanced memory issues which can weaken the training of the agent. Consequently, to cope with those issues and offer a more robust training of the agent, we adapted the algorithm to the EC-POMDP specificities and we introduced strategies of memory management and episode management. In experiments, we showed that these contributions improved the performance of the agent over the original algorithm, and that we were able to train an EC agent which compromised between speed and accuracy, on each sequence individually. We were also able to train EC agents on public datasets for which we have no expertise, showing that the method is applicable to various domains. Finally, we proposed some strategies to interpret the decisions of the agent, validate or reject them. In experiments, we showed how these solutions can help gain insight in the choice of action made by the agent
Nicart, Esther. "Qualitative reinforcement for man-machine interactions". Thesis, Normandie, 2017. http://www.theses.fr/2017NORMC206/document.
Texto completoInformation extraction (IE) is defined as the identification and extraction of elements of interest, such as named entities, their relationships, and their roles in events. For example, a web-crawler might collect open-source documents, which are then processed by an IE treatment chain to produce a summary of the information contained in them.We model such an IE document treatment chain} as a Markov Decision Process, and use reinforcement learning to allow the agent to learn to construct custom-made chains ``on the fly'', and to continuously improve them.We build a platform, BIMBO (Benefiting from Intelligent and Measurable Behaviour Optimisation) which enables us to measure the impact on the learning of various models, algorithms, parameters, etc.We apply this in an industrial setting, specifically to a document treatment chain which extracts events from massive volumes of web pages and other open-source documents.Our emphasis is on minimising the burden of the human analysts, from whom the agent learns to improve guided by their feedback on the events extracted. For this, we investigate different types of feedback, from numerical rewards, which requires a lot of user effort and tuning, to partially and even fully qualitative feedback, which is much more intuitive, and demands little to no user intervention. We carry out experiments, first with numerical rewards, then demonstrate that intuitive feedback still allows the agent to learn effectively.Motivated by the need to rapidly propagate the rewards learnt at the final states back to the initial ones, even on exploration, we propose Dora: an improved version Q-Learning
Sola, Yoann. "Contributions to the development of deep reinforcement learning-based controllers for AUV". Thesis, Brest, École nationale supérieure de techniques avancées Bretagne, 2021. http://www.theses.fr/2021ENTA0015.
Texto completoThe marine environment is a very hostile setting for robotics. It is strongly unstructured, very uncertain and includes a lot of external disturbances which cannot be easily predicted or modelled. In this work, we will try to control an autonomous underwater vehicle (AUV) in order to perform a waypoint tracking task, using a machine learning-based controller. Machine learning allowed to make impressive progress in a lot of different domain in the recent years, and the subfield of deep reinforcement learning managed to design several algorithms very suitable for the continuous control of dynamical systems. We chose to implement the Soft Actor-Critic (SAC) algorithm, an entropy-regularized deep reinforcement learning algorithm allowing to fulfill a learning task and to encourage the exploration of the environment simultaneously. We compared a SAC-based controller with a Proportional-Integral-Derivative (PID) controller on a waypoint tracking task and using specific performance metrics. All the tests were performed in simulation thanks to the use of the UUV Simulator. We decided to apply these two controllers to the RexROV 2, a six degrees of freedom cube-shaped remotely operated underwater vehicle (ROV) converted in an AUV. Thanks to these tests, we managed to propose several interesting contributions such as making the SAC achieve an end-to-end control of the AUV, outperforming the PID controller in terms of energy saving, and reducing the amount of information needed by the SAC algorithm. Moreover we propose a methodology for the training of deep reinforcement learning algorithms on control tasks, as well as a discussion about the absence of guidance algorithms for our end-to-end AUV controller
Brenon, Alexis. "Modèle profond pour le contrôle vocal adaptatif d'un habitat intelligent". Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAM057/document.
Texto completoSmart-homes, resulting of the merger of home-automation, ubiquitous computing and artificial intelligence, support inhabitants in their activity of daily living to improve their quality of life.Allowing dependent and aged people to live at home longer, these homes provide a first answer to society problems as the dependency tied to the aging population.In voice controlled home, the home has to answer to user's requests covering a range of automated actions (lights, blinds, multimedia control, etc.).To achieve this, the control system of the home need to be aware of the context in which a request has been done, but also to know user habits and preferences.Thus, the system must be able to aggregate information from a heterogeneous home-automation sensors network and take the (variable) user behavior into account.The development of smart home control systems is hard due to the huge variability regarding the home topology and the user habits.Furthermore, the whole set of contextual information need to be represented in a common space in order to be able to reason about them and make decisions.To address these problems, we propose to develop a system which updates continuously its model to adapt itself to the user and which uses raw data from the sensors through a graphical representation.This new method is particularly interesting because it does not require any prior inference step to extract the context.Thus, our system uses deep reinforcement learning; a convolutional neural network allowing to extract contextual information and reinforcement learning used for decision-making.Then, this memoir presents two systems, a first one only based on reinforcement learning showing limits of this approach against real environment with thousands of possible states.Introduction of deep learning allowed to develop the second one, ARCADES, which gives good performances proving that this approach is relevant and opening many ways to improve it
Pesquerel, Fabien. "Information per unit of interaction in stochastic sequential decision making". Electronic Thesis or Diss., Université de Lille (2022-....), 2023. https://pepite-depot.univ-lille.fr/LIBRE/EDMADIS/2023/2023ULILB048.pdf.
Texto completoIn this thesis, we wonder about the rate at which one can solve an unknown stochastic problem.To this purpose we introduce two research fields known as Bandit and Reinforcement Learning.In these two settings, a learner must sequentially makes decision that will affect a reward signal that the learner receive.The learner does not know the environment with which it is interaction, yet wish to maximize its average reward in the long run.More specifically, we are interested in studying some form of stochastic decision problem under the average-reward criterion in which a learning algorithm interacts sequentially with a dynamical system, without any reset, in a single and infinite sequence of observations, actions, and rewards while trying to maximize its total accumulated rewards over time.We first introduce Bandit, in which the set of decision is constant and introduce what is meant by solving the problem.Amongst those learners, some are better than all the others, and called optimal.We first focus on how to make the most out of each interaction with the system by revisiting an optimal algorithm, and reduce its numerical complexity.Therefore, the information extracted from each sample, per-time-step, is larger since the optimality remains.Then we study an interesting structured problem in which one can exploit the structure without estimating it.Afterward we introduce Reinforcement Learning, in which the decision a learner can make depend on a notion of state.Each time a learner makes a decision, it receives a reward and the state change according to transition law on the set of states.In some setting, known as ergodic, an optimal rate of solving is known and we introduce a knew algorithm that we can prove to be optimal and show to be numerically efficient.In a final chapter, we make a step in the direction of removing the ergodic assumption by considering the a priori simpler problem where the transitions are known.Yet, correctly understanding the rate at which information can be acquired about an optimal solution is already not easy
Merckling, Astrid. "Unsupervised pretraining of state representations in a rewardless environment". Electronic Thesis or Diss., Sorbonne université, 2021. http://www.theses.fr/2021SORUS141.
Texto completoThis thesis seeks to extend the capabilities of state representation learning (SRL) to help scale deep reinforcement learning (DRL) algorithms to continuous control tasks with high-dimensional sensory observations (such as images). SRL allows to improve the performance of DRL by providing it with better inputs than the input embeddings learned from scratch with end-to-end strategies. Specifically, this thesis addresses the problem of performing state estimation in the manner of deep unsupervised pretraining of state representations without reward. These representations must verify certain properties to allow for the correct application of bootstrapping and other decision making mechanisms common to supervised learning, such as being low-dimensional and guaranteeing the local consistency and topology (or connectivity) of the environment, which we will seek to achieve through the models pretrained with the two SRL algorithms proposed in this thesis