Teses / dissertações sobre o tema "Apprentissage par renforcement mulitagent"
Crie uma referência precisa em APA, MLA, Chicago, Harvard, e outros estilos
Veja os 50 melhores trabalhos (teses / dissertações) para estudos sobre o assunto "Apprentissage par renforcement mulitagent".
Ao lado de cada fonte na lista de referências, há um botão "Adicionar à bibliografia". Clique e geraremos automaticamente a citação bibliográfica do trabalho escolhido no estilo de citação de que você precisa: APA, MLA, Harvard, Chicago, Vancouver, etc.
Você também pode baixar o texto completo da publicação científica em formato .pdf e ler o resumo do trabalho online se estiver presente nos metadados.
Veja as teses / dissertações das mais diversas áreas científicas e compile uma bibliografia correta.
Dinneweth, Joris. "Vers des approches hybrides fondées sur l'émergence et l'apprentissage : prise en compte des véhicules autonomes dans le trafic". Electronic Thesis or Diss., université Paris-Saclay, 2024. http://www.theses.fr/2024UPASG099.
Texto completo da fonteAccording to the World Health Organization, road accidents cause almost 1.2 million deaths and 40 million injuries each year. In wealthy countries, safety standards prevent a large proportion of accidents. The remaining accidents are caused by human behavior. For this reason, some are planning to automate road traffic, i.e., to replace humans as drivers of their vehicles. However, automating road traffic can hardly be achieved overnight. Thus, driving robots (DRs) and human drivers could cohabit in mixed traffic. Our thesis focuses on the safety issues that may arise due to behavioral differences between DRs and human drivers. DRs are designed to respect formal norms, those of the Highway Code. Human drivers, on the other hand, are opportunistic, not hesitating to break formal norms and adopt new, informal ones. The emergence of new behaviors in traffic can make it more heterogeneous and encourage accidents caused by misinterpretation of these new behaviors. We believe that minimizing this behavioral heterogeneity would reduce the above risks. Therefore, our thesis proposes a decision-making model of DR whose behavior is intended to be close to non-hazardous human practices, in order to minimize the heterogeneity between RC and human driver behavior, and with the aim of promoting their acceptance by the latter. To achieve this, we will adopt a multidisciplinary approach, inspired by studies in driving psychology and combining traffic simulation, multi-agent reinforcement learning (MARL). MARL consists of learning a behavior by trial and error guided by a utility function. Thanks to its ability to generalize, especially via neural networks, MARL can be adapted to any environment, including traffic. We will use it to teach our decision model robust behavior in the face of the diversity of traffic situations. To avoid incidents, DR manufacturers could design relatively homogeneous and defensive behaviors rather than opportunistic ones. However, this approach risks making DRs predictable and, therefore, vulnerable to opportunistic behavior by human drivers. The consequences could then be detrimental to both traffic fluidity and safety. Our first contribution aims at reproducing heterogeneous traffic, i.e., where each vehicle exhibits a unique behavior. We assume that by making the behavior of DRs heterogeneous, their predictability will be reduced and opportunistic human drivers will be less able to anticipate their actions. Therefore, this paradigm considers the behavioral heterogeneity of DRs as a critical feature for the safety and fluidity of mixed traffic. In an experimental phase, we will demonstrate the ability of our model to produce heterogeneous behavior while meeting some of the challenges of MARL. Our second contribution will be the integration of informal norms into the decision processes of our DR decision model. We will focus exclusively on integrating the notion of social orientation value, which describes individuals' social behaviors such as altruism or selfishness. Starting with a highway merging scenario, we will evaluate the impact of social orientation on the fluidity and safety of merging vehicles. We will show that altruism can improve safety, but that its actual impact is highly dependent on traffic density
Zimmer, Matthieu. "Apprentissage par renforcement développemental". Thesis, Université de Lorraine, 2018. http://www.theses.fr/2018LORR0008/document.
Texto completo da fonteReinforcement learning allows an agent to learn a behavior that has never been previously defined by humans. The agent discovers the environment and the different consequences of its actions through its interaction: it learns from its own experience, without having pre-established knowledge of the goals or effects of its actions. This thesis tackles how deep learning can help reinforcement learning to handle continuous spaces and environments with many degrees of freedom in order to solve problems closer to reality. Indeed, neural networks have a good scalability and representativeness. They make possible to approximate functions on continuous spaces and allow a developmental approach, because they require little a priori knowledge on the domain. We seek to reduce the amount of necessary interaction of the agent to achieve acceptable behavior. To do so, we proposed the Neural Fitted Actor-Critic framework that defines several data efficient actor-critic algorithms. We examine how the agent can fully exploit the transitions generated by previous behaviors by integrating off-policy data into the proposed framework. Finally, we study how the agent can learn faster by taking advantage of the development of his body, in particular, by proceeding with a gradual increase in the dimensionality of its sensorimotor space
Zimmer, Matthieu. "Apprentissage par renforcement développemental". Electronic Thesis or Diss., Université de Lorraine, 2018. http://www.theses.fr/2018LORR0008.
Texto completo da fonteReinforcement learning allows an agent to learn a behavior that has never been previously defined by humans. The agent discovers the environment and the different consequences of its actions through its interaction: it learns from its own experience, without having pre-established knowledge of the goals or effects of its actions. This thesis tackles how deep learning can help reinforcement learning to handle continuous spaces and environments with many degrees of freedom in order to solve problems closer to reality. Indeed, neural networks have a good scalability and representativeness. They make possible to approximate functions on continuous spaces and allow a developmental approach, because they require little a priori knowledge on the domain. We seek to reduce the amount of necessary interaction of the agent to achieve acceptable behavior. To do so, we proposed the Neural Fitted Actor-Critic framework that defines several data efficient actor-critic algorithms. We examine how the agent can fully exploit the transitions generated by previous behaviors by integrating off-policy data into the proposed framework. Finally, we study how the agent can learn faster by taking advantage of the development of his body, in particular, by proceeding with a gradual increase in the dimensionality of its sensorimotor space
Kozlova, Olga. "Apprentissage par renforcement hiérarchique et factorisé". Phd thesis, Université Pierre et Marie Curie - Paris VI, 2010. http://tel.archives-ouvertes.fr/tel-00632968.
Texto completo da fonteFilippi, Sarah. "Stratégies optimistes en apprentissage par renforcement". Phd thesis, Ecole nationale supérieure des telecommunications - ENST, 2010. http://tel.archives-ouvertes.fr/tel-00551401.
Texto completo da fonteThéro, Héloïse. "Contrôle, agentivité et apprentissage par renforcement". Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEE028/document.
Texto completo da fonteSense of agency or subjective control can be defined by the feeling that we control our actions, and through them effects in the outside world. This cluster of experiences depend on the ability to learn action-outcome contingencies and a more classical algorithm to model this originates in the field of human reinforcementlearning. In this PhD thesis, we used the cognitive modeling approach to investigate further the interaction between perceived control and reinforcement learning. First, we saw that participants undergoing a reinforcement-learning task experienced higher agency; this influence of reinforcement learning on agency comes as no surprise, because reinforcement learning relies on linking a voluntary action and its outcome. But our results also suggest that agency influences reinforcement learning in two ways. We found that people learn actionoutcome contingencies based on a default assumption: their actions make a difference to the world. Finally, we also found that the mere fact of choosing freely shapes the learning processes following that decision. Our general conclusion is that agency and reinforcement learning, two fundamental fields of human psychology, are deeply intertwined. Contrary to machines, humans do care about being in control, or about making the right choice, and this results in integrating information in a one-sided way
Munos, Rémi. "Apprentissage par renforcement, étude du cas continu". Paris, EHESS, 1997. http://www.theses.fr/1997EHESA021.
Texto completo da fonteLesner, Boris. "Planification et apprentissage par renforcement avec modèles d'actions compacts". Caen, 2011. http://www.theses.fr/2011CAEN2074.
Texto completo da fonteWe study Markovian Decision Processes represented with Probabilistic STRIPS action models. A first part of our work is about solving those processes in a compact way. To that end we propose two algorithms. A first one based on propositional formula manipulation allows to obtain approximate solutions in tractable propositional fragments such as Horn and 2-CNF. The second algorithm solves exactly and efficiently problems represented in PPDDL using a new notion of extended value functions. The second part is about learning such action models. We propose different approaches to solve the problem of ambiguous observations occurring while learning. Firstly, a heuristic method based on Linear Programming gives good results in practice yet without theoretical guarantees. We next describe a learning algorithm in the ``Know What It Knows'' framework. This approach gives strong theoretical guarantees on the quality of the learned models as well on the sample complexity. These two approaches are then put into a Reinforcement Learning setting to allow an empirical evaluation of their respective performances
Maillard, Odalric-Ambrym. "APPRENTISSAGE SÉQUENTIEL : Bandits, Statistique et Renforcement". Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2011. http://tel.archives-ouvertes.fr/tel-00845410.
Texto completo da fonteKlein, Édouard. "Contributions à l'apprentissage par renforcement inverse". Thesis, Université de Lorraine, 2013. http://www.theses.fr/2013LORR0185/document.
Texto completo da fonteThis thesis, "Contributions à l'apprentissage par renforcement inverse", brings three major contributions to the community. The first one is a method for estimating the feature expectation, a quantity involved in most of state-of-the-art approaches which were thus extended to a batch off-policy setting. The second major contribution is an Inverse Reinforcement Learning algorithm, structured classification for inverse reinforcement learning (SCIRL), which relaxes a standard constraint in the field, the repeated solving of a Markov Decision Process, by introducing the temporal structure (using the feature expectation) of this process into a structured margin classification algorithm. The afferent theoritical guarantee and the good empirical performance it exhibited allowed it to be presentend in a good international conference: NIPS. Finally, the third contribution is cascaded supervised learning for inverse reinforcement learning (CSI) a method consisting in learning the expert's behavior via a supervised learning approach, and then introducing the temporal structure of the MDP via a regression involving the score function of the classifier. This method presents the same type of theoretical guarantee as SCIRL, but uses standard components for classification and regression, which makes its use simpler. This work will be presented in another good international conference: ECML
Gelly, Sylvain. "Une contribution à l'apprentissage par renforcement : application au Computer Go". Paris 11, 2007. http://www.theses.fr/2007PA112227.
Texto completo da fonteReinforcement Learning (RL) is at the interface of control theory, supervised and unsupervised learning, optimization and cognitive sciences. While RL addresses many objectives with major economic impact, it raises deep theoretical and practical difficulties. This thesis brings some contributions to RL, mainly on three axis. The first axis corresponds to environment modeling, i. E. Learning the transition function between two time steps. Factored approaches give an efficiently framework for the learning and use of this model. The Bayesian Networks are a tool to represent such a model, and this work brings new learning criterion, either in parametric learning (conditional probabilities) and non parametric (structure). The second axis is a study in continuous space and action RL, thanks to the dynamic programming algorithm. This analysis tackles three fundamental steps: optimization (action choice from the value function), supervised learning (regression) of the value function and choice of the learning examples (active learning). The third axis tackles the applicative domain of the game of Go, as a high dimensional discrete control problem, one of the greatest challenge in Machine Learning. The presented algorithms with their improvements made the resulting program, MoGo, win numerous international competitions, becoming for example the first go program playing at an amateur dan level on 9x9
Degris, Thomas. "Apprentissage par renforcement dans les processus de décision Markoviens factorisés". Paris 6, 2007. http://www.theses.fr/2007PA066594.
Texto completo da fonteZaidenberg, Sofia. "Apprentissage par renforcement de modèles de contexte pour l'informatique ambiante". Grenoble INPG, 2009. http://www.theses.fr/2009INPG0088.
Texto completo da fonteThis thesis studies the automatic acquisition by machine learning of a context model for a user in a ubiquitous environment. In such an environment, devices can communicate and cooperate in order to create a consistent computerized space. Some devices possess perceptual capabilities. The environment uses them to detect the user's situation his context. Other devices are able to execute actions. Our problematics consists in determining the optimal associations, for a given user, between situations and actions. Machine learning seems to be a sound approach since it results in a customized environment without requiring an explicit specification from the user. A life long learning lets the environment adapt itself continuously to world changes and user preferences changes. Reinforcement learning can be a solution to this problem, as long as it is adapted to some particular constraints due to our application setting
Klein, Édouard. "Contributions à l'apprentissage par renforcement inverse". Electronic Thesis or Diss., Université de Lorraine, 2013. http://www.theses.fr/2013LORR0185.
Texto completo da fonteThis thesis, "Contributions à l'apprentissage par renforcement inverse", brings three major contributions to the community. The first one is a method for estimating the feature expectation, a quantity involved in most of state-of-the-art approaches which were thus extended to a batch off-policy setting. The second major contribution is an Inverse Reinforcement Learning algorithm, structured classification for inverse reinforcement learning (SCIRL), which relaxes a standard constraint in the field, the repeated solving of a Markov Decision Process, by introducing the temporal structure (using the feature expectation) of this process into a structured margin classification algorithm. The afferent theoritical guarantee and the good empirical performance it exhibited allowed it to be presentend in a good international conference: NIPS. Finally, the third contribution is cascaded supervised learning for inverse reinforcement learning (CSI) a method consisting in learning the expert's behavior via a supervised learning approach, and then introducing the temporal structure of the MDP via a regression involving the score function of the classifier. This method presents the same type of theoretical guarantee as SCIRL, but uses standard components for classification and regression, which makes its use simpler. This work will be presented in another good international conference: ECML
Darwiche, Domingues Omar. "Exploration en apprentissage par renforcement : au-delà des espaces d'états finis". Thesis, Université de Lille (2022-....), 2022. http://www.theses.fr/2022ULILB002.
Texto completo da fonteReinforcement learning (RL) is a powerful machine learning framework to design algorithms that learn to make decisions and to interact with the world. Algorithms for RL can be classified as offline or online. In the offline case, the algorithm is given a fixed dataset, based on which it needs to compute a good decision-making strategy. In the online case, an agent needs to efficiently collect data by itself, by interacting with the environment: that is the problem of exploration in reinforcement learning. This thesis presents theoretical and practical contributions to online RL. We investigate the worst-case performance of online RL algorithms in finite environments, that is, those that can be modeled with a finite amount of states, and where the set of actions that can be taken by an agent is also finite. Such performance degrades as the number of states increases, whereas in real-world applications the state set can be arbitrarily large or continuous. To tackle this issue, we propose kernel-based algorithms for exploration that can be implemented for general state spaces, and for which we provide theoretical results under weak assumptions on the environment. Those algorithms rely on a kernel function that measures the similarity between different states, which can be defined on arbitrary state-spaces, including discrete sets and Euclidean spaces, for instance. Additionally, we show that our kernel-based algorithms are able to handle non-stationary environments by using time-dependent kernel functions, and we propose and analyze approximate versions of our methods to reduce their computational complexity. Finally, we introduce a scalable approximation of our kernel-based methods, that can be implemented with deep reinforcement learning and integrate different representation learning methods to define a kernel function
Garcia, Pascal. "Exploration guidée et induction de comportements génériques en apprentissage par renforcement". Rennes, INSA, 2004. http://www.theses.fr/2004ISAR0010.
Texto completo da fonteReinforcement learning is a general framework in which an autonomous agent learns which actions to choose in particular situations (states) in order to optimize some reinforcements (rewards or punitions) in the long run. Even if a lot of tasks can be formulated in this framework, there are two problems with the standard reinforcement learning algorithms: 1. Due to the learning time of those algorithms, in practice, tasks with a moderatly large state space are not solvable in reasonable time. 2. Given several problems to solve in some domains, a standard reinforcement learning agent learns an optimal policy from scratch for each problem. It would be far more useful to have systems that can solve several problems over time, using the knowledge obtained from previous problem instances to guide in learning on new problems. We propose some methods to address those issues: 1. We define two formalisms to introduce a priori knowledge to guide the agent on a given task. The agent has an initial behaviour which can be modified during the learning process. 2. We define a method to induce generic behaviours,based on the previously solved tasks and on basicbuilding blocks. Those behaviours will be added to the primitive actions of a new related task tohelp the agent solve it
Vasileiadis, Athanasios. "Apprentissage par renforcement à champ moyen : une perspective de contrôle optimal". Electronic Thesis or Diss., Université Côte d'Azur, 2024. http://www.theses.fr/2024COAZ5005.
Texto completo da fonteThe goal of the PhD will be to implement a similar mean field approach to handle MARL. This idea was investigated, at least for individual agents, in several recent papers. In all of them, not only Mean field approach to MARL (Multi Agent Reinforcement Learning) does the mean field approach allow for a significant decrease of complexity, but it also provides distributed (or decentralized) solutions, which are of a very convenient use in practice. Numerical implementation using either on-or off-policy learning is discussed in the literature. The first part of the thesis will consist in revisiting the former works from a mathematical point of view. In particular, this will ask for a careful stability analysis addressing both the passage from a finite to an infinite system of agents and the use of approximated (instead of exact) policies. We may expect monotonicity to play a key role in the overall analysis; another, but more prospective, direction is to discuss the influence of a stochastic environment onto the behavior of the algorithms themselves. Another part of the thesis will be dedicated to the cooperative case the analysis of which will rely upon mean field control theory. Potential structures may allow to make the connection between individual and cooperative cases. The connection between the two may indeed play an important role for incentive design or, equivalently, for mimicking a cooperative system with individual agents. In this regard, connection with distributional reinforcement learning, may be an interesting question as well
Zhang, Ping. "Etudes de différents aspects de l'apprentissage par renforcement". Compiègne, 1997. http://www.theses.fr/1997COMP0993.
Texto completo da fonteThis dissertation deals with the research on three important aspects of the reinforcement learning : the temporal differences (TD(). ), the Q-learning and the exploration/ exploitation dilemma. We propose algorithms and techniques based on new concepts that allow a better understanding, and ultimately, the solution to the problem of reinforcement learning. The first part of this work deals with a method that optimizes the choice of parameter of T D(). . ) and then solves a real problem of a person's ability to evaluate utilizing the different methods based on the principle of T D(>,). In the second part, we introduce the notion "confidence" and propose a new version of Q-learning, SCIQ, which generalizes and improves the Q-learning. We point out that this algorithm can overcome the over-estimation problem of Q-values associated with non-optimal actions. Contrary to other versions of Q-learning, our algorithm is adaptive thanks to its evolving capacity to modify the Q-values. Again, it is robust and faster than the Q-learning. In the last part, in order to solve the exploration/exploitation dilemma, the notion "entropy" is introduced as the measure of information on the system state. We present two methods allowing to estimate the entropy approximation and two types of tech¬niques for exploration by means of these estimations. It is noted that aside from using entropy itself by using the entropy approximation we can define the efficient algorithm without the counter and extra structure
Léon, Aurélia. "Apprentissage séquentiel budgétisé pour la classification extrême et la découverte de hiérarchie en apprentissage par renforcement". Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS226.
Texto completo da fonteThis thesis deals with the notion of budget to study problems of complexity (it can be computational complexity, a complex task for an agent, or complexity due to a small amount of data). Indeed, the main goal of current techniques in machine learning is usually to obtain the best accuracy, without worrying about the cost of the task. The concept of budget makes it possible to take into account this parameter while maintaining good performances. We first focus on classification problems with a large number of classes: the complexity in those algorithms can be reduced thanks to the use of decision trees (here learned through budgeted reinforcement learning techniques) or the association of each class with a (binary) code. We then deal with reinforcement learning problems and the discovery of a hierarchy that breaks down a (complex) task into simpler tasks to facilitate learning and generalization. Here, this discovery is done by reducing the cognitive effort of the agent (considered in this work as equivalent to the use of an additional observation). Finally, we address problems of understanding and generating instructions in natural language, where data are available in small quantities: we test for this purpose the simultaneous use of an agent that understands and of an agent that generates the instructions
Daoudi, Paul. "Apprentissage par renforcement sur des systèmes réels : exploitation de différents contextes industriels". Electronic Thesis or Diss., Université Grenoble Alpes, 2024. http://www.theses.fr/2024GRALT047.
Texto completo da fonteThere are many infrastructures in the industry that require complex control with a crucial role. Traditionally, this problem is addressed by the use of automatic and optimal control methods. These methods require a model of the system dynamics, which can be inaccurate in complex systems. Machine learning offers an alternative solution to this problem, where the model of the system under consideration is obtained by extrapolation from input/output data while being agnostic to the underlying physics of the system. Reinforcement learning is the learning response to these decision problems in an uncertain environment. According to this paradigm, the objective is to design a policy, i.e. a correspondence between system state and controls maximizing a certain success criterion taking into account the future states of the system. This paradigm has been very successful in the last few years and on different tasks such as ATARI video games, the game of go, Dota2... However, its application to real systems faces many difficulties: safety constraints, learning with limited data, off-line learning, partially observable system... The objective of this thesis will be to take advantage of the power of reinforcement learning for the control of real systems by focusing on the first two points exposed above: learning a policy while taking into account safety constraints in a framework with little data
Mesnard, Thomas. "Attribution de crédit pour l'apprentissage par renforcement dans des réseaux profonds". Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAX155.
Texto completo da fonteDeep reinforcement learning has been at the heart of many revolutionary results in artificial intelligence in the last few years. These agents are based on credit assignment techniques that try to establish correlations between past actions and future events and use these correlations to become effective in a given task. This problem is at the heart of the current limitations of deep reinforcement learning and credit assignment techniques used today remain relatively rudimentary and incapable of inductive reasoning. This thesis therefore focuses on the study and formulation of new credit assignment methods for deep reinforcement learning. Such techniques could speed up learning, make better generalization when agents are trained on multiple tasks, and perhaps even allow the emergence of abstraction and reasoning
Martinez, Coralie. "Classification précoce de séquences temporelles par de l'apprentissage par renforcement profond". Thesis, Université Grenoble Alpes (ComUE), 2019. http://www.theses.fr/2019GREAT123.
Texto completo da fonteEarly classification (EC) of time series is a recent research topic in the field of sequential data analysis. It consists in assigning a label to some data that is sequentially collected with new data points arriving over time, and the prediction of a label has to be made using as few data points as possible in the sequence. The EC problem is of paramount importance for supporting decision makers in many real-world applications, ranging from process control to fraud detection. It is particularly interesting for applications concerned with the costs induced by the acquisition of data points, or for applications which seek for rapid label prediction in order to take early actions. This is for example the case in the field of health, where it is necessary to provide a medical diagnosis as soon as possible from the sequence of medical observations collected over time. Another example is predictive maintenance with the objective to anticipate the breakdown of a machine from its sensor signals. In this doctoral work, we developed a new approach for this problem, based on the formulation of a sequential decision making problem, that is the EC model has to decide between classifying an incomplete sequence or delaying the prediction to collect additional data points. Specifically, we described this problem as a Partially Observable Markov Decision Process noted EC-POMDP. The approach consists in training an EC agent with Deep Reinforcement Learning (DRL) in an environment characterized by the EC-POMDP. The main motivation for this approach was to offer an end-to-end model for EC which is able to simultaneously learn optimal patterns in the sequences for classification and optimal strategic decisions for the time of prediction. Also, the method allows to set the importance of time against accuracy of the classification in the definition of rewards, according to the application and its willingness to make this compromise. In order to solve the EC-POMDP and model the policy of the EC agent, we applied an existing DRL algorithm, the Double Deep-Q-Network algorithm, whose general principle is to update the policy of the agent during training episodes, using a replay memory of past experiences. We showed that the application of the original algorithm to the EC problem lead to imbalanced memory issues which can weaken the training of the agent. Consequently, to cope with those issues and offer a more robust training of the agent, we adapted the algorithm to the EC-POMDP specificities and we introduced strategies of memory management and episode management. In experiments, we showed that these contributions improved the performance of the agent over the original algorithm, and that we were able to train an EC agent which compromised between speed and accuracy, on each sequence individually. We were also able to train EC agents on public datasets for which we have no expertise, showing that the method is applicable to various domains. Finally, we proposed some strategies to interpret the decisions of the agent, validate or reject them. In experiments, we showed how these solutions can help gain insight in the choice of action made by the agent
Laurent, Guillaume. "Synthèse de comportements par apprentissages par renforcement parallèles : application à la commande d'un micromanipulateur plan". Phd thesis, Université de Franche-Comté, 2002. http://tel.archives-ouvertes.fr/tel-00008761.
Texto completo da fonteBouzid, Salah Eddine. "Optimisation multicritères des performances de réseau d’objets communicants par méta-heuristiques hybrides et apprentissage par renforcement". Thesis, Le Mans, 2020. http://cyberdoc-int.univ-lemans.fr/Theses/2020/2020LEMA1026.pdf.
Texto completo da fonteThe deployment of Communicating Things Networks (CTNs), with continuously increasing densities, needs to be optimal in terms of quality of service, energy consumption and lifetime. Determining the optimal placement of the nodes of these networks, relative to the different quality criteria, is an NP-Hard problem. Faced to this NP-Hardness, especially for indoor environments, existing approaches focus on the optimization of one single objective while neglecting the other criteria, or adopt an expensive manual solution. Finding new approaches to solve this problem is required. Accordingly, in this thesis, we propose a new approach which automatically generates the deployment that guarantees optimality in terms of performance and robustness related to possible topological failures and instabilities. The proposed approach is based, on the first hand, on the modeling of the deployment problem as a multi-objective optimization problem under constraints, and its resolution using a hybrid algorithm combining genetic multi-objective optimization with weighted sum optimization and on the other hand, the integration of reinforcement learning to guarantee the optimization of energy consumption and the extending the network lifetime. To apply this approach, two tools are developed. A first called MOONGA (Multi-Objective Optimization of wireless Network approach based on Genetic Algorithm) which automatically generates the placement of nodes while optimizing the metrics that define the QoS of the CTN: connectivity, m-connectivity, coverage, k-coverage, coverage redundancy and cost. MOONGA tool considers constraints related to the architecture of the deployment space, the network topology, the specifies of the application and the preferences of the network designer. The second optimization tool is named R2LTO (Reinforcement Learning for Life-Time Optimization), which is a new routing protocol for CTNs, based on distributed reinforcement learning that allows to determine the optimal rooting path in order to guarantee energy-efficiency and to extend the network lifetime while maintaining the required QoS
Buffet, Olivier. "Une double approche modulaire de l'apprentissage par renforcement pour des agents intelligents adaptatifs". Phd thesis, Université Henri Poincaré - Nancy I, 2003. http://tel.archives-ouvertes.fr/tel-00509349.
Texto completo da fonteDutech, Alain. "Apprentissage par Renforcement : Au delà des Processus Décisionnels de Markov (Vers la cognition incarnée)". Habilitation à diriger des recherches, Université Nancy II, 2010. http://tel.archives-ouvertes.fr/tel-00549108.
Texto completo da fonteCoulom, Rémi. "Apprentissage par renforcement utilisant des réseaux de neurones avec des applications au contrôle moteur". Phd thesis, Grenoble INPG, 2002. http://tel.archives-ouvertes.fr/tel-00004386.
Texto completo da fonteJneid, Khoder. "Apprentissage par Renforcement Profond pour l'Optimisation du Contrôle et de la Gestion des Bâtiment". Electronic Thesis or Diss., Université Grenoble Alpes, 2023. http://www.theses.fr/2023GRALM062.
Texto completo da fonteHeating, ventilation, and air-conditioning (HVAC) systems account for high energy consumption in buildings. Conventional approaches used to control HVAC systems rely on rule-based control (RBC) that consists of predefined rules set by an expert. Model-predictive control (MPC), widely explored in literature, is not adopted in the industry since it is a model-based approach that requires to build models of the building at the first stage to be used in the optimization phase and thus is time-consuming and expensive. During the PhD, we investigate reinforcement learning (RL) to optimize the energy consumption of HVAC systems while maintaining good thermal comfort and good air quality. Specifically, we focus on model-free RL algorithms that learn through interaction with the environment (building including the HVAC) and thus not requiring to have accurate models of the environment. In addition, online approaches are considered. The main challenge of an online model-free RL is the number of days that are necessary for the algorithm to acquire enough data and actions feedback to start acting properly. Hence, the research subject of the PhD is boosting model-free RL algorithms to converge faster to make them applicable in real-world applications, HVAC control. Two approaches have been explored during the PhD to achieve our objective: the first approach combines RBC with value-based RL, and the second approach combines fuzzy rules with policy-based RL. Both approaches aim to boost the convergence of RL by guiding the RL policy but they are completely different. The first approach exploits RBC rules during training while in the second approach, the fuzzy rules are injected directly into the policy. Tests areperformed on a simulated office during winter. This simulated office is a replica of a real office at Grenoble INP
Gueguen, Maëlle. "Dynamique intracérébrale de l'apprentissage par renforcement chez l'humain". Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAS042/document.
Texto completo da fonteWe make decisions every waking day of our life. Facing our options, we tend to pick the most likely to get our expected outcome. Taking into account our past experiences and their outcome is mandatory to identify the best option. This cognitive process is called reinforcement learning. To date, the underlying neural mechanisms are debated. Despite a consensus on the role of dopaminergic neurons in reward processing, several hypotheses on the neural bases of reinforcement learning coexist: either two distinct opposite systems covering cortical and subcortical areas, or a segregation of neurons within brain regions to process reward-based and punishment-avoidance learning.This PhD work aimed to identify the brain dynamics of human reinforcement learning. To unravel the neural mechanisms involved, we used intracerebral recordings in refractory epileptic patients during a probabilistic learning task. In the first study, we used a computational model to tackle the brain dynamics of reinforcement signal encoding, especially the encoding of reward and punishment prediction errors. Local field potentials exhibited the central role of high frequency gamma activity (50-150Hz) in these encodings. We report a role of the ventromedial prefrontal cortex in reward prediction error encoding while the anterior insula and the dorsolateral prefrontal cortex encoded punishment prediction errors. In addition, the magnitude of the neural response in the insula predicted behavioral learning and trial-to-trial behavioral adaptations. These results are consistent with the existence of two distinct opposite cortical systems processing reward and punishments during reinforcement learning. In a second study, we recorded the neural activity of the anterior and dorsomedial nuclei of the thalamus during the same cognitive task. Local field potentials recordings highlighted the role of low frequency theta activity in punishment processing, supporting an implication of these nuclei during punishment-avoidance learning. In a third behavioral study, we investigated the influence of risk on reinforcement learning. We observed a risk-aversion during punishment-avoidance, affecting the performance, as well as a risk-seeking behavior during reward-seeking, revealed by an increased reaction time towards appetitive risky choices. Taken together, these results suggest we are risk-seeking when we have something to gain and risk-averse when we have something to lose, in contrast to the prediction of the prospect theory.Improving our common knowledge of the brain dynamics of human reinforcement learning could improve the understanding of cognitive deficits of neurological patients, but also the decision bias all human beings can exhibit
Robledo, Relaño Francisco. "Algorithmes d'apprentissage par renforcement avancé pour les problèmes bandits multi-arches". Electronic Thesis or Diss., Pau, 2024. http://www.theses.fr/2024PAUU3021.
Texto completo da fonteThis thesis presents advances in Reinforcement Learning (RL) algorithms for resource and policy management in Restless Multi-Armed Bandit (RMAB) problems. We develop algorithms through two approaches in this area. First, for problems with discrete and binary actions, which is the original case of RMAB, we have developed QWI and QWINN. These algorithms compute Whittle indices, a heuristic that decouples the different RMAB processes, thereby simplifying the policy determination. Second, for problems with continuous actions, which generalize to Weakly Coupled Markov Decision Processes (MDPs), we propose LPCA. This algorithm employs a Lagrangian relaxation to decouple the different MDPs.The QWI and QWINN algorithms are introduced as two-timescale methods for computing Whittle indices for RMAB problems. In our results, we show mathematically that the estimates of Whittle indices of QWI converge to the theoretical values. QWINN, an extension of QWI, incorporates neural networks to compute the Q-values used to compute the Whittle indices. Through our results, we present the local convergence properties of the neural network used in QWINN. Our results show how QWINN outperforms QWI in terms of convergence rates and scalability.In the continuous action case, the LPCA algorithm applies a Lagrangian relaxation to decouple the linked decision processes, allowing for efficient computation of optimal policies under resource constraints. We propose two different optimization methods, differential evolution and greedy optimization strategies, to efficiently handle resource allocation. In our results, LPCA shows superior performance over other contemporary RL approaches.Empirical results from different simulated environments validate the effectiveness of the proposed algorithms.These algorithms represent a significant contribution to the field of resource allocation in RL and pave the way for future research into more generalized and scalable reinforcement learning frameworks
Godbout, Mathieu. "Approches par bandit pour la génération automatique de résumés de textes". Master's thesis, Université Laval, 2021. http://hdl.handle.net/20.500.11794/69488.
Texto completo da fonteThis thesis discusses the use of bandit methods to solve the problem of training extractive abstract generation models. The extractive models, which build summaries by selecting sentences from an original document, are difficult to train because the target summary of a document is usually not built in an extractive way. It is for this purpose that we propose to see the production of extractive summaries as different bandit problems, for which there exist algorithms that can be leveraged for training summarization models.In this paper, BanditSum is first presented, an approach drawn from the literature that sees the generation of the summaries of a set of documents as a contextual bandit problem. Next,we introduce CombiSum, a new algorithm which formulates the generation of the summary of a single document as a combinatorial bandit. By exploiting the combinatorial formulation,CombiSum manages to incorporate the notion of the extractive potential of each sentence of a document in its training. Finally, we propose LinCombiSum, the linear variant of Com-biSum which exploits the similarities between sentences in a document and uses the linear combinatorial bandit formulation instead
Montagne, Fabien. "Une architecture logicielle pour aider un agent apprenant par renforcement". Littoral, 2008. http://www.theses.fr/2008DUNK0198.
Texto completo da fonteThis thesis deals with reinforcement learning. One of the main advantage of this learning is to not require to know explicitely the expected behavior. During its learning, the agent percieves states, gets a set of rewards and selects actions to carry out. The agent fits its behavior by optimizing the amount of rewards. Nevertheless, the computing time required quickly becomes prohibitive. This is mainly due to the agent’s need of exploring its environment. The approach considered here consists in using external knowledge to “guide” the agent during its exploration. This knowledge constitutes an help which can, for example, be expressed by trajectories that set up a knowledge database. These trajectories are used to limit the exploration of the environment while allowing the agent to build a good quality behavior. Helping an agent does neither involve knowing the actions choose in all states, nor having the same perceptions as the agent. The critic-critic architecture was devised to fulfill to this problematic. It combines a standard reinforcement learning algorithm with an help given through potentials. The potentials assiociate a value to each transition of the trajectories. The value function estimation by the agent and the potential of the help are combined during the training. Fitting this combine dynamically makes it possible to throw assistance into question while guaranteing an optimal or almost optimal policy quickly. It is formally proved that the proposed algorithm converges under certain conditions. Moreover, empirical work show that the agent is able to benefit from an help without these conditions
Geist, Matthieu. "Optimisation des chaînes de production dans l'industrie sidérurgique : une approche statistique de l'apprentissage par renforcement". Phd thesis, Université de Metz, 2009. http://tel.archives-ouvertes.fr/tel-00441557.
Texto completo da fonteMatignon, Laëtitia. "Synthèse d'agents adaptatifs et coopératifs par apprentissage par renforcement : application à la commande d'un système distribué de micromanipulation". Besançon, 2008. http://www.theses.fr/2008BESA2041.
Texto completo da fonteNumerous applications can be formulated in terms of distributed systems, be it a necessity face to a physical distribution of entities (networks, mobile robotics) or a means of confronting the complexity to solve globally a problem. The objective is to use together reinforcement learning methods and multi-agent systems. Thus, cooperative and autonomous agents can learn resolve in a decentralized way complex problems by adapting to them 50 as to realize a joint objective. Reinforcement learning methods do not need any a priori knowledge about the dynamics of the system, which can be stochastic and nonlinear. In order to improve the learning speed, knowledge incorporation methods are studied within the context of goal-directed tasks. A generic goal bias function is also proposed. Then we took an interest in independent learners in team Markov games. In this framework, agents learning by reinforcement must overcome several difficulties as the coordination or the impact of the exploration. The study of these issues allows first to synthesize the characteristics of existing reinforcement learning decentralized methods. Then, given the difficulties encountered by this approach, two algorithms are proposed. The first one, called hysteretic Q-learning, is based on agents with "adjustable optimistic tendency". The second one is the Swing between Optimistic or Neutral (SOoN) in which independent agents can adapt automatically to the environment stochasticity. Experimentations on various team Markov games notably show that SOoN overcomes the main factors of non-coordination and is robust face to the exploration of the other agents. An extension of these works to the decentralized control of a distributed micromanipulation system (smart surface) in a partially observable case is finally proposed
Geist, Matthieu. "Optimisation des chaînes de production dans l'industrie sidérurgique : une approche statistique de l'apprentissage par renforcement". Electronic Thesis or Diss., Metz, 2009. http://www.theses.fr/2009METZ023S.
Texto completo da fonteReinforcement learning is the response of machine learning to the problem of optimal control. In this paradigm, an agent learns do control an environment by interacting with it. It receives evenly a numeric reward (or reinforcement signal), which is a local information about the quality of the control. The agent objective is to maximize a cumulative function of these rewards, generally modelled as a so-called value function. A policy specifies the action to be chosen in a particular configuration of the environment to be controlled, and thus the value function quantifies the quality of yhis policy. This paragon is very general, and it allows taking into account many applications. In this manuscript, we apply it to a gas flow management problem in the iron and steel industry. However, its application can be quite difficult. Notably, if the environment description is too large, an exact representation of the value function (or of the policy) is not possible. This problem is known as generalization (or value function approximation) : on the one hand, one has to design algorithms with low computational complexity, and on the other hand, one has to infer the behaviour the agent should have in an unknown configuration of the environment when close configurations have been experimented. This is the main problem we address in this manuscript, by introducing a family of algorithms inspired from Kalman filtering
Zennir, Youcef. "Apprentissage par renforcement et systèmes distribués : application à l'apprentissage de la marche d'un robot hexapode". Lyon, INSA, 2004. http://theses.insa-lyon.fr/publication/2004ISAL0034/these.pdf.
Texto completo da fonteThe goal of this thesis is to study and to develop reinforcement learning techniques in order a hexapod robot to learn to walk. The main assumption on which this work is based is that effective gaits can be obtained as the control of the movements is distributed on each leg rather than centralised in a single decision centre. A distributed approach of the Q-learning technique is adopted in which the agents contributing to the same global objective perform their own learning process taking into account or not the other agents. The centralised and distributed approaches are compared. Different simulations and tests are carried out so as to generate stable periodic gaits. The influence of the learning parameters on the quality of the gaits are studied. The walk appears as an emerging phenomenon from the individual movements of the legs. Problems of fault tolerance and lack of state information are investigated. Finally it is verified that with the developed algorithm the simulated robot learns how to reach a desired trajectory while controlling its posture
Leurent, Edouard. "Apprentissage par renforcement sûr et efficace pour la prise de décision comportementale en conduite autonome". Thesis, Lille 1, 2020. http://www.theses.fr/2020LIL1I049.
Texto completo da fonteIn this Ph.D. thesis, we study how autonomous vehicles can learn to act safely and avoid accidents, despite sharing the road with human drivers whose behaviors are uncertain. To explicitly account for this uncertainty, informed by online observations of the environment, we construct a high-confidence region over the system dynamics, which we propagate through time to bound the possible trajectories of nearby traffic. To ensure safety under such uncertainty, we resort to robust decision-making and act by always considering the worst-case outcomes. This approach guarantees that the performance reached during planning is at least achieved for the true system, and we show by end-to-end analysis that the overall sub-optimality is bounded. Tractability is preserved at all stages, by leveraging sample-efficient tree-based planning algorithms. Another contribution is motivated by the observation that this pessimistic approach tends to produce overly conservative behaviors: imagine you wish to overtake a vehicle, what certainty do you have that they will not change lane at the very last moment, causing an accident? Such reasoning makes it difficult for robots to drive amidst other drivers, merge into a highway, or cross an intersection — an issue colloquially known as the “freezing robot problem”. Thus, the presence of uncertainty induces a trade-off between two contradictory objectives: safety and efficiency. How to arbitrate this conflict? The question can be temporarily circumvented by reducing uncertainty as much as possible. For instance, we propose an attention-based neural network architecture that better accounts for interactions between traffic participants to improve predictions. But to actively embrace this trade-off, we draw on constrained decision-making to consider both the task completion and safety objectives independently. Rather than a unique driving policy, we train a whole continuum of behaviors, ranging from conservative to aggressive. This provides the system designer with a slider allowing them to adjust the level of risk assumed by the vehicle in real-time
Zennir, Youcef Bétemps Maurice. "Apprentissage par renforcement et systèmes distribués application à l'apprentissage de la marche d'un robot hexapode /". Villeurbanne : Doc'INSA, 2005. http://docinsa.insa-lyon.fr/these/pont.php?id=zennir.
Texto completo da fonteRodrigues, Christophe. "Apprentissage incrémental des modèles d'action relationnels". Paris 13, 2013. http://scbd-sto.univ-paris13.fr/secure/edgalilee_th_2013_rodrigues.pdf.
Texto completo da fonteIn this thesis, we study machine learning for action. Our work both covers reinforcement learning (RL) and inductive logic programming (ILP). We focus on learning action models. An action model describes the preconditions and effects of possible actions in an environment. It enables anticipating the consequences of the agent’s actions and may also be used by a planner. We specifically work on a relational representation of environments. They allow to describe states and actions by the means of objects and relations between the various objects that compose them. We present the IRALe method, which learns incrementally relational action models. First, we presume that states are fully observable and the consequences of actions are deterministic. We provide a proof of convergence for this method. Then, we develop an active exploration approach which allows focusing the agent’s experience on actions that are supposedly non-covered by the model. Finally, we generalize the approach by introducing a noisy perception of the environment in order to make our learning framework more realistic. We empirically illustrate each approach’s importance on various planification problems. The results obtained show that the number of interactions necessary with the environments is very weak compared to the size of the considered states spaces. Moreover, active learning allows to improve significantly these results
Gabillon, Victor. "Algorithmes budgétisés d'itérations sur les politiques obtenues par classification". Thesis, Lille 1, 2014. http://www.theses.fr/2014LIL10032/document.
Texto completo da fonteThis dissertation is motivated by the study of a class of reinforcement learning (RL) algorithms, called classification-based policy iteration (CBPI). Contrary to the standard RL methods, CBPI do not use an explicit representation for value function. Instead, they use rollouts and estimate the action-value function of the current policy at a collection of states. Using a training set built from these rollout estimates, the greedy policy is learned as the output of a classifier. Thus, the policy generated at each iteration of the algorithm, is no longer defined by a (approximated) value function, but instead by a classifier. In this thesis, we propose new algorithms that improve the performance of the existing CBPI methods, especially when they have a fixed budget of interaction with the environment. Our improvements are based on the following two shortcomings of the existing CBPI algorithms: 1) The rollouts that are used to estimate the action-value functions should be truncated and their number is limited, and thus, we have to deal with bias-variance tradeoff in estimating the rollouts, and 2) The rollouts are allocated uniformly over the states in the rollout set and the available actions, while a smarter allocation strategy could guarantee a more accurate training set for the classifier. We propose CBPI algorithms that address these issues, respectively, by: 1) the use of a value function approximation to improve the accuracy (balancing the bias and variance) of the rollout estimates, and 2) adaptively sampling the rollouts over the state-action pairs
Langlois, Thibault. "Algorithmes d'apprentissage par renforcement pour la commande adaptative : Texte imprimé". Compiègne, 1992. http://www.theses.fr/1992COMPD530.
Texto completo da fonteTournaire, Thomas. "Model-based reinforcement learning for dynamic resource allocation in cloud environments". Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAS004.
Texto completo da fonteThe emergence of new technologies (Internet of Things, smart cities, autonomous vehicles, health, industrial automation, ...) requires efficient resource allocation to satisfy the demand. These new offers are compatible with new 5G network infrastructure since it can provide low latency and reliability. However, these new needs require high computational power to fulfill the demand, implying more energy consumption in particular in cloud infrastructures and more particularly in data centers. Therefore, it is critical to find new solutions that can satisfy these needs still reducing the power usage of resources in cloud environments. In this thesis we propose and compare new AI solutions (Reinforcement Learning) to orchestrate virtual resources in virtual network environments such that performances are guaranteed and operational costs are minimised. We consider queuing systems as a model for clouds IaaS infrastructures and bring learning methodologies to efficiently allocate the right number of resources for the users.Our objective is to minimise a cost function considering performance costs and operational costs. We go through different types of reinforcement learning algorithms (from model-free to relational model-based) to learn the best policy. Reinforcement learning is concerned with how a software agent ought to take actions in an environment to maximise some cumulative reward. We first develop queuing model of a cloud system with one physical node hosting several virtual resources. On this first part we assume the agent perfectly knows the model (dynamics of the environment and the cost function), giving him the opportunity to perform dynamic programming methods for optimal policy computation. Since the model is known in this part, we also concentrate on the properties of the optimal policies, which are threshold-based and hysteresis-based rules. This allows us to integrate the structural property of the policies into MDP algorithms. After providing a concrete cloud model with exponential arrivals with real intensities and energy data for cloud provider, we compare in this first approach efficiency and time computation of MDP algorithms against heuristics built on top of the queuing Markov Chain stationary distributions.In a second part we consider that the agent does not have access to the model of the environment and concentrate our work with reinforcement learning techniques, especially model-based reinforcement learning. We first develop model-based reinforcement learning methods where the agent can re-use its experience replay to update its value function. We also consider MDP online techniques where the autonomous agent approximates environment model to perform dynamic programming. This part is evaluated in a larger network environment with two physical nodes in tandem and we assess convergence time and accuracy of different reinforcement learning methods, mainly model-based techniques versus the state-of-the-art model-free methods (e.g. Q-Learning).The last part focuses on model-based reinforcement learning techniques with relational structure between environment variables. As these tandem networks have structural properties due to their infrastructure shape, we investigate factored and causal approaches built-in reinforcement learning methods to integrate this information. We provide the autonomous agent with a relational knowledge of the environment where it can understand how variables are related to each other. The main goal is to accelerate convergence by: first having a more compact representation with factorisation where we devise a factored MDP online algorithm that we evaluate and compare with model-free and model-based reinforcement learning algorithms; second integrating causal and counterfactual reasoning that can tackle environments with partial observations and unobserved confounders
Jouffe, Lionel. "Apprentissage de systèmes d'inférence floue par des méthodes de renforcement : application à la régulation d'ambiance dans un bâtiment d'élevage porcin". Rennes 1, 1997. http://www.theses.fr/1997REN10071.
Texto completo da fonteRoberty, Adrien. "Ordonnancer le trafic dans des réseaux déterministes grâce à l’apprentissage par renforcement". Electronic Thesis or Diss., Chasseneuil-du-Poitou, Ecole nationale supérieure de mécanique et d'aérotechnique, 2024. http://www.theses.fr/2024ESMA0001.
Texto completo da fonteOne of the most disruptive changes brought by Industry 4.0 is the networking of production facilities. Furthermore, the discussions on Industry 5.0 show the need for an integrated industrial ecosystem,combining AI and the digital twin. In this environment, industrial equipment will work seamlessly with human workers, requiring minimal latency, high-speed connectivity for real-time monitoring. In order to meet this requirement, the Time-Sensitive Networking (TSN) set of standards was introduced. However, configuring TSN in a complex industrial network poses new challenges. For example, the TSN standard allow some flexibility and modularity in the data plane, however, the mechanisms defined by these standards depends on many parameters (such as network topology, routing, etc.) which makes the design work difficult. IEEE 802.1Q is one of the main TSN standards that provides several mechanisms to achieve deterministic latency. One of them is called Time-Aware Shaper (TAS). A switch with a TAS function divides the data traffic, through multiple priorities, into multiple queues arranged in a regular schedule. The main way to organize this process is based on exact or heuristic methods. These are good for closed networks (when all streams are identified in advance and the network topology is fixed). However, in an open network (where more streams are added to the network and the network topology is dynamic), scheduling in TSN can lead to NP-hard problems. The goal of this thesis is to propose a solution to process the scheduling in TSN using Deep Reinforcement Learning with the use of simulations to train and evaluate the configuration agent
Pamponet, Machado Aydano. "Le transfert adaptatif en apprentissage par renforcement : application à la simulation de schéma de jeux tactiques". Phd thesis, Université Pierre et Marie Curie - Paris VI, 2009. http://tel.archives-ouvertes.fr/tel-00814207.
Texto completo da fonteGérard, Pierre. "Systèmes de classeurs : étude de l'apprentissage latent". Paris 6, 2002. http://www.theses.fr/2002PA066155.
Texto completo da fonteFouladi, Karan. "Recommandation multidimensionnelle d’émissions télévisées par apprentissage : Une interface de visualisation intelligente pour la télévision numérique". Paris 6, 2013. http://www.theses.fr/2013PA066040.
Texto completo da fonteDue to the wealth of entertainment contents provided by Digital Mass Media and in particular by Digital Television (satellite, cable, terrestrial or IP), choosing a program has become more and more difficult. Far from having a user-friendly environment, Digital Television (DTV) users face a huge choice of content, assisted only by off-putting interfaces named classical "Electronic Program Guide" EPG. That makes users' attention blurry and decreases their active program searching and choice. The central topic of this thesis is the development of a Recommendation System interfaced mapping interactive TV content. To do this, we chose to use a Recommendation System based on the content and have adapted to the field of television. This adaptation is carried out at several specific steps. We especially worked processing metadata associated with television content and developing an expert system can provide us with a unique categorization of television. We also took the initiative to model and integrate the context of use in our television viewing environment modeling. The integration of context allowed us to obtain a sufficiently fine and stable in this environment, allowing us to implementing our recommendation system. Detailed categorization of metadata associated with television content and modeling & integration of context of use television is the main contribution of this thesis. To assess / improve our developments, we installed a fleet of nine homes left in three specific types of families. This has given us the means to assess the contribution of our work in ease of use television in real conditions of use. By an implicit approach, we apprehended the behavior of television families (involved in our project) vis-à-vis television content. A syntactic-semantic analyzer has provided a measure of gradual interest thereon to the content, for each family. We have also developed an interactive mapping interface based on the idea of "Island of memory" for the interactive interface is in line with Recommendation System in place. Our recommendation system based on content and assisted learning (reinforcement learning), has provided us with the most optimal results to the scientific community in the field
Carrara, Nicolas. "Reinforcement learning for dialogue systems optimization with user adaptation". Thesis, Lille 1, 2019. http://www.theses.fr/2019LIL1I071/document.
Texto completo da fonteThe most powerful artificial intelligence systems are now based on learned statistical models. In order to build efficient models, these systems must collect a huge amount of data on their environment. Personal assistants, smart-homes, voice-servers and other dialogue applications are no exceptions to this statement. A specificity of those systems is that they are designed to interact with humans, and as a consequence, their training data has to be collected from interactions with these humans. As the number of interactions with a single person is often too scarce to train a proper model, the usual approach to maximise the amount of data consists in mixing data collected with different users into a single corpus. However, one limitation of this approach is that, by construction, the trained models are only efficient with an "average" human and do not include any sort of adaptation; this lack of adaptation makes the service unusable for some specific group of persons and leads to a restricted customers base and inclusiveness problems. This thesis proposes solutions to construct Dialogue Systems that are robust to this problem by combining Transfer Learning and Reinforcement Learning. It explores two main ideas: The first idea of this thesis consists in incorporating adaptation in the very first dialogues with a new user. To that extend, we use the knowledge gathered with previous users. But how to scale such systems with a growing database of user interactions? The first proposed approach involves clustering of Dialogue Systems (tailored for their respective user) based on their behaviours. We demonstrated through handcrafted and real user-models experiments how this method improves the dialogue quality for new and unknown users. The second approach extends the Deep Q-learning algorithm with a continuous transfer process.The second idea states that before using a dedicated Dialogue System, the first interactions with a user should be handled carefully by a safe Dialogue System common to all users. The underlying approach is divided in two steps. The first step consists in learning a safe strategy through Reinforcement Learning. To that extent, we introduced a budgeted Reinforcement Learning framework for continuous state space and the underlying extensions of classic Reinforcement Learning algorithms. In particular, the safe version of the Fitted-Q algorithm has been validated, in term of safety and efficiency, on a dialogue system tasks and an autonomous driving problem. The second step consists in using those safe strategies when facing new users; this method is an extension of the classic ε-greedy algorithm
Fournier, Pierre. "Intrinsically Motivated and Interactive Reinforcement Learning : a Developmental Approach". Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS634.
Texto completo da fonteReinforcement learning (RL) is today more popular than ever, but certain basic skills are still out of reach of this paradigm: object manipulation, sensorimotor control, natural interaction with other agents. A possible approach to address these challenges consist in taking inspiration from human development, or even trying to reproduce it. In this thesis, we study the intersection of two crucial topics in developmental sciences and how to apply them to RL in order to tackle the aforementioned challenges: interactive learning and intrinsic motivation. Interactive learning and intrinsic motivation have already been studied, separately, in combination with RL, but in order to improve quantitatively existing agents performances, rather than to learn in a developmental fashion. We thus focus our efforts on the developmental aspect of these subjects. Our work touches the self-organisation of learning in developmental trajectories through an intrinsically motivated for learning progress, and the interaction of this organisation with goal-directed learning and imitation learning. We show that these mechanisms, when implemented in open-ended environments with no task predefined, can interact to produce learning behaviors that are sound from a developmental standpoint, and richer than those produced by each mechanism separately
Islas, Ramírez Omar Adair. "Learning Robot Interactive Behaviors in Presence of Humans and Groups of Humans". Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066632/document.
Texto completo da fonteIn the past years, robots have been a part of our every day lives. Even when we do not see them, we depend on them to build our computers, mobile phones, cars and more. They are also been used for organizing stocks in warehouses. And, with the growth of autonomous cars, we see them driving autonomously on highways and cities. Another area of growth is social robotics. We can see a lot of studies such as robots helping children with autism. Other robots are being used to receive people in hotels or to interact with people in shopping centers. In the latter examples, robots need to understand people behavior. In addition, in the case of mobile robots, they need to know how to navigate in human environments. In the context of human environments, this thesis explores socially acceptable navigation of robots towards people. To give an example, when a robot approaches one person, the robot shall by no means treat people as an obstacle because the robot get really close to the human and interfere with her personal space. The human is an entity that needs to be considered based on social norms that we (humans) use on a daily basis. In a first time, we explore how a robot can approach one person. A person is an entity that can be bothered if someone or something approaches invading her personal space. The person also will feel distressed when she is approached from behind. These social norms have to be respected by the robot. For this reason, we decided to model the behavior of the robot through learning algorithms. We manually approach a robot to a person several times and the robot learns how to reproduce this behavior. In a second time, we present how a robot can understand what is a group of people. We, humans, have the ability to do this intuitively. However, for a robot, a mathematical model is essential. Lastly, we address how a robot can approach a group of people. We use exemplary demonstrations to teach this behavior to the robot. We evaluate then the robot's movements by for example, observing if the robot invades people's personal space during the trajectory