Rozprawy doktorskie na temat „Apprentissage automatique sur données confidentielles”
Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych
Sprawdź 50 najlepszych rozpraw doktorskich naukowych na temat „Apprentissage automatique sur données confidentielles”.
Przycisk „Dodaj do bibliografii” jest dostępny obok każdej pracy w bibliografii. Użyj go – a my automatycznie utworzymy odniesienie bibliograficzne do wybranej pracy w stylu cytowania, którego potrzebujesz: APA, MLA, Harvard, Chicago, Vancouver itp.
Możesz również pobrać pełny tekst publikacji naukowej w formacie „.pdf” i przeczytać adnotację do pracy online, jeśli odpowiednie parametry są dostępne w metadanych.
Przeglądaj rozprawy doktorskie z różnych dziedzin i twórz odpowiednie bibliografie.
Saadeh, Angelo. "Applications of secure multi-party computation in Machine Learning". Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAT022.
Pełny tekst źródłaPrivacy-preserving in machine learning and data analysis is becoming increasingly important as the amount of sensitive personal information collected and used by organizations continues to grow. This poses the risk of exposing sensitive personal information to malicious third parties - which can lead to identity theft, financial fraud, or other types of cybercrime. Laws against the use of private data are important to protect individuals from having their information used and shared. However, by doing so, data protection laws limit the applications of machine learning models, and some of these applications could be life-saving - like in the medical field.Secure multi-party computation (MPC) allows multiple parties to jointly compute a function over their inputs without having to reveal or exchange the data itself. This tool can be used for training collaborative machine learning models when there are privacy concerns about exchanging sensitive datasets between different entities.In this thesis, we (I) use existing and develop new secure multi-party computation algorithms, (II) introduce cryptography-friendly approximations of common machine functions, and (III) complement secure multi-party computation with other privacy tools. This work is done in the goal of implementing privacy-preserving machine learning and data analysis algorithms.Our work and experimental results show that by executing the algorithms using secure multi-party computation both security and correctness are satisfied. In other words, no party has access to another's information and they are still being able to collaboratively train machine learning models with high accuracy results, and to collaboratively evaluate data analysis algorithms in comparison with non-encrypted datasets.Overall, this thesis provides a comprehensive view of secure multi-party computation for machine learning, demonstrating its potential to revolutionize the field. This thesis contributes to the deployment and acceptability of secure multi-party computation in machine learning and data analysis
Girard, Régis. "Classification conceptuelle sur des données arborescentes et imprécises". La Réunion, 1997. http://elgebar.univ-reunion.fr/login?url=http://thesesenligne.univ.run/97_08_Girard.pdf.
Pełny tekst źródłaAllesiardo, Robin. "Bandits Manchots sur Flux de Données Non Stationnaires". Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLS334/document.
Pełny tekst źródłaThe multi-armed bandit is a framework allowing the study of the trade-off between exploration and exploitation under partial feedback. At each turn t Є [1,T] of the game, a player has to choose an arm kt in a set of K and receives a reward ykt drawn from a reward distribution D(µkt) of mean µkt and support [0,1]. This is a challeging problem as the player only knows the reward associated with the played arm and does not know what would be the reward if she had played another arm. Before each play, she is confronted to the dilemma between exploration and exploitation; exploring allows to increase the confidence of the reward estimators and exploiting allows to increase the cumulative reward by playing the empirical best arm (under the assumption that the empirical best arm is indeed the actual best arm).In the first part of the thesis, we will tackle the multi-armed bandit problem when reward distributions are non-stationary. Firstly, we will study the case where, even if reward distributions change during the game, the best arm stays the same. Secondly, we will study the case where the best arm changes during the game. The second part of the thesis tacles the contextual bandit problem where means of reward distributions are now dependent of the environment's current state. We will study the use of neural networks and random forests in the case of contextual bandits. We will then propose meta-bandit based approach for selecting online the most performant expert during its learning
Bascol, Kevin. "Adaptation de domaine multisource sur données déséquilibrées : application à l'amélioration de la sécurité des télésièges". Thesis, Lyon, 2019. http://www.theses.fr/2019LYSES062.
Pełny tekst źródłaBluecime has designed a camera-based system to monitor the boarding station of chairlifts in ski resorts, which aims at increasing the safety of all passengers. This already successful system does not use any machine learning component and requires an expensive configuration step. Machine learning is a subfield of artificial intelligence which deals with studying and designing algorithms that can learn and acquire knowledge from examples for a given task. Such a task could be classifying safe or unsafe situations on chairlifts from examples of images already labeled with these two categories, called the training examples. The machine learning algorithm learns a model able to predict one of these two categories on unseen cases. Since 2012, it has been shown that deep learning models are the best suited machine learning models to deal with image classification problems when many training data are available. In this context, this PhD thesis, funded by Bluecime, aims at improving both the cost and the effectiveness of Bluecime's current system using deep learning
Vandromme, Maxence. "Optimisation combinatoire et extraction de connaissances sur données hétérogènes et temporelles : application à l’identification de parcours patients". Thesis, Lille 1, 2017. http://www.theses.fr/2017LIL10044.
Pełny tekst źródłaHospital data exhibit numerous specificities that make the traditional data mining tools hard to apply. In this thesis, we focus on the heterogeneity associated with hospital data and on their temporal aspect. This work is done within the frame of the ANR ClinMine research project and a CIFRE partnership with the Alicante company. In this thesis, we propose two new knowledge discovery methods suited for hospital data, each able to perform a variety of tasks: classification, prediction, discovering patients profiles, etc.In the first part, we introduce MOSC (Multi-Objective Sequence Classification), an algorithm for supervised classification on heterogeneous, numeric and temporal data. In addition to binary and symbolic terms, this method uses numeric terms and sequences of temporal events to form sets of classification rules. MOSC is the first classification algorithm able to handle these types of data simultaneously. In the second part, we introduce HBC (Heterogeneous BiClustering), a biclustering algorithm for heterogeneous data, a problem that has never been studied so far. This algorithm is extended to support temporal data of various types: temporal events and unevenly-sampled time series. HBC is used for a case study on a set of hospital data, whose goal is to identify groups of patients sharing a similar profile. The results make sense from a medical viewpoint; they indicate that relevant, and sometimes new knowledge is extracted from the data. These results also lead to further, more precise case studies. The integration of HBC within a software is also engaged, with the implementation of a parallel version and a visualization tool for biclustering results
Jaillet, Simon. "Catégorisation automatique de documents textuels : D'une représentation basée sur les concepts aux motifs séquentiels". Montpellier 2, 2005. http://www.theses.fr/2005MON20030.
Pełny tekst źródłaAllart, Thibault. "Apprentissage statistique sur données longitudinales de grande taille et applications au design des jeux vidéo". Electronic Thesis or Diss., Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1136.
Pełny tekst źródłaThis thesis focuses on longitudinal time to event data possibly large along the following tree axes : number of individuals, observation frequency and number of covariates. We introduce a penalised estimator based on Cox complete likelihood with data driven weights. We introduce proximal optimization algorithms to efficiently fit models coefficients. We have implemented thoses methods in C++ and in the R package coxtv to allow everyone to analyse data sets bigger than RAM; using data streaming and online learning algorithms such that proximal stochastic gradient descent with adaptive learning rates. We illustrate performances on simulations and benchmark with existing models. Finally, we investigate the issue of video game design. We show that using our model on large datasets available in video game industry allows us to bring to light ways of improving the design of studied games. First we have a look at low level covariates, such as equipment choices through time and show that this model allows us to quantify the effect of each game elements, giving to designers ways to improve the game design. Finally, we show that the model can be used to extract more general design recommendations such as dificulty influence on player motivations
Dragoni, Laurent. "Tri de potentiels d'action sur des données neurophysiologiques massives : stratégie d’ensemble actif par fenêtre glissante pour l’estimation de modèles convolutionnels en grande dimension". Thesis, Université Côte d'Azur, 2022. http://www.theses.fr/2022COAZ4016.
Pełny tekst źródłaIn the nervous system, cells called neurons are specialized in the communication of information. Through the generation and propagation of electrical currents named action potentials, neurons are able to transmit information in the body. Given the importance of the neurons, in order to better understand the functioning of the nervous system, a wide range of methods have been proposed for studying those cells. In this thesis, we focus on the analysis of signals which have been recorded by electrodes, and more specifically, tetrodes and multi-electrode arrays (MEA). Since those devices usually record the activity of a set of neurons, the recorded signals are often a mixture of the activity of several neurons. In order to gain more knowledge from this type of data, a crucial pre-processing step called spike sorting is required to separate the activity of each neuron. Nowadays, the general procedure for spike sorting consists in a three steps procedure: thresholding, feature extraction and clustering. Unfortunately this methodology requires a large number of manual operations. Moreover, it becomes even more difficult when treating massive volumes of data, especially MEA recordings which also tend to feature more neuronal synchronizations. In this thesis, we present a spike sorting strategy allowing the analysis of large volumes of data and which requires few manual operations. This strategy makes use of a convolutional model which aims at breaking down the recorded signals as temporal convolutions between two factors: neuron activations and action potential shapes. The estimation of these two factors is usually treated through alternative optimization. Being the most difficult task, we only focus here on the estimation of the activations, assuming that the action potential shapes are known. Estimating the activations is traditionally referred to convolutional sparse coding. The well-known Lasso estimator features interesting mathematical properties for the resolution of such problem. However its computation remains challenging on high dimensional problems. We propose an algorithm based of the working set strategy in order to compute efficiently the Lasso. This algorithm takes advantage of the particular structure of the problem, derived from biological properties, by using temporal sliding windows, allowing it to scale in high dimension. Furthermore, we adapt theoretical results about the Lasso to show that, under reasonable assumptions, our estimator recovers the support of the true activation vector with high probability. We also propose models for both the spatial distribution and activation times of the neurons which allow us to quantify the size of our problem and deduce the theoretical complexity of our algorithm. In particular, we obtain a quasi-linear complexity with respect to the size of the recorded signal. Finally we present numerical results illustrating both the theoretical results and the performances of our approach
Roudiere, Gilles. "Détection d'attaques sur les équipements d'accès à Internet". Thesis, Toulouse, INSA, 2018. http://www.theses.fr/2018ISAT0017/document.
Pełny tekst źródłaNetwork anomalies, and specifically distributed denial of services attacks, are still an important threat to the Internet stakeholders. Detecting such anomalies requires dedicated tools, not only able to perform an accurate detection but also to meet the several constraints due to an industrial operation. Such constraints include, amongst others, the ability to run autonomously or to operate on sampled traffic. Unlike supervised or signature-based approaches, unsupervised detection do not require any kind of knowledge database on the monitored traffic. Such approaches rely on an autonomous characterization of the traffic in production. They require the intervention of the network administrator a posteriori, when it detects a deviation from the usual shape of the traffic. The main problem with unsupervised detection relies on the fact that building such characterization is complex, which might require significant amounts of computing resources. This requirement might be deterrent, especially when the detection should run on network devices that already have a significant workload. As a consequence, we propose a new unsupervised detection algorithm that aims at reducing the computing power required to run the detection. Its detection focuses on distributed denial of service attacks. Its processing is based upon the creation, at a regular interval, of traffic snapshots, which helps the diagnosis of detected anomalies. We evaluate the performances of the detector over two datasets to check its ability to accurately detect anomalies and to operate, in real time, with limited computing power resources. We also evaluate its performances over sampled traffic. The results we obtained are compared with those obtained with FastNetMon and UNADA
Eude, Thibaut. "Forage des données et formalisation des connaissances sur un accident : Le cas Deepwater Horizon". Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEM079/document.
Pełny tekst źródłaData drilling, the method and means developed in this thesis, redefines the process of data extraction, the formalization of knowledge and its enrichment, particularly in the context of the elucidation of events that have not or only slightly been documented. The Deepwater Horizon disaster, the drilling platform operated for BP in the Gulf of Mexico that suffered a blowout on April 20, 2010, will be our case study for the implementation of our proof of concept for data drilling. This accident is the result of an unprecedented discrepancy between the state of the art of drilling engineers' heuristics and that of pollution response engineers. The loss of control of the MC 252-1 well is therefore an engineering failure and it will take the response party eighty-seven days to regain control of the wild well and halt the pollution. Deepwater Horizon is in this sense a case of engineering facing extreme situation, as defined by Guarnieri and Travadel.First, we propose to return to the overall concept of accident by means of an in-depth linguistic analysis presenting the semantic spaces in which the accident takes place. This makes it possible to enrich its "core meaning" and broaden the shared acceptance of its definition.Then, we bring that the literature review must be systematically supported by algorithmic assistance to process the data taking into account the available volume, the heterogeneity of the sources and the requirements of quality and relevance standards. In fact, more than eight hundred scientific articles mentioning this accident have been published to date and some twenty investigation reports, constituting our research material, have been produced. Our method demonstrates the limitations of accident models when dealing with a case like Deepwater Horizon and the urgent need to look for an appropriate way to formalize knowledge.As a result, the use of upper-level ontologies should be encouraged. The DOLCE ontology has shown its great interest in formalizing knowledge about this accident and especially in elucidating very accurately a decision-making process at a critical moment of the intervention. The population, the creation of instances, is the heart of the exploitation of ontology and its main interest, but the process is still largely manual and not without mistakes. This thesis proposes a partial answer to this problem by an original NER algorithm for the automatic population of an ontology.Finally, the study of accidents involves determining the causes and examining "socially constructed facts". This thesis presents the original plans of a "semantic pipeline" built with a series of algorithms that extract the expressed causality in a document and produce a graph that represents the "causal path" underlying the document. It is significant for scientific or industrial research to highlight the reasoning behind the findings of the investigation team. To do this, this work leverages developments in Machine Learning and Question Answering and especially the Natural Language Processing tools.As a conclusion, this thesis is a work of a fitter, an architect, which offers both a prime insight into the Deepwater Horizon case and proposes the data drilling, an original method and means to address an event, in order to uncover answers from the research material for questions that had previously escaped understanding
Bordes, Antoine. "Nouveaux Algorithmes pour l'Apprentissage de Machines à Vecteurs Supports sur de Grandes Masses de Données". Phd thesis, Université Pierre et Marie Curie - Paris VI, 2010. http://tel.archives-ouvertes.fr/tel-00464007.
Pełny tekst źródłaSimon, Franck. "Découverte causale sur des jeux de données classiques et temporels. Application à des modèles biologiques". Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS528.
Pełny tekst źródłaThis thesis focuses on the field of causal discovery : the construction of causal graphs from observational data, and in particular, temporal causal discovery and the reconstruction of large gene regulatory networks. After a brief history, this thesis introduces the main concepts, hypotheses and theorems underlying causal graphs as well as the two main approaches: score-based and constraint-based methods. The MIIC (Multivariate Information-based Inductive Causation) method, developed in our laboratory, is then described with its latest improvements: Interpretable MIIC. The issues and solutions implemented to construct a temporal version (tMIIC) are presented as well as benchmarks reflecting the advantages of tMIIC compared to other state-of-the-art methods. The application to sequences of images taken with a microscope of a tumor environment reconstituted on microchips illustrates the capabilities of tMIIC to recover, solely from data, known and new relationships. Finally, this thesis introduces the use of a consequence a priori to apply causal discovery to the reconstruction of gene regulatory networks. By assuming that all genes, except transcription factors, are only consequence genes, it becomes possible to reconstruct graphs with thousands of genes. The ability to identify key transcription factors de novo is illustrated by an application to single cell RNA sequencing data with the discovery of two transcription factors likely to be involved in the biological process of interest
Durand, Maëva. "Alimentation sur mesure et estimation du bien-être des truies gestantes à partir de données hétérogènes". Electronic Thesis or Diss., Rennes, Agrocampus Ouest, 2023. http://www.theses.fr/2023NSARC169.
Pełny tekst źródłaNew technologies are developing increasingly in pig farming, to help farmers in their labour tasks. They allow the distribution of tailored diets for gestating sows and better animal behaviour monitoring. The issue of this thesis is to improve the estimation of daily nutritional requirements and estimate the individual welfare status of gestating sows using behavioural and environmental data collected automatically. The first aim was to evaluate experimentally the effects of environmental disturbances on behaviour and nutritional requirements. To achieve this, two groups of sows were followed during two consecutive gestations during which several events were induced. A database containing a variety ofsows’ behavioural data was built from these experiments. The results of the thesis highlighted the influence of environmental conditions on the behaviour and nutritional requirements of sows during gestation, as well as an important individual variability. The second part involved estimating individual daily requirements and welfare based on behavioural and environmental data recorded by sensors. The individual estimation of nutritional requirements and state of welfare can be carried out accurately using machine learning algorithms and data produced by the automatic feeder. Using these innovative methods, this thesis opens potential for the design of a decision-support tool aiming at adjusting feeding and improving the welfare of gestating sows
Mahmoudysepehr, Mehdi. "Modélisation du comportement du tunnelier et impact sur son environnement". Thesis, Centrale Lille Institut, 2020. http://www.theses.fr/2020CLIL0028.
Pełny tekst źródłaThis PhD thesis research work consists in understanding the behavior of the TBM according to the environment encountered in order to propose safe, durable and quality solutions for the digging of the tunnel.The main objective of this doctoral thesis work is to better understand the behavior of the TBM according to its environment. Thus, we will explore how the TBM reacts according to the different types of terrain and how it acts on the various elements of tunnel structure (voussoirs). This will make it possible to propose an intelligent and optimal dimensioning of the voussoirs and instructions of adapted piloting
Loeffel, Pierre-Xavier. "Algorithmes de machine learning adaptatifs pour flux de données sujets à des changements de concept". Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066496/document.
Pełny tekst źródłaIn this thesis, we investigate the problem of supervised classification on a data stream subject to concept drifts. In order to learn in this environment, we claim that a successful learning algorithm must combine several characteristics. It must be able to learn and adapt continuously, it shouldn’t make any assumption on the nature of the concept or the expected type of drifts and it should be allowed to abstain from prediction when necessary. On-line learning algorithms are the obvious choice to handle data streams. Indeed, their update mechanism allows them to continuously update their learned model by always making use of the latest data. The instance based (IB) structure also has some properties which make it extremely well suited to handle the issue of data streams with drifting concepts. Indeed, IB algorithms make very little assumptions about the nature of the concept they are trying to learn. This grants them a great flexibility which make them likely to be able to learn from a wide range of concepts. Another strength is that storing some of the past observations into memory can bring valuable meta-informations which can be used by an algorithm. Furthermore, the IB structure allows the adaptation process to rely on hard evidences of obsolescence and, by doing so, adaptation to concept changes can happen without the need to explicitly detect the drifts. Finally, in this thesis we stress the importance of allowing the learning algorithm to abstain from prediction in this framework. This is because the drifts can generate a lot of uncertainties and at times, an algorithm might lack the necessary information to accurately predict
Allart, Thibault. "Apprentissage statistique sur données longitudinales de grande taille et applications au design des jeux vidéo". Thesis, Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1136/document.
Pełny tekst źródłaThis thesis focuses on longitudinal time to event data possibly large along the following tree axes : number of individuals, observation frequency and number of covariates. We introduce a penalised estimator based on Cox complete likelihood with data driven weights. We introduce proximal optimization algorithms to efficiently fit models coefficients. We have implemented thoses methods in C++ and in the R package coxtv to allow everyone to analyse data sets bigger than RAM; using data streaming and online learning algorithms such that proximal stochastic gradient descent with adaptive learning rates. We illustrate performances on simulations and benchmark with existing models. Finally, we investigate the issue of video game design. We show that using our model on large datasets available in video game industry allows us to bring to light ways of improving the design of studied games. First we have a look at low level covariates, such as equipment choices through time and show that this model allows us to quantify the effect of each game elements, giving to designers ways to improve the game design. Finally, we show that the model can be used to extract more general design recommendations such as dificulty influence on player motivations
Irain, Malik. "Plateforme d'analyse de performances des méthodes de localisation des données dans le cloud basées sur l'apprentissage automatique exploitant des délais de messages". Thesis, Toulouse 3, 2019. http://www.theses.fr/2019TOU30195.
Pełny tekst źródłaCloud usage is a necessity today, as data produced and used by all types of users (individuals, companies, administrative structures) has become too large to be stored otherwise. It requires to sign, explicitly or not, a contract with a cloud storage provider. This contract specifies the levels of quality of service required for various criteria. Among these criteria is the location of the data. However, this criterion is not easily verifiable by a user. This is why research in the field of data localization verification has led to several studies in recent years, but the proposed solutions can still be improved. The work proposed in this thesis consists in studying solutions of location verification by a user, i.e. solutions that estimate data location and operate using landmarks. The implemented approach can be summarized as follows: exploiting communication delays and using network time models to estimate, with some distance error, data location. To this end, the work carried out is as follows: • A survey of the state of the art on the different methods used to provide users with location information. • The design of a unified notation for the methods studied in the survey, with a proposal of two scores to assess methods. • Implementation of a network measurements collecting platform. Thanks to this platform, two datasets were collected, at both national level and international level. These two data sets are used to evaluate the different methods presented in the state of the art survey. • Implementation of an evaluation architecture based on the two data sets and the defined scores. This allows us to establish the quality of the methods (success rate) and the quality of the results (accuracy of the result) thanks to the proposed scores
Qamar, Ali Mustafa. "Mesures de similarité et cosinus généralisé : une approche d'apprentissage supervisé fondée sur les k plus proches voisins". Phd thesis, Université de Grenoble, 2010. http://tel.archives-ouvertes.fr/tel-00591988.
Pełny tekst źródłaGhrissi, Amina. "Ablation par catheter de fibrillation atriale persistante guidée par dispersion spatiotemporelle d’électrogrammes : Identification automatique basée sur l’apprentissage statistique". Thesis, Université Côte d'Azur, 2021. http://www.theses.fr/2021COAZ4026.
Pełny tekst źródłaCatheter ablation is increasingly used to treat atrial fibrillation (AF), the most common sustained cardiac arrhythmia encountered in clinical practice. A recent patient-tailored AF ablation therapy, giving 95% of procedural success rate, is based on the use of a multipolar mapping catheter called PentaRay. It targets areas of spatiotemporal dispersion (STD) in the atria as potential AF drivers. STD stands for a delay of the cardiac activation observed in intracardiac electrograms (EGMs) across contiguous leads.In practice, interventional cardiologists localize STD sites visually using the PentaRay multipolar mapping catheter. This thesis aims to automatically characterize and identify ablation sites in STD-based ablation of persistent AF using machine learning (ML) including deep learning (DL) techniques. In the first part, EGM recordings are classified into STD vs. non-STD groups. However, highly imbalanced dataset ratio hampers the classification performance. We tackle this issue by using adapted data augmentation techniques that help achieve good classification. The overall performance is high with values of accuracy and AUC around 90%. First, two approaches are benchmarked, feature engineering and automatic feature extraction from a time series, called maximal voltage absolute values at any of the bipoles (VAVp). Statistical features are extracted and fed to ML classifiers but no important dissimilarity is obtained between STD and non-STD categories. Results show that the supervised classification of raw VAVp time series itself into the same categories is promising with values of accuracy, AUC, sensi-tivity and specificity around 90%. Second, the classification of raw multichannel EGM recordings is performed. Shallow convolutional arithmetic circuits are investigated for their promising theoretical interest but experimental results on synthetic data are unsuccessful. Then, we move forward to more conventional supervised ML tools. We design a selection of data representations adapted to different ML and DL models, and benchmark their performance in terms of classification and computational cost. Transfer learning is also assessed. The best performance is achieved with a convolutional neural network (CNN) model for classifying raw EGM matrices. The average performance over cross-validation reaches 94% of accuracy and AUC added to an F1-score of 60%. In the second part, EGM recordings acquired during mapping are labeled ablated vs. non-ablated according to their proximity to the ablation sites then classified into the same categories. STD labels, previously defined by interventional cardiologists at the ablation procedure, are also aggregated as a prior probability in the classification task.Classification results on the test set show that a shallow CNN gives the best performance with an F1-score of 76%. Aggregating STD label does not help improve the model’s performance. Overall, this work is among the first attempts at the application of statistical analysis and ML tools to automatically identify successful ablation areas in STD-based ablation. By providing interventional cardiologists with a real-time objective measure of STD, the proposed solution offers the potential to improve the efficiency and effectiveness of this fully patient-tailored catheter ablation approach for treating persistent AF
Ahmia, Oussama. "Veille stratégique assistée sur des bases de données d’appels d’offres par traitement automatique de la langue naturelle et fouille de textes". Thesis, Lorient, 2020. http://www.theses.fr/2020LORIS555.
Pełny tekst źródłaThis thesis, carried out within the framework of a CIFRE contract with the OctopusMind company, is focused on developing a set of automated tools dedicated and optimized to assist call for tender databases processing, for the purpose of strategic intelligence monitoring. Our contribution is divided into three chapters: The first chapter is about developing a partially comparable multilingual corpus, built from the European calls for tender published by TED (Tenders Electronic Daily). It contains more than 2 million documents translated into 24 languages published over the last 9 years. The second chapter presents a study on the questions of words, sentences and documents embedding, likely to capture semantic features at different scales. We proposed two approaches: the first one is based on a combination between a word embedding (word2vec) and latent semantic analysis (LSA). The second one is based on a novel artificial neural network architecture based on two-level convolutional attention mechanisms. These embedding methods are evaluated on classification and text clustering tasks. The third chapter concerns the extraction of semantic relationships in calls for tenders, in particular, allowing to link buildings to areas, lots to budgets, and so on. The supervised approaches developed in this part of the thesis are essentially based on Conditionnal Random Fields. The end of the third chapter concerns the application aspect, in particular with the implementation of some solutions deployed within OctopusMind's software environment, including information extraction, a recommender system, as well as the combination of these different modules to solve some more complex problems
Kerrouche, Abdelali. "Routage des données dans les réseaux centrés sur les contenus". Thesis, Paris Est, 2017. http://www.theses.fr/2017PESC1119/document.
Pełny tekst źródłaThe Information Centric Networking (ICN) represents a new paradigm that is increasingly developed within the Internet world. It brings forward new content-centric based approaches, in order to design a new architecture for the future Internet, whose usage today shifts from a machine oriented communication (hosts) to a large-scale content distribution and retrieval.In this context, several ICN architectures have been proposed by the scientific community, within several international projects: DONA, PURSUIT, SAIL, COMET, CONVERGENCE, Named Data Networking (NDN), etc.Our thesis work has focused on the problems of routing in such networks, through a NDN architecture, which represents one of the most advanced ICN architectures nowadays.In particular, we were interested in designing and implementing routing solutions that integrate quality-of-service metrics (QoS) in the NDN architecture in terms of current Internet usage. This latter is indeed characterized by a heterogeneity of connections and highly dynamic traffic conditions.In this type of architecture, data packets broadcast is organized in two levels: the routing planand the forwarding plane. The latter is responsible for routing packets on all available paths through an identified upstream strategy. The routing plan is meanwhile used only to support the forwarding plane. In fact, our solutions consist of new QoS routing strategies which we describe as adaptive. These strategies can transmit packets over multiple paths while taking into account the QoS parameters related to the state of the network and collected in real time.The first proposed approach is designed on the basis of a on-line Q-learn type inductive learning method, and is used to estimate the information collected on the dynamic state of the network.The second contribution is an adaptive routing strategy designed for NDN architectures which considers the metrics related to QoS. It is based on the similarities between the packet forwarding process in the NDN architecture and the behavior of ants when finding the shortest path between their nest and food sources. The techniques used to design this strategy are based on optimization approaches used "ant colonies" algorithms.Finally, in the last part of the thesis, we generalize the approach described above to extend it to the simultaneous consideration of several QoS parameters. Based on these principles, this approach was later extended to solving problems related to congestion.The results show the effectiveness of the proposed solutions in an NDN architecture and thus allow to consider QoS parameters in packet delivery mechanisms paving the way for various content-oriented applications on this architecture
Coelho, Rodrigues Pedro Luiz. "Exploration des invariances de séries temporelles multivariées via la géométrie Riemannienne : validation sur des données EEG". Electronic Thesis or Diss., Université Grenoble Alpes (ComUE), 2019. http://www.theses.fr/2019GREAT095.
Pełny tekst źródłaMultivariate time series are the standard tool for describing and analysing measurements from multiple sensors during an experiment. In this work, we discuss different aspects of such representations that are invariant to transformations occurring in practical situations. The main source of inspiration for our investigations are experiments with neural signals from electroencephalography (EEG), but the ideas that we present are amenable to other kinds of time series.The first invariance that we consider concerns the dimensionality of the multivariate time series. Very often, signals recorded from neighbouring sensors present strong statistical dependency between them. We present techniques for disposing of the redundancy of these correlated signals and obtaining new multivariate time series that represent the same phenomenon but in a smaller dimension.The second invariance that we treat is related to time series describing the same phenomena but recorded under different experimental conditions. For instance, signals recorded with the same experimental apparatus but on different days of the week, different test subjects, etc. In such cases, despite an underlying variability, the multivariate time series share certain commonalities that can be exploited for joint analysis. Moreover, reusing information already available from other datasets is a very appealing idea and allows for “data-efficient” machine learning methods. We present an original transfer learning procedure that transforms these time series so that their statistical distributions become aligned and can be pooled together for further statistical analysis.Finally, we extend the previous case to when the time series are obtained from different experimental conditions and also different experimental setups. A practical example is having EEG recordings from subjects executing the same cognitive task but with the electrodes positioned differently. We present an original method that transforms these multivariate time series so that they become compatible in terms of dimensionality and also in terms of statistical distributions.We illustrate the techniques described above on EEG epochs recorded during brain-computer interface (BCI) experiments. We show examples where the reduction of the multivariate time series does not affect the performance of statistical classifiers used to distinguish their classes, as well as instances where our transfer learning and dimension-matching proposals provide remarkable results on classification in cross-session and cross-subject settings.For exploring the invariances presented above, we rely on a framework that parametrizes the statistics of the multivariate time series via Hermitian positive definite (HPD) matrices. We manipulate these matrices by considering them in a Riemannian manifold in which an adequate metric is chosen. We use concepts from Riemannian geometry to define notions such as geodesic distance, center of mass, and statistical classifiers for time series. This approach is rooted on fundamental results of differential geometry for Hermitian positive definite matrices and has links with other well established areas in applied mathematics, such as information geometry and signal processing
Loeffel, Pierre-Xavier. "Algorithmes de machine learning adaptatifs pour flux de données sujets à des changements de concept". Electronic Thesis or Diss., Paris 6, 2017. http://www.theses.fr/2017PA066496.
Pełny tekst źródłaIn this thesis, we investigate the problem of supervised classification on a data stream subject to concept drifts. In order to learn in this environment, we claim that a successful learning algorithm must combine several characteristics. It must be able to learn and adapt continuously, it shouldn’t make any assumption on the nature of the concept or the expected type of drifts and it should be allowed to abstain from prediction when necessary. On-line learning algorithms are the obvious choice to handle data streams. Indeed, their update mechanism allows them to continuously update their learned model by always making use of the latest data. The instance based (IB) structure also has some properties which make it extremely well suited to handle the issue of data streams with drifting concepts. Indeed, IB algorithms make very little assumptions about the nature of the concept they are trying to learn. This grants them a great flexibility which make them likely to be able to learn from a wide range of concepts. Another strength is that storing some of the past observations into memory can bring valuable meta-informations which can be used by an algorithm. Furthermore, the IB structure allows the adaptation process to rely on hard evidences of obsolescence and, by doing so, adaptation to concept changes can happen without the need to explicitly detect the drifts. Finally, in this thesis we stress the importance of allowing the learning algorithm to abstain from prediction in this framework. This is because the drifts can generate a lot of uncertainties and at times, an algorithm might lack the necessary information to accurately predict
Meghnoudj, Houssem. "Génération de caractéristiques à partir de séries temporelles physiologiques basée sur le contrôle optimal parcimonieux : application au diagnostic de maladies et de troubles humains". Electronic Thesis or Diss., Université Grenoble Alpes, 2024. http://www.theses.fr/2024GRALT003.
Pełny tekst źródłaIn this thesis, a novel methodology for features generation from physiological signals (EEG, ECG) has been proposed that is used for the diagnosis of a variety of brain and heart diseases. Based on sparse optimal control, the generation of Sparse Dynamical Features (SDFs) is inspired by the functioning of the brain. The method's fundamental concept revolves around sparsely decomposing the signal into dynamical modes that can be switched on and off at the appropriate time instants with the appropriate amplitudes. This decomposition provides a new point of view on the data which gives access to informative features that are faithful to the brain functioning. Nevertheless, the method remains generic and versatile as it can be applied to a wide range of signals. The methodology's performance was evaluated on three use cases using openly accessible real-world data: (1) Parkinson's Disease, (2) Schizophrenia, and (3) various cardiac diseases. For all three applications, the results are highly conclusive, achieving results that are comparable to the state-of-the-art methods while using only few features (one or two for brain applications) and a simple linear classifier supporting the significance and reliability of the findings. It's worth highlighting that special attention has been given to achieving significant and meaningful results with an underlying explainability
Qamar, Ali Mustafa. "Mesures de similarité et cosinus généralisé : une approche d'apprentissage supervisé fondée sur les k plus proches voisins". Phd thesis, Grenoble, 2010. http://www.theses.fr/2010GRENM083.
Pełny tekst źródłaAlmost all machine learning problems depend heavily on the metric used. Many works have proved that it is a far better approach to learn the metric structure from the data rather than assuming a simple geometry based on the identity matrix. This has paved the way for a new research theme called metric learning. Most of the works in this domain have based their approaches on distance learning only. However some other works have shown that similarity should be preferred over distance metrics while dealing with textual datasets as well as with non-textual ones. Being able to efficiently learn appropriate similarity measures, as opposed to distances, is thus of high importance for various collections. If several works have partially addressed this problem for different applications, no previous work is known which has fully addressed it in the context of learning similarity metrics for kNN classification. This is exactly the focus of the current study. In the case of information filtering systems where the aim is to filter an incoming stream of documents into a set of predefined topics with little supervision, cosine based category specific thresholds can be learned. Learning such thresholds can be seen as a first step towards learning a complete similarity measure. This strategy was used to develop Online and Batch algorithms for information filtering during the INFILE (Information Filtering) track of the CLEF (Cross Language Evaluation Forum) campaign during the years 2008 and 2009. However, provided enough supervised information is available, as is the case in classification settings, it is usually beneficial to learn a complete metric as opposed to learning thresholds. To this end, we developed numerous algorithms for learning complete similarity metrics for kNN classification. An unconstrained similarity learning algorithm called SiLA is developed in which case the normalization is independent of the similarity matrix. SiLA encompasses, among others, the standard cosine measure, as well as the Dice and Jaccard coefficients. SiLA is an extension of the voted perceptron algorithm and allows to learn different types of similarity functions (based on diagonal, symmetric or asymmetric matrices). We then compare SiLA with RELIEF, a well known feature re-weighting algorithm. It has recently been suggested by Sun and Wu that RELIEF can be seen as a distance metric learning algorithm optimizing a cost function which is an approximation of the 0-1 loss. We show here that this approximation is loose, and propose a stricter version closer to the the 0-1 loss, leading to a new, and better, RELIEF-based algorithm for classification. We then focus on a direct extension of the cosine similarity measure, defined as a normalized scalar product in a projected space. The associated algorithm is called generalized Cosine simiLarity Algorithm (gCosLA). All of the algorithms are tested on many different datasets. A statistical test, the s-test, is employed to assess whether the results are significantly different. GCosLA performed statistically much better than SiLA on many of the datasets. Furthermore, SiLA and gCosLA were compared with many state of the art algorithms, illustrating their well-foundedness
Vo, Nguyen Dang Khoa. "Compression vidéo basée sur l'exploitation d'un décodeur intelligent". Thesis, Nice, 2015. http://www.theses.fr/2015NICE4136/document.
Pełny tekst źródłaThis Ph.D. thesis studies the novel concept of Smart Decoder (SDec) where the decoder is given the ability to simulate the encoder and is able to conduct the R-D competition similarly as in the encoder. The proposed technique aims to reduce the signaling of competing coding modes and parameters. The general SDec coding scheme and several practical applications are proposed, followed by a long-term approach exploiting machine learning concept in video coding. The SDec coding scheme exploits a complex decoder able to reproduce the choice of the encoder based on causal references, eliminating thus the need to signal coding modes and associated parameters. Several practical applications of the general outline of the SDec scheme are tested, using different coding modes during the competition on the reference blocs. Despite the choice for the SDec reference block being still simple and limited, interesting gains are observed. The long-term research presents an innovative method that further makes use of the processing capacity of the decoder. Machine learning techniques are exploited in video coding with the purpose of reducing the signaling overhead. Practical applications are given, using a classifier based on support vector machine to predict coding modes of a block. The block classification uses causal descriptors which consist of different types of histograms. Significant bit rate savings are obtained, which confirms the potential of the approach
Frouin, Arthur. "Lien entre héritabilité et prédiction de phénotypes complexes chez l’humain : une approche du problème par la régression ridge sur des données de population". Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASL027.
Pełny tekst źródłaThis thesis studies the contribution of machine learning methods for the prediction of complex and heritable human phenotypes, from population genetic data. Indeed, genome-wide association studies (GWAS) generally only explain a small fraction of the heritability observed in family data. However, heritability can be approximated on population data by genomic heritability, which estimates the phenotypic variance explained by the set of single nucleotide polymorphisms (SNPs) of the genome using mixed models. This thesis therefore approaches heritability from a machine learning perspective and examines the close link between mixed models and ridge regression.Our contribution is twofold. First, we propose to estimate genomic heritability using a predictive approach via ridge regression and generalized cross validation (GCV). Second, we derive simple formulas that express the precision of the ridge regression prediction as a function of the size of the population and the total number of SNPs, showing that a high heritability does not necessarily imply an accurate prediction. Heritability estimation via GCV and prediction precision formulas are validated using simulated data and real data from UK Biobank. The last part of the thesis presents results on qualitative phenotypes. These results allow a better understanding of the biases of the heritability estimation methods
Muzeau, Julien. "Système de vision pour la sécurité des personnes sur les remontées mécaniques". Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALT075.
Pełny tekst źródłaWith the increase in the number of visitors in mountain ranges and the multiplication of accidents on skilifts attributed to human behavior, safety has become a major issue for resort managers.To fight this phenomenon, the start-up from Grenoble Bluecime developed a computer vision system, named SIVAO, which is able to detect a hazardous situation at the boarding of a skilift. The operation of the system breaks down into three steps. First, the chair (or vehicle) is detected in the image. Then, the presence of passengers is confirmed or invalidated. Finally, the position of the security railing is determined. If passengers are present on the vehicle and if the security railing is not down, then the situation is considered as hazardous. In that case, an alarm is triggered, in order to inform the skiers or the operator who can slow down the skilift to secure the vehicle.Despite convincing results, numerous difficulties have to be overcome by SIVAO: various variabilities (vehicle size, boarding orientation, meteorological conditions, number of passengers), camera vibration, complex configuration in the context of a new plant, etc.The MIVAO project, in partnership with the Hubert Curien laboratory, the Bluecime start-up and the Sofival company, was born in order to overcome the previous challenges. The goal is to build an artificial intelligence able to detect, even anticipate, a hazardous situation on vehicles of a skilift, in order to guarantee the security of passengers. Within this project, the general goal of the Gipsa-lab is the automatic annotation, in the least supervised way possible, of chairlift videos.Firstly, we present a classification method whose aim is to confirm or invalidate the presence of passengers on each vehicle. In fact, this preliminary information is critical for the analysis of a potential danger. The proposed technique is based on hand-made features which have a physical interpretation. We show that, by including a priori knowledge, the obtained results are comptetitive against those from complex neural networks, allowing real-time functioning as well.Then, we detail a process for passenger counting on each vehicle in the most unsupervised way possible. This pipeline consists in a dimensionality reduction step followed by a data clustering stage. The latter aims, in the context of our project, at gathering tracks whose vehicles carry the same number of passengers. One can then deduce, from a small number of labels obtained by hand, the number of people present during each track. In particular, we detail two algorithms developed during this thesis. The first one proposes a generalisation of the density-based clustering method DBSCAN, via the introduction of the concept of ellipsoidal neighborhood. The second conciliates Gaussian mixture and spectral clusterings so as to discover non-convex data groups.Finally, we address the problem of automatic extraction of vehicles from camera images, as well as the modeling of their trajectory. To do this, we propose a first method which consists in removing the noise from the optical flow by means of the optical strain. We also present a technique for automatically determining the duration of a vehicle track via frequency analysis.Moreover, we detail an annotation work whose objective is to define clipping paths, pixel by pixel, over the passengers and vehicles in sequences of fourty consecutive images
Shahzad, Atif. "Une Approche Hybride de Simulation-Optimisation Basée sur la fouille de Données pour les problèmes d'ordonnancement". Phd thesis, Université de Nantes, 2011. http://tel.archives-ouvertes.fr/tel-00647353.
Pełny tekst źródłaSellami, Akrem. "Interprétation sémantique d'images hyperspectrales basée sur la réduction adaptative de dimensionnalité". Thesis, Ecole nationale supérieure Mines-Télécom Atlantique Bretagne Pays de la Loire, 2017. http://www.theses.fr/2017IMTA0037/document.
Pełny tekst źródłaHyperspectral imagery allows to acquire a rich spectral information of a scene in several hundred or even thousands of narrow and contiguous spectral bands. However, with the high number of spectral bands, the strong inter-bands spectral correlation and the redundancy of spectro-spatial information, the interpretation of these massive hyperspectral data is one of the major challenges for the remote sensing scientific community. In this context, the major challenge is to reduce the number of unnecessary spectral bands, that is, to reduce the redundancy and high correlation of spectral bands while preserving the relevant information. Therefore, projection approaches aim to transform the hyperspectral data into a reduced subspace by combining all original spectral bands. In addition, band selection approaches attempt to find a subset of relevant spectral bands. In this thesis, firstly we focus on hyperspectral images classification attempting to integrate the spectro-spatial information into dimension reduction in order to improve the classification performance and to overcome the loss of spatial information in projection approaches.Therefore, we propose a hybrid model to preserve the spectro-spatial information exploiting the tensor model in the locality preserving projection approach (TLPP) and to use the constraint band selection (CBS) as unsupervised approach to select the discriminant spectral bands. To model the uncertainty and imperfection of these reduction approaches and classifiers, we propose an evidential approach based on the Dempster-Shafer Theory (DST). In the second step, we try to extend the hybrid model by exploiting the semantic knowledge extracted through the features obtained by the previously proposed approach TLPP to enrich the CBS technique. Indeed, the proposed approach makes it possible to select a relevant spectral bands which are at the same time informative, discriminant, distinctive and not very redundant. In fact, this approach selects the discriminant and distinctive spectral bands using the CBS technique injecting the extracted rules obtained with knowledge extraction techniques to automatically and adaptively select the optimal subset of relevant spectral bands. The performance of our approach is evaluated using several real hyperspectral data
Derksen, Dawa. "Classification contextuelle de gros volumes de données d'imagerie satellitaire pour la production de cartes d'occupation des sols sur de grandes étendues". Thesis, Toulouse 3, 2019. http://www.theses.fr/2019TOU30290.
Pełny tekst źródłaThis work studies the application of supervised classification for the production of land cover maps using time series of satellite images at high spatial, spectral, and temporal resolutions. On this problem, certain classes such as urban cover, depend more on the context of the pixel than its content. The issue of this Ph.D. work is therefore to take into account the neighborhood of the pixel, to improve the recognition rates of these classes. This research first leads to question the definition of the context, and to imagine different possible shapes for it. Then comes describing the context, that is to say to create a representation or a model that allows the target classes to be recognized. The combinations of these two aspects are evaluated on two experimental data sets, one on Sentinel-2 images, and the other on SPOT-7 images
Zhao, Zilong. "Extracting knowledge from macroeconomic data, images and unreliable data". Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALT074.
Pełny tekst źródłaSystem identification and machine learning are two similar concepts independently used in automatic and computer science community. System identification uses statistical methods to build mathematical models of dynamical systems from measured data. Machine learning algorithms build a mathematical model based on sample data, known as "training data" (clean or not), in order to make predictions or decisions without being explicitly programmed to do so. Except prediction accuracy, converging speed and stability are another two key factors to evaluate the training process, especially in the online learning scenario, and these properties have already been well studied in control theory. Therefore, this thesis will implement the interdisciplinary researches for following topic: 1) System identification and optimal control on macroeconomic data: We first modelize the China macroeconomic data on Vector Auto-Regression (VAR) model, then identify the cointegration relation between variables and use Vector Error Correction Model (VECM) to study the short-time fluctuations around the long-term equilibrium, Granger Causality is also studied with VECM. This work reveals the trend of China's economic growth transition: from export-oriented to consumption-oriented; Due to limitation of China economic data, we turn to use France macroeconomic data in the second study. We represent the model in state-space, put the model into a feedback control framework, the controller is designed by Linear-Quadratic Regulator (LQR). The system can apply the control law to bring the system to a desired state. We can also impose perturbations on outputs and constraints on inputs, which emulates the real-world situation of economic crisis. Economists can observe the recovery trajectory of economy, which gives meaningful implications for policy-making. 2) Using control theory to improve the online learning of deep neural network: We propose a performance-based learning rate algorithm: E (Exponential)/PD (Proportional Derivative) feedback control, which consider the Convolutional Neural Network (CNN) as plant, learning rate as control signal and loss value as error signal. Results show that E/PD outperforms the state-of-the-art in final accuracy, final loss and converging speed, and the result are also more stable. However, one observation from E/PD experiments is that learning rate decreases while loss continuously decreases. But loss decreases mean model approaches optimum, we should not decrease the learning rate. To prevent this, we propose an event-based E/PD. Results show that it improves E/PD in final accuracy, final loss and converging speed; Another observation from E/PD experiment is that online learning fixes a constant training epoch for each batch. Since E/PD converges fast, the significant improvement only comes from the beginning epochs. Therefore, we propose another event-based E/PD, which inspects the historical loss, when the progress of training is lower than a certain threshold, we turn to next batch. Results show that it can save up to 67% epochs on CIFAR-10 dataset without degrading much performance. 3) Machine learning out of unreliable data: We propose a generic framework: Robust Anomaly Detector (RAD), The data selection part of RAD is a two-layer framework, where the first layer is used to filter out the suspicious data, and the second layer detects the anomaly patterns from the remaining data. We also derive three variations of RAD namely, voting, active learning and slim, which use additional information, e.g., opinions of conflicting classifiers and queries of oracles. We iteratively update the historical selected data to improve accumulated data quality. Results show that RAD can continuously improve model's performance under the presence of noise on labels. Three variations of RAD show they can all improve the original setting, and the RAD Active Learning performs almost as good as the case where there is no noise on labels
Jacques, Julie. "Classification sur données médicales à l'aide de méthodes d'optimisation et de datamining, appliquée au pré-screening dans les essais cliniques". Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2013. http://tel.archives-ouvertes.fr/tel-00919876.
Pełny tekst źródłaHamdan, Hani. "Développement de méthodes de classification pour le contrôle par émission acoustique d'appareils à pression". Compiègne, 2005. http://www.theses.fr/2005COMP1583.
Pełny tekst źródłaThis PhD thesis deals with real-time computer-aided decision for acoustic emission-based control of pressure equipments. The addressed problem is the taking into account of the location uncertainty of acoustic emission signals, in the mixture model-based clustering. Two new algorithms (EM and CEM for uncertain data) are developed. These algorithms are only based on uncertainty zone data and their development is carried out by optimizing new likelihood criteria adapted to this kind of data. In order to speed up the data processing when the data size becomes very big, we have also developed a new method for the discretization of uncertainty zone data. This method is compared with the traditional one applied to imprecise data. An experimental study using simulated and real data shows the efficiency of the various developed approaches
Hosni, Nadia. "De l’analyse en composantes principales fonctionnelle à l’autoencodeur convolutif profond sur les trajectoires de formes de Kendall pour l’analyse et la reconnaissance de la démarche en 3D". Thesis, Lille 1, 2020. http://www.theses.fr/2020LIL1I066.
Pełny tekst źródłaIn the field of Computer Vision and Pattern Recognition, human behavior understanding has attracted the attention of several research groups and specialized companies. Successful intelligent solutions will be playing an important role in applications which involve humanrobot or human-computer interaction, biometrics recognition (security), and physical performance assessment (healthcare and well-being) since it will help the human beings were their cognitive and limited capabilities cannot perform well. In my thesis project, we investigate the problem of 3D gait recognition and analysis as gait is user-friendly and a well-accepted technology especially with the availability of RGB-D sensors and algorithms for detecting and tracking of human landmarks in video streams. Unlike other biometrics such as fingerprints, face or iris, it can be acquired at a large distance and do not require any collaboration of the end user. This point makes gait recognition suitable in intelligent video surveillance problems used, for example, in the security field as one of the behavioral biometrics or in healthcare as good physical patterns. However, using 3D human body tracked landmarks to provide such motions’ analysis faces many challenges like spatial and temporal variations and high dimension. Hence, in this thesis, we propose novel frameworks to infer 3D skeletal sequences for the purpose of 3D gait analysis and recognition. They are based on viewing the above-cited sequences as time-parameterized trajectories on the Kendall shape space S, results of modding out shape-preserving transformations, i.e., scaling, translation and rotation. Considering the non-linear structure of the manifold on which these shape trajectories are lying, the use of the conventional machine learning tools and the standard computational tools cannot be straightforward. Hence, we make use of geometric steps related to the Riemannian geometry in order to handle the problem of nonlinearity. Our first contribution is a geometric-functional framework for 3D gait analysis with a direct application to behavioral biometric recognition and physical performance assessment. We opt for an extension of the functional Principal Component Analysis to the underlying space. This functional analysis of trajectories, grounding on the geometry of the space of representation, allows to extract compact and efficient biometric signatures. In addition, we also propose a geometric deep convolutional auto-encoder (DCAE) for the purpose of gait recognition from time-varying 3D skeletal data. To accommodate the Neural Network architectures to obtained manifold-valued trajectories on the underlying non-linear space S, these trajectories are mapped to a certain vector space by means of someRiemannien geometry tools, prior to the encoding-decoding scheme. Without applying any prior temporal alignment step (e.g., Dynamic Time Warping) or modeling (e.g., HMM, RNN), they are then fed to a convolutional auto-encoder to build an identity-relevant latent space that showed discriminating capacities for identifying persons when no Temporal Alignment is applied to the time-parametrized gait trajectories: Efficient gait patterns are extracted. Both approaches were tested on several publicly available datasets and shows promising results
Jiao, Yunlong. "Pronostic moléculaire basé sur l'ordre des gènes et découverte de biomarqueurs guidé par des réseaux pour le cancer du sein". Thesis, Paris Sciences et Lettres (ComUE), 2017. http://www.theses.fr/2017PSLEM027/document.
Pełny tekst źródłaBreast cancer is the second most common cancer worldwide and the leading cause of women's death from cancer. Improving cancer prognosis has been one of the problems of primary interest towards better clinical management and treatment decision making for cancer patients. With the rapid advancement of genomic profiling technologies in the past decades, easy availability of a substantial amount of genomic data for medical research has been motivating the currently popular trend of using computational tools, especially machine learning in the era of data science, to discover molecular biomarkers regarding prognosis improvement. This thesis is conceived following two lines of approaches intended to address two major challenges arising in genomic data analysis for breast cancer prognosis from a methodological standpoint of machine learning: rank-based approaches for improved molecular prognosis and network-guided approaches for enhanced biomarker discovery. Furthermore, the methodologies developed and investigated in this thesis, pertaining respectively to learning with rank data and learning on graphs, have a significant contribution to several branches of machine learning, concerning applications across but not limited to cancer biology and social choice theory
Malherbe, Emmanuel. "Standardization of textual data for comprehensive job market analysis". Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLC058/document.
Pełny tekst źródłaWith so many job adverts and candidate profiles available online, the e-recruitment constitutes a rich object of study. All this information is however textual data, which from a computational point of view is unstructured. The large number and heterogeneity of recruitment websites also means that there is a lot of vocabularies and nomenclatures. One of the difficulties when dealing with this type of raw textual data is being able to grasp the concepts contained in it, which is the problem of standardization that is tackled in this thesis. The aim of standardization is to create a unified process providing values in a nomenclature. A nomenclature is by definition a finite set of meaningful concepts, which means that the attributes resulting from standardization are a structured representation of the information. Several questions are however raised: Are the websites' structured data usable for a unified standardization? What structure of nomenclature is the best suited for standardization, and how to leverage it? Is it possible to automatically build such a nomenclature from scratch, or to manage the standardization process without one? To illustrate the various obstacles of standardization, the examples we are going to study include the inference of the skills or the category of a job advert, or the level of training of a candidate profile. One of the challenges of e-recruitment is that the concepts are continuously evolving, which means that the standardization must be up-to-date with job market trends. In light of this, we will propose a set of machine learning models that require minimal supervision and can easily adapt to the evolution of the nomenclatures. The questions raised found partial answers using Case Based Reasoning, semi-supervised Learning-to-Rank, latent variable models, and leveraging the evolving sources of the semantic web and social media. The different models proposed have been tested on real-world data, before being implemented in a industrial environment. The resulting standardization is at the core of SmartSearch, a project which provides a comprehensive analysis of the job market
Celikkanat, Abdulkadir. "Graph Representation Learning with Random Walk Diffusions". Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG030.
Pełny tekst źródłaGraph Representation Learning aims to embed nodes in a low-dimensional space. In this thesis, we tackle various challenging problems arising in the field. Firstly, we study how to leverage the inherent local community structure of graphs while learning node representations. We learn enhanced community-aware representations by combining the latent information with the embeddings. Moreover, we concentrate on the expressive- ness of node representations. We emphasize exponential family distributions to capture rich interaction patterns. We propose a model that combines random walks with kernelized matrix factorization. In the last part of the thesis, we study models balancing the trade-off between efficiency and accuracy. We propose a scalable embedding model which computes binary node representations
Geuens, Stijn. "Personalization in e-commerce : a procedure to create and evaluate business relevant recommendation systems". Thesis, Lille 1, 2017. http://www.theses.fr/2017LIL12016/document.
Pełny tekst źródłaRecommendation systems are a heavily investigated within machine learning literature, resulting in the creation of many algorithms. This doctoral dissertation goes beyond merely proposing new recommendation algorithms by leveraging state-of-the-art techniques and investigating the interaction of these techniques with different data sources having distinct characteristics. The focus lies upon the creation of frameworks guiding both marketers and academics in developing, evaluating, and testing recommendation systems in an e-commerce context. Concretely, this dissertation adds to literature in seven distinct ways. First, a framework evaluating collaborative filtering algorithms is designed and validated on real-life offline data sets of a large European e-tailer, La Redoute. Second, a five-step framework to develop and evaluate hybrid recommendation systems combing different data sources is proposed and validate on real-life historical data in Chapter II. Third, Chapter II introduces feature importance in the recommendation systems literature. Fourth, the best performing algorithms in the offline tests are leveraged to serve as basis for creating two revenue maximization recommendation systems in Chapter III. Fifth, a framework investigating three effects of (revenue maximization) recommendation systems on business metrics throughout the purchase funnel is proposed in Chapter III. Sixth, the framework is validated in a large-scale field experiment executed in collaboration with La Redoute. Finally, a business case shows the added value of the best performing recommendation systems
Lacombe, Théo. "Statistiques sur les descripteurs topologiques à base de transport optimal". Thesis, Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAX036.
Pełny tekst źródłaTopological data analysis (TDA) allows one to extract rich information from structured data (such as graphs or time series) that occurs in modern machine learning problems. This information will be represented as descriptors such as persistence diagrams, which can be described as point measures supported on a half-plane. While persistence diagrams are not elements of a vector space, they can still be compared using partial matching metrics. The similarities between these metrics and those routinely used in optimal transport—another field of mathematics—are known for long, but a formal connection between these two fields is yet to come.The purpose of this thesis is to clarify this connection and develop new theoretical and computational tools to manipulate persistence diagrams, targeting statistical applications. First, we show how optimal partial transport with boundary, a variation of classic optimal transport theory, provides a formalism that encompasses standard metrics in TDA. We then show-case the benefits of this connection in different situations: a theoretical study and the development of an algorithm to perform fast estimation of barycenters of persistence diagrams, the characterization of continuous linear representations of persistence diagrams and how to learn such representations using a neural network, and eventually a stability result in the context of linearly averaging random persistence diagrams
Yang, Tong. "Constitution et exploitation d’une base de données pour l’enseignement/apprentissage des phrasèmes NAdj du domaine culinaire français auprès d’apprenants non-natifs". Thesis, Paris 3, 2019. http://www.theses.fr/2019PA030049.
Pełny tekst źródłaThis thesis project aims to study the teaching method of FOS (French on Specific Objectives) catering to foreign cooks who come to work in French restaurants or who have chosen catering as a specialty. The objective of our research is therefore to teach the culinary NAdj phrasemas to foreign A2 level learners. The teaching/learning of phraseology is required in specialty languages and the high frequency of NAdj phrasems has caught our attention. Several questions are then addressed: where to find this specific lexicon? How to extract them? By which approach do we teach the selected phrasems? To answer these questions, we made our own corpus Cuisitext - written and oral - and then used NooJ to extract the NAdj phrasems from the corpus. Finally, we have proposed the three approaches to the use of corpora for the teaching/learning of NAdj phrasems: guided inductive approach, deductive approach, pure inductive approach
Ben, Chaabene Nour El Houda. "Détection d'utilisateurs violents et de menaces dans les réseaux sociaux". Electronic Thesis or Diss., Institut polytechnique de Paris, 2022. http://www.theses.fr/2022IPPAS001.
Pełny tekst źródłaOnline social networks are an integral part of people's daily social activity. They provide platforms to connect people from all over the world and share their interests. Recent statistics indicate that 56% of the world's population use these social media. However, these network services have also had many negative impacts and the existence of phenomena of aggression and intimidation in these spaces is inevitable and must therefore be addressed. Exploring the complex structure of social networks to detect violent behavior and threats is a challenge for data mining, machine learning, and artificial intelligence. In this thesis work, we aim to propose new approaches for the detection of violent behavior in social networks. Our approaches attempt to resolve this problem for several practical reasons. First, different people have different ways of expressing the same violent behavior. It is desirable to design an approach that works for everyone because of the variety of behaviors and the various ways in which they are expressed. Second, the approaches must have a way to detect potential unseen abnormal behaviors and automatically add them to the training set. Third, the multimodality and multidimensionality of the data available on social networking sites must be taken into account for the development of data mining solutions that will be able to extract relevant information useful for the detection of violent behavior. Finally, approaches must consider the time-varying nature of networks to process new users and links and automatically update built models. In the light of this and to achieve the aforementioned objectives, the main contributions of this thesis are as follows: - The first contribution proposes a model for detecting violent behavior on Twitter. This model supports the dynamic nature of the network and is capable of extracting and analyzing heterogeneous data. - The second contribution introduces an approach for detecting atypical behaviors on a multidimensional network. This approach is based on the exploration and analysis of the relationships between the individuals present on this multidimensional social structure. - The third contribution presents a framework for identifying abnormal people. This intelligent framework is based on the exploitation of a multidimensional model which takes as input multimodal data coming from several sources, capable of automatically enriching the learning set by the violent behaviors detected and considers the dynamicity of the data in order to detect new violent behaviors that appear on the network. This thesis describes achievements combining data mining techniques with new machine learning techniques. To prove the performance of our experimental results, we sums based on real data taken from three popular social networks
Barré, Anthony. "Analyse statistique de données issues de batteries en usage réel sur des véhicules électriques, pour la compréhension, l’estimation et la gestion des phénomènes de vieillissement". Thesis, Grenoble, 2014. http://www.theses.fr/2014GRENT064/document.
Pełny tekst źródłaDue to different reason The electrical vehicle market is undergoing important developments. However the limits associated with performance represent major drawbacks to increase the sales even more. The batteries performance and lifetime are the main focus of EV users. Batteries are subject to performance loss due to complex phenomena implying interactions between the different life conditions of the battery. In order to improve the understanding and estimation of battery aging, the studies were based on datasets from real use ev batteries. More precisely, this study consists in the adaptation and application of statistical approaches on the available data in order to highlight the interactions between variables, as well as the creation of methods for the estimation of battery performance. The obtained results allowed to illustrate the interests of a statistical approach. For example the demonstration of informations contained in the signals coming from the battery which are useful for the estimation of its state of health
Blachon, David. "Reconnaissance de scènes multimodale embarquée". Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM001/document.
Pełny tekst źródłaContext: This PhD takes place in the contexts of Ambient Intelligence and (Mobile) Context/Scene Awareness. Historically, the project comes from the company ST-Ericsson. The project was depicted as a need to develop and embed a “context server” on the smartphone that would get and provide context information to applications that would require it. One use case was given for illustration: when someone is involved in a meeting and receives a call, then thanks to the understanding of the current scene (meet at work), the smartphone is able to automatically act and, in this case, switch to vibrate mode in order not to disturb the meeting. The main problems consist of i) proposing a definition of what is a scene and what examples of scenes would suit the use case, ii) acquiring a corpus of data to be exploited with machine learning based approaches, and iii) propose algorithmic solutions to the problem of scene recognition.Data collection: After a review of existing databases, it appeared that none fitted the criteria I fixed (long continuous records, multi-sources synchronized records necessarily including audio, relevant labels). Hence, I developed an Android application for collecting data. The application is called RecordMe and has been successfully tested on 10+ devices, running Android 2.3 and 4.0 OS versions. It has been used for 3 different campaigns including the one for scenes. This results in 500+ hours recorded, 25+ volunteers were involved, mostly in Grenoble area but abroad also (Dublin, Singapore, Budapest). The application and the collection protocol both include features for protecting volunteers privacy: for instance, raw audio is not saved, instead MFCCs are saved; sensitive strings (GPS coordinates, device ids) are hashed on the phone.Scene definition: The study of existing works related to the task of scene recognition, along with the analysis of the annotations provided by the volunteers during the data collection, allowed me to propose a definition of a scene. It is defined as a generalisation of a situation, composed of a place and an action performed by one person (the smartphone owner). Examples of scenes include taking a transportation, being involved in a work meeting, walking in the street. The composition allows to get different kinds of information to provide on the current scene. However, the definition is still too generic, and I think that it might be completed with additionnal information, integrated as new elements of the composition.Algorithmics: I have performed experiments involving machine learning techniques, both supervised and unsupervised. The supervised one is about classification. The method is quite standard: find relevant descriptors of the data through the use of an attribute selection method. Then train and test several classifiers (in my case, there were J48 and Random Forest trees ; GMM ; HMM ; and DNN). Also, I have tried a 2-stage system composed of a first step of classifiers trained to identify intermediate concepts and whose predictions are merged in order to estimate the most likely scene. The unsupervised part of the work aimed at extracting information from the data, in an unsupervised way. For this purpose, I applied a bottom-up hierarchical clustering, based on the EM algorithm on acceleration and audio data, taken separately and together. One of the results is the distinction of acceleration into groups based on the amount of agitation
Kassab, Randa. "Analyse des propriétés stationnaires et des propriétés émergentes dans les flux d'informations changeant au cours du temps". Phd thesis, Université Henri Poincaré - Nancy I, 2009. http://tel.archives-ouvertes.fr/tel-00402644.
Pełny tekst źródłaL'apport de ce travail de thèse réside principalement dans le développement d'un modèle d'apprentissage - nommé ILoNDF - fondé sur le principe de la détection de nouveauté. L'apprentissage de ce modèle est, contrairement à sa version de départ, guidé non seulement par la nouveauté qu'apporte une donnée d'entrée mais également par la donnée elle-même. De ce fait, le modèle ILoNDF peut acquérir constamment de nouvelles connaissances relatives aux fréquences d'occurrence des données et de leurs variables, ce qui le rend moins sensible au bruit. De plus, doté d'un fonctionnement en ligne sans répétition d'apprentissage, ce modèle répond aux exigences les plus fortes liées au traitement des flux de données.
Dans un premier temps, notre travail se focalise sur l'étude du comportement du modèle ILoNDF dans le cadre général de la classification à partir d'une seule classe en partant de l'exploitation des données fortement multidimensionnelles et bruitées. Ce type d'étude nous a permis de mettre en évidence les capacités d'apprentissage pures du modèle ILoNDF vis-à-vis de l'ensemble des méthodes proposées jusqu'à présent. Dans un deuxième temps, nous nous intéressons plus particulièrement à l'adaptation fine du modèle au cadre précis du filtrage d'informations. Notre objectif est de mettre en place une stratégie de filtrage orientée-utilisateur plutôt qu'orientée-système, et ceci notamment en suivant deux types de directions. La première direction concerne la modélisation utilisateur à l'aide du modèle ILoNDF. Cette modélisation fournit une nouvelle manière de regarder le profil utilisateur en termes de critères de spécificité, d'exhaustivité et de contradiction. Ceci permet, entre autres, d'optimiser le seuil de filtrage en tenant compte de l'importance que pourrait donner l'utilisateur à la précision et au rappel. La seconde direction, complémentaire de la première, concerne le raffinement des fonctionnalités du modèle ILoNDF en le dotant d'une capacité à s'adapter à la dérive du besoin de l'utilisateur au cours du temps. Enfin, nous nous attachons à la généralisation de notre travail antérieur au cas où les données arrivant en flux peuvent être réparties en classes multiples.
Kassab, Randa. "Analyse des propriétés stationnaires et des propriétés émergentes dans les flux d'information changeant au cours du temps". Thesis, Nancy 1, 2009. http://www.theses.fr/2009NAN10027/document.
Pełny tekst źródłaMany applications produce and receive continuous, unlimited, and high-speed data streams. This raises obvious problems of storage, treatment and analysis of data, which are only just beginning to be treated in the domain of data streams. On the one hand, it is a question of treating data streams on the fly without having to memorize all the data. On the other hand, it is also a question of analyzing, in a simultaneous and concurrent manner, the regularities inherent in the data stream as well as the novelties, exceptions, or changes occurring in this stream over time. The main contribution of this thesis concerns the development of a new machine learning approach - called ILoNDF - which is based on novelty detection principle. The learning of this model is, contrary to that of its former self, driven not only by the novelty part in the input data but also by the data itself. Thereby, ILoNDF can continuously extract new knowledge relating to the relative frequencies of the data and their variables. This makes it more robust against noise. Being operated in an on-line mode without repeated training, ILoNDF can further address the primary challenges for managing data streams. Firstly, we focus on the study of ILoNDF's behavior for one-class classification when dealing with high-dimensional noisy data. This study enabled us to highlight the pure learning capacities of ILoNDF with respect to the key classification methods suggested until now. Next, we are particularly involved in the adaptation of ILoNDF to the specific context of information filtering. Our goal is to set up user-oriented filtering strategies rather than system-oriented in following two types of directions. The first direction concerns user modeling relying on the model ILoNDF. This provides a new way of looking at user's need in terms of specificity, exhaustivity and contradictory profile-contributing criteria. These criteria go on to estimate the relative importance the user might attach to precision and recall. The filtering threshold can then be adjusted taking into account this knowledge about user's need. The second direction, complementary to the first one, concerns the refinement of ILoNDF's functionality in order to confer it the capacity of tracking drifting user's need over time. Finally, we consider the generalization of our previous work to the case where streaming data can be divided into multiple classes
Aziz, Usama. "Détection des défauts des éoliennes basée sur la courbe de puissance : Comparaison critique des performances et proposition d'une approche multi-turbines". Thesis, Université Grenoble Alpes, 2020. https://tel.archives-ouvertes.fr/tel-03066125.
Pełny tekst źródłaSince wind turbines are electricity generators, the electrical power produced by a machine is a relevant variable for monitoring and detecting possible faults. In the framework of this thesis, an in-depth literature review was first performed on fault detection methods for wind turbines using the electrical power produced. It showed that, although many methods have been proposed in the literature, it is very difficult to compare their performance in an objective way due to the lack of reference data, allowing to implement and evaluate all these methods on the basis of the same data.To address this problem, as a first step, a new realistic simulation approach has been proposed in this thesis. It allows to create simulated data streams, coupling the power output, wind speed and temperature, in normal conditions and in fault situations, in an infinite way. The defects that can be simulated are those that impact the shape of the power curve. The simulated data are generated from real data recorded on several French wind farms, located on different geographical sites. In a second step, a method for evaluating the performance of fault detection methods using the power produced has been proposed.This new simulation method was implemented on 4 different fault situations affecting the power curve, using data from 5 geographically remote wind farms. A total of 1875 years of 10-minute SCADA data was generated and used to compare the detection performance of 3 fault detection methods proposed in the literature. This allowed a rigorous comparison of their performance.In the second part of this research, the proposed simulation method was extended to a multi-turbine configuration. Indeed, several multi-turbine strategies have been published in the literature, with the objective of reducing the impact of environmental conditions on the performance of fault detection methods using temperature as a variable. In order to evaluate the performance gain that a multi-turbine strategy could bring, a hybrid mono-multi-turbine implementation of fault detection methods based on the power curve was first proposed. Then, the simulation framework proposed to evaluate mono-turbine methods was extended to multi-turbine approaches and a numerical experimental analysis of the performance of this hybrid mono-multi-turbine implementation was performed
Selmane, Sid Ali. "Détection et analyse des communautés dans les réseaux sociaux : approche basée sur l'analyse formelle de concepts". Thesis, Lyon 2, 2015. http://www.theses.fr/2015LYO22004.
Pełny tekst źródłaThe study of community structure in networks became an increasingly important issue. The knowledge of core modules (communities) of networks helps us to understand how they work and behaviour, and to understand the performance of these systems. A community in a graph (network) is defined as a set of nodes that are strongly linked, but weakly linked with the rest of the graph. Members of the same community share the same interests. The originality of our research is to show that it is relevant to use formal concept analysis for community detection unlike conventional approaches using graphs. We studied several problems related to community detection in social networks : (1) the evaluation of community detection methods in the literature, (2) the detection of disjointed and overlapping communities, and (3) modelling and analysing heterogeneous social network of three-dimensional data. To assess the community detection methods proposed in the literature, we discussed this subject by studying first the state of the art that allowed us to present a classification of community detection methods by evaluating each method presented in the literature (the best known methods). For the second part, we were interested in developing a disjointed and overlapping community detection approach in homogeneous social networks from adjacency matrices (one mode data or one dimension) by exploiting techniques from formal concept analysis. We paid also a special attention to methods of modeling heterogeneous social networks. We focused in particular to three-dimensional data and proposed in this framework a modeling approach and social network analysis from three-dimensional data. This is based on a methodological framework to better understand the threedimensional aspect of this data. In addition, the analysis concerns the discovery of communities and hidden relationships between different types of individuals of these networks. The main idea lies in mining communities and rules of triadic association from these heterogeneous networks to simplify and reduce the computational complexity of this process. The results will then be used for an application recommendation of links and content to individuals in a social network
Chen, Xiangtuo. "Statistical Learning Methodology to Leverage the Diversity of Environmental Scenarios in Crop Data : Application to the prediction of crop production at large-scale". Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLC055.
Pełny tekst źródłaCrop yield prediction is a paramount issue in agriculture. Considerable research has been performed with this objective relying on various methodologies. Generally, they can be classified into model-driven approaches and data-driven approaches.The model-driven approaches are based on crop mechanistic modelling. They describe crop growth in interaction with their environment as dynamical systems. Since these models are based on the mechanical description of biophysical processes, they potentially imply a large number of state variables and parameters, whose estimation is not straightforward. In particular, the resulting parameter estimation problems are typically non-linear, leading to non-convex optimisation problems in multi-dimensional space. Moreover, data acquisition is very challenging and necessitates heavy specific experimental work in order to obtain the appropriate data for model identification.On the other hand, the data-driven approaches for yield prediction necessitate data from a large number of environmental scenarios, but with data quite easy to obtain: climatic data and final yield. However, the perspectives of this type of models are mostly limited to prediction purposes.An original contribution of this thesis consists in proposing a statistical methodology for the parameterisation of potentially complex mechanistic models, when datasets with different environmental scenarios and large-scale production records are available, named Multi-scenario Parameter Estimation Methodology (MuScPE). The main steps are the following:First, we take advantage of prior knowledge on the parameters to assign them relevant prior distributions and perform a global sensitivity analysis of the model parameters to screen the most important ones that will be estimated in priority;Then, we implement an efficient non-convex optimisation method, the parallel particle swarm optimisation, to search for the MAP (maximum a posterior) estimator of the parameters;Finally, we choose the best configuration regarding the number of estimated parameters by model selection criteria. Because when more parameters are estimated, theoretically, the calibrated model could explain better the variance of the output. Meanwhile, it increases also difficulty for optimization, which leads to uncertainty in calibration.This methodology is first tested with the CORNFLO model, a functional crop model for the corn.A second contribution of the thesis is the comparison of this model-driven method with classical data-driven methods. For this purpose, according to their different methodology in fitting the model complexity, we consider two classes of regression methods: first, Statistical methods derived from generalized linear regression that are good at simplifying the model by dimensional reduction, such as Ridge and Lasso Regression, Principal Components Regression or Partial Least Squares Regression; second, Machine Learning Regression based on re-sampling techniques like Random Forest, k-Nearest Neighbour, Artificial Neural Network and Support Vector Machine (SVM) regression.At last, a weighted regression is applied to large-scale yield prediction. Soft wheat production in France is taken as an example. Model-driven and data-driven approaches have also been compared for their performances in achieving this goal, which could be recognised as the third contribution of this thesis
Désoyer, Adèle. "Appariement de contenus textuels dans le domaine de la presse en ligne : développement et adaptation d'un système de recherche d'information". Thesis, Paris 10, 2017. http://www.theses.fr/2017PA100119/document.
Pełny tekst źródłaThe goal of this thesis, conducted within an industrial framework, is to pair textual media content. Specifically, the aim is to pair on-line news articles to relevant videos for which we have a textual description. The main issue is then a matter of textual analysis, no image or spoken language analysis was undertaken in the present study. The question that arises is how to compare these particular objects, the texts, and also what criteria to use in order to estimate their degree of similarity. We consider that one of these criteria is the topic similarity of their content, in other words, the fact that two documents have to deal with the same topic to form a relevant pair. This problem fall within the field of information retrieval (ir) which is the main strategy called upon in this research. Furthermore, when dealing with news content, the time dimension is of prime importance. To address this aspect, the field of topic detection and tracking (tdt) will also be explored.The pairing system developed in this thesis distinguishes different steps which complement one another. In the first step, the system uses natural language processing (nlp) methods to index both articles and videos, in order to overcome the traditionnal bag-of-words representation of texts. In the second step, two scores are calculated for an article-video pair: the first one reflects their topical similarity and is based on a vector space model; the second one expresses their proximity in time, based on an empirical function. At the end of the algorithm, a classification model learned from manually annotated document pairs is used to rank the results.Evaluation of the system's performances raised some further questions in this doctoral research. The constraints imposed both by the data and the specific need of the partner company led us to adapt the evaluation protocol traditionnal used in ir, namely the cranfield paradigm. We therefore propose an alternative solution for evaluating the system that takes all our constraints into account