Tesis sobre el tema "Élagage de forêts aléatoires"
Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros
Consulte los 49 mejores tesis para su investigación sobre el tema "Élagage de forêts aléatoires".
Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.
También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.
Explore tesis sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.
Cherfaoui, Farah. "Echantillonnage pour l'accélération des méthodes à noyaux et sélection gloutonne pour les représentations parcimonieuses". Electronic Thesis or Diss., Aix-Marseille, 2022. http://www.theses.fr/2022AIXM0256.
Texto completoThe contributions of this thesis are divided into two parts. The first part is dedicated to the acceleration of kernel methods and the second to optimization under sparsity constraints. Kernel methods are widely known and used in machine learning. However, the complexity of their implementation is high and they become unusable when the number of data is large. We first propose an approximation of Ridge leverage scores. We then use these scores to define a probability distribution for the sampling process of the Nyström method in order to speed up the kernel methods. We then propose a new kernel-based framework for representing and comparing discrete probability distributions. We then exploit the link between our framework and the maximum mean discrepancy to propose an accurate and fast approximation of the latter. The second part of this thesis is devoted to optimization with sparsity constraint for signal optimization and random forest pruning. First, we prove under certain conditions on the coherence of the dictionary, the reconstruction and convergence properties of the Frank-Wolfe algorithm. Then, we use the OMP algorithm to reduce the size of random forests and thus reduce the size needed for its storage. The pruned forest consists of a subset of trees from the initial forest selected and weighted by OMP in order to minimize its empirical prediction error
Zirakiza, Brice. "Forêts Aléatoires PAC-Bayésiennes". Thesis, Université Laval, 2013. http://www.theses.ulaval.ca/2013/29815/29815.pdf.
Texto completoIn this master's thesis, we present at first an algorithm of the state of the art called Random Forests introduced by Léo Breiman. This algorithm construct a uniformly weighted majority vote of decision trees built using the CART algorithm without pruning. Thereafter, we introduce an algorithm that we called SORF. The SORF algorithm is based on the PAC-Bayes approach, which in order to minimize the risk of Bayes classifier, minimizes the risk of the Gibbs classifier with a regularizer. The risk of Gibbs classifier is indeed a convex function which is an upper bound of the risk of Bayes classifier. To find the distribution that would be optimal, the SORF algorithm is reduced to being a simple quadratic program minimizing the quadratic risk of Gibbs classifier to seek a distribution Q of base classifiers which are trees of the forest. Empirical results show that generally SORF is almost as efficient as Random forests, and in some cases, it can even outperform Random forests.
Zirakiza, Brice y Brice Zirakiza. "Forêts Aléatoires PAC-Bayésiennes". Master's thesis, Université Laval, 2013. http://hdl.handle.net/20.500.11794/24036.
Texto completoDans ce mémoire de maîtrise, nous présentons dans un premier temps un algorithme de l'état de l'art appelé Forêts aléatoires introduit par Léo Breiman. Cet algorithme effectue un vote de majorité uniforme d'arbres de décision construits en utilisant l'algorithme CART sans élagage. Par après, nous introduisons l'algorithme que nous avons nommé SORF. L'algorithme SORF s'inspire de l'approche PAC-Bayes, qui pour minimiser le risque du classificateur de Bayes, minimise le risque du classificateur de Gibbs avec un régularisateur. Le risque du classificateur de Gibbs constitue en effet, une fonction convexe bornant supérieurement le risque du classificateur de Bayes. Pour chercher la distribution qui pourrait être optimale, l'algorithme SORF se réduit à être un simple programme quadratique minimisant le risque quadratique de Gibbs pour chercher une distribution Q sur les classificateurs de base qui sont des arbres de la forêt. Les résultasts empiriques montrent que généralement SORF est presqu'aussi bien performant que les forêts aléatoires, et que dans certains cas, il peut même mieux performer que les forêts aléatoires.
In this master's thesis, we present at first an algorithm of the state of the art called Random Forests introduced by Léo Breiman. This algorithm construct a uniformly weighted majority vote of decision trees built using the CART algorithm without pruning. Thereafter, we introduce an algorithm that we called SORF. The SORF algorithm is based on the PAC-Bayes approach, which in order to minimize the risk of Bayes classifier, minimizes the risk of the Gibbs classifier with a regularizer. The risk of Gibbs classifier is indeed a convex function which is an upper bound of the risk of Bayes classifier. To find the distribution that would be optimal, the SORF algorithm is reduced to being a simple quadratic program minimizing the quadratic risk of Gibbs classifier to seek a distribution Q of base classifiers which are trees of the forest. Empirical results show that generally SORF is almost as efficient as Random forests, and in some cases, it can even outperform Random forests.
In this master's thesis, we present at first an algorithm of the state of the art called Random Forests introduced by Léo Breiman. This algorithm construct a uniformly weighted majority vote of decision trees built using the CART algorithm without pruning. Thereafter, we introduce an algorithm that we called SORF. The SORF algorithm is based on the PAC-Bayes approach, which in order to minimize the risk of Bayes classifier, minimizes the risk of the Gibbs classifier with a regularizer. The risk of Gibbs classifier is indeed a convex function which is an upper bound of the risk of Bayes classifier. To find the distribution that would be optimal, the SORF algorithm is reduced to being a simple quadratic program minimizing the quadratic risk of Gibbs classifier to seek a distribution Q of base classifiers which are trees of the forest. Empirical results show that generally SORF is almost as efficient as Random forests, and in some cases, it can even outperform Random forests.
Scornet, Erwan. "Apprentissage et forêts aléatoires". Thesis, Paris 6, 2015. http://www.theses.fr/2015PA066533/document.
Texto completoThis is devoted to a nonparametric estimation method called random forests, introduced by Breiman in 2001. Extensively used in a variety of areas, random forests exhibit good empirical performance and can handle massive data sets. However, the mathematical forces driving the algorithm remain largely unknown. After reviewing theoretical literature, we focus on the link between infinite forests (theoretically analyzed) and finite forests (used in practice) aiming at narrowing the gap between theory and practice. In particular, we propose a way to select the number of trees such that the errors of finite and infinite forests are similar. On the other hand, we study quantile forests, a type of algorithms close in spirit to Breiman's forests. In this context, we prove the benefit of trees aggregation: while each tree of quantile forest is not consistent, with a proper subsampling step, the forest is. Next, we show the connection between forests and some particular kernel estimates, which can be made explicit in some cases. We also establish upper bounds on the rate of convergence for these kernel estimates. Then we demonstrate two theorems on the consistency of both pruned and unpruned Breiman forests. We stress the importance of subsampling to demonstrate the consistency of the unpruned Breiman's forests. At last, we present the results of a Dreamchallenge whose goal was to predict the toxicity of several compounds for several patients based on their genetic profile
Genuer, Robin. "Forêts aléatoires : aspects théoriques, sélection de variables et applications". Phd thesis, Université Paris Sud - Paris XI, 2010. http://tel.archives-ouvertes.fr/tel-00550989.
Texto completoPoterie, Audrey. "Arbres de décision et forêts aléatoires pour variables groupées". Thesis, Rennes, INSA, 2018. http://www.theses.fr/2018ISAR0011/document.
Texto completoIn many problems in supervised learning, inputs have a known and/or obvious group structure. In this context, elaborating a prediction rule that takes into account the group structure can be more relevant than using an approach based only on the individual variables for both prediction accuracy and interpretation. The goal of this thesis is to develop some tree-based methods adapted to grouped variables. Here, we propose two new tree-based approaches which use the group structure to build decision trees. The first approach allows to build binary decision trees for classification problems. A split of a node is defined according to the choice of both a splitting group and a linear combination of the inputs belonging to the splitting group. The second method, which can be used for prediction problems in both regression and classification, builds a non-binary tree in which each split is a binary tree. These two approaches build a maximal tree which is next pruned. To this end, we propose two pruning strategies, one of which is a generalization of the minimal cost-complexity pruning algorithm. Since decisions trees are known to be unstable, we introduce a method of random forests that deals with groups of inputs. In addition to the prediction purpose, these new methods can be also use to perform group variable selection thanks to the introduction of some measures of group importance, This thesis work is supplemented by an independent part in which we consider the unsupervised framework. We introduce a new clustering algorithm. Under some classical regularity and sparsity assumptions, we obtain the rate of convergence of the clustering risk for the proposed alqorithm
Ciss, Saïp. "Forêts uniformément aléatoires et détection des irrégularités aux cotisations sociales". Thesis, Paris 10, 2014. http://www.theses.fr/2014PA100063/document.
Texto completoWe present in this thesis an application of machine learning to irregularities in the case of social contributions. These are, in France, all contributions due by employees and companies to the "Sécurité sociale", the french system of social welfare (alternative incomes in case of unemployement, Medicare, pensions, ...). Social contributions are paid by companies to the URSSAF network which in charge to recover them. Our main goal was to build a model that would be able to detect irregularities with a little false positive rate. We, first, begin the thesis by presenting the URSSAF and how irregularities can appear, how can we handle them and what are the data we can use. Then, we talk about a new machine learning algorithm we have developped for, "random uniform forests" (and its R package "randomUniformForest") which are a variant of Breiman "random Forests" (tm), since they share the same principles but in in a different way. We present theorical background of the model and provide several examples. Then, we use it to show, when irregularities are fraud, how financial situation of firms can affect their propensity for fraud. In the last chapter, we provide a full evaluation for declarations of social contributions of all firms in Ile-de-France for year 2013, by using the model to predict if declarations present irregularities or not
Mourtada, Jaouad. "Contributions à l'apprentissage statistique : estimation de densité, agrégation d'experts et forêts aléatoires". Thesis, Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAX014.
Texto completoStatistical machine learning is a general framework to study predictive problems, where one aims to predict unobserved quantities using examples.The first part of this thesis is devoted to Random forests, a family of methods which are widely used in practice, but whose theoretical analysis has proved challenging. Our main contribution is the precise analysis of a simplified variant called Mondrian forests, for which we establish minimax nonparametric rates of convergence and an advantage of forests over trees. We also study an online variant of Mondrian forests.The second part is about prediction with expert advice, where one aims to sequentially combine different sources of predictions (experts) so as to perform almost as well as the best one in retrospect. We analyze the standard exponential weights algorithm on favorable stochastic instances, showing in particular that it exhibits some adaptivity to the hardness of the problem. We also study a variant of the problem with a growing expert class.The third part deals with regression and density estimation problems. Our first main contribution is a detailed minimax analysis of linear least squares prediction, as a function of the distribution of covariates; our upper bounds rely on a control of the lower tail of empirical covariance matrices. Our second main contribution is a general procedure for density estimation under entropy risk, which achieves optimal excess risk rates that do not degrade under model misspecification. When applied to logistic regression, this procedure has a simple form and achieves fast rates of convergence, bypassing some intrinsic limitations of plug-in estimators
Bernard, Simon. "Forêts aléatoires : de l’analyse des mécanismes de fonctionnement à la construction dynamique". Phd thesis, Rouen, 2009. http://www.theses.fr/2009ROUES011.
Texto completoThis research work is related to machine learning and more particularlydealswiththeparametrizationofRandomForests,whichareclassifierensemble methods that use decision trees as base classifiers. We focus on two important parameters of the forest induction : the number of features randomly selected at each node and the number of trees. We first show that the number of random features has to be chosen regarding to the feature space properties, and we propose hence a new algorithm called Forest-RK that exploits those properties. We then show that a static induction process implies that some of the trees of the forest make the ensemble generalisation error decrease, by deteriorating the strength/correlation compromise. We finaly propose an original random forest dynamic induction algorithm that favorably compares to static induction processes
Bernard, Simon. "Forêts Aléatoires: De l'Analyse des Mécanismes de Fonctionnement à la Construction Dynamique". Phd thesis, Université de Rouen, 2009. http://tel.archives-ouvertes.fr/tel-00598441.
Texto completoTéphany, Hervé. "Modèles expérimentaux de combustion sur milieux hétérogènes aléatoires". Poitiers, 1997. http://www.theses.fr/1997POIT2309.
Texto completoGoehry, Benjamin. "Prévision multi-échelle par agrégation de forêts aléatoires. Application à la consommation électrique". Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS461/document.
Texto completoThis thesis has two objectives. A first objective concerns the forecast of a total load in the context of Smart Grids using approaches that are based on the bottom-up forecasting method. The second objective is based on the study of random forests when observations are dependent, more precisely on time series. In this context, we are extending the consistency results of Breiman’s random forests as well as the convergence rates for a simplified random forest that have both been hitherto only established for independent and identically distributed observations. The last contribution on random forests describes a new methodology that incorporates the time-dependent structure in the construction of forests and thus have a gain in performance in the case of time series, illustrated with an application of load forecasting of a building
Caron, Maxime. "Données confidentielles : génération de jeux de données synthétisés par forêts aléatoires pour des variables catégoriques". Master's thesis, Université Laval, 2015. http://hdl.handle.net/20.500.11794/25935.
Texto completoConfidential data are very common in statistics nowadays. One way to treat them is to create partially synthetic datasets for data sharing. We will present an algorithm based on random forest to generate such datasets for categorical variables. We are interested by the formula used to make inference from multiple synthetic dataset. We show that the order of the synthesis has an impact on the estimation of the variance with the formula. We propose a variant of the algorithm inspired by differential privacy, and show that we are then not able to estimate a regression coefficient nor its variance. We show the impact of synthetic datasets on structural equations modeling. One conclusion is that the synthetic dataset does not really affect the coefficients between latent variables and measured variables.
Gregorutti, Baptiste. "Forêts aléatoires et sélection de variables : analyse des données des enregistreurs de vol pour la sécurité aérienne". Thesis, Paris 6, 2015. http://www.theses.fr/2015PA066045/document.
Texto completoNew recommendations require airlines to establish a safety management strategy to keep reducing the number of accidents. The flight data recorders have to be systematically analysed in order to identify, measure and monitor the risk evolution. The aim of this thesis is to propose methodological tools to answer the issue of flight data analysis. Our work revolves around two statistical topics: variable selection in supervised learning and functional data analysis. The random forests are used as they implement importance measures which can be embedded in selection procedures. First, we study the permutation importance measure when the variables are correlated. This criterion is extended for groups of variables and a new selection algorithm for functional variables is introduced. These methods are applied to the risks of long landing and hard landing which are two important questions for airlines. Finally, we present the integration of the proposed methods in the software FlightScanner implemented by Safety Line. This new solution in the air transport helps safety managers to monitor the risks and identify the contributed factors
Rancourt, Marie-Pierre. "Programmes d'aide à l'emploi et solidarité sociale : analyse causale des effets de la participation par l'approche des forêts aléatoires". Master's thesis, Université Laval, 2020. http://hdl.handle.net/20.500.11794/67007.
Texto completoIn this thesis, we assess the effect of employment assistance programs on the number of exits from social assistance and the cumulative duration spent outside of it among beneficiaries living with severe constraints. It is obvious that not all beneficiaries will derive the same benefits from participating in a program and for this reason it is useful to assess treatment effects conditional on the characteristics of each individual. To answer the research question, we need a flexible method that allows us to estimate differentiated treatment effects based on individual characteristics. To do this, we use a machine learning technique called generalized random forests (grf ) allowing us to evaluate heterogeneous treatment effects by conditioning on the characteristics of individuals. We used a database provided by the Ministère du Travail, de l’Emploi et de la Solidarité sociale (MTESS) containing monthly observations of all recipients of social assistance between 1999 and 2018 in Quebec. Using the grf method and the MTESS database, we found that beneficiaries with the longest cumulative durations on social assistance had lower treatment effects than those with shorter durations. We also observed that the younger and more educated beneficiaries benefited more from program participation than the others. This is also the case for individuals who have an auditory diagnosis and those who do not have an organic diagnosis.
Boucekine, Mohamed. "Caractérisation de l'effet response shift par l'approche des forêts aléatoires : application à la sclérose en plaques et à la schizophrénie". Thesis, Aix-Marseille, 2015. http://www.theses.fr/2015AIXM5062.
Texto completoTo asses Quality Of Life, patients are often asked to evaluate their well-being using a self-report instrument to document patient-reported outcome (PROs) measures. The data are often collected on multiple domains, such as physical function, social health and emotional health. However, longitudinal PROs, which are collected at multiple occasions from the same individual, may be affected by adaptation or "response shift" effects and may lead to under- or overestimation of the treatment effects. Response shift is the phenomenon by which an individual's self-evaluation of a construct changes due to change in internal standards of measurement (recalibration), a change in value or priorities (reprioritization), or a personal redefinition of the target construct (reconceptualisation). If the response shift is present in the data, the interpretation of change is altered and conventional difference between post-test and pre-test may not be able to detect true change in PROs measures. The aim of the work is to propose an innovative method, based on random forest method, to highlight response shift effect
Emprin, Gustave. "Une topologie pour les arbres labellés, application aux arbres aléatoires s-compacts". Thesis, Paris Est, 2019. http://www.theses.fr/2019PESC1032.
Texto completoIn this thesis, we develop a new space for the study of measured labelled metric spaces, ultimately designed to represent genealogical trees with a root at generation minus infinity. The time in the genealogical tree is represented by a 1-Lipschitz label function. We define the notion of S-compact measured labelled metric space, that is a metric space E equipped with a measure nu and a 1-Lipschitz label function from E to R, with the additional condition that each slice (the set of points with labels in a compact of R) must be compact and have finite measure. On the space XS of measured labelled metric spaces (up to isometry), we define a distance dLGHP by comparing the slices and study the resulting metric space, which we find to be Polish.We proceed with the study of the set T of all elements of XS that are real tree in which the label function decreases at rate 1 when we go toward the root" (which can be infinitely far). Each possible value of the label function corresponds to a generation in the genealogical tree. We prove that (T, dLGHP) is Polish as well. We define a number of measurable operation on T, including a way to randomly graft a forest on a tree. We use this operation to build a particular random tree generalizing Aldous' Brownian motion conditioned on its local time
Sun, Wangru. "Modèle de forêts enracinées sur des cycles et modèle de perles via les dimères". Thesis, Sorbonne université, 2018. http://www.theses.fr/2018SORUS007/document.
Texto completoThe dimer model, also known as the perfect matching model, is a probabilistic model originally introduced in statistical mechanics. A dimer configuration of a graph is a subset of the edges such that every vertex is incident to exactly one edge of the subset. A weight is assigned to every edge, and the probability of a configuration is proportional to the product of the weights of the edges present. In this thesis we mainly study two related models and in particular their limiting behavior. The first one is the model of cycle-rooted-spanning-forests (CRSF) on tori, which is in bijection with toroidal dimer configurations via Temperley's bijection. This gives rise to a measure on CRSF. In the limit that the size of torus tends to infinity, the CRSF measure tends to an ergodic Gibbs measure on the whole plane. We study the connectivity property of the limiting object, prove that it is determined by the average height change of the limiting ergodic Gibbs measure and give a phase diagram. The second one is the bead model, a random point field on $\mathbb{Z}\times\mathbb{R}$ which can be viewed as a scaling limit of dimer model on a hexagon lattice. We formulate and prove a variational principle similar to that of the dimer model \cite{CKP01}, which states that in the scaling limit, the normalized height function of a uniformly chosen random bead configuration lies in an arbitrarily small neighborhood of a surface $h_0$ that maximizes some functional which we call as entropy. We also prove that the limit shape $h_0$ is a scaling limit of the limit shapes of a properly chosen sequence of dimer models. There is a map form bead configurations to standard tableaux of a (skew) Young diagram, and the map is measure preserving if both sides take uniform measures. The variational principle of the bead model yields the existence of the limit shape of a random standard Young tableau, which generalizes the result of \cite{PR}. We derive also the existence of an arctic curve of a discrete point process that encodes the standard tableaux, raised in \cite{Rom}
Morvan, Ludivine. "Prédiction de la progression du myélome multiple par imagerie TEP : Adaptation des forêts de survie aléatoires et de réseaux de neurones convolutionnels". Thesis, Ecole centrale de Nantes, 2021. http://www.theses.fr/2021ECDN0045.
Texto completoThe aim of this work is to provide a model for survival prediction and biomarker identification in the context of multiple myeloma (MM) using PET (Positron Emission Tomography) imaging and clinical data. This PhD is divided into two parts: The first part provides a model based on Random Survival Forests (RSF). The second part is based on the adaptation of deep learning to survival and to our data. The main contributions are the following: 1) Production of a model based on RSF and PET images allowing the prediction of a risk group for multiple myeloma patients. 2) Determination of biomarkers using this model.3) Demonstration of the interest of PET radiomics.4) Extension of the state of the art of methods for the adaptation of deep learning to a small database and small images. 5) Study of the cost functions used in survival. In addition, we are, to our knowledge, the first to investigate the use of RSFs in the context of MM and PET images, to use self-supervised pre-training with PET images, and, with a survival task, to fit the triplet cost function to survival and to fit a convolutional neural network to MM survival from PET lesions
Jabot, Franck. "Marches aléatoires en forêt tropicale : contribution à la théorie de la biodiversité". Toulouse 3, 2009. http://thesesups.ups-tlse.fr/641/.
Texto completoTropical forests contain a huge diversity of trees, even at small spatial scales. This diversity challenges the idea that, in given environmental conditions, one species should be better suited to this particular environment and progressively exclude all other species. Ecologists have proposed various hypotheses to explain diversity maintenance. One element prevents the test of these hypotheses: the lack of robust methods to link available theories and knowledge on tropical forests to field data, so as to compare different hypotheses. This thesis thus aims at developing more efficient tests of coexistence mechanisms. It is shown that environment filters tree communities at both the regional and local scales. This rejects, for the first time rigorously, the neutrality hypothesis, which aims at explaining species local coexistence in assuming their functional equivalence. This finding stimulates the development of a new dynamical model describing environmental filtering on the basis of species characteristics, such as functional traits. Applications to field data are discussed. Finally, evolutionary relationships among coexisting species contain potentially useful information on their ability to coexist. In this vein, it is shown how to integrate these evolutionary relationships in the test of the neutral theory of biodiversity. The dynamical models studied during this thesis are called, in mathematical terms, random walks. They have been mainly studied here thanks to a statistical technique called Approximate Bayesian Computation, which opens new perspectives for the study of dynamical models in ecology
Desbordes, Paul. "Méthode de sélection de caractéristiques pronostiques et prédictives basée sur les forêts aléatoires pour le suivi thérapeutique des lésions tumorales par imagerie fonctionnelle TEP". Thesis, Normandie, 2017. http://www.theses.fr/2017NORMR030/document.
Texto completoRadiomics proposes to combine image features with those extracted from other modalities (clinical, genomic, proteomic) to set up a personalized medicine in the management of cancer. From an initial exam, the objective is to anticipate the survival rate of the patient or the treatment response probability. In medicine, classical statistical methods are generally used, such as theMann-Whitney analysis for predictive studies and analysis of Kaplan-Meier survival curves for prognostic studies. Thus, the increasing number of studied features limits the use of these statistics. We have focused our works on machine learning algorithms and features selection methods. These methods are resistant to large dimensions as well as non-linear relations between features. We proposed two features selection strategy based on random forests. Our methods allowed the selection of subsets of predictive and prognostic features on 2 databases (oesophagus and lung cancers). Our algorithms showed the best classification performances compared to classical statistical methods and other features selection strategies studied
Bouaziz, Ameni. "Méthodes d’apprentissage interactif pour la classification des messages courts". Thesis, Université Côte d'Azur (ComUE), 2017. http://www.theses.fr/2017AZUR4039/document.
Texto completoAutomatic short text classification is more and more used nowadays in various applications like sentiment analysis or spam detection. Short texts like tweets or SMS are more challenging than traditional texts. Therefore, their classification is more difficult owing to their shortness, sparsity and lack of contextual information. We present two new approaches to improve short text classification. Our first approach is "Semantic Forest". The first step of this approach proposes a new enrichment method that uses an external source of enrichment built in advance. The idea is to transform a short text from few words to a larger text containing more information in order to improve its quality before building the classification model. Contrarily to the methods proposed in the literature, the second step of our approach does not use traditional learning algorithm but proposes a new one based on the semantic links among words in the Random Forest classifier. Our second contribution is "IGLM" (Interactive Generic Learning Method). It is a new interactive approach that recursively updates the classification model by considering the new data arriving over time and by leveraging the user intervention to correct misclassified data. An abstraction method is then combined with the update mechanism to improve short text quality. The experiments performed on these two methods show their efficiency and how they outperform traditional algorithms in short text classification. Finally, the last part of the thesis concerns a complete and argued comparative study of the two proposed methods taking into account various criteria such as accuracy, speed, etc
Raynal, Louis. "Bayesian statistical inference for intractable likelihood models". Thesis, Montpellier, 2019. http://www.theses.fr/2019MONTS035/document.
Texto completoIn a statistical inferential process, when the calculation of the likelihood function is not possible, approximations need to be used. This is a fairly common case in some application fields, especially for population genetics models. Toward this issue, we are interested in approximate Bayesian computation (ABC) methods. These are solely based on simulated data, which are then summarised and compared to the observed ones. The comparisons are performed depending on a distance, a similarity threshold and a set of low dimensional summary statistics, which must be carefully chosen.In a parameter inference framework, we propose an approach combining ABC simulations and the random forest machine learning algorithm. We use different strategies depending on the parameter posterior quantity we would like to approximate. Our proposal avoids the usual ABC difficulties in terms of tuning, while providing good results and interpretation tools for practitioners. In addition, we introduce posterior measures of error (i.e., conditionally on the observed data of interest) computed by means of forests. In a model choice setting, we present a strategy based on groups of models to determine, in population genetics, which events of an evolutionary scenario are more or less well identified. All these approaches are implemented in the R package abcrf. In addition, we investigate how to build local random forests, taking into account the observation to predict during their learning phase to improve the prediction accuracy. Finally, using our previous developments, we present two case studies dealing with the reconstruction of the evolutionary history of Pygmy populations, as well as of two subspecies of the desert locust Schistocerca gregaria
Etourneau, Thomas. "Les forêts Lyman alpha du relevé eBOSS : comprendre les fonctions de corrélation et les systématiques". Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASP029.
Texto completoThis PhD thesis is part of eBOSS and DESI projects. These projects, among other tracers, use the Lyman-α (Lyα) absorption to probe the matter distribution in the universe and measure thebaryon acoustic oscillations (BAO) scale. The measurement of the BAO scale to the sound horizon ratio allows to constrain the universe expansion and so the ΛCDM model, the standard model of cosmology. This thesis presents the development of mock data sets used in order to check the BAO analyses carried out by the Lyα group within the eBOSS and DESI collaborations. These mocks make use of gaussian random fields (GRF). GRF allow to generate a density field δ. From this density field, quasar (QSO) positions are drawn. From each quasar, a line of sight is constructed. Then, the density field δ is interpolated along each line of sight. Finally, the fluctuating Gunn Peterson approximation (FGPA) is used to convert the interpolated density into the optical depth τ , and then into the Lyα absorption. Thanks to a program developed by the DESI community, a continuum is added to each Lyα forest in order to produce quasar synthetic spectra. The mocks presented in the manuscript provide a survey of quasars whose Lyα forests in the quasar spectra have the correct Lyα×Lyα auto-correlation, Lyα×QSO cross-correlation, as well as the correct QSO×QSO and HCD×HCD (High Column Density systems) auto-correlation functions. The study of these mocks shows that the BAO analysis run on the whole Lyα eBOSS data set produces a non-biaised measurement of the BAO parameters αk et α⊥. In addition, the analysis of the model used to fit the correlation functions shows that the shape of the Lyα×Lyα auto-correlation, which is linked to the bias bLyα and redshift space distorsions (RSD) parameter βLyα, are understood up to 80 %. The systematics affecting the measurement of the Lyα parameters (bLyα et βLyα) come from two different effects. The first one originates from thedistortion matrix which does not capture all the distortions produced by the quasar continuum fittingprocedure. The second one is linked to the HCD modelling. The modelling of these strong absorbers is not perfect and affects the measurement of the Lyα parameters, especially the RSD parameter βLyα. Thus, the analysis of these mocks allows to validate the systematic control of the BAO analyses done with the Lyα. However, a better understanding of the measurement of the Lyα parameters is required in order to consider using the Lyα, which means combining the Lyα×Lyα autocorrelation and Lyα×QSO cross-correlation, to do a RSD analysis
Toussile, Wilson. "Sélection de variable : structure génétique d'une population et transmission de Plasmodium à travers le moustique". Phd thesis, Université Paris Sud - Paris XI, 2010. http://tel.archives-ouvertes.fr/tel-00553674.
Texto completoPisetta, Vincent. "New Insights into Decision Trees Ensembles". Thesis, Lyon 2, 2012. http://www.theses.fr/2012LYO20018/document.
Texto completoDecision trees ensembles are among the most popular tools in machine learning. Nevertheless, their theoretical properties as well as their empirical performances are subject to strong investigation up to date. In this thesis, we propose to shed light on these methods. More precisely, after having described the current theoretical aspects of three main ensemble schemes (chapter 1), we give an analysis supporting the existence of common reasons to the success of these three principles (chapter 2). This last takes into account the two first moments of the margin as an essential ingredient to obtain strong learning abilities. Starting from this rejoinder, we propose a new ensemble algorithm called OSS (Oriented Sub-Sampling) whose steps are in perfect accordance with the point of view we introduce. The empirical performances of OSS are superior to the ones of currently popular algorithms such as Random Forests and AdaBoost. In a third chapter (chapter 3), we analyze Random Forests adopting a “kernel” point of view. This last allows us to understand and observe the underlying regularization mechanism of these kinds of methods. Adopting the kernel point of view also enables us to improve the predictive performance of Random Forests using popular post-processing techniques such as SVM and multiple kernel learning. In conjunction with random Forests, they show greatly improved performances and are able to realize a pruning of the ensemble by conserving only a small fraction of the initial base learners
Dramé, Ibrahima. "Processus d'exploration des arbres aléatoires en temps continu à branchement non binaire : limite en grande population". Thesis, Aix-Marseille, 2017. http://www.theses.fr/2017AIXM0110.
Texto completoIn this thesis, we study the convergence of the exploration process of the non-binary tree associated to a continuous time branching process, in the limit of a large population. In the first part, we give a precise description of the exploration process of the non-binary tree. We then describe a bijection between exploration processes and Galton Watson non-binary trees. After some renormalization, we present the results of convergence of the population process and the exploration process, in the limit of a large populations.In the second part, we first establish the convergence of the population process to a continuous state branching process (CSBP) with jumps. We then show the convergence of the (rescaled) exploration process, of the corresponding genealogical tree towards the continuous height process recently defined by Li, Pardoux and Wakolbinger. In the last part, we consider a population model with interaction defined with a more general non linear function $f.$ We proceed to a renormalization of the parameters model and we obtain in limit a generalized CSBP. We then renormalize the height process of the associated genealogical tree, and take the weak limit as the size of the population tends to infinity
Beguet, Benoît. "Caractérisation et cartographie de la structure forestière à partir d'images satellitaires à très haute résolution spatiale". Thesis, Bordeaux 3, 2014. http://www.theses.fr/2014BOR30041/document.
Texto completoVery High spatial Resolution (VHR) images like Pléiades imagery (50 cm panchromatic, 2m multispectral) allows a detailed description of forest structure (tree distribution and size) at stand level, by exploiting the spatial relationship between tree structure and image texture when the pixel size is smaller than tree dimensions. This information meets the expected strong need for spatial inventory of forest resources at the stand level and its changes due to forest management, land use or catastrophic events. The aim is twofold : (1) assess the VHR satellite images potential to estimate the main variables of forest structure from the image texture: crown diameter, stem diameter, height, density or tree spacing, (2) on these bases, a pixel-based image classification of forest structure is processed in order to produce the finest possible spatial information. The main developments concern parameter optimization, variable selection, multivariate regression modelling and ensemble-based classification (Random Forests). They are tested and evaluated on the Landes maritime pine forest with three Pléiades images and a Quickbird image acquired under different conditions (season, sun angle, view angle). The method is generic. The robustness of the proposed method to image acquisition parameters is evaluated. Results show that fine variations of texture characteristics related to those of forest structure are clearly identifiable. Performances in terms of forest variable estimation (RMSE): ~1,1m for crown diameter, ~3m for tree height and ~0,9m for tree spacing, as well as forest structure mapping (~82% Overall accuracy for the classification of the five main forest structure classes) are satisfactory from an operational perspective. Their application to multi- annual images will assess their ability to detect and map forest changes such as clear cut, urban sprawl or storm damages
Ospina, Arango Juan David. "Predictive models for side effects following radiotherapy for prostate cancer". Thesis, Rennes 1, 2014. http://www.theses.fr/2014REN1S046/document.
Texto completoExternal beam radiotherapy (EBRT) is one of the cornerstones of prostate cancer treatment. The objectives of radiotherapy are, firstly, to deliver a high dose of radiation to the tumor (prostate and seminal vesicles) in order to achieve a maximal local control and, secondly, to spare the neighboring organs (mainly the rectum and the bladder) to avoid normal tissue complications. Normal tissue complication probability (NTCP) models are then needed to assess the feasibility of the treatment and inform the patient about the risk of side effects, to derive dose-Volume constraints and to compare different treatments. In the context of EBRT, the objectives of this thesis were to find predictors of bladder and rectal complications following treatment; to develop new NTCP models that allow for the integration of both dosimetric and patient parameters; to compare the predictive capabilities of these new models to the classic NTCP models and to develop new methodologies to identify dose patterns correlated to normal complications following EBRT for prostate cancer treatment. A large cohort of patient treated by conformal EBRT for prostate caner under several prospective French clinical trials was used for the study. In a first step, the incidence of the main genitourinary and gastrointestinal symptoms have been described. With another classical approach, namely logistic regression, some predictors of genitourinary and gastrointestinal complications were identified. The logistic regression models were then graphically represented to obtain nomograms, a graphical tool that enables clinicians to rapidly assess the complication risks associated with a treatment and to inform patients. This information can be used by patients and clinicians to select a treatment among several options (e.g. EBRT or radical prostatectomy). In a second step, we proposed the use of random forest, a machine-Learning technique, to predict the risk of complications following EBRT for prostate cancer. The superiority of the random forest NTCP, assessed by the area under the curve (AUC) of the receiving operative characteristic (ROC) curve, was established. In a third step, the 3D dose distribution was studied. A 2D population value decomposition (PVD) technique was extended to a tensorial framework to be applied on 3D volume image analysis. Using this tensorial PVD, a population analysis was carried out to find a pattern of dose possibly correlated to a normal tissue complication following EBRT. Also in the context of 3D image population analysis, a spatio-Temporal nonparametric mixed-Effects model was developed. This model was applied to find an anatomical region where the dose could be correlated to a normal tissue complication following EBRT
Benoumechiara, Nazih. "Traitement de la dépendance en analyse de sensibilité pour la fiabilité industrielle". Thesis, Sorbonne université, 2019. http://www.theses.fr/2019SORUS047.
Texto completoStructural reliability studies use probabilistic approaches to quantify the risk of an accidental event occurring. The dependence between the random input variables of a model can have a significant impact on the results of the reliability study. This thesis contributes to the treatment of dependency in structural reliability studies. The two main topics covered in this document are the sensitivity analysis for dependent variables when the dependence is known and, as well as the assessment of a reliability risk when the dependence is unknown. First, we propose an extension of the permutation-based importance measures of the random forest algorithm towards the case of dependent data. We also adapt the Shapley index estimation algorithm, used in game theory, to take into account the index estimation error. Secondly, in the case of dependence structure being unknown, we propose a conservative estimate of the reliability risk based on dependency modelling to determine the most penalizing dependence structure. The proposed methodology is applied to an example of structural reliability to obtain a conservative estimate of the risk
Jouganous, Julien. "Modélisation et simulation de la croissance de métastases pulmonaires". Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0154/document.
Texto completoThis thesis deals with mathematical modeling and simulation of lung metastases growth.We first present a partial differential equations model to simulate the growth and possibly the response to some types of treatments of metastases to the lung. This model must be personalized to be used individually on clinical cases. Consequently, we developed a calibration technic based on medical images of the tumor. Several applications on clinical cases are presented.Then we introduce a simplification of the first model and the calibration algorithm. This new method, more robust, is tested on 36 clinical cases. The results are presented in the third chapter. To finish, a machine learning algorithm
Fouemkeu, Norbert. "Modélisation de l'incertitude sur les trajectoires d'avions". Phd thesis, Université Claude Bernard - Lyon I, 2010. http://tel.archives-ouvertes.fr/tel-00710595.
Texto completoFeng, Wei. "Investigation of training data issues in ensemble classification based on margin concept : application to land cover mapping". Thesis, Bordeaux 3, 2017. http://www.theses.fr/2017BOR30016/document.
Texto completoClassification has been widely studied in machine learning. Ensemble methods, which build a classification model by integrating multiple component learners, achieve higher performances than a single classifier. The classification accuracy of an ensemble is directly influenced by the quality of the training data used. However, real-world data often suffers from class noise and class imbalance problems. Ensemble margin is a key concept in ensemble learning. It has been applied to both the theoretical analysis and the design of machine learning algorithms. Several studies have shown that the generalization performance of an ensemble classifier is related to the distribution of its margins on the training examples. This work focuses on exploiting the margin concept to improve the quality of the training set and therefore to increase the classification accuracy of noise sensitive classifiers, and to design effective ensemble classifiers that can handle imbalanced datasets. A novel ensemble margin definition is proposed. It is an unsupervised version of a popular ensemble margin. Indeed, it does not involve the class labels. Mislabeled training data is a challenge to face in order to build a robust classifier whether it is an ensemble or not. To handle the mislabeling problem, we propose an ensemble margin-based class noise identification and elimination method based on an existing margin-based class noise ordering. This method can achieve a high mislabeled instance detection rate while keeping the false detection rate as low as possible. It relies on the margin values of misclassified data, considering four different ensemble margins, including the novel proposed margin. This method is extended to tackle the class noise correction which is a more challenging issue. The instances with low margins are more important than safe samples, which have high margins, for building a reliable classifier. A novel bagging algorithm based on a data importance evaluation function relying again on the ensemble margin is proposed to deal with the class imbalance problem. In our algorithm, the emphasis is placed on the lowest margin samples. This method is evaluated using again four different ensemble margins in addressing the imbalance problem especially on multi-class imbalanced data. In remote sensing, where training data are typically ground-based, mislabeled training data is inevitable. Imbalanced training data is another problem frequently encountered in remote sensing. Both proposed ensemble methods involving the best margin definition for handling these two major training data issues are applied to the mapping of land covers
Wallard, Henri. "Analyse des leviers : effets de colinéarité et hiérarchisation des impacts dans les études de marché et sociales". Thesis, Paris, CNAM, 2015. http://www.theses.fr/2015CNAM1019/document.
Texto completoAbstractLinear regression is used in Market Research but faces difficulties due to multicollinearity. Other methods have been considered.A demonstration of the equality between lmg-Shapley and and Johnson methods for Variance Decomposition has been proposed. Also this research has shown that the decomposition proposed by Fabbris is not identical to those proposed by Genizi and Johnson, and that the CAR scores of two predictors do not equalize when their correlation tends towards 1. A new method, weifila (weighted first last) has been proposed and published in 2015.Also we have shown that permutation importance using Random Forest enables to take into account non linear relationships and deserves broader usage in Marketing Research.Regarding Bayesian Networks, there are multiple solutions available and expert driven restrictions and decisions support the recommendation to be careful in their usage and presentation, even if they allow to explore possible structures and make simulations.In the end, weifila or random forests are recommended instead of lmg-Shapley knowing that the benefit of structural and conceptual models should not be underestimated.Keywords :Linear regression, Variable Importance, Shapley Value, Random Forests, Bayesian Networks
Taillardat, Maxime. "Méthodes Non-Paramétriques de Post-Traitement des Prévisions d'Ensemble". Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLV072/document.
Texto completoIn numerical weather prediction, ensemble forecasts systems have become an essential tool to quantifyforecast uncertainty and to provide probabilistic forecasts. Unfortunately, these models are not perfect and a simultaneouscorrection of their bias and their dispersion is needed.This thesis presents new statistical post-processing methods for ensemble forecasting. These are based onrandom forests algorithms, which are non-parametric.Contrary to state of the art procedures, random forests can take into account non-linear features of atmospheric states. They easily allowthe addition of covariables (such as other weather variables, seasonal or geographic predictors) by a self-selection of the mostuseful predictors for the regression. Moreover, we do not make assumptions on the distribution of the variable of interest. This new approachoutperforms the existing methods for variables such as surface temperature and wind speed.For variables well-known to be tricky to calibrate, such as six-hours accumulated rainfall, hybrid versions of our techniqueshave been created. We show that these versions (and our original methods) are better than existing ones. Especially, they provideadded value for extreme precipitations.The last part of this thesis deals with the verification of ensemble forecasts for extreme events. We have shown several properties ofthe Continuous Ranked Probability Score (CRPS) for extreme values. We have also defined a new index combining the CRPS and the extremevalue theory, whose consistency is investigated on both simulations and real cases.The contributions of this work are intended to be inserted into the forecasting and verification chain at Météo-France
Duroux, Roxane. "Inférence pour les modèles statistiques mal spécifiés, application à une étude sur les facteurs pronostiques dans le cancer du sein". Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066224/document.
Texto completoThe thesis focuses on inference of statistical misspecified models. Every result finds its application in a prognostic factors study for breast cancer, thanks to the data collection of Institut Curie. We consider first non-proportional hazards models, and make use of the marginal survival of the failure time. This model allows a time-varying regression coefficient, and therefore generalizes the proportional hazards model. On a second time, we study step regression models. We propose an inference method for the changepoint of a two-step regression model, and an estimation method for a multiple-step regression model. Then, we study the influence of the subsampling rate on the performance of median forests and try to extend the results to random survival forests through an application. Finally, we present a new dose-finding method for phase I clinical trials, in case of partial ordering
Elghazel, Wiem. "Wireless sensor networks for Industrial health assessment based on a random forest approach". Thesis, Besançon, 2015. http://www.theses.fr/2015BESA2055/document.
Texto completoAn efficient predictive maintenance is based on the reliability of the monitoring data. In some cases, themonitoring activity cannot be ensured with individual or wired sensors. Wireless sensor networks (WSN) arethen an alternative. Considering the wireless communication, data loss becomes highly probable. Therefore,we study certain aspects of WSN reliability. We propose a distributed algorithm for network resiliency and datasurvival while optimizing energy consumption. This fault tolerant algorithm reduces the risks of data loss andensures the continuity of data transfer. We also simulated different network topologies in order to evaluate theirimpact on data completeness at the sink level. Thereafter, we propose an approach to evaluate the system’sstate of health using the random forests algorithm. In an offline phase, the random forest algorithm selects theparameters holding more information about the system’s health state. These parameters are used to constructthe decision trees that make the forest. By injecting the random aspect in the training set, the algorithm (thetrees) will have different starting points. In an online phase, the algorithm evaluates the current health stateusing the sensor data. Each tree will provide a decision, and the final class is the result of the majority voteof all trees. When sensors start to break down, the data describing a health indicator becomes incompleteor unavailable. Considering that the trees have different starting points, the absence of some data will notnecessarily result in the interruption of the prediction process
Wallard, Henri. "Analyse des leviers : effets de colinéarité et hiérarchisation des impacts dans les études de marché et sociales". Electronic Thesis or Diss., Paris, CNAM, 2015. http://www.theses.fr/2015CNAM1019.
Texto completoAbstractLinear regression is used in Market Research but faces difficulties due to multicollinearity. Other methods have been considered.A demonstration of the equality between lmg-Shapley and and Johnson methods for Variance Decomposition has been proposed. Also this research has shown that the decomposition proposed by Fabbris is not identical to those proposed by Genizi and Johnson, and that the CAR scores of two predictors do not equalize when their correlation tends towards 1. A new method, weifila (weighted first last) has been proposed and published in 2015.Also we have shown that permutation importance using Random Forest enables to take into account non linear relationships and deserves broader usage in Marketing Research.Regarding Bayesian Networks, there are multiple solutions available and expert driven restrictions and decisions support the recommendation to be careful in their usage and presentation, even if they allow to explore possible structures and make simulations.In the end, weifila or random forests are recommended instead of lmg-Shapley knowing that the benefit of structural and conceptual models should not be underestimated.Keywords :Linear regression, Variable Importance, Shapley Value, Random Forests, Bayesian Networks
Duhalde, Jean-Pierre. "Sur des propriétés fractales et trajectorielles de processus de branchement continus". Thesis, Paris 6, 2015. http://www.theses.fr/2015PA066029/document.
Texto completoThis thesis investigates some fractal and pathwise properties of branching processes with continuous time and state-space. Informally, this kind of process can be described by considering the evolution of a population where individuals reproduce and die over time, randomly. The first chapter deals with the class of continuous branching processes with immigration. We provide a semi-explicit formula for the hitting times and a necessary and sufficient condition for the process to be recurrent or transient. Those two results illustrate the competition between branching and immigration. The second chapter deals with the Brownian tree and its local time measures : the level-sets measures. We show that they can be obtained as the restriction, with an explicit multiplicative constant, of a Hausdorff measure on the tree. The result holds uniformly for all levels. The third chapter study the Super-Brownian motion associated with a general branching mechanism. Its total occupation measure is obtained as the restriction to the total range, of a given packing measure on the euclidean space. The result is valid for large dimensions. The condition on the dimension is discussed by computing the packing dimension of the total range. This is done under a weak assumption on the regularity of the branching mechanism
Geremia, Ezequiel. "Spatial random forests for brain lesions segmentation in MRIs and model-based tumor cell extrapolation". Phd thesis, Université Nice Sophia Antipolis, 2013. http://tel.archives-ouvertes.fr/tel-00838795.
Texto completoTaillardat, Maxime. "Méthodes Non-Paramétriques de Post-Traitement des Prévisions d'Ensemble". Electronic Thesis or Diss., Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLV072.
Texto completoIn numerical weather prediction, ensemble forecasts systems have become an essential tool to quantifyforecast uncertainty and to provide probabilistic forecasts. Unfortunately, these models are not perfect and a simultaneouscorrection of their bias and their dispersion is needed.This thesis presents new statistical post-processing methods for ensemble forecasting. These are based onrandom forests algorithms, which are non-parametric.Contrary to state of the art procedures, random forests can take into account non-linear features of atmospheric states. They easily allowthe addition of covariables (such as other weather variables, seasonal or geographic predictors) by a self-selection of the mostuseful predictors for the regression. Moreover, we do not make assumptions on the distribution of the variable of interest. This new approachoutperforms the existing methods for variables such as surface temperature and wind speed.For variables well-known to be tricky to calibrate, such as six-hours accumulated rainfall, hybrid versions of our techniqueshave been created. We show that these versions (and our original methods) are better than existing ones. Especially, they provideadded value for extreme precipitations.The last part of this thesis deals with the verification of ensemble forecasts for extreme events. We have shown several properties ofthe Continuous Ranked Probability Score (CRPS) for extreme values. We have also defined a new index combining the CRPS and the extremevalue theory, whose consistency is investigated on both simulations and real cases.The contributions of this work are intended to be inserted into the forecasting and verification chain at Météo-France
De, Moliner Anne. "Estimation robuste de courbes de consommmation électrique moyennes par sondage pour de petits domaines en présence de valeurs manquantes". Thesis, Bourgogne Franche-Comté, 2017. http://www.theses.fr/2017UBFCK021/document.
Texto completoIn this thesis, we address the problem of robust estimation of mean or total electricity consumption curves by sampling in a finite population for the entire population and for small areas. We are also interested in estimating mean curves by sampling in presence of partially missing trajectories.Indeed, many studies carried out in the French electricity company EDF, for marketing or power grid management purposes, are based on the analysis of mean or total electricity consumption curves at a fine time scale, for different groups of clients sharing some common characteristics.Because of privacy issues and financial costs, it is not possible to measure the electricity consumption curve of each customer so these mean curves are estimated using samples. In this thesis, we extend the work of Lardin (2012) on mean curve estimation by sampling by focusing on specific aspects of this problem such as robustness to influential units, small area estimation and estimation in presence of partially or totally unobserved curves.In order to build robust estimators of mean curves we adapt the unified approach to robust estimation in finite population proposed by Beaumont et al (2013) to the context of functional data. To that purpose we propose three approaches : application of the usual method for real variables on discretised curves, projection on Functional Spherical Principal Components or on a Wavelets basis and thirdly functional truncation of conditional biases based on the notion of depth.These methods are tested and compared to each other on real datasets and Mean Squared Error estimators are also proposed.Secondly we address the problem of small area estimation for functional means or totals. We introduce three methods: unit level linear mixed model applied on the scores of functional principal components analysis or on wavelets coefficients, functional regression and aggregation of individual curves predictions by functional regression trees or functional random forests. Robust versions of these estimators are then proposed by following the approach to robust estimation based on conditional biais presented before.Finally, we suggest four estimators of mean curves by sampling in presence of partially or totally unobserved trajectories. The first estimator is a reweighting estimator where the weights are determined using a temporal non parametric kernel smoothing adapted to the context of finite population and missing data and the other ones rely on imputation of missing data. Missing parts of the curves are determined either by using the smoothing estimator presented before, or by nearest neighbours imputation adapted to functional data or by a variant of linear interpolation which takes into account the mean trajectory of the entire sample. Variance approximations are proposed for each method and all the estimators are compared to each other on real datasets for various missing data scenarios
Chaibou, Salaou Mahaman Sani. "Segmentation d'image par intégration itérative de connaissances". Thesis, Ecole nationale supérieure Mines-Télécom Atlantique Bretagne Pays de la Loire, 2019. http://www.theses.fr/2019IMTA0140.
Texto completoImage processing has been a very active area of research for years. The interpretation of images is one of its most important branches because of its socio-economic and scientific applications. However, the interpretation, like most image processing processes, requires a segmentation phase to delimit the regions to be analyzed. In fact, interpretation is a process that gives meaning to the regions detected by the segmentation phase. Thus, the interpretation phase can only analyze the regions detected during the segmentation. Although the ultimate objective of automatic interpretation is to produce the same result as a human, the logic of classical techniques in this field does not marry that of human interpretation. Most conventional approaches to this task separate the segmentation phase from the interpretation phase. The images are first segmented and then the detected regions are interpreted. In addition, conventional techniques of segmentation scan images sequentially, in the order of pixels appearance. This way does not necessarily reflect the way of the expert during the image exploration. Indeed, a human usually starts by scanning the image for possible region of interest. When he finds a potential area, he analyzes it under three view points trying to recognize what object it is. First, he analyzes the area based on its physical characteristics. Then he considers the region's surrounding areas and finally he zooms in on the whole image in order to have a wider view while considering the information local to the region and those of its neighbors. In addition to information directly gathered from the physical characteristics of the image, the expert uses several sources of information that he merges to interpret the image. These sources include knowledge acquired through professional experience, existing constraints between objects from the images, and so on.The idea of the proposed approach, in this manuscript, is that simulating the visual activity of the expert would allow a better compatibility between the results of the interpretation and those ofthe expert. We retain from the analysis of the expert's behavior three important aspects of the image interpretation process that we will model in this work: 1. Unlike what most of the segmentation techniques suggest, the segmentation process is not necessarily sequential, but rather a series of decisions that each one may question the results of its predecessors. The main objective is to produce the best possible regions classification. 2. The process of characterizing an area of interest is not a one way process i.e. the expert can go from a local view restricted to the region of interest to a wider view of the area, including its neighbors and vice versa. 3. Several information sources are gathered and merged for a better certainty, during the decision of region characterisation. The proposed model of these three levels places particular emphasis on the knowledge used and the reasoning behind image segmentation
Fouemkeu, Norbert. "Modélisation de l’incertitude sur les trajectoires d’avions". Thesis, Lyon 1, 2010. http://www.theses.fr/2010LYO10217/document.
Texto completoIn this thesis we propose probabilistic and statistic models based on multidimensional data for forecasting uncertainty on aircraft trajectories. Assuming that during the flight, aircraft follows his 3D trajectory contained into his initial flight plan, we used all characteristics of flight environment as predictors to explain the crossing time of aircraft at given points on their planned trajectory. These characteristics are: weather and atmospheric conditions, flight current parameters, information contained into the flight plans and the air traffic complexity. Typically, in this study, the dependent variable is difference between actual time observed during flight and planned time to cross trajectory planned points: this variable is called temporal difference. We built four models using method based on partitioning recursive of the sample. The first called classical CART is based on Breiman CART method. Here, we use regression trees to build points typology of aircraft trajectories based on previous characteristics and to forecast crossing time of aircrafts on these points. The second model called amended CART is the previous model improved. This latter is built by replacing forecasting estimated by the mean of dependent variable inside the terminal nodes of classical CART by new forecasting given by multiple regression inside these nodes. This new model developed using Stepwise algorithm is parcimonious because for each terminal node it permits to explain the flight time by the most relevant predictors inside the node. The third model is built based on MARS (Multivariate adaptive regression splines) method. Besides continuity of the dependent variable estimator, this model allows to assess the direct and interaction effects of the explanatory variables on the crossing time on flight trajectory points. The fourth model uses boostrap sampling method. It’s random forests where for each bootstrap sample from the initial data, a tree regression model is built like in CART method. The general model forecasting is obtained by aggregating forecasting on the set of trees. Despite the overfitting observed on this model, it is robust and constitutes a solution against instability problem concerning regression trees obtained from CART method. The models we built have been assessed and validated using data test. Their using to compute the sector load forecasting in term to aircraft count entering the sector shown that, the forecast time horizon about 20 minutes with the interval time larger than 20 minutes, allowed to obtain forecasting with relative errors less than 10%. Among all these models, classical CART and random forests are more powerful. Hence, for regulator authority these models can be a very good help for managing the sector load of the airspace controlled
Desir, Chesner. "Classification Automatique d'Images, Application à l'Imagerie du Poumon Profond". Phd thesis, Université de Rouen, 2013. http://tel.archives-ouvertes.fr/tel-00879356.
Texto completoDesir, Chesner. "Classification automatique d'images, application à l'imagerie du poumon profond". Phd thesis, Rouen, 2013. http://www.theses.fr/2013ROUES053.
Texto completoThis thesis deals with automated image classification, applied to images acquired with alveoscopy, a new imaging technique of the distal lung. The aim is to propose and develop a computer aided-diagnosis system, so as to help the clinician analyze these images never seen before. Our contributions lie in the development of effective, robust and generic methods to classify images of healthy and pathological patients. Our first classification system is based on a rich and local characterization of the images, an ensemble of random trees approach for classification and a rejection mechanism, providing the medical expert with tools to enhance the reliability of the system. Due to the complexity of alveoscopy images and to the lack of expertize on the pathological cases (unlike healthy cases), we adopt the one-class learning paradigm which allows to learn a classifier from healthy data only. We propose a one-class approach taking advantage of combining and randomization mechanisms of ensemble methods to respond to common issues such as the curse of dimensionality. Our method is shown to be effective, robust to the dimension, competitive and even better than state-of-the-art methods on various public datasets. It has proved to be particularly relevant to our medical problem
Cabrol, Sébastien. "Les crises économiques et financières et les facteurs favorisant leur occurrence". Thesis, Paris 9, 2013. http://www.theses.fr/2013PA090019.
Texto completoThe aim of this thesis is to analyze, from an empirical point of view, both the different varieties of economic and financial crises (typological analysis) and the context’s characteristics, which could be associated with a likely occurrence of such events. Consequently, we analyze both: years seeing a crisis occurring and years preceding such events (leading contexts analysis, forecasting). This study contributes to the empirical literature by focusing exclusively on the crises in advanced economies over the last 30 years, by considering several theoretical types of crises and by taking into account a large number of both economic and financial explanatory variables. As part of this research, we also analyze stylized facts related to the 2007/2008 subprimes turmoil and our ability to foresee crises from an epistemological perspective. Our empirical results are based on the use of binary classification trees through CART (Classification And Regression Trees) methodology. This nonparametric and nonlinear statistical technique allows us to manage large data set and is suitable to identify threshold effects and complex interactions among variables. Furthermore, this methodology leads to characterize crises (or context preceding a crisis) by several distinct sets of independent variables. Thus, we identify as leading indicators of economic and financial crises: variation and volatility of both gold prices and nominal exchange rates, as well as current account balance (as % of GDP) and change in openness ratio. Regarding the typological analysis, we figure out two main different empirical varieties of crises. First, we highlight « global type » crises characterized by a slowdown in US economic activity (stressing the role and influence of the USA in global economic conditions) and low GDP growth in the countries affected by the turmoil. Second, we find that country-specific high level of both inflation and exchange rates volatility could be considered as evidence of « idiosyncratic type » crises
Chaibou, salaou Mahaman Sani. "Segmentation d'image par intégration itérative de connaissances". Thesis, 2019. http://www.theses.fr/2019IMTA0140/document.
Texto completoImage processing has been a very active area of research for years. The interpretation of images is one of its most important branches because of its socio-economic and scientific applications. However, the interpretation, like most image processing processes, requires a segmentation phase to delimit the regions to be analyzed. In fact, interpretation is a process that gives meaning to the regions detected by the segmentation phase. Thus, the interpretation phase can only analyze the regions detected during the segmentation. Although the ultimate objective of automatic interpretation is to produce the same result as a human, the logic of classical techniques in this field does not marry that of human interpretation. Most conventional approaches to this task separate the segmentation phase from the interpretation phase. The images are first segmented and then the detected regions are interpreted. In addition, conventional techniques of segmentation scan images sequentially, in the order of pixels appearance. This way does not necessarily reflect the way of the expert during the image exploration. Indeed, a human usually starts by scanning the image for possible region of interest. When he finds a potential area, he analyzes it under three view points trying to recognize what object it is. First, he analyzes the area based on its physical characteristics. Then he considers the region's surrounding areas and finally he zooms in on the whole image in order to have a wider view while considering the information local to the region and those of its neighbors. In addition to information directly gathered from the physical characteristics of the image, the expert uses several sources of information that he merges to interpret the image. These sources include knowledge acquired through professional experience, existing constraints between objects from the images, and so on.The idea of the proposed approach, in this manuscript, is that simulating the visual activity of the expert would allow a better compatibility between the results of the interpretation and those ofthe expert. We retain from the analysis of the expert's behavior three important aspects of the image interpretation process that we will model in this work: 1. Unlike what most of the segmentation techniques suggest, the segmentation process is not necessarily sequential, but rather a series of decisions that each one may question the results of its predecessors. The main objective is to produce the best possible regions classification. 2. The process of characterizing an area of interest is not a one way process i.e. the expert can go from a local view restricted to the region of interest to a wider view of the area, including its neighbors and vice versa. 3. Several information sources are gathered and merged for a better certainty, during the decision of region characterisation. The proposed model of these three levels places particular emphasis on the knowledge used and the reasoning behind image segmentation
Fromont, Lauren A. "Verbing and nouning in French : toward an ecologically valid approach to sentence processing". Thèse, 2019. http://hdl.handle.net/1866/23521.
Texto completoThe present thesis uses event-related potentials (ERPs) to investigate neurocognitve mechanisms underlying sentence comprehension. In particular, these two experiments seek to clarify the interplay between syntactic and semantic processes in native speakers and second language learners. Friederici’s (2002, 2011) “syntax-first” model predicts that syntactic categories are analyzed at the earliest stages of speech perception reflected by the ELAN (Early left anterior negativity), reported for syntactic category violations. Further, syntactic category violations seem to prevent the appearance of N400s (linked to lexical-semantic processing), a phenomenon known as “semantic blocking” (Friederici et al., 1999). However, a review article by Steinhauer and Drury (2012) argued that most ELAN studies used flawed designs, where pre-target context differences may have caused ELAN-like artifacts as well as the absence of N400s. The first study reevaluates syntax-first approaches to sentence processing by implementing a novel paradigm in French that included correct sentences, pure syntactic category violations, lexical-semantic anomalies, and combined anomalies. This balanced design systematically controlled for target word (noun vs. verb) and the context immediately preceding it. Group results from native speakers of Quebec French revealed an N400-P600 complex in response to all anomalous conditions, providing strong evidence against the syntax-first and semantic blocking hypotheses. Additive effects of syntactic category and lexical-semantic anomalies on the N400 may reflect a mismatch detection between a predicted word-stem and the actual target, in parallel with lexical-semantic retrieval. An interactive rather than additive effect on the P600 reveals that the same neurocognitive resources are recruited for syntactic and semantic integration. Analyses of individual data showed that participants did not rely on one single cognitive mechanism reflected by either the N400 or the P600 effect but on both, suggesting that the biphasic N400-P600 ERP wave can indeed be considered to be an index of phrase-structure violation processing in most individuals. The second study investigates the underlying mechanisms of phrase-structure building in late second language learners of French. The convergence hypothesis (Green, 2003; Steinhauer, 2014) predicts that second language learners can achieve native-like online- processing with sufficient proficiency. However, considering together different factors that relate to proficiency, exposure, and age of acquisition has proven challenging. This study further explores individual data modeling using a Random Forests approach. It revealed that daily usage and proficiency are the most reliable predictors in explaining the ERP responses, with N400 and P600 effects getting larger as these variables increased, partly confirming and extending the convergence hypothesis. This thesis demonstrates that the “syntax-first” model is not viable and should be replaced. A new account is suggested, based on predictive approaches, where semantic and syntactic information are first used in parallel to facilitate retrieval, and then controlled mechanisms are recruited to analyze sentences at the interface of syntax and semantics. Those mechanisms are mediated by inter-individual abilities reflected by language exposure and performance.