Teses / dissertações sobre o tema "Sélection de variables bayésienne"
Crie uma referência precisa em APA, MLA, Chicago, Harvard, e outros estilos
Veja os 50 melhores trabalhos (teses / dissertações) para estudos sobre o assunto "Sélection de variables bayésienne".
Ao lado de cada fonte na lista de referências, há um botão "Adicionar à bibliografia". Clique e geraremos automaticamente a citação bibliográfica do trabalho escolhido no estilo de citação de que você precisa: APA, MLA, Harvard, Chicago, Vancouver, etc.
Você também pode baixar o texto completo da publicação científica em formato .pdf e ler o resumo do trabalho online se estiver presente nos metadados.
Veja as teses / dissertações das mais diversas áreas científicas e compile uma bibliografia correta.
Baragatti, Meïli. "Sélection bayésienne de variables et méthodes de type Parallel Tempering avec et sans vraisemblance". Thesis, Aix-Marseille 2, 2011. http://www.theses.fr/2011AIX22100/document.
Texto completo da fonteThis thesis is divided into two main parts. In the first part, we propose a Bayesian variable selection method for probit mixed models. The objective is to select few relevant variables among tens of thousands while taking into account the design of a study, and in particular the fact that several datasets are merged together. The probit mixed model used is considered as part of a larger hierarchical Bayesian model, and the dataset is introduced as a random effect. The proposed method extends a work of Lee et al. (2003). The first step is to specify the model and prior distributions. In particular, we use the g-prior of Zellner (1986) for the fixed regression coefficients. In a second step, we use a Metropolis-within-Gibbs algorithm combined with the grouping (or blocking) technique of Liu (1994). This choice has both theoritical and practical advantages. The method developed is applied to merged microarray datasets of patients with breast cancer. However, this method has a limit: the covariance matrix involved in the g-prior should not be singular. But there are two standard cases in which it is singular: if the number of observations is lower than the number of variables, or if some variables are linear combinations of others. In such situations we propose to modify the g-prior by introducing a ridge parameter, and a simple way to choose the associated hyper-parameters. The prior obtained is a compromise between the conditional independent case of the coefficient regressors and the automatic scaling advantage offered by the g-prior, and can be linked to the work of Gupta and Ibrahim (2007).In the second part, we develop two new population-based MCMC methods. In cases of complex models with several parameters, but whose likelihood can be computed, the Equi-Energy Sampler (EES) of Kou et al. (2006) seems to be more efficient than the Parallel Tempering (PT) algorithm introduced by Geyer (1991). However it is difficult to use in combination with a Gibbs sampler, and it necessitates increased storage. We propose an algorithm combining the PT with the principle of exchange moves between chains with same levels of energy, in the spirit of the EES. This adaptation which we are calling Parallel Tempering with Equi-Energy Move (PTEEM) keeps the original idea of the EES method while ensuring good theoretical properties and a practical use in combination with a Gibbs sampler.Then, in some complex models whose likelihood is analytically or computationally intractable, the inference can be difficult. Several likelihood-free methods (or Approximate Bayesian Computational Methods) have been developed. We propose a new algorithm, the Likelihood Free-Parallel Tempering, based on the MCMC theory and on a population of chains, by using an analogy with the Parallel Tempering algorithm
Viallefont, Valérie. "Analyses bayesiennes du choix de modèles en épidémiologie : sélection de variables et modélisation de l'hétérogénéité pour des évènements". Paris 11, 2000. http://www.theses.fr/2000PA11T023.
Texto completo da fonteThis dissertation has two separated parts. In the first part, we compare different strategies for variable selection in a multivariate logistic regression model. Covariate and confounder selection in case-control studies is often carried out using either a two-step method or a stepwise variable selection method. Inference is then carried out conditionally on the selected model, but this ignores the madel uncertainty implicit in the variable selection process, and so underestimates uncertainty about relative risks. It is well known, and showed again in our study, that the ρ-values computed after variable selection can greatly overstate the strength of conclusions. We propose Bayesian Model Averaging as a formal way of taking account of madel uncertainty in a logistic regression context. The BMA methods, that allows to take into account several models, each being associated with its posterior probability, yields an easily interpreted summary, the posterior probability that a variable is a risk factor, and its estimate averaged over the set of models. We conduct two comparative simulations studies : the first one has a simple design including only one risk factor and one confounder, the second one mimics a epidemiological cohort study dataset, with a large number of potential risk factors. Our criteria are the mean bias, the rate of type I and type II errors, and the assessment of uncertainty in the results, which is bath more accurate and explicit under the BMA analysis. The methods are applied and compared in the context of a previously published case-control study of cervical cancer. The choice of the prior distributions are discussed. In the second part, we focus on the modelling of rare events via a Poisson distribution, that sometimes reveals substantial over-dispersion, indicating that sorme un explained discontinuity arises in the data. We suggest to madel this over-dispersion by a Poisson mixture. In a hierarchical Bayesian model, the posterior distributions of he unknown quantities in the mixture (number of components, weights, and Poisson parameters) can be estimated by MCMC algorithms, including reversible jump algothms which allows to vary the dimension of the mixture. We focus on the difficulty of finding a weakly informative prior for the Poisson parameters : different priors are detailed and compared. Then, the performances of different maves created for changing dimension are investigated. The model is extended by the introduction of covariates, with homogeneous or heterogeneous effect. Simulated data sets are designed for the different comparisons, and the model is finally illustrated in two different contexts : an ecological analysis of digestive cancer mortality along the coasts of France, and a dataset concerning counts of accidents in road-junctions
Bouhamed, Heni. "L'Apprentissage automatique : de la sélection de variables à l'apprentissage de structure d'un classifieur bayésien". Rouen, 2013. http://www.theses.fr/2013ROUES037.
Texto completo da fonteThe work developed in the framework of this thesis deals with the problem of processing large amounts of data in machine learning model from an examples’ database. Thus, the model constructed will serve as a tool for classifying new cases. We will particularly focus firstly, to the concept of variable selection by presenting its major strategies and propelling their shortcomings, in fact, a new filter method will be developed in this work in the aim to remedy to the identified shortcomings. Secondly, we will study the super exponential increase problem of the computational complexity of learning Bayesian classifier structure in the case of using general algorithms with no special restrictions. Indeed, referring to the formula of Robinson (Robinson, 1977), it is certain that the number of the directed acyclic graph (DAG) increases with a super exponential manner according to the increase of variables numbers. So, it is proposed in this work to develop a new approach in the aim to reduce the number of possible DAG in learning structure, without losing information. Obviously, reducing the number of DAG as possible will reduce the computational complexity of the process and therefore reducing the execution time, which will allow us to model grater information systems with the same quality of exploitation
Guin, Ophélie. "Méthodes bayésiennes semi-paramétriques d'extraction et de sélection de variables dans le cadre de la dendroclimatologie". Phd thesis, Université Paris Sud - Paris XI, 2011. http://tel.archives-ouvertes.fr/tel-00636704.
Texto completo da fonteMattei, Pierre-Alexandre. "Sélection de modèles parcimonieux pour l’apprentissage statistique en grande dimension". Thesis, Sorbonne Paris Cité, 2017. http://www.theses.fr/2017USPCB051/document.
Texto completo da fonteThe numerical surge that characterizes the modern scientific era led to the rise of new kinds of data united in one common immoderation: the simultaneous acquisition of a large number of measurable quantities. Whether coming from DNA microarrays, mass spectrometers, or nuclear magnetic resonance, these data, usually called high-dimensional, are now ubiquitous in scientific and technological worlds. Processing these data calls for an important renewal of the traditional statistical toolset, unfit for such frameworks that involve a large number of variables. Indeed, when the number of variables exceeds the number of observations, most traditional statistics becomes inefficient. First, we give a brief overview of the statistical issues that arise with high-dimensional data. Several popular solutions are presented, and we present some arguments in favor of the method utilized and advocated in this thesis: Bayesian model uncertainty. This chosen framework is the subject of a detailed review that insists on several recent developments. After these surveys come three original contributions to high-dimensional model selection. A new algorithm for high-dimensional sparse regression called SpinyReg is presented. It compares favorably to state-of-the-art methods on both real and synthetic data sets. A new data set for high-dimensional regression is also described: it involves predicting the number of visitors in the Orsay museum in Paris using bike-sharing data. We focus next on model selection for high-dimensional principal component analysis (PCA). Using a new theoretical result, we derive the first closed-form expression of the marginal likelihood of a PCA model. This allows us to propose two algorithms for model selection in PCA. A first one called globally sparse probabilistic PCA (GSPPCA) that allows to perform scalable variable selection, and a second one called normal-gamma probabilistic PCA (NGPPCA) that estimates the intrinsic dimensionality of a high-dimensional data set. Both methods are competitive with other popular approaches. In particular, using unlabeled DNA microarray data, GSPPCA is able to select genes that are more biologically relevant than several popular approaches
Naveau, Marion. "Procédures de sélection de variables en grande dimension dans les modèles non-linéaires à effets mixtes. Application en amélioration des plantes". Electronic Thesis or Diss., université Paris-Saclay, 2024. http://www.theses.fr/2024UPASM031.
Texto completo da fonteMixed-effects models analyze observations collected repeatedly from several individuals, attributing variability to different sources (intra-individual, inter-individual, residual). Accounting for this variability is essential to characterize the underlying biological mechanisms without biais. These models use covariates and random effects to describe variability among individuals: covariates explain differences due to observed characteristics, while random effects represent the variability not attributable to measured covariates. In high-dimensional context, where the number of covariates exceeds the number of individuals, identifying influential covariates is challenging, as selection focuses on latent variables in the model. Many procedures have been developed for linear mixed-effects models, but contributions for non-linear models are rare and lack theoretical foundations. This thesis aims to develop a high-dimensional covariate selection procedure for non-linear mixed-effects models by studying their practical implementations and theoretical properties. This procedure is based on the use of a gaussian spike-and-slab prior and the SAEM algorithm (Stochastic Approximation of Expectation Maximisation Algorithm). Posterior contraction rates around true parameter values in a non-linear mixed-effects model under a discrete spike-and-slab prior have been obtained, comparable to those observed in linear models. The work in this thesis is motivated by practical questions in plant breeding, where these models describe plant development as a function of their genotypes and environmental conditions. The considered covariates are generally numerous since varieties are characterized by thousands of genetic markers, most of which have no effect on certain phenotypic traits. The statistical method developed in the thesis is applied to a real dataset related to this application
Prestat, Emmanuel. "Les réseaux bayésiens : classification et recherche de réseaux locaux en cancérologie". Phd thesis, Université Claude Bernard - Lyon I, 2010. http://tel.archives-ouvertes.fr/tel-00707732.
Texto completo da fonteJebreen, Kamel. "Modèles graphiques pour la classification et les séries temporelles". Thesis, Aix-Marseille, 2017. http://www.theses.fr/2017AIXM0248/document.
Texto completo da fonteFirst, in this dissertation, we will show that Bayesian networks classifiers are very accurate models when compared to other classical machine learning methods. Discretising input variables often increase the performance of Bayesian networks classifiers, as does a feature selection procedure. Different types of Bayesian networks may be used for supervised classification. We combine such approaches together with feature selection and discretisation to show that such a combination gives rise to powerful classifiers. A large choice of data sets from the UCI machine learning repository are used in our experiments, and the application to Epilepsy type prediction based on PET scan data confirms the efficiency of our approach. Second, in this dissertation we also consider modelling interaction between a set of variables in the context of time series and high dimension. We suggest two approaches; the first is similar to the neighbourhood lasso where the lasso model is replaced by Support Vector Machines (SVMs); the second is a restricted Bayesian network for time series. We demonstrate the efficiency of our approaches simulations using linear and nonlinear data set and a mixture of both
Dangauthier, Pierre-Charles. "Fondations, méthode et applications de l'apprentissage bayésien". Phd thesis, Grenoble INPG, 2007. http://tel.archives-ouvertes.fr/tel-00267643.
Texto completo da fonteBedenel, Anne-Lise. "Appariement de descripteurs évoluant dans le temps : application à la comparaison d'assurance". Thesis, Lille 1, 2019. http://www.theses.fr/2019LIL1I011/document.
Texto completo da fonteMost of the classical learning methods require data descriptors equal to both learning and test samples. But, in the online insurance comparison field, forms and features where data come from are often changed. These constant modifications of data descriptors lead us to work with the small amount of data and make analysis more complex. So, the goal is to use data generated before the feature descriptors modification. By doing so, we increase the size of the observed sample after the descriptors modification. We intend to perform a learning transfer between observed data before and after features modification. The links between data descriptors of the feature before and after the modification are totally unknown which bring a problem of missing data. A modelling of the joint distribution of the feature before and after the modification of the data descriptors has been suggested. The problem becomes an estimation problem in a graph where some business and technical constraints ensure the identifiability of the model and we have to work with a reduced set of very parsimonious models. Two methods of estimation rely on EM algorithms have been intended. The constraints set lead us to work with a set of models. A model selection step is required. For this step, two criterium are proposed: an asymptotic and a non-asymptotic criterium rely on Bayesian analysis which includes an importance sampling combined with Gibbs algorithm. An exhaustive search and a non-exhaustive search based on genetic algorithm, combining both estimation and selection, are suggested to have an optimal method for both results and execution time. This thesis finishes with an application on real data
Schäfer, Christian. "Monte Carlo methods for sampling high-dimensional binary vectors". Phd thesis, Université Paris Dauphine - Paris IX, 2012. http://tel.archives-ouvertes.fr/tel-00767163.
Texto completo da fonteBontemps, Dominique. "Statistiques discrètes et Statistiques bayésiennes en grande dimension". Phd thesis, Université Paris Sud - Paris XI, 2010. http://tel.archives-ouvertes.fr/tel-00561749.
Texto completo da fonteTayeb, Arafat. "Estimation bayésienne des modèles à variables latentes". Paris 9, 2006. https://portail.bu.dauphine.fr/fileviewer/index.php?doc=2006PA090061.
Texto completo da fonteIn this thesis, we study some models with latent variables. Given a set of data , we suppose that there is a hidden variable such that the distribution of conditional on is of known class and is often depending on a (multidimensional) parameter. This parameter can depend on time and on the latent variable. When does not depend on , we simply write. Depending on the model, the variable represents the observation allocation, the observation component, the observation state or its regime. The aim of this work is to estimate the parameter and the hidden variable. Bayesian inference about the parameter is given by its posterior distribution. Precisely, we wish either to produce an efficient sample (approximately) following this distribution or to approximate some of its properties like mean, median or modes. Different methods of sampling and/or deriving of such posterior properties are used in this thesis. Mostly, five models are studied and for each situation, specific techniques are used
El, anbari Mohammed. "Regularisation and variable selection using penalized likelihood". Phd thesis, Université Paris Sud - Paris XI, 2011. http://tel.archives-ouvertes.fr/tel-00661689.
Texto completo da fonteCaron, François. "Inférence bayésienne pour la détermination et la sélection de modèles stochastiques". Ecole Centrale de Lille, 2006. http://www.theses.fr/2006ECLI0012.
Texto completo da fonteWe are interested in the addition of uncertainty in hidden Markov models. The inference is made in a Bayesian framework based on Monte Carlo methods. We consider multiple sensors that may switch between several states of work. An original jump model is developed for different kind of situations, including synchronous/asynchronous data and the binary valid/invalid case. The model/algorithm is applied to the positioning of a land vehicle equipped with three sensors. One of them is a GPS receiver, whose data are potentially corrupted due to multipaths phenomena. We consider the estimation of the probability density function of the evolution and observation noises in hidden Markov models. First, the case of linear models is addressed and MCMC and particle filter algorithms are developed and applied on three different applications. Then the case of the estimation of probability density functions in nonlinear models is addressed. For that purpose, time-varying Dirichlet processes are defined for the online estimation of time-varying probability density functions
Choiruddin, Achmad. "Sélection de variables pour des processus ponctuels spatiaux". Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAM045/document.
Texto completo da fonteRecent applications such as forestry datasets involve the observations of spatial point pattern data combined with the observation of many spatial covariates. We consider in this thesis the problem of estimating a parametric form of the intensity function in such a context. This thesis develops feature selection procedures and gives some guarantees on their validity. In particular, we propose two different feature selection approaches: the lasso-type methods and the Dantzig selector-type procedures. For the methods considering lasso-type techniques, we derive asymptotic properties of the estimates obtained from estimating functions derived from Poisson and logistic regression likelihoods penalized by a large class of penalties. We prove that the estimates obtained from such procedures satisfy consistency, sparsity, and asymptotic normality. For the Dantzig selector part, we develop a modified version of the Dantzig selector, which we call the adaptive linearized Dantzig selector (ALDS), to obtain the intensity estimates. More precisely, the ALDS estimates are defined as the solution to an optimization problem which minimizes the sum of coefficients of the estimates subject to linear approximation of the score vector as a constraint. We find that the estimates obtained from such methods have asymptotic properties similar to the ones proposed previously using an adaptive lasso regularization term. We investigate the computational aspects of the methods developped using either lasso-type procedures or the Dantzig selector-type approaches. We make links between spatial point processes intensity estimation and generalized linear models (GLMs), so we only have to deal with feature selection procedures for GLMs. Thus, easier computational procedures are implemented and computationally fast algorithm are proposed. Simulation experiments are conducted to highlight the finite sample performances of the estimates from each of two proposed approaches. Finally, our methods are applied to model the spatial locations a species of tree in the forest observed with a large number of environmental factors
Sidi, Zakari Ibrahim. "Sélection de variables et régression sur les quantiles". Thesis, Lille 1, 2013. http://www.theses.fr/2013LIL10081/document.
Texto completo da fonteThis work is a contribution to the selection of statistical models and more specifically in the selection of variables in penalized linear quantile regression when the dimension is high. It focuses on two points in the selection process: the stability of selection and the inclusion of variables by grouping effect. As a first contribution, we propose a transition from the penalized least squares regression to quantiles regression (QR). A bootstrap approach based on frequency of selection of each variable is proposed for the construction of linear models (LM). In most cases, the QR approach provides more significant coefficients. A second contribution is to adapt some algorithms of "Random" LASSO (Least Absolute Shrinkage and Solution Operator) family in connection with the QR and to propose methods of selection stability. Examples from food security illustrate the obtained results. As part of the penalized QR in high dimension, the grouping effect property is established under weak conditions and the oracle ones. Two examples of real and simulated data illustrate the regularization paths of the proposed algorithms. The last contribution deals with variable selection for generalized linear models (GLM) using the nonconcave penalized likelihood. We propose an algorithm to maximize the penalized likelihood for a broad class of non-convex penalty functions. The convergence property of the algorithm and the oracle one of the estimator obtained after an iteration have been established. Simulations and an application to real data are also presented
Harroue, Benjamin. "Approche bayésienne pour la sélection de modèles : application à la restauration d’image". Thesis, Bordeaux, 2020. http://www.theses.fr/2020BORD0127.
Texto completo da fonteInversing main goal is about reconstructing objects from data. Here, we focus on the special case of image restauration in convolution problems. The data are acquired through a altering observation system and additionnaly distorted by errors. The problem becomes ill-posed due to the loss of information. One way to tackle it is to exploit Bayesian approach in order to regularize the problem. Introducing prior information about the unknown quantities osset the loss, and it relies on stochastic models. We have to test all the candidate models, in order to select the best one. But some questions remain : how do you choose the best model? Which features or quantities should we rely on ? In this work, we propose a method to automatically compare and choose the model, based on Bayesion decision theory : objectively compare the models based on their posterior probabilities. These probabilities directly depend on the marginal likelihood or “evidence” of the models. The evidence comes from the marginalization of the jointe law according to the unknow image and the unknow hyperparameters. This a difficult integral calculation because of the complex dependancies between the quantities and the high dimension of the image. That way, we have to work with computationnal methods and approximations. There are several methods on the test stand as Harmonic Mean, Laplace method, discrete integration, Chib from Gibbs approximation or the power posteriors. Comparing is those methods is significative step to determine which ones are the most competent in image restauration. As a first lead of research, we focus on the family of Gaussian models with circulant covariance matrices to lower some difficulties
Genuer, Robin. "Forêts aléatoires : aspects théoriques, sélection de variables et applications". Phd thesis, Université Paris Sud - Paris XI, 2010. http://tel.archives-ouvertes.fr/tel-00550989.
Texto completo da fonteGrimonprez, Quentin. "Sélection de groupes de variables corrélées en grande dimension". Thesis, Lille 1, 2016. http://www.theses.fr/2016LIL10165/document.
Texto completo da fonteThis thesis takes place in the context of variable selection in the high dimensional setting using penalizedregression in presence of redundancy between explanatory variables. Among all variables, we supposethat only a few number is relevant for predicting the response variable. In this high dimensional setting,performance of classical lasso-based approaches decreases when redundancy increases as they do not takeit into account. Firstly aggregating variables can overcome this problem but generally requires calibrationof additional parameters. The proposed approach combines variables aggregation and selection in order to improve interpretabilityand performance. First, a hierarchical clustering procedure provides at each level a partition of the variablesinto groups. Then the Group-lasso is used with the set of groups of variables from the different levels ofthe hierarchical clustering and a fixed regularization parameter. Choosing this parameter provides a list ofcandidates groups potentially coming from different levels. The final choice of groups is done by a multipletesting procedure. The proposed procedure exploits the hierarchical structure from hierarchical clustering and some weightsin Group-lasso. This allows to greatly reduce the algorithm complexity induced by the possibility to choosegroups coming from different levels of the hierarchical clustering
Ros, Mathieu. "Sélection canalisante et modélisation bayésienne de variances hétérogènes : application à Helix Aspersa Müller". Rennes, Agrocampus, 2005. http://www.theses.fr/2005NSARB164.
Texto completo da fonteHebiri, Mohamed. "Quelques questions de sélection de variables autour de l'estimateur LASSO". Phd thesis, Université Paris-Diderot - Paris VII, 2009. http://tel.archives-ouvertes.fr/tel-00408737.
Texto completo da fonteCasarin, Roberto. "Méthodes de simulation pour l'estimation bayésienne des modèles à variables latentes". Paris 9, 2007. https://portail.bu.dauphine.fr/fileviewer/index.php?doc=2007PA090056.
Texto completo da fonteLatent variable models are now very common in econometrics and statistics. This thesis mainly focuses on the use of latent variables in mixture modelling, time series analysis and continuous time models. We follow a Bayesian inference framework based on simulation methods. In the third chapter we propose alfa-stable mixtures in order to account for skewness, heavy tails and multimodality in financial modelling. Chapter four proposes a Markov-Switching Stochastic-Volatility model with a heavy-tail observable process. We follow a Bayesian approach and make use of Particle Filter, in order to filter the state and estimate the parameters. Chapter five deals with the parameter estimation and the extraction of the latent structure in the volatilities of the US business cycle and stock market valuations. We propose a new regularised SMC procedure for doing Bayesian inference. In chapter six we employ a Bayesian inference procedure, based on Population Monte Carlo, to estimate the parameters in the drift and diffusion terms of a stochastic differential equation (SDE), from discretely observed data
Mbina, Mbina Alban. "Contributions à la sélection des variables en statistique multidimensionnelle et fonctionnelle". Thesis, Lille 1, 2017. http://www.theses.fr/2017LIL10102/document.
Texto completo da fonteThis thesis focuses on variables selection on linear models and additif functional linear model. More precisely we propose three variables selection methods. The first one is concerned with the selection continuous variables of multidimentional linear model. The comparative study based on prediction loss shows that our method is beter to method of An et al. (2013) Secondly, we propose a new selection method of mixed variables (mixing of discretes and continuous variables). This method is based on generalization in the mixed framwork of NKIET (2012) method, more precisely, is based on a generalization of linear canonical invariance criterion to the framework of discrimination with mixed variables. A comparative study based on the rate of good classification show that our method is equivalente to the method of MAHAT et al. (2007) in the case of two groups. In the third method, we propose an approach of variables selection on an additive functional linear model. A simulations study shows from Hausdorff distance an illustration of our approach
Meynet, Caroline. "Sélection de variables pour la classification non supervisée en grande dimension". Phd thesis, Université Paris Sud - Paris XI, 2012. http://tel.archives-ouvertes.fr/tel-00752613.
Texto completo da fonteSauvé, Marie. "Sélection de modèles en régression non gaussienne : applications à la sélection de variables et aux tests de survie accélérés". Paris 11, 2006. http://www.theses.fr/2006PA112201.
Texto completo da fonteThis thesis deals with model selection in non Gaussian regression. Our aim is to get informations on a function s given only some values perturbed by noises non necessarily Gaussian. In a first part, we consider histogram models (i. E. Classes of piecewise constant functions) associated with a collection of partitions of the set on which s is defined. We determine a penalized least squares criterion which selects a partition whose associated estimator satisfies an oracle inequality. Selecting a histogram model does not always lead to an accurate estimation of s, but allows for example to detect the change-points of s. In order to perform variable selection, we also propose a non linear method which relies on the use of CART and on histogram model selection. In a second part, we consider piecewise polynomial models, whose approximation properties are better. We aim at estimating s with a piecewise polynomial whose degree can vary from region to region. We determine a penalized criterion which selects a partition and a series of degrees whose associated piecewise polynomial estimator satisfies an oracle inequality. We also apply this result to detect the change-points of a piecewise affine function. The aim of this last work is to provide an adequate stress interval for Accelerating Life Test
Comminges, Laëtitia, e Laëtitia Comminges. "Quelques contributions à la sélection de variables et aux tests non-paramétriques". Phd thesis, Université Paris-Est, 2012. http://pastel.archives-ouvertes.fr/pastel-00804979.
Texto completo da fonteComminges, Laëtitia. "Quelques contributions à la sélection de variables et aux tests non-paramétriques". Thesis, Paris Est, 2012. http://www.theses.fr/2012PEST1068/document.
Texto completo da fonteReal-world data are often extremely high-dimensional, severely under constrained and interspersed with a large number of irrelevant or redundant features. Relevant variable selection is a compelling approach for addressing statistical issues in the scenario of high-dimensional and noisy data with small sample size. First, we address the issue of variable selection in the regression model when the number of variables is very large. The main focus is on the situation where the number of relevant variables is much smaller than the ambient dimension. Without assuming any parametric form of the underlying regression function, we get tight conditions making it possible to consistently estimate the set of relevant variables. Secondly, we consider the problem of testing a particular type of composite null hypothesis under a nonparametric multivariate regression model. For a given quadratic functional $Q$, the null hypothesis states that the regression function $f$ satisfies the constraint $Q[f] = 0$, while the alternative corresponds to the functions for which $Q[f]$ is bounded away from zero. We provide minimax rates of testing and the exact separation constants, along with a sharp-optimal testing procedure, for diagonal and nonnegative quadratic functionals. We can apply this to testing the relevance of a variable. Studying minimax rates for quadratic functionals which are neither positive nor negative, makes appear two different regimes: “regular” and “irregular”. We apply this to the issue of testing the equality of norms of two functions observed in noisy environments
Lê, Cao Kim-Anh. "Outils statistiques pour la sélection de variables et l'intégration de données "omiques"". Toulouse, INSA, 2008. http://eprint.insa-toulouse.fr/archive/00000225/.
Texto completo da fonteRecent advances in biotechnology allow the monitoring of large quantities of biological data of various types, such as genomics, proteomics, metabolomics, phenotypes. . . , that are often characterized by a small number of samples or observations. The aim of this thesis was to develop, or adapt, appropriate statistical methodologies to analyse highly dimensional data, and to present efficient tools to biologists for selecting the most biologically relevant variables. In the first part, we focus on microarray data in a classification framework, and on the selection of discriminative genes. In the second part, in the context of data integration, we focus on the selection of different types of variables with two-block omics data. Firstly, we propose a wrapper method, which agregates two classifiers (CART or SVM) to select discriminative genes for binary or multiclass biological conditions. Secondly, we develop a PLS variant called sparse PLS that adapts l1 penalization and allows for the selection of a subset of variables, which are measured from the same biological samples. Either a regression or canonical analysis frameworks are proposed to answer biological questions correctly. We assess each of the proposed approaches by comparing them to similar methods known in the literature on numerous real data sets. The statistical criteria that we use are often limited by the small number of samples. We always try, therefore, to combine statistical assessments with a thorough biological interpretation of the results. The approaches that we propose are easy to apply and give relevant results that answer the biologists needs
Maria, Sébastien. "Modélisation parcimonieuse : application à la sélection de variables et aux données STAP". Rennes 1, 2006. http://www.theses.fr/2006REN1S153.
Texto completo da fonteLevrard, Clément. "Quantification vectorielle en grande dimension : vitesses de convergence et sélection de variables". Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112214/document.
Texto completo da fonteThe distortion of the quantizer built from a n-sample of a probability distribution over a vector space with the famous k-means algorithm is firstly studied in this thesis report. To be more precise, this report aims to give oracle inequalities on the difference between the distortion of the k-means quantizer and the minimum distortion achievable by a k-point quantizer, where the influence of the natural parameters of the quantization issue should be precisely described. For instance, some natural parameters are the distribution support, the size k of the quantizer set of images, the dimension of the underlying Euclidean space, and the sample size n. After a brief summary of the previous works on this topic, an equivalence between the conditions previously stated for the excess distortion to decrease fast with respect to the sample size and a technical condition is stated, in the continuous density case. Interestingly, this condition looks like a technical condition required in statistical learning to achieve fast rates of convergence. Then, it is proved that the excess distortion achieves a fast convergence rate of 1/n in expectation, provided that this technical condition is satisfied. Next, a so-called margin condition is introduced, which is easier to understand, and it is established that this margin condition implies the technical condition mentioned above. Some examples of distributions satisfying this margin condition are exposed, such as the Gaussian mixtures, which are classical distributions in the clustering framework. Then, provided that this margin condition is satisfied, an oracle inequality on the excess distortion of the k-means quantizer is given. This convergence result shows that the excess distortion decreases with a rate 1/n and depends on natural geometric properties of the probability distribution with respect to the size of the set of images k. Suprisingly the dimension of the underlying Euclidean space seems to play no role in the convergence rate of the distortion. Following the latter point, the results are directly extended to the case where the underlying space is a Hilbert space, which is the adapted framework when dealing with curve quantization. However, high-dimensional quantization often needs in practical a dimension reduction step, before proceeding to a quantization algorithm. This motivates the following study of a variable selection procedure adapted to the quantization issue. To be more precise, a Lasso type procedure adapted to the quantization framework is studied. The Lasso type penalty applies to the set of image points of the quantizer, in order to obtain sparse image points. The outcome of this procedure is called the Lasso k-means quantizer, and some theoretical results on this quantizer are established, under the margin condition introduced above. First it is proved that the image points of such a quantizer are close to the image points of a sparse quantizer, achieving a kind of tradeoff between excess distortion and size of the support of image points. Then an oracle inequality on the excess distortion of the Lasso k-means quantizer is given, providing a convergence rate of 1/n^(1/2) in expectation. Moreover, the dependency of this convergence rate on different other parameters is precisely described. These theoretical predictions are illustrated with numerical experimentations, showing that the Lasso k-means procedure mainly behaves as expected. However, the numerical experimentations also shed light on some drawbacks concerning the practical implementation of such an algorithm
Mallein, Bastien. "Marches aléatoires branchantes, temps inhomogène, sélection". Thesis, Paris 6, 2015. http://www.theses.fr/2015PA066104/document.
Texto completo da fonteIn this thesis, we take interest in the branching random walk, a particles system, in which particles move and reproduce independently. The aim is to study the rhythm at which these particles invade their environment, a quantity which often reveals information on the past of the extremal individuals. We take care of two particular variants of branching random walk, that we describe below.In the first variant, the way individuals behave evolves with time. This model has been introduced by Fang and Zeitouni in 2010. This time-dependence can be a slow evolution of the reproduction mechanism of individuals, at macroscopic scale, in which case the maximal displacement is obtained through the resolution of a convex optimization problem. A second kind of time-dependence is to sample at random, at each generation, the way individuals behave. This model has been introduced and studied in an article in collaboration with Piotr Mi\l{}os.In the second variant, individuals endure a Darwinian selection mechanism. The position of an individual is understood as its fitness, and the displacement of a child with respect to its parent is associated to the process of heredity. In such a process, the total size of the population is fixed to some integer N, and at each step, only the N fittest individuals survive. This model was introduced by Brunet, Derrida, Mueller and Munier. In a first time, we took interest in a mechanism of reproduction which authorises some large jumps. In the second model we considered, the total size N of the population may depend on time
Dubois, Jean-François. "Quelques pièges cachés des méthodes de sélection de variables en régression linéaire multiple". Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp03/MQ67260.pdf.
Texto completo da fonteBécu, Jean-Michel. "Contrôle des fausses découvertes lors de la sélection de variables en grande dimension". Thesis, Compiègne, 2016. http://www.theses.fr/2016COMP2264/document.
Texto completo da fonteIn the regression framework, many studies are focused on the high-dimensional problem where the number of measured explanatory variables is very large compared to the sample size. If variable selection is a classical question, usual methods are not applicable in the high-dimensional case. So, in this manuscript, we develop the transposition of statistical tests to the high dimension. These tests operate on estimates of regression coefficients obtained by penalized linear regression, which is applicable in high-dimension. The main objective of these tests is the false discovery control. The first contribution of this manuscript provides a quantification of the uncertainty for regression coefficients estimated by ridge regression in high dimension. The Ridge regression penalizes the coefficients on their l2 norm. To do this, we devise a statistical test based on permutations. The second contribution is based on a two-step selection approach. A first step is dedicated to the screening of variables, based on parsimonious regression Lasso. The second step consists in cleaning the resulting set by testing the relevance of pre-selected variables. These tests are made on adaptive-ridge estimates, where the penalty is constructed on Lasso estimates learned during the screening step. A last contribution consists to the transposition of this approach to group-variables selection
Grelaud, Aude. "Méthodes sans vraisemblance appliquées à l'étude de la sélection naturelle et à la prédiction de structure tridimensionnelle des protéines". Paris 9, 2009. https://portail.bu.dauphine.fr/fileviewer/index.php?doc=2009PA090048.
Texto completo da fonteTuleau, Christine. "Sélection de variables pour la discrimination en grande dimension et classification de données fonctionnelles". Paris 11, 2005. https://tel.archives-ouvertes.fr/tel-00012008.
Texto completo da fonteThis thesis deals with nonparametric statistics and is related to classification and discrimination in high dimension, and more particularly on variable selection. A first part is devoted to variable selection through cart, both the regression and binary classification frameworks. The proposed exhaustive procedure is based on model selection which leads to “oracle” inequalities and allows to perform variable selection by penalized empirical contrast. A second part is motivated by an industrial problem. It consists of determining among the temporal signals, measured during experiments, those able to explain the subjective drivability, and then to define the ranges responsible for this relevance. The adopted methodology is articulated around the preprocessing of the signals, dimensionality reduction by compression using a common wavelet basis and selection of useful variables involving cart and a strategy step by step. A last part deals with functional data classification with k-nearest neighbors. The procedure consists of applying k-nearest neighbors on the coordinates of the projections of the data on a suitable chosen finite dimesional space. The procedure involves selecting simultaneously the space dimension and the number of neighbors. The traditional version of k-nearest neighbors and a slightly penalized version are theoretically considered. A study on real and simulated data shows that the introduction of a small penalty term stabilizes the selection while preserving good performance
Robineau, Jean-François. "Méthodes de sélection de variables, parmi un grand nombre, dans un cadre de discrimation". Université Joseph Fourier (Grenoble), 2004. http://www.theses.fr/2004GRE19009.
Texto completo da fonteThe purpose of this document is the development of a practical framework for feature selection in supervised learning task. The issue of feature selection is mainly known from data-mining, where one has to deal with many irrelevant variables. We want to develop an environment, both at the same time theoretical and applied, in order to implement feature selection methods independent from any probabilistic model and disciminant algorithm. We propose supervised quantization methods based upon information measures. These methods perform discretization of continuous attribute following the class variable distribution. Following this pre-processing, feature selection methods use similar criteria to generate relevant variable subsets. Several methods are proposed, enlightening the impossible quest for the ideal subset
Hindawi, Mohammed. "Sélection de variables pour l’analyse des données semi-supervisées dans les systèmes d’Information décisionnels". Thesis, Lyon, INSA, 2013. http://www.theses.fr/2013ISAL0015/document.
Texto completo da fonteFeature selection is an important task in data mining and machine learning processes. This task is well known in both supervised and unsupervised contexts. The semi-supervised feature selection is still under development and far from being mature. In general, machine learning has been well developed in order to deal with partially-labeled data. Thus, feature selection has obtained special importance in the semi-supervised context. It became more adapted with the real world applications where labeling process is costly to obtain. In this thesis, we present a literature review on semi-supervised feature selection, with regard to supervised and unsupervised contexts. The goal is to show the importance of compromising between the structure from unlabeled part of data, and the background information from their labeled part. In particular, we are interested in the so-called «small labeled-sample problem» where the difference between both data parts is very important. In order to deal with the problem of semi-supervised feature selection, we propose two groups of approaches. The first group is of «Filter» type, in which, we propose some algorithms which evaluate the relevance of features by a scoring function. In our case, this function is based on spectral-graph theory and the integration of pairwise constraints which can be extracted from the data in hand. The second group of methods is of «Embedded» type, where feature selection becomes an internal function integrated in the learning process. In order to realize embedded feature selection, we propose algorithms based on feature weighting. The proposed methods rely on constrained clustering. In this sense, we propose two visions, (1) a global vision, based on relaxed satisfaction of pairwise constraints. This is done by integrating the constraints in the objective function of the proposed clustering model; and (2) a second vision, which is local and based on strict control of constraint violation. Both approaches evaluate the relevance of features by weights which are learned during the construction of the clustering model. In addition to the main task which is feature selection, we are interested in redundancy elimination. In order to tackle this problem, we propose a novel algorithm based on combining the mutual information with maximum spanning tree-based algorithm. We construct this tree from the relevant features in order to optimize the number of these selected features at the end. Finally, all proposed methods in this thesis are analyzed and their complexities are studied. Furthermore, they are validated on high-dimensional data versus other representative methods in the literature
Tuleau, Christine. "SELECTION DE VARIABLES POUR LA DISCRIMINATION EN GRANDE DIMENSION ET CLASSIFICATION DE DONNEES FONCTIONNELLES". Phd thesis, Université Paris Sud - Paris XI, 2005. http://tel.archives-ouvertes.fr/tel-00012008.
Texto completo da fonteLaporte, Léa. "La sélection de variables en apprentissage d'ordonnancement pour la recherche d'information : vers une approche contextuelle". Toulouse 3, 2013. http://thesesups.ups-tlse.fr/2170/.
Texto completo da fonteLearning-to-rank aims at automatically optimizing a ranking function learned on training data by a machine learning algorithm. Existing approaches have two major drawbacks. Firstly, the ranking functions can use several thousands of features, which is an issue since algorithms have to deal with large scale data. This can also have a negative impact on the ranking quality. Secondly, algorithms learn an unique fonction for all queries. Then, nor the kind of user need neither the context of the query are taken into account in the ranking process. Our works focus on solving the large-scale issue and the context-aware issue by using feature selection methods dedicated to learning-to-rank. We propose five feature selection algorithms based on sparse Support Vector Machines (SVM). Three proceed to feature selection by reweighting the L2-norm, one solves a L1-regularized problem whereas the last algorithm consider nonconvex regularizations. Our methods are faster and sparser than state-of-the-art algorithms on benchmark datasets, while providing similar performances in terms of RI measures. We also evaluate our approches on a commercial dataset. Experimentations confirm the previous results. We propose in this context a relevance model based on users clicks, in the special case of multi-clickable documents. Finally, we propose an adaptative and query-dependent ranking system based on feature selection. This system considers several clusters of queries, each group defines a context. For each cluster, the system selects a group of features to learn a context-aware ranking function
Donnet, Sophie. "Inversion de données IRMf : estimation et sélection de modèles". Paris 11, 2006. http://www.theses.fr/2006PA112193.
Texto completo da fonteThis thesis is devoted to the analysis of functional Magnetic Resonance Imaging data (fMRI). In the framework of standard convolution models, we test a model that allows for the variation of the magnitudes of the hemodynamic reponse. To estimate the parameters of this model, we have to resort to the Expectation-Maximisation algorithm. We test this model against the standard one --withconstant magnitudes-- on several real data, set by a likelihood ratio test. The linear model suffers from a lack of biological basis, hence we consider a physiological model. In this framework, we describe the data as the sum of a regression term, defined as the non-analytical solution of an ordinary differentiel equation (ODE) depending on random parameters, and a gaussian observation noise. We develop a general method to estimate the parameters of a statistical model defined by ODE with non-observed parameters. This method, integrating a numerical resolution of the ODE, relies on a stochastic version of the EM algorithm. The convergence of the algorithm is proved and the error induced by the numerical solving method is controlled. We apply this method on simulated and real data sets. Subsequently, we consider statistical models defined by stochastic differential equations (SDE) depending on random parameters. We approximate the diffusion process by a numerical scheme and propose a general estimation method. Results of a pharmacokineticmixed model study (on simulated and real data set) illustrate the accuracy of the estimation and the relevance of the SDE approach. Finally, the identifiability of statistical models defined by SDE with random parameters is studied
Chastaing, Gaëlle. "Indices de Sobol généralisés pour variables dépendantes". Phd thesis, Université de Grenoble, 2013. http://tel.archives-ouvertes.fr/tel-00930229.
Texto completo da fonteVandewalle, Vincent. "Estimation et sélection en classification semi-supervisée". Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2009. http://tel.archives-ouvertes.fr/tel-00447141.
Texto completo da fonteDu, Jardin Philippe. "Prévision de la défaillance et réseaux de neurones : l'apport des méthodes numériques de sélection de variables". Phd thesis, Université de Nice Sophia-Antipolis, 2007. http://tel.archives-ouvertes.fr/tel-00475200.
Texto completo da fonteHamon, Julie. "Optimisation combinatoire pour la sélection de variables en régression en grande dimension : Application en génétique animale". Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2013. http://tel.archives-ouvertes.fr/tel-00920205.
Texto completo da fonteAygalinc, Pascal. "Application de la reconnaissance des formes à l'aide au diagnostic médical : sélection multicritère de variables explicatives". Lille 1, 1986. http://www.theses.fr/1986LIL10083.
Texto completo da fonteRohart, Florian. "Prédiction phénotypique et sélection de variables en grande dimension dans les modèles linéaires et linéaires mixtes". Thesis, Toulouse, INSA, 2012. http://www.theses.fr/2012ISAT0027/document.
Texto completo da fonteRecent technologies have provided scientists with genomics and post-genomics high-dimensional data; there are always more variables that are measured than the number of individuals. These high dimensional datasets usually need additional assumptions in order to be analyzed, such as a sparsity condition which means that only a small subset of the variables are supposed to be relevant. In this high-dimensional context we worked on a real dataset which comes from the pig species and high-throughput biotechnologies. Metabolomic data has been measured with NMR spectroscopy and phenotypic data has been mainly obtained post-mortem. There are two objectives. On one hand, we aim at obtaining good prediction for the production phenotypes and on the other hand we want to pinpoint metabolomic data that explain the phenotype under study. Thanks to the Lasso method applied in a linear model, we show that metabolomic data has a real prediction power for some important phenotypes for livestock production, such as a lean meat percentage and the daily food consumption. The second objective is a problem of variable selection. Classic statistical tools such as the Lasso method or the FDR procedure are investigated and new powerful methods are developed. We propose a variable selection method based on multiple hypotheses testing. This procedure is designed to perform in linear models and non asymptotic results are given under a condition on the signal. Since supplemental data are available on the real dataset such as the batch or the family relationships between the animals, linear mixed models are considered. A new algorithm for fixed effects selection is developed, and this algorithm turned out to be faster than the usual ones. Thanks to its structure, it can be combined with any variable selection methods built for linear models. However, the convergence property of this algorithm depends on the method that is used. The multiple hypotheses testing procedure shows good empirical results. All the mentioned methods are applied to the real data and biological relationships are emphasized
Dernoncourt, David. "Stabilité de la sélection de variables sur des données haute dimension : une application à l'expression génique". Thesis, Paris 6, 2014. http://www.theses.fr/2014PA066317/document.
Texto completo da fonteHigh throughput technologies allow us to measure very high amounts of variables in patients: DNA sequence, gene expression, lipid profile… Knowledge discovery can be performed on such data using, for instance, classification methods. However, those data contain a very high number of variables, which are measured, in the best cases, on a few hundreds of patients. This makes feature selection a necessary first step so as to reduce the risk of overfitting, reduce computation time, and improve model interpretability. When the amount of observations is low, feature selection tends to be unstable. It is common to observe that two selections obtained from two different datasets dealing with the same problem barely overlap. Yet, it seems important to obtain a stable selection if we want to be confident that the selected variables are really relevant, in an objective of knowledge discovery. In this work, we first tried to determine which factors have the most influence on feature selection stability. We then proposed a feature selection method, specific to microarray data, using functional annotations from Gene Ontology in order to assist usual feature selection methods, with the addition of a priori knowledge to the data. We then worked on two aspects of ensemble methods: the choice of the aggregation method, and hybrid ensemble methods. In the final chapter, we applied the methods studied in the thesis to a dataset from our lab, dealing with the prediction of weight regain after a diet, from microarray data, in obese patients
Pressat-Laffouilhère, Thibaut. "Modèle ontologique formel, un appui à la sélection des variables pour la construction des modèles multivariés". Electronic Thesis or Diss., Normandie, 2023. http://www.theses.fr/2023NORMR104.
Texto completo da fonteResponding to a causal research question in the context of observational studies requires the selection ofconfounding variables. Integrating them into a multivariate model as co-variables helps reduce bias in estimatingthe true causal effect of exposure on the outcome. Identification is achieved through causal diagrams (CDs) ordirected acyclic graphs (DAGs). These representations, composed of nodes and directed arcs, prevent theselection of variables that would introduce bias, such as mediating and colliding variables. However, existingmethods for constructing CDs lack systematic approaches and exhibit limitations in terms of formalism,expressiveness, and completeness. To offer a formal and comprehensive framework capable of representing allnecessary information for variable selection on an enriched CD, analyzing this CD, and, most importantly,explaining the analysis results, we propose utilizing an ontological model enriched with inference rules. Anontological model allows for representing knowledge in the form of an expressive and formal graph consisting ofclasses and relations similar to the nodes and arcs of Cds. We developed the OntoBioStat (OBS) ontology basedon a list of competency questions about variable selection and an analysis of scientific literature on CDs andontologies. The construction framework of OBS is richer than that of a CD, incorporating implicit elements likenecessary causes, study context, uncertainty in knowledge, and data quality. To evaluate the contribution of OBS,we used it to represent variables from a published observational study and compared its conclusions with thoseof a CD. OBS identified new confounding variables due to its different construction framework and the axiomsand inference rules. OBS was also used to represent an ongoing retrospective study analysis. The modelexplained statistical correlations found between study variables and highlighted potential confounding variablesand their possible substitutes (proxies). Information on data quality and causal relation uncertainty facilitatedproposing sensitivity analyses, enhancing the study's conclusion robustness. Finally, inferences were explainedthrough the reasoning capabilities provided by OBS's formal representation. Ultimately, OBS will be integratedinto statistical analysis tools to leverage existing libraries for variable selection, making it accessible toepidemiologists and biostatisticians
Pluntz, Matthieu. "Sélection de variables en grande dimension par le Lasso et tests statistiques - application à la pharmacovigilance". Electronic Thesis or Diss., université Paris-Saclay, 2024. http://www.theses.fr/2024UPASR002.
Texto completo da fonteVariable selection in high-dimensional regressions is a classic problem in health data analysis. It aims to identify a limited number of factors associated with a given health event among a large number of candidate variables such as genetic factors or environmental or drug exposures.The Lasso regression (Tibshirani, 1996) provides a series of sparse models where variables appear one after another depending on the regularization parameter's value. It requires a procedure for choosing this parameter and thus the associated model. In this thesis, we propose procedures for selecting one of the models of the Lasso path, which belong to or are inspired by the statistical testing paradigm. Thus, we aim to control the risk of selecting at least one false positive (Family-Wise Error Rate, FWER) unlike most existing post-processing methods of the Lasso, which accept false positives more easily.Our first proposal is a generalization of the Akaike Information Criterion (AIC) which we call the Extended AIC (EAIC). We penalize the log-likelihood of the model under consideration by its number of parameters weighted by a function of the total number of candidate variables and the targeted level of FWER but not the number of observations. We obtain this function by observing the relationship between comparing the information criteria of nested sub-models of a high-dimensional regression, and performing multiple likelihood ratio test, about which we prove an asymptotic property.Our second proposal is a test of the significance of a variable appearing on the Lasso path. Its null hypothesis depends on a set A of already selected variables and states that it contains all the active variables. As the test statistic, we aim to use the regularization parameter value from which a first variable outside A is selected by Lasso. This choice faces the fact that the null hypothesis is not specific enough to define the distribution of this statistic and thus its p-value. We solve this by replacing the statistic with its conditional p-value, which we define conditional on the non-penalized estimated coefficients of the model restricted to A. We estimate the conditional p-value with an algorithm that we call simulation-calibration, where we simulate outcome vectors and then calibrate them on the observed outcome‘s estimated coefficients. We adapt the calibration heuristically to the case of generalized linear models (binary and Poisson) in which it turns into an iterative and stochastic procedure. We prove that using our test controls the risk of selecting a false positive in linear models, both when the null hypothesis is verified and, under a correlation condition, when the set A does not contain all active variables.We evaluate the performance of both procedures through extensive simulation studies, which cover both the potential selection of a variable under the null hypothesis (or its equivalent for EAIC) and on the overall model selection procedure. We observe that our proposals compare well to their closest existing counterparts, the BIC and its extended versions for the EAIC, and Lockhart et al.'s (2014) covariance test for the simulation-calibration test. We also illustrate both procedures in the detection of exposures associated with drug-induced liver injuries (DILI) in the French national pharmacovigilance database (BNPV) by measuring their performance using the DILIrank reference set of known associations