Dissertations / Theses on the topic 'Régressions pénalisées'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 19 dissertations / theses for your research on the topic 'Régressions pénalisées.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Gnanguenon, guesse Girault. "Modélisation et visualisation des liens entre cinétiques de variables agro-environnementales et qualité des produits dans une approche parcimonieuse et structurée." Electronic Thesis or Diss., Montpellier, 2021. http://www.theses.fr/2021MONTS139.
Full textThe development of digital agriculture allows to observe at high frequency the dynamics of production according to the climate. Data from these dynamic observations can be considered as functional data. To analyze this new type of data, it is necessary to extend the usual statistical tools to the functional case or develop new ones.In this thesis, we have proposed a new approach (SpiceFP: Sparse and Structured Procedure to Identify Combined Effects of Functional Predictors) to explain the variations of a scalar response variable by two or three functional predictors in a context of joint influence of these predictors. Particular attention was paid to the interpretability of the results through the use of combined interval classes defining a partition of the observation domain of the explanatory factors. Recent developments around LASSO (Least Absolute Shrinkage and Selection Operator) models have been adapted to estimate the areas of influence in the partition via a generalized penalized regression. The approach also integrates a double selection, of models (among the possible partitions) and of variables (areas inside a given partition) based on AIC and BIC information criteria. The methodological description of the approach, its study through simulations as well as a case study based on real data have been presented in chapter 2 of this thesis.The real data used in this thesis were obtained from a vineyard experiment aimed at understanding the impact of climate change on anthcyanins accumulation in berries. Analysis of these data in chapter 3 using SpiceFP and one extension identified a negative impact of morning combinations of low irradiance (lower than about 100 µmol/s/m2 or 45 µmol/s/m2 depending on the advanced-delayed state of the berries) and high temperature (higher than about 25°C). A slight difference associated with overnight temperature occurred between these effects identified in the morning.In chapter 4 of this thesis, we propose an implementation of the proposed approach as an R package. This implementation provides a set of functions allowing to build the class intervals according to linear or logarithmic scales, to transform the functional predictors using the joint class intervals and finally to execute the approach in two or three dimensions. Other functions help to perform post-processing or allow the user to explore other models than those selected by the approach, such as an average of different models.Keywords: Penalized regressions, Interaction, information criteria, scalar-on-function, interpretable coefficients,grapevine microclimate
Mansiaux, Yohann. "Analyse d'un grand jeu de données en épidémiologie : problématiques et perspectives méthodologiques." Thesis, Paris 6, 2014. http://www.theses.fr/2014PA066272/document.
Full textThe increasing size of datasets is a growing issue in epidemiology. The CoPanFlu-France cohort(1450 subjects), intended to study H1N1 pandemic influenza infection risk as a combination of biolo-gical, environmental, socio-demographic and behavioral factors, and in which hundreds of covariatesare collected for each patient, is a good example. The statistical methods usually employed to exploreassociations have many limits in this context. We compare the contribution of data-driven exploratorymethods, assuming the absence of a priori hypotheses, to hypothesis-driven methods, requiring thedevelopment of preliminary hypotheses.Firstly a data-driven study is presented, assessing the ability to detect influenza infection determi-nants of two data mining methods, the random forests (RF) and the boosted regression trees (BRT), ofthe conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regres-sion - UFMLR) and of the Least Absolute Shrinkage and Selection Operator (LASSO), with penaltyin multivariate logistic regression to achieve a sparse selection of covariates. A simulation approachwas used to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods.Between three and twenty-four determinants of infection were identified, the pre-epidemic antibodytiter being the unique covariate selected with all methods. The mean TPR were the highest for RF(85%) and BRT (80%), followed by the LASSO (up to 78%), while the UFMLR methodology wasinefficient (below 50%). A slight increase of alpha risk (mean FPR up to 9%) was observed for logisticregression-based models, LASSO included, while the mean FPR was 4% for the data-mining methods.Secondly, we propose a hypothesis-driven causal analysis of the infection risk, with a structural-equation model (SEM). We exploited the SEM specificity of modeling latent variables to study verydiverse factors, their relative impact on the infection, as well as their eventual relationships. Only thelatent variables describing host susceptibility (modeled by the pre-epidemic antibody titer) and com-pliance with preventive behaviors were directly associated with infection. The behavioral factors des-cribing risk perception and preventive measures perception positively influenced compliance with pre-ventive behaviors. The intensity (number and duration) of social contacts was not associated with theinfection.This thesis shows the necessity of considering novel statistical approaches for the analysis of largedatasets in epidemiology. Data mining and LASSO are credible alternatives to the tools generally usedto explore associations with a high number of variables. SEM allows the integration of variables des-cribing diverse dimensions and the explicit modeling of their relationships ; these models are thereforeof major interest in a multidisciplinary study as CoPanFlu
Detais, Amélie. "Maximum de vraisemblance et moindre carrés pénalisés dans des modèles de durée de vie censurées." Toulouse 3, 2008. http://thesesups.ups-tlse.fr/820/.
Full textLife data analysis is used in various application fields. Different methods have been proposed for modelling such data. In this thesis, we are interested in two distinct modelisation types, the stratified Cox model with randomly missing strata indicators and the right-censored linear regression model. We propose methods for estimating the parameters and establish the asymptotic properties of the obtained estimators in each of these models. First, we consider a generalization of the Cox model, allowing different groups, named strata, of the population to have distinct baseline intensity functions, whereas the regression parameter is shared by all the strata. In this stratified proportional intensity model, we are interested in the parameters estimation when the strata indicator is missing for some of the population individuals. Nonparametric maximum likelihood estimators are proposed for the model parameters and their consistency and asymptotic normality are established. We show the efficiency of the regression parameter and obtain consistent estimators of its variance. The Expectation-Maximization algorithm is proposed and developed for the evaluation of the estimators of the model parameters. Second, we are interested in the regression linear model when the response data is randomly right-censored. We introduce a new estimator of the regression parameter, which minimizes a Kaplan-Meier-weighted penalized least squares criterion. Results of consistency and asymptotic normality are obtained and a simulation study is conducted in order to investigate the small sample properties of this LASSO-type estimator. The bootstrap method is used for the estimation of the asymptotic variance
Soret, Perrine. "Régression pénalisée de type Lasso pour l’analyse de données biologiques de grande dimension : application à la charge virale du VIH censurée par une limite de quantification et aux données compositionnelles du microbiote." Thesis, Bordeaux, 2019. http://www.theses.fr/2019BORD0254.
Full textIn clinical studies and thanks to technological progress, the amount of information collected in the same patient continues to grow leading to situations where the number of explanatory variables is greater than the number of individuals. The Lasso method proved to be appropriate to circumvent over-adjustment problems in high-dimensional settings.This thesis is devoted to the application and development of Lasso-penalized regression for clinical data presenting particular structures.First, in patients with the human immunodeficiency virus, mutations in the virus's genetic structure may be related to the development of drug resistance. The prediction of the viral load from (potentially large) mutations allows guiding treatment choice.Below a threshold, the viral load is undetectable, data are left-censored. We propose two new Lasso approaches based on the Buckley-James algorithm, which imputes censored values by a conditional expectation. By reversing the response, we obtain a right-censored problem, for which non-parametric estimates of the conditional expectation have been proposed in survival analysis. Finally, we propose a parametric estimation based on a Gaussian hypothesis.Secondly, we are interested in the role of the microbiota in the deterioration of respiratory health. The microbiota data are presented as relative abundances (proportion of each species per individual, called compositional data) and they have a phylogenetic structure.We have established a state of the art methods of statistical analysis of microbiota data. Due to the novelty, few recommendations exist on the applicability and effectiveness of the proposed methods. A simulation study allowed us to compare the selection capacity of penalization methods proposed specifically for this type of data.Then we apply this research to the analysis of the association between bacteria / fungi and the decline of pulmonary function in patients with cystic fibrosis from the MucoFong project
Sorba, Olivier. "Pénalités minimales pour la sélection de modèle." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLS043/document.
Full textL. Birgé and P. Massart proved that the minimum penalty phenomenon occurs in Gaussian model selection when the model family arises from complete variable selection among independent variables. We extend some of their results to discrete Gaussian signal segmentation when the model family corresponds to a sufficiently rich family of partitions of the signal's support. This is the case of regression trees. We show that the same phenomenon occurs in the context of density estimation. The richness of the model family can be related to a certain form of isotropy. In this respect the minimum penalty phenomenon is intrinsic. To corroborate this point of view, we show that the minimum penalty phenomenon occurs when the models are chosen randomly under an isotropic law
Gannaz, Irène. "Estimation par ondelettes dans les modèles partiellement linéaires." Phd thesis, Grenoble 1, 2007. http://www.theses.fr/2007GRE10281.
Full textThis dissertation is concerned with the use of wavelet methods in semiparametric partially linear models. These models are composed by a linear component with unknown regression coefficients and an unknown nonparametric function. The aim is to estimate both of the predictors, possibly under the presence of correlation. A wavelet thresholding based procedure is built to estimate the nonparametric part of the model using a penalized least squares criterion. We establish a connection between different thresholding schemes and M-estimators in linear models with outliers, where the wavelet coefficients of the nonparametric part of the model are considered as outliers. We also propose an estimate for the noise variance. Some asymptotic results of the estimates of both the parametric and the nonparametric part are given. Their behavior is close to optimality, up to a logarithmic factor, under usual restrictions for the correlation between variables. Simulations illustrate the properties of the proposed methodology and compare it with existing methods. An application to real data from functional IRM is also presented. The last part of this work deals with the extension to nonequidistant observations for the nonparametric part, comparing in particular via simulations nonparametric estimation procedures
Moumouni, Kairou. "Etude et conception d'un modèle mixte semiparamétrique stochastique pour l'analyse des données longitudinales environnementales." Phd thesis, Université Rennes 2, 2005. http://tel.archives-ouvertes.fr/tel-00012164.
Full textDans une deuxième partie, une extension de la méthode d'influence locale de Cook au modèle mixte modifié est proposée, elle fournit une analyse de sensibilité permettant de détecter les effets de certaines perturbations sur les composantes structurelles du modèle. Quelques propriétés asymptotiques de la matrice d'influence locale sont exhibées.
Enfin, le modèle proposé est appliqué à deux jeux de données réelles : une analyse des données de concentrations de nitrates issues de différentes stations de mesures d'un bassin versant, puis une analyse de la pollution bactériologiques d'eaux de baignades.
Gannaz, Irène. "Estimation par ondelettes dans les modèles partiellement linéaires." Phd thesis, Université Joseph Fourier (Grenoble), 2007. http://tel.archives-ouvertes.fr/tel-00197146.
Full textNguyen, Thi Le Thu. "Sequential Monte-Carlo sampler for Bayesian inference in complex systems." Thesis, Lille 1, 2014. http://www.theses.fr/2014LIL10058/document.
Full textIn many problems, complex non-Gaussian and/or nonlinear models are required to accurately describe a physical system of interest. In such cases, Monte Carlo algorithms are remarkably flexible and extremely powerful to solve such inference problems. However, in the presence of high-dimensional and/or multimodal posterior distribution, standard Monte-Carlo techniques could lead to poor performance. In this thesis, the study is focused on Sequential Monte-Carlo Sampler, a more robust and efficient Monte Carlo algorithm. Although this approach presents many advantages over traditional Monte-Carlo methods, the potential of this emergent technique is however largely underexploited in signal processing. In this thesis, we therefore focus our study on this technique by aiming at proposing some novel strategies that will improve the efficiency and facilitate practical implementation of the SMC sampler. Firstly, we propose an automatic and adaptive strategy that selects the sequence of distributions within the SMC sampler that approximately minimizes the asymptotic variance of the estimator of the posterior normalization constant. Secondly, we present an original contribution in order to improve the global efficiency of the SMC sampler by introducing some correction mechanisms that allow the use of the particles generated through all the iterations of the algorithm (instead of only particles from the last iteration). Finally, to illustrate the usefulness of such approaches, we apply the SMC sampler integrating our proposed improvement strategies to two challenging practical problems: Multiple source localization in wireless sensor networks and Bayesian penalized regression
Ternes, Nils. "Identification de biomarqueurs prédictifs de la survie et de l'effet du traitement dans un contexte de données de grande dimension." Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLS278/document.
Full textWith the recent revolution in genomics and in stratified medicine, the development of molecular signatures is becoming more and more important for predicting the prognosis (prognostic biomarkers) and the treatment effect (predictive biomarkers) of each patient. However, the large quantity of information has rendered false positives more and more frequent in biomedical research. The high-dimensional space (i.e. number of biomarkers ≫ sample size) leads to several statistical challenges such as the identifiability of the models, the instability of the selected coefficients or the multiple testing issue.The aim of this thesis was to propose and evaluate statistical methods for the identification of these biomarkers and the individual predicted survival probability for new patients, in the context of the Cox regression model. For variable selection in a high-dimensional setting, the lasso penalty is commonly used. In the prognostic setting, an empirical extension of the lasso penalty has been proposed to be more stringent on the estimation of the tuning parameter λ in order to select less false positives. In the predictive setting, focus has been given to the biomarker-by-treatment interactions in the setting of a randomized clinical trial. Twelve approaches have been proposed for selecting these interactions such as lasso (standard, adaptive, grouped or ridge+lasso), boosting, dimension reduction of the main effects and a model incorporating arm-specific biomarker effects. Finally, several strategies were studied to obtain an individual survival prediction with a corresponding confidence interval for a future patient from a penalized regression model, while limiting the potential overfit.The performance of the approaches was evaluated through simulation studies combining null and alternative scenarios. The methods were also illustrated in several data sets containing gene expression data in breast cancer
Courtois, Émeline. "Score de propension en grande dimension et régression pénalisée pour la détection automatisée de signaux en pharmacovigilance Propensity Score-Based Approaches in High Dimension for Pharmacovigilance Signal Detection: an Empirical Comparison on the French Spontaneous Reporting Database New adaptive lasso approaches for variable selection in automated pharmacovigilance signal detection." Thesis, université Paris-Saclay, 2020. http://www.theses.fr/2020UPASR009.
Full textPost-marketing pharmacovigilance aims to detect as early as possible adverse effects of marketed drugs. It relies on large databases of individual case safety reports of adverse events suspected to be drug-induced. Several automated signal detection tools have been developed to mine these large amounts of data in order to highlight suspicious adverse event-drug combinations. Classical signal detection methods are based on disproportionality analyses of counts aggregating patients’ reports. Recently, multiple regression-based methods have been proposed to account for multiple drug exposures. In chapter 2, we propose a signal detection method based on the high-dimensional propensity score (HDPS). An empirical study, conducted on the French pharmacovigilance database with a reference signal set pertaining to drug-induced liver injury (DILIrank), is carried out to compare the performance of this method (in 12 modalities) to methods based on lasso penalized regressions. In this work, the influence of the score estimation method is minimal, unlike the score integration method. In particular, HDPS weighting with matching weights shows good performances, comparable to those of lasso-based methods. In chapter 3, we propose a method based on a lasso extension: the adaptive lasso which allows to introduce specific penalties to each variable through adaptive weights. We propose two new weights adapted to spontaneous reports data, as well as the use of the BIC for the choice of the penalty term. An extensive simulation study is performed to compare the performances of our proposals with other implementations of the adaptive lasso, a disproportionality method, lasso-based methods and HDPS-based methods. The proposed methods show overall better results in terms of false discoveries and sensitivity than competing methods. An empirical study similar to the one conducted in chapter 2 completes the evaluation. All the evaluated methods are implemented in the R package "adapt4pv" available on the CRAN. Alongside to methodological developments in spontaneous reporting, there has been a growing interest in the use of medico-administrative databases for signal detection in pharmacovigilance. Methodological research efforts in this area are to be developed. In chapter 4, we explore detection strategies exploiting spontaneous reports and the national health insurance permanent sample (Echantillon Généraliste des bénéficiaires, EGB). We first evaluate the performance of a detection on the EGB using DILIrank. Then, we consider a detection conducted on spontaneous reports based on an adaptive lasso integrating, through weights, the information related to the drug exposure of a control group measured in the EGB. In both cases, the contribution of medico-administrative data is difficult to evaluate because of the relatively small size of the EGB
Thouvenot, Vincent. "Estimation et sélection pour les modèles additifs et application à la prévision de la consommation électrique." Thesis, Université Paris-Saclay (ComUE), 2015. http://www.theses.fr/2015SACLS184/document.
Full textFrench electricity load forecasting encounters major changes since the past decade. These changes are, among others things, due to the opening of electricity market (and economical crisis), which asks development of new automatic time adaptive prediction methods. The advent of innovating technologies also needs the development of some automatic methods, because we have to study thousands or tens of thousands time series. We adopt for time prediction a semi-parametric approach based on additive models. We present an automatic procedure for covariate selection in a additive model. We combine Group LASSO, which is selection consistent, with P-Splines, which are estimation consistent. Our estimation and model selection results are valid without assuming that the norm of each of the true non-zero components is bounded away from zero and need only that the norms of non-zero components converge to zero at a certain rate. Real applications on local and agregate load forecasting are provided.Keywords: Additive Model, Group LASSO, Load Forecasting, Multi-stage estimator, P-Splines, Variables selection
Lavarde, Marc. "Fiabilité des semi-conducteurs, tests accélérés, sélection de modèles définis par morceaux et détection de sur-stress." Paris 11, 2007. http://www.theses.fr/2007PA112266.
Full textThis thesis deals with the using of accelerating data and regression model selection for high technology field: semiconductor chips. The accelerating trail gives us regression frameworks. The aim of the accelerating test consists on fitting the logarithm of the lifetime through the use of some function f, called the acceleration function. However, accelerating data may have misleading and complex comportment. In order to adapt the model with such data, we have proposed to detect the changes on the comportment of the acceleration function. We have considered a collection of piecewise acceleration models candidate to the estimation. For each model candidate we have estimated the least-squares estimation. And we have selected the final estimator using a penalized criterion. The penalized estimator is optimal approximation of the reality since the quadratic risk of penalized estimator is bounded by the minimal risk upon every least-squares estimators candidates. Moreover, this oracle inequality is non asymptotic. Furthermore, we have considered classical reliability cases: the Lognormal case associating with some fatigue failure, and the Weibull case associating with some choc failure. Lastly we have implemented model selection tools in order to realise survey study without a priori on the acceleration models and to use overstress trials
Jardillier, Rémy. "Evaluation de différentes variantes du modèle de Cox pour le pronostic de patients atteints de cancer à partir de données publiques de séquençage et cliniques." Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALS008.
Full textCancer has been the leading cause of premature mortality (death before the age of 65) in France since 2004. For the same organ, each cancer is unique, and personalized prognosis is therefore an important aspect of patient management and follow-up. The decrease in sequencing costs over the last decade have made it possible to measure the molecular profiles of many tumors on a large scale. Thus, the TCGA database provides RNA-seq data of tumors, clinical data (age, sex, grade, stage, etc.), and follow-up times of associated patients over several years (including patient survival, possible recurrence, etc.). New discoveries are thus made possible in terms of biomarkers built from transcriptomic data, with individualized prognoses. These advances require the development of large-scale data analysis methods adapted to take into account both survival data (right-censored), clinical characteristics, and molecular profiles of patients. In this context, the main goal of the thesis is to compare and adapt methodologies to construct prognostic risk scores for survival or recurrence of patients with cancer from sequencing and clinical data.The Cox model (semi-parametric) is widely used to model these survival data, and allows linking them to explanatory variables. The RNA-seq data from TCGA contain more than 20,000 genes for only a few hundred patients. The number p of variables then exceeds the number n of patients, and parameters estimation is subject to the “curse of dimensionality”. The two main strategies to overcome this issue are penalty methods and gene pre-filtering. Thus, the first objective of this thesis is to compare the classical penalization methods of Cox's model (i.e. ridge, lasso, elastic net, adaptive elastic net). To this end, we use real and simulated data to control the amount of information contained in the transcriptomic data. Then, the second issue addressed concerns the univariate pre-filtering of genes before using a multivariate Cox model. We propose a methodology to increase the stability of the genes selected, and to choose the filtering thresholds by optimizing the predictions. Finally, although the cost of sequencing (RNA-seq) has decreased drastically over the last decade, it remains too high for routine use in practice. In a final section, we show that the sequencing depth of miRNAs can be reduced without degrading the quality of predictions for some TCGA cancers, but not for others
Li, Weiyu. "Quelques contributions à l'estimation des modèles définis par des équations estimantes conditionnelles." Thesis, Rennes 1, 2015. http://www.theses.fr/2015REN1S065/document.
Full textIn this dissertation we study statistical models defined by condition estimating equations. Many statistical models could be stated under this form (mean regression, quantile regression, transformation models, instrumental variable models, etc.). We consider models with finite dimensional unknown parameter, as well as semiparametric models involving an additional infinite dimensional parameter. In the latter case, we focus on single-index models that realize an appealing compromise between parametric specifications, simple and leading to accurate estimates, but too restrictive and likely misspecified, and the nonparametric approaches, flexible but suffering from the curse of dimensionality. In particular, we study the single-index models in the presence of random censoring. The guiding line of our study is a U-statistics which allows to estimate the unknown parameters in a wide spectrum of models
Shehzad, Muhammad Ahmed. "Pénalisation et réduction de la dimension des variables auxiliaires en théorie des sondages." Phd thesis, Université de Bourgogne, 2012. http://tel.archives-ouvertes.fr/tel-00812880.
Full textAlquier, Pierre. "Contributions à l'apprentissage statistique dans les modèles parcimonieux." Habilitation à diriger des recherches, Université Pierre et Marie Curie - Paris VI, 2013. http://tel.archives-ouvertes.fr/tel-00915505.
Full textVasseur, Yann. "Inférence de réseaux de régulation orientés pour les facteurs de transcription d'Arabidopsis thaliana et création de groupes de co-régulation." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLS475/document.
Full textThis thesis deals with the characterisation of key genes in gene expression regulation, called transcription factors, in the plant Arabidopsis thaliana. Using expression data, our biological goal is to cluster transcription factors in groups of co-regulator transcription factors, and in groups of co-regulated transcription factors. To do so, we propose a two-step procedure. First, we infer the network of regulation between transcription factors. Second, we cluster transcription factors based on their connexion patterns to other transcriptions factors.From a statistical point of view, the transcription factors are the variables and the samples are the observations. The regulatory network between the transcription factors is modelled using a directed graph, where variables are nodes. The estimation of the nodes can be interpreted as a problem of variables selection. To infer the network, we perform LASSO type penalised linear regression. A preliminary approach selects a set of variable along the regularisation path using penalised likelihood criterion. However, this approach is unstable and leads to select too many variables. To overcome this difficulty, we propose to put in competition two selection procedures, designed to deal with high dimension data and mixing linear penalised regression and subsampling. Parameters estimation of the two procedures are designed to lead to select stable set of variables. Stability of results is evaluated on simulated data under a graphical model. Subsequently, we use an unsupervised clustering method on each inferred oriented graph to detect groups of co-regulators and groups of co-regulated. To evaluate the proximity between the two classifications, we have developed an index of comparaison of pairs of partitions whose relevance is tested and promoted. From a practical point of view, we propose a cascade simulation method required to respect the model complexity and inspired from parametric bootstrap, to simulate data under our model. We have validated our model by inspecting the proximity between the two classifications on simulated and real data
St-Onge, Pascal. "Détection et caractérisation des interactions dans les maladies complexes." Thèse, 2007. http://hdl.handle.net/1866/7963.
Full text