Tesis sobre el tema "Imputation de données manquantes"
Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros
Consulte los 50 mejores tesis para su investigación sobre el tema "Imputation de données manquantes".
Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.
También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.
Explore tesis sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.
Bernard, Francis. "Méthodes d'analyse des données incomplètes incorporant l'incertitude attribuable aux valeurs manquantes". Mémoire, Université de Sherbrooke, 2013. http://hdl.handle.net/11143/6571.
Texto completoAudigier, Vincent. "Imputation multiple par analyse factorielle : Une nouvelle méthodologie pour traiter les données manquantes". Thesis, Rennes, Agrocampus Ouest, 2015. http://www.theses.fr/2015NSARG015/document.
Texto completoThis thesis proposes new multiple imputation methods that are based on principal component methods, which were initially used for exploratory analysis and visualisation of continuous, categorical and mixed multidimensional data. The study of principal component methods for imputation, never previously attempted, offers the possibility to deal with many types and sizes of data. This is because the number of estimated parameters is limited due to dimensionality reduction.First, we describe a single imputation method based on factor analysis of mixed data. We study its properties and focus on its ability to handle complex relationships between variables, as well as infrequent categories. Its high prediction quality is highlighted with respect to the state-of-the-art single imputation method based on random forests.Next, a multiple imputation method for continuous data using principal component analysis (PCA) is presented. This is based on a Bayesian treatment of the PCA model. Unlike standard methods based on Gaussian models, it can still be used when the number of variables is larger than the number of individuals and when correlations between variables are strong.Finally, a multiple imputation method for categorical data using multiple correspondence analysis (MCA) is proposed. The variability of prediction of missing values is introduced via a non-parametric bootstrap approach. This helps to tackle the combinatorial issues which arise from the large number of categories and variables. We show that multiple imputation using MCA outperforms the best current methods
Héraud, Bousquet Vanina. "Traitement des données manquantes en épidémiologie : application de l’imputation multiple à des données de surveillance et d’enquêtes". Thesis, Paris 11, 2012. http://www.theses.fr/2012PA11T017/document.
Texto completoThe management of missing values is a common and widespread problem in epidemiology. The most common technique used restricts the data analysis to subjects with complete information on variables of interest, which can reducesubstantially statistical power and precision and may also result in biased estimates.This thesis investigates the application of multiple imputation methods to manage missing values in epidemiological studies and surveillance systems for infectious diseases. Study designs to which multiple imputation was applied were diverse: a risk analysis of HIV transmission through blood transfusion, a case-control study on risk factors for ampylobacter infection, and a capture-recapture study to estimate the number of new HIV diagnoses among children. We then performed multiple imputation analysis on data of a surveillance system for chronic hepatitis C (HCV) to assess risk factors of severe liver disease among HCV infected patients who reported drug use. Within this study on HCV, we proposedguidelines to apply a sensitivity analysis in order to test the multiple imputation underlying hypotheses. Finally, we describe how we elaborated and applied an ongoing multiple imputation process of the French national HIV surveillance database, evaluated and attempted to validate multiple imputation procedures.Based on these practical applications, we worked out a strategy to handle missing data in surveillance data base, including the thorough examination of the incomplete database, the building of the imputation model, and the procedure to validate imputation models and examine underlying multiple imputation hypotheses
Croiseau, Pascal. "Influence et traitement des données manquantes dans les études d'association sur trios : application à des données sur la sclérose en plaques". Paris 11, 2008. http://www.theses.fr/2008PA112021.
Texto completoTo test for association between a set of markers and a disease, or to estimate the disease risks, different methods have been developped. Several of these methods need that all individuals are genotyped for all markers. When it is not the case, individuals with missing data are discarded. We have shown that this solution, which leads to a strong decrease of the sample size, could involve a loss of power to detect an association but also to false conclusion. In this work, we adapted to genetic data a method of "multiple imputation" that consists in replacing missing data by plausible values. Results obtained from simulated data show that this approach is promising to search for disease susceptibility genes. It is simple to use and very flexible in terms of genetic models that can be tested. We applied our method to a sample of 450 multiple sclerosis family trios (an affected child and both parents). Recent works have detected an association between a polymorphism of CTLA4 gene and multiple sclerosis. However, CTLA4 belongs to a cluster of three gene CD28, CTLA4 and ICOS all involved in the immune response. Consequently, this association could be due to another marker in linkage disequilibrium with CTLA4. Our method allows us to detect the association with CTLA4's polymorphism and also to provide us with a new candidate to explore : a CD28 polymorphism which could be involved in multiple sclerosis in interaction with the CTLA4 polymorphism
Etourneau, Lucas. "Contrôle du FDR et imputation de valeurs manquantes pour l'analyse de données de protéomiques par spectrométrie de masse". Electronic Thesis or Diss., Université Grenoble Alpes, 2024. http://www.theses.fr/2024GRALS001.
Texto completoProteomics involves characterizing the proteome of a biological sample, that is, the set of proteins it contains, and doing so as exhaustively as possible. By identifying and quantifying protein fragments that are analyzable by mass spectrometry (known as peptides), proteomics provides access to the level of gene expression at a given moment. This is crucial information for improving the understanding of molecular mechanisms at play within living organisms. These experiments produce large amounts of data, often complex to interpret and subject to various biases. They require reliable data processing methods that ensure a certain level of quality control, as to guarantee the relevance of the resulting biological conclusions.The work of this thesis focuses on improving this data processing, and specifically on the following two major points:The first is controlling for the false discovery rate (FDR), when either identifying (1) peptides or (2) quantitatively differential biomarkers between a tested biological condition and its negative control. Our contributions focus on establishing links between the empirical methods stemmed for proteomic practice and other theoretically supported methods. This notably allows us to provide directions for the improvement of FDR control methods used for peptide identification.The second point focuses on managing missing values, which are often numerous and complex in nature, making them impossible to ignore. Specifically, we have developed a new algorithm for imputing them that leverages the specificities of proteomics data. Our algorithm has been tested and compared to other methods on multiple datasets and according to various metrics, and it generally achieves the best performance. Moreover, it is the first algorithm that allows imputation following the trending paradigm of "multi-omics": if it is relevant to the experiment, it can impute more reliably by relying on transcriptomic information, which quantifies the level of messenger RNA expression present in the sample. Finally, Pirat is implemented in a freely available software package, making it easy to use for the proteomic community
Héraud, Bousquet Vanina. "Traitement des données manquantes en épidémiologie : Application de l'imputation multiple à des données de surveillance et d'enquêtes". Phd thesis, Université Paris Sud - Paris XI, 2012. http://tel.archives-ouvertes.fr/tel-00713926.
Texto completoLorga, Da Silva Ana. "Tratamento de dados omissos e métodos de imputação em classificação". Doctoral thesis, Instituto Superior de Economia e Gestão, 2005. http://hdl.handle.net/10400.5/3849.
Texto completoNeste trabalho, pretende-se estudar o efeito dos dados omissos em classificação de variáveis, principalmente em classificação hierárquica ascendente, de acordo com.òs seguintes factores: percentagens de dados omissos, métodos de imputação, coeficientes de semelhança-e métodos de classificação. Supõe-se que os dados omissos são do tipo MAR ("missing at random"), isto é, a presença de dados omissos não depende dos valores omissos, nem das variáveis com dados omissos, mas depende de valores observados sobre outras variáveis da matriz de dados. Os dados omissos satisfazem um padrão maioritariamente monótono. Utilizaram-se as técnicas, em presença de dados omissos "listwise" e "pairwise"; como métodos de imputação simples: o algoritmo EM, o modelo de regressão OLS, o algoritmo MPALS e um método de regressão PLS. Como métodos de imputação múltipla, adoptou-se um método baseado sobre o modelo de regressão OLS associado a técnicas bayesianas; propôs-se também um novo método de imputação múltipla baseado sobre os métodos de regressão PLS. Como métodos de classificação hierárquica utilizaram-se classificações clássicas e probabilísticas, estas últimas baseadas na família de métodos VL (validade da ligação). Os métodos de classificação hierárquica utilizados foram, "single", "complete" e "average" "linkage", AVL e AYB. Para as matrizes de semelhança utilizou-se o coeficiente de afinidade básico (para dados contínuos) - que corresponde ao índice d'Ochiai para dados binários; o coeficiente de correlação de Pearson e a aproximação probabilística do coeficiente de afinidade centrado e reduzido pelo método-W. O estudo foi baseado em dados simulados e reais. Utilizou-se o coeficiente de Spearman, para comparar as estruturas de classificação hierárquicas e para as classificações não hierárquicas o índice de Rand.
Le but de ce travail est d'étudier l’effet des données manquantes en classification de variables, principalement en classification hiérarchique ascendante, et aussi en classification non hiérarchique (ou partitionnement). L'étude est effectuée en considérant les facteurs suivants: pourcentage de données manquantes, méthodes d'imputation, coefficients de ressemblance et critères de classification. On suppose que les données manquantes sont du type MAR («missing at random») données manquantes au hasard, mais pas. complètement au hasard.. Les données manquantes satisfont un schéma majoritairement monotone. Nous avons utilisé comme techniques sans imputation les méthodes lisîwise et pairwise et comme méthodes d'imputation simple: l'algorithme EM, le modèle de régression OLS, l’algorithme NIPALS et une méthode de régression PLS., Comme méthodes d'imputation multiple nous avons adopté une méthode basée sur le modèle de régression OLS associé à des techniques bayesiennes; on a aussi proposé un nouveau modèle d'imputation multiple basé sur les méthodes de régression PLS. Pour combiner les structures de classification résultant des méthodes d'imputation multiple nous avons proposé une combinaison par la moyenne des matrices de similarité et deux méthodes de consensus. Nous avons utilisé comme méthodes de classification hiérarchique des méthodes classiques et probabilistes, ces dernières basées sur la famille de méthodes VL (Vraisemblance du Lien), comme méthodes de classification hiérarchique, le saut minimal, le saut maximal, la moyenne parmi les groupes et aussi les AVL et AVB; pour les matrices de ressemblance, le coefficient d'affinité basique (pour les données continues) - qui correspond à l'indice d'Ochiai; pour les données binaires, le coefficient de corrélation de Bravais-Pearson et l'approximation probabiliste du coefficient d'affinité centré et réduit par la méthode-W. L'étude est basée principalement sur des données simulées et complétée par des applications à des données réelles. Nous avons travaillé sur des données continues et binaires. Le coefficient de Spearman est utilisé pour comparer les structures hiérarchiques obtenues sur des matrices complètes avec les structures obtenues à partir des matrices ; où les données sont «effacées» puis imputées. L'indice de Rand est utilisé pour comparer les structures non hiérarchiques. Enfin, nous avons aussi proposé une méthode non hiérarchique qui «s'adapte» aux données manquantes. Sur un cas réel la méthode de Ward est utilisée dans les mêmes conditions que pour les simulations; mais aussi sans satisfaire un schéma monotone; une méthode de Monte Carlo par chaînes de Markov sert pour l'imputation multiple.
In this work we aimed to study the effect of missing data in classification of variables; mainly in ascending hierarchical classification, according to the following factors: amount of missing data, imputation techniques, similarity coefficient and classification-criterion. We used as techniques in presence of missing data, listwise and pairwise; as simple imputation methods, an EM algorithm, the OLS regression method, the NIPALS algorithm and a PLS regression method. As multiple imputation, we used a method based on the OLS regression and a new one based on PLS, combined by the mean value of the similarity matrices and an ordinal consensus. As hierarchical methods we used classical and. probabilistic approaches, the latter based on the VL-family. The hierarchical methods used were single, complete and average linkage, AVL and AVB. For the similarity matrices we used the basic affinity coefficient (for continuous data) - that corresponds to the Ochiai index for binary data; the Pearson's correlation coefficient and the probabilistic approach of the affinity coefficient, centered and reduced by the W-method.. The study was based mainly on simulated data, complemented by real ones. We used the Spearman.coefficient between the associated ultrametrics to compare the structures of the hierarchical classifications and, for the non hierarchical classifications, the Rand's index.
Marti, soler Helena. "Modélisation des données d'enquêtes cas-cohorte par imputation multiple : Application en épidémiologie cardio-vasculaire". Phd thesis, Université Paris Sud - Paris XI, 2012. http://tel.archives-ouvertes.fr/tel-00779739.
Texto completoGeronimi, Julia. "Contribution à la sélection de variables en présence de données longitudinales : application à des biomarqueurs issus d'imagerie médicale". Thesis, Paris, CNAM, 2016. http://www.theses.fr/2016CNAM1114/document.
Texto completoClinical studies enable us to measure many longitudinales variables. When our goal is to find a link between a response and some covariates, one can use regularisation methods, such as LASSO which have been extended to Generalized Estimating Equations (GEE). They allow us to select a subgroup of variables of interest taking into account intra-patient correlations. Databases often have unfilled data and measurement problems resulting in inevitable missing data. The objective of this thesis is to integrate missing data for variable selection in the presence of longitudinal data. We use mutiple imputation and introduce a new imputation function for the specific case of variables under detection limit. We provide a new variable selection method for correlated data that integrate missing data : the Multiple Imputation Penalized Generalized Estimating Equations (MI-PGEE). Our operator applies the group-LASSO penalty on the group of estimated regression coefficients of the same variable across multiply-imputed datasets. Our method provides a consistent selection across multiply-imputed datasets, where the optimal shrinkage parameter is chosen by minimizing a BIC-like criteria. We then present an application on knee osteoarthritis aiming to select the subset of biomarkers that best explain the differences in joint space width over time
Mehanna, Souheir. "Data quality issues in mobile crowdsensing environments". Electronic Thesis or Diss., université Paris-Saclay, 2023. http://www.theses.fr/2023UPASG053.
Texto completoMobile crowdsensing has emerged as a powerful paradigm for harnessing the collective sensing capabilities of mobile devices to gather diverse data in real-world settings. However, ensuring the quality of the collected data in mobile crowdsensing environments (MCS) remains a challenge because low-cost nomadic sensors can be prone to malfunctions, faults, and points of failure. The quality of the collected data can significantly impact the results of the subsequent analyses. Therefore, monitoring the quality of sensor data is crucial for effective analytics.In this thesis, we have addressed some of the issues related to data quality in mobile crowdsensing environments. First, we have explored issues related to data completeness. The mobile crowdsensing context has specific characteristics that are not all captured by the existing factors and metrics. We have proposed a set of quality factors of data completeness suitable for mobile crowdsensing environments. We have also proposed a set of metrics to evaluate each of these factors. In order to improve data completeness, we have tackled the problem of generating missing values.Existing data imputation techniques generate missing values by relying on existing measurements without considering the disparate quality levels of these measurements. We propose a quality-aware data imputation approach that extends existing data imputation techniques by taking into account the quality of the measurements.In the second part of our work, we have focused on anomaly detection, which is another major problem that sensor data face. Existing anomaly detection approaches use available data measurements to detect anomalies, and are oblivious of the quality of the measurements. In order to improve the detection of anomalies, we propose an approach relying on clustering algorithms that detects pattern anomalies while integrating the quality of the sensor into the algorithm.Finally, we have studied the way data quality could be taken into account for analyzing sensor data. We have proposed some contributions which are the first step towards quality-aware sensor data analytics, which consist of quality-aware aggregation operators, and an approach that evaluates the quality of a given aggregate considering the data used in its computation
Chion, Marie. "Développement de nouvelles méthodologies statistiques pour l'analyse de données de protéomique quantitative". Thesis, Strasbourg, 2021. http://www.theses.fr/2021STRAD025.
Texto completoProteomic analysis consists of studying all the proteins expressed by a given biological system, at a given time and under given conditions. Recent technological advances in mass spectrometry and liquid chromatography make it possible to envisage large-scale and high-throughput proteomic studies.This thesis work focuses on developing statistical methodologies for the analysis of quantitative proteomics data and thus presents three main contributions. The first part proposes to use monotone spline regression models to estimate the amounts of all peptides detected in a sample using internal standards labelled for a subset of targeted peptides. The second part presents a strategy to account for the uncertainty induced by the multiple imputation process in the differential analysis, also implemented in the mi4p R package. Finally, the third part proposes a Bayesian framework for differential analysis, making it notably possible to consider the correlations between the intensities of peptides
Geronimi, Julia. "Contribution à la sélection de variables en présence de données longitudinales : application à des biomarqueurs issus d'imagerie médicale". Electronic Thesis or Diss., Paris, CNAM, 2016. http://www.theses.fr/2016CNAM1114.
Texto completoClinical studies enable us to measure many longitudinales variables. When our goal is to find a link between a response and some covariates, one can use regularisation methods, such as LASSO which have been extended to Generalized Estimating Equations (GEE). They allow us to select a subgroup of variables of interest taking into account intra-patient correlations. Databases often have unfilled data and measurement problems resulting in inevitable missing data. The objective of this thesis is to integrate missing data for variable selection in the presence of longitudinal data. We use mutiple imputation and introduce a new imputation function for the specific case of variables under detection limit. We provide a new variable selection method for correlated data that integrate missing data : the Multiple Imputation Penalized Generalized Estimating Equations (MI-PGEE). Our operator applies the group-LASSO penalty on the group of estimated regression coefficients of the same variable across multiply-imputed datasets. Our method provides a consistent selection across multiply-imputed datasets, where the optimal shrinkage parameter is chosen by minimizing a BIC-like criteria. We then present an application on knee osteoarthritis aiming to select the subset of biomarkers that best explain the differences in joint space width over time
Phan, Thi-Thu-Hong. "Elastic matching for classification and modelisation of incomplete time series". Thesis, Littoral, 2018. http://www.theses.fr/2018DUNK0483/document.
Texto completoMissing data are a prevalent problem in many domains of pattern recognition and signal processing. Most of the existing techniques in the literature suffer from one major drawback, which is their inability to process incomplete datasets. Missing data produce a loss of information and thus yield inaccurate data interpretation, biased results or unreliable analysis, especially for large missing sub-sequence(s). So, this thesis focuses on dealing with large consecutive missing values in univariate and low/un-correlated multivariate time series. We begin by investigating an imputation method to overcome these issues in univariate time series. This approach is based on the combination of shape-feature extraction algorithm and Dynamic Time Warping method. A new R-package, namely DTWBI, is then developed. In the following work, the DTWBI approach is extended to complete large successive missing data in low/un-correlated multivariate time series (called DTWUMI) and a DTWUMI R-package is also established. The key of these two proposed methods is that using the elastic matching to retrieving similar values in the series before and/or after the missing values. This optimizes as much as possible the dynamics and shape of knowledge data, and while applying the shape-feature extraction algorithm allows to reduce the computing time. Successively, we introduce a new method for filling large successive missing values in low/un-correlated multivariate time series, namely FSMUMI, which enables to manage a high level of uncertainty. In this way, we propose to use a novel fuzzy grades of basic similarity measures and fuzzy logic rules. Finally, we employ the DTWBI to (i) complete the MAREL Carnot dataset and then we perform a detection of rare/extreme events in this database (ii) forecast various meteorological univariate time series collected in Vietnam
Dupuy, Mariette. "Analyse des caractéristiques électriques pour la détection des sujets à risque de mort subite cardiaque". Electronic Thesis or Diss., Bordeaux, 2025. http://www.theses.fr/2025BORD0002.
Texto completoSudden cardiac death (SCD) accounts for 30% of adult mortality in industrialized countries. The majority of SCD cases are the result of an arrhythmia called ventricular fibrillation, which itself results from structural abnormalities in the heart muscle. Despite the existence of effective therapies, most individuals at risk for SCD are not identified preventively due to the lack of available testing. Developing specific markers on electrocardiographic recordings would enable the identification and stratification of SCD risk. Over the past six years, the Liryc Institute has recorded surface electrical signals from over 800 individuals (both healthy and pathological) using a high-resolution 128-electrode device. Features were calculated from these signals (signal duration per electrode, frequency, amplitude fractionation, etc.). In total, more than 1,500 electrical features are available per patient. During the acquisition process using the 128-electrode system in a hospital setting, noise or poor positioning of specific electrodes sometimes prevents calculating the intended features, leading to an incomplete database. This thesis is organized around two main axes. First, we developed a method for imputing missing data to address the problem of faulty electrodes. Then, we developed a risk score for the sudden death risk stratification. The most commonly used family of methods for handling missing data is imputation, ranging from simple completion by averaging to local aggregation methods, local regressions, optimal transport, or even modifications of generative models. Recently, Autoencoders (AE) and, more specifically, Denoising AutoEncoders (DAE) have performed well in this task. AEs are neural networks used to learn a representation of data in a reduced-dimensional space. DAEs are AEs that have been proposed to reconstruct original data from noisy data. In this work, we propose a new methodology based on DAEs called the modified Denoising AutoEncoder (mDAE) to allow for the imputation of missing data. The second research axis of the thesis focused on developing a risk score for sudden cardiac death. DAEs can model and reconstruct complex data. We trained DAEs to model the distribution of healthy individuals based on a selected subset of electrical features. Then, we used these DAEs to discriminate pathological patients from healthy individuals by analyzing the imputation quality of the DAE on partially masked features. We also compared different classification methods to establish a risk score for sudden death
Moreno, Betancur Margarita. "Regression modeling with missing outcomes : competing risks and longitudinal data". Thesis, Paris 11, 2013. http://www.theses.fr/2013PA11T076/document.
Texto completoMissing data are a common occurrence in medical studies. In regression modeling, missing outcomes limit our capability to draw inferences about the covariate effects of medical interest, which are those describing the distribution of the entire set of planned outcomes. In addition to losing precision, the validity of any method used to draw inferences from the observed data will require that some assumption about the mechanism leading to missing outcomes holds. Rubin (1976, Biometrika, 63:581-592) called the missingness mechanism MAR (for “missing at random”) if the probability of an outcome being missing does not depend on missing outcomes when conditioning on the observed data, and MNAR (for “missing not at random”) otherwise. This distinction has important implications regarding the modeling requirements to draw valid inferences from the available data, but generally it is not possible to assess from these data whether the missingness mechanism is MAR or MNAR. Hence, sensitivity analyses should be routinely performed to assess the robustness of inferences to assumptions about the missingness mechanism. In the field of incomplete multivariate data, in which the outcomes are gathered in a vector for which some components may be missing, MAR methods are widely available and increasingly used, and several MNAR modeling strategies have also been proposed. On the other hand, although some sensitivity analysis methodology has been developed, this is still an active area of research. The first aim of this dissertation was to develop a sensitivity analysis approach for continuous longitudinal data with drop-outs, that is, continuous outcomes that are ordered in time and completely observed for each individual up to a certain time-point, at which the individual drops-out so that all the subsequent outcomes are missing. The proposed approach consists in assessing the inferences obtained across a family of MNAR pattern-mixture models indexed by a so-called sensitivity parameter that quantifies the departure from MAR. The approach was prompted by a randomized clinical trial investigating the benefits of a treatment for sleep-maintenance insomnia, from which 22% of the individuals had dropped-out before the study end. The second aim was to build on the existing theory for incomplete multivariate data to develop methods for competing risks data with missing causes of failure. The competing risks model is an extension of the standard survival analysis model in which failures from different causes are distinguished. Strategies for modeling competing risks functionals, such as the cause-specific hazards (CSH) and the cumulative incidence function (CIF), generally assume that the cause of failure is known for all patients, but this is not always the case. Some methods for regression with missing causes under the MAR assumption have already been proposed, especially for semi-parametric modeling of the CSH. But other useful models have received little attention, and MNAR modeling and sensitivity analysis approaches have never been considered in this setting. We propose a general framework for semi-parametric regression modeling of the CIF under MAR using inverse probability weighting and multiple imputation ideas. Also under MAR, we propose a direct likelihood approach for parametric regression modeling of the CSH and the CIF. Furthermore, we consider MNAR pattern-mixture models in the context of sensitivity analyses. In the competing risks literature, a starting point for methodological developments for handling missing causes was a stage II breast cancer randomized clinical trial in which 23% of the deceased women had missing cause of death. We use these data to illustrate the practical value of the proposed approaches
Faucheux, Lilith. "Learning from incomplete biomedical data : guiding the partition toward prognostic information". Electronic Thesis or Diss., Université Paris Cité, 2021. http://www.theses.fr/2021UNIP5242.
Texto completoThe topic of this thesis is partition learning analyses in the context of incomplete data. Two methodological development are presented, with two medical and biomedical applications. The first methodological development concerns the implementation of unsupervised partition learning in the presence of incomplete data. Two types of incomplete data were considered: missing data and left-censored data (that is, values “lower than some detection threshold"), and handled through multiple imputation (MI) framework. Multivariate imputation by chained equation (MICE) was used to perform tailored imputations for each type of incomplete data. Then, for each imputed dataset, unsupervised learning was performed, with a data-based selected number of clusters. Last, a consensus clustering algorithm was used to pool the partitions, as an alternative to Rubin's rules. The second methodological development concerns the implementation of semisupervised partition learning in an incomplete dataset, to combine data structure and patient survival. This aimed at identifying patient profiles that relate both to differences in the group structure extracted from the data, and in the patients' prognosis. The supervised (prognostic value) and unsupervised (group structure) objectives were combined through Pareto multi-objective optimization. Missing data were handled, as above, through MI, with Rubin's rules used to combine the supervised and unsupervised objectives across the imputations, and the optimal partitions pooled using consensus clustering. Two applications are provided, one on the immunological landscape of the breast tumor microenvironment and another on the COVID-19 infection in the context of a hematological disease
Nadif, Mohamed. "Classification automatique et données manquantes". Metz, 1991. http://docnum.univ-lorraine.fr/public/UPV-M/Theses/1991/Nadif.Mohamed.SMZ912.pdf.
Texto completoSilva, Gonçalves da Costa Lorga da Ana Isabel. "Données manquantes et méthodes d'imputation en classification". Paris, CNAM, 2005. http://www.theses.fr/2005CNAM0719.
Texto completoLe but de ce travail est d'étudier l'effet des données manquantes en classification de variables, principalement en classification hiérarchique ascendante, et aussi en classification hiérachique ascendante, et aussi en classification non hiérarchique (ou partitionnement). L'étude est effectuée en considérant les facteurs suivants : pourcentage de donnes manquantes, méthodes d'imputation, coefficients de ressemblance et critères de classification. On suppose que les données manquantes au hasard, mais pas complètement au hasard. Les données manquantes satisfont un schéma majoritairement monotone. Nous avons utilisé comme techniques sans imputation les méthodes listwise et pairwise et comme méthodes d'imputation simple. L'algorithme EM, le modèle de régression OLS, l'algorithme NIPALS et une méthode de régression PLS. Comme méthodes d'imputation multiple basé sur les méthodes de régression PLS. Pour combiner les strctures de classification résultant des méthodes d'imputation multiple nous avons proposé une combinaison par la moyenne des matrices de similarité et deux méthodes de consensus. Nous avons utilisé comme méthodes de classification hiérachique, le saut minimal, le saut maximal, la moyenne parmi les groupes et aussi les AVL et AVB ; pour les matrices de ressemblance, le coefficient d'affinité basique (pour les données continues) -qui correspond à l'indice d'Ochiai; pour les données binaires, le coefficient de corrélation de Bravais-Pearson et l'approximation probabiliste du coefficient d'affinité centré et réduit par la méthode-W. L'étude est basée principalemnt sur des données simulées et complétée par des applications à des données réelles
Bahamonde, Natalia. "Estimation de séries chronologiques avec données manquantes". Paris 11, 2007. http://www.theses.fr/2007PA112115.
Texto completoEl-Taib, El-Rafehi Ahmed. "Estimation des données manquantes dans les séries chronologiques". Montpellier 2, 1992. http://www.theses.fr/1992MON20239.
Texto completoBarhoumi, Mohamed Adel. "Traitement des données manquantes dans les données de panel : cas des variables dépendantes dichotomiques". Thesis, Université Laval, 2006. http://www.theses.ulaval.ca/2006/23619/23619.pdf.
Texto completoFiot, Céline. "Extraction de séquences fréquentes : des données numériques aux valeurs manquantes". Phd thesis, Montpellier 2, 2007. http://www.theses.fr/2007MON20056.
Texto completoFiot, Céline. "Extraction de séquences fréquentes : des données numériques aux valeurs manquantes". Phd thesis, Université Montpellier II - Sciences et Techniques du Languedoc, 2007. http://tel.archives-ouvertes.fr/tel-00179506.
Texto completoGu, Co Weila Vila. "Méthodes statistiques et informatiques pour le traitement des données manquantes". Phd thesis, Conservatoire national des arts et metiers - CNAM, 1997. http://tel.archives-ouvertes.fr/tel-00808585.
Texto completoLadjouze, Salim. "Problèmes d'estimation dans les séries temporelles stationnaires avec données manquantes". Phd thesis, Université Joseph Fourier (Grenoble ; 1971-2015), 1986. http://tel.archives-ouvertes.fr/tel-00319946.
Texto completoDemange, Sébastien. "Contributions à la reconnaissance automatique de la parole avec données manquantes". Phd thesis, Université Henri Poincaré - Nancy I, 2007. http://tel.archives-ouvertes.fr/tel-00187953.
Texto completoResseguier, Noémie. "Méthodes de gestion des données manquantes en épidémiologie. : Application en cancérologie". Thesis, Aix-Marseille, 2013. http://www.theses.fr/2013AIXM5063.
Texto completoThe issue of how to deal with missing data in epidemiological studies is a topic which concerns every researcher involved in the analysis of collected data and in the interpretation of the results produced by these analyses. And even if the issue of the handling of missing data and of their impact on the validity of the results is often discussed, simple, but not always appropriate methods to deal with missing data are commonly used. The use of each of these methods is based on some hypotheses under which the obtained results are valid, but it is not always possible to test these hypotheses. The objective of this work was (i) to propose a review of various methods to handle missing data used in the field of epidemiology, and to discuss the advantages and disadvantages of each of these methods, (ii) to propose a strategy of analysis in order to study the robustness of the results obtained via classical methods to handle missing data to the departure from hypotheses which are required for the validity of these results, although they are not testable, and (iii) to propose some applications on real data of the issues discussed in the first two sections
Vidal, Vincent. "Échantillonnage de Gibbs avec augmentation de données et imputation multiple". Thesis, Université Laval, 2006. http://www.theses.ulaval.ca/2006/23906/23906.pdf.
Texto completoDellagi, Hatem. "Estimations paramétrique et non paramétrique des données manquantes : application à l'agro-climatologie". Paris 6, 1994. http://www.theses.fr/1994PA066546.
Texto completoBen, Othman Amroussi Leila. "Conception et validation d’une méthode de complétion des valeurs manquantes fondée sur leurs modèles d’apparition". Caen, 2011. http://www.theses.fr/2011CAEN2067.
Texto completoKnowledge Discovery from incomplete databases is a thriving research area. In this thesis, the main focus is put on the proposal of a missing values completion method. We start approaching this issue by defining the appearing models of the missing values. We thus propose a new typology according to the given data and we characterize these missing values in a non-redundant manner defined by means of the basis of proper implications. An algorithm computing this basis of rules, heavily relying on the hypergraph theory battery of results, is also introduced in this thesis. We then explore the information provided during the characterization stage in order to propose a new contextual completion method. The latter completes the missing values with respect to their type as well as to their appearance context. The non-random missing values are completed with special values intrinsically containing the explanation defined by the characterization schemes. Finally, we investigate the evaluation techniques of the missing values completion methods and we introduce a new technique based on the stability of a clustering, when applied on reference data and completed ones
Yuan, Shuning. "Méthodes d'analyse de données GPS dans les enquêtes sur la mobilité des personnes : les données manquantes et leur estimation". Paris 1, 2010. http://www.theses.fr/2010PA010074.
Texto completoNguyen, Dinh Tuan. "Propriétés asymtpotiques et inférence avec des données manquantes pour les modèles de maintenance imparfaite". Thesis, Troyes, 2015. http://www.theses.fr/2015TROY0034/document.
Texto completoThe thesis analyses imperfect maintenance processes of industrial systems by statistical models. Imperfect maintenance is an intermediate situation of two extremes ones: minimal maintenance where the system is restored to the state immediately prior to failure, and perfect maintenance where the system is renewed after the failure. Analytical expressions of reliability quantities of an imperfect maintenance model are developed. The convergence of the model is highlighted and the asymptotic expressions are proposed. The results are applied to build some preventive maintenance policies that contain only imperfect maintenances. The second part of the thesis consists of analyzing failure data contained in observation windows. An observation window is a period of the entire functioning history that only the events occurring in this period are recorded. The modelling and the inference are based on the convergence property or the modelling of initial age. Finally, Bayesian inference of an imperfect maintenance model is presented. The impact of the choices of a priori distributions is analyzed by numerical simulations. A selection method of imperfect maintenance models using the Bayes factor is also introduced.The statistical modelling in each section is applied to real data
Rioult, François. "Extraction de connaissances dans les bases de données comportant des valeurs manquantes ou un grand nombre d'attributs". Caen, 2005. http://www.theses.fr/2005CAEN2035.
Texto completoEl, Abed Abir. "Suivi multi-objets par filtrage particulaire dans un contexte de données incomplètes et/ou manquantes". Paris 6, 2008. http://www.theses.fr/2008PA066304.
Texto completoMorisot, Adeline. "Méthodes d’analyse de survie, valeurs manquantes et fractions attribuables temps dépendantes : application aux décès par cancer de la prostate". Thesis, Montpellier, 2015. http://www.theses.fr/2015MONTT010/document.
Texto completoThe term survival analysis refers to methods used for modeling the time of occurrence of one or more events taking censoring into account. The event of interest may be either the onset or the recurrence of a disease, or death. The causes of death may have missing values, a status that may be modeled by imputation methods. In the first section of this thesis we made a review of the methods used to deal with these missing data. Then, we detailed the procedures that enable multiple imputation of causes of death. We have developed these methods in a subset of the ERSPC (European Randomized Study of Screening for Prostate Cancer), which studied screening and mortality for prostate cancer. We proposed a theoretical formulation of Rubin rules after a complementary log-log transformation to combine estimates of survival. In addition, we provided the related R code. In a second section, we presented the survival analysis methods, by proposing a unified writing based on the definitions of crude and net survival, while considering either all-cause or specific cause of death. This involves consideration of censoring which can then be informative. We considered the so-called traditional methods (Kaplan-Meier, Nelson-Aalen, Cox and parametric) methods of competing risks (considering a multistate model or a latent failure time model), methods called specific that are corrected using IPCW (Inverse Ponderation Censoring Weighting) and relative survival methods. The classical methods are based on a non-informative censoring assumption. When we are interested in deaths from all causes, this assumption is often valid. However, for a particular cause of death, other causes of death are considered as a censoring. In this case, censoring by other causes of death is generally considered informative. We introduced an approach based on the IPCW method to correct this informative censoring, and we provided an R function to apply this approach directly. All methods presented in this chapter were applied to datasets completed by multiple imputation. Finally, in a last part we sought to determine the percentage of deaths explained by one or more variables using attributable fractions. We presented the theoretical formulations of attributable fractions, time-independent and time-dependent that are expressed as survival. We illustrated these concepts using all the survival methods presented in section 2, and compared the results. Estimates obtained with the different methods were very similar
Tzompanaki, Aikaterini. "Réponses manquantes : Débogage et Réparation de requêtes". Thesis, Université Paris-Saclay (ComUE), 2015. http://www.theses.fr/2015SACLS223/document.
Texto completoWith the increasing amount of available data and data transformations, typically specified by queries, the need to understand them also increases. “Why are there medicine books in my sales report?” or “Why are there not any database books?” For the first question we need to find the origins or provenance of the result tuples in the source data. However, reasoning about missing query results, specified by Why-Not questions as the latter previously mentioned, has not till recently receivedthe attention it is worth of. Why-Not questions can be answered by providing explanations for the missing tuples. These explanations identify why and how data pertinent to the missing tuples were not properly combined by the query. Essentially, the causes lie either in the input data (e.g., erroneous or incomplete data) or at the query level (e.g., a query operator like join). Assuming that the source data contain all the necessary relevant information, we can identify the responsible query operators formingquery-based explanations. This information can then be used to propose query refinements modifying the responsible operators of the initial query such that the refined query result contains the expected data. This thesis proposes a framework targeted towards SQL query debugging and fixing to recover missing query results based on query-based explanations and query refinements.Our contribution to query debugging consist in two different approaches. The first one is a tree-based approach. First, we provide the formal framework around Why-Not questions, missing from the state-of-the-art. Then, we review in detail the state-of-the-art, showing how it probably leads to inaccurate explanations or fails to provide an explanation. We further propose the NedExplain algorithm that computes correct explanations for SPJA queries and unions there of, thus considering more operators (aggregation) than the state of the art. Finally, we experimentally show that NedExplain is better than the both in terms of time performance and explanation quality. However, we show that the previous approach leads to explanations that differ for equivalent query trees, thus providing incomplete information about what is wrong with the query. We address this issue by introducing a more general notion of explanations, using polynomials. The polynomial captures all the combinations in which the query conditions should be fixed in order for the missing tuples to appear in the result. This method is targeted towards conjunctive queries with inequalities. We further propose two algorithms, Ted that naively interprets the definitions for polynomial explanations and the optimized Ted++. We show that Ted does not scale well w.r.t. the size of the database. On the other hand, Ted++ is capable ii of efficiently computing the polynomial, relying on schema and data partitioning and advantageous replacement of expensive database evaluations by mathematical calculations. Finally, we experimentally evaluate the quality of the polynomial explanations and the efficiency of Ted++, including a comparative evaluation.For query fixing we propose is a new approach for refining a query by leveraging polynomial explanations. Based on the input data we propose how to change the query conditions pinpointed by the explanations by adjusting the constant values of the selection conditions. In case of joins, we introduce a novel type of query refinements using outer joins. We further devise the techniques to compute query refinements in the FixTed algorithm, and discuss how our method has the potential to be more efficient and effective than the related work.Finally, we have implemented both Ted++ and FixTed in an system prototype. The query debugging and fixing platform, short EFQ allows users to nteractively debug and fix their queries when having Why- Not questions
Bouges, Pierre. "Gestion de données manquantes dans des cascades de boosting : application à la détection de visages". Phd thesis, Université Blaise Pascal - Clermont-Ferrand II, 2012. http://tel.archives-ouvertes.fr/tel-00840842.
Texto completoBock, Dumas Élodie de. "Identification de stratégies d’analyse de variables latentes longitudinales en présence de données manquantes potentiellement informatives". Nantes, 2014. http://archive.bu.univ-nantes.fr/pollux/show.action?id=ed3dcb7e-dec1-4506-b99d-50e3448d1ce4.
Texto completoThe purpose of this study was to identify the most adequate strategy to analyse longitudinal latent variables (patient reported outcomes) when potentially informative missing data are observed. Models coming from classical test theory and Rasch-family were compared. In order to obtain an objective comparison of these methods, simulation studies were used. Moreover, illustrative examples were analysed. This research work showed that the method that comes from Rasch-family models performs better than the other in some circumstances, mainly for power. However, limitations were highlighted. Moreover, some results were obtained about personal mean score imputation
Imbert, Alyssa. "Intégration de données hétérogènes complexes à partir de tableaux de tailles déséquilibrées". Thesis, Toulouse 1, 2018. http://www.theses.fr/2018TOU10022/document.
Texto completoThe development of high-throughput sequencing technologies has lead to a massive acquisition of high dimensional and complex datasets. Different features make these datasets hard to analyze : high dimensionality, heterogeneity at the biological level or at the data type level, the noise in data (due to biological heterogeneity or to errors in data) and the presence of missing data (for given values or for an entire individual). The integration of various data is thus an important challenge for computational biology. This thesis is part of a large clinical research project on obesity, DiOGenes, in which we have developed methods for data analysis and integration. The project is based on a dietary intervention that was led in eight Europeans centers. This study investigated the effect of macronutrient composition on weight-loss maintenance and metabolic and cardiovascular risk factors after a phase of calorie restriction in obese individuals. My work have mainly focused on transcriptomic data analysis (RNA-Seq) with missing individuals and data integration of transcriptomic (new QuantSeq protocol) and clinic datasets. The first part is focused on missing data and network inference from RNA-Seq datasets. During longitudinal study, some observations are missing for some time step. In order to take advantage of external information measured simultaneously to RNA-Seq data, we propose an imputation method, hot-deck multiple imputation (hd-MI), that improves the reliability of network inference. The second part deals with an integrative study of clinical data and transcriptomic data, measured by QuantSeq, based on a network approach. The new protocol is shown efficient for transcriptome measurement. We proposed an analysis based on network inference that is linked to clinical variables of interest
Gouba, Elisée. "Identification de paramètres dans les systèmes distribuées à données manquantes : modèles mathématiques de la performance en sport". Antilles Guyane, 2010. http://www.theses.fr/2010AGUY0330.
Texto completoTwo topics were studied in this thesis: parameter's identification in distributed systems with missing data in first part and mathematical models of performance in sports in second part. The aim of the first part of this thesis is to identify the permeability parameter of an oil tanks in monophasic flow. The nonlinear model w have is a system with incomplete data in the sense that the initial condition, the boundary conditions and some petro-physical parameters of the model are partially known. Two approaches are possible, one using the classical method of least squares and the other more targeted using the sentinel method developed by J. L. Lions. Ln this work, we first show that the sentinel problem is equivalent to a null controllability problem. And we solves the problem of null controllability by the variational method made possible by the Carleman inequalities. The second part of this thesis is devoted to the mathematical model of performance in sports proposed by Banister in 1975. We firstly apply this model at physiological data of monofin swimmer and we propose a model that improves Banister's mode/
Kezouit, Omar Abdelaziz. "Bases de données relationnelles et analyse de données : conception et réalisation d'un système intégré". Paris 11, 1987. http://www.theses.fr/1987PA112130.
Texto completoBen, Othman Leila. "Conception et validation d'une méthode de complétion des valeurs manquantes fondée sur leurs modèles d'apparition". Phd thesis, Université de Caen, 2011. http://tel.archives-ouvertes.fr/tel-01017941.
Texto completoPicard, Jacques. "Structure, classification et discrimination des profils évolutifs incomplets et asynchrones". Lyon 1, 1987. http://www.theses.fr/1987LYO19044.
Texto completoPeng, Tao. "Analyse de données loT en flux". Electronic Thesis or Diss., Aix-Marseille, 2021. http://www.theses.fr/2021AIXM0649.
Texto completoSince the advent of the IoT (Internet of Things), we have witnessed an unprecedented growth in the amount of data generated by sensors. To exploit this data, we first need to model it, and then we need to develop analytical algorithms to process it. For the imputation of missing data from a sensor f, we propose ISTM (Incremental Space-Time Model), an incremental multiple linear regression model adapted to non-stationary data streams. ISTM updates its model by selecting: 1) data from sensors located in the neighborhood of f, and 2) the near-past most recent data gathered from f. To evaluate data trustworthiness, we propose DTOM (Data Trustworthiness Online Model), a prediction model that relies on online regression ensemble methods such as AddExp (Additive Expert) and BNNRW (Bagging NNRW) for assigning a trust score in real time. DTOM consists: 1) an initialization phase, 2) an estimation phase, and 3) a heuristic update phase. Finally, we are interested predicting multiple outputs STS in presence of imbalanced data, i.e. when there are more instances in one value interval than in another. We propose MORSTS, an online regression ensemble method, with specific features: 1) the sub-models are multiple output, 2) adoption of a cost sensitive strategy i.e. the incorrectly predicted instance has a higher weight, and 3) management of over-fitting by means of k-fold cross-validation. Experimentation with with real data has been conducted and the results were compared with reknown techniques
Tabouy, Timothée. "Impact de l’échantillonnage sur l’inférence de structures dans les réseaux : application aux réseaux d’échanges de graines et à l’écologie". Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS289/document.
Texto completoIn this thesis we are interested in studying the stochastic block model (SBM) in the presence of missing data. We propose a classification of missing data into two categories Missing At Random and Not Missing At Random for latent variable models according to the model described by D. Rubin. In addition, we have focused on describing several network sampling strategies and their distributions. The inference of SBMs with missing data is made through an adaptation of the EM algorithm : the EM with variational approximation. The identifiability of several of the SBM models with missing data has been demonstrated as well as the consistency and asymptotic normality of the maximum likelihood estimators and variational approximation estimators in the case where each dyad (pair of nodes) is sampled independently and with equal probability. We also looked at SBMs with covariates, their inference in the presence of missing data and how to proceed when covariates are not available to conduct the inference. Finally, all our methods were implemented in an R package available on the CRAN. A complete documentation on the use of this package has been written in addition
Rioult, François. "Extraction de connaissances dans les bases de donn'ees comportant des valeurs manquantes ou un grand nombre d'attributs". Phd thesis, Université de Caen, 2005. http://tel.archives-ouvertes.fr/tel-00252089.
Texto completoBadran, Hussein. "Contribution à la mesure en analyse factorielle des données et applications". Aix-Marseille 3, 2001. http://www.theses.fr/2001AIX30035.
Texto completoThis thesis presents under a same cover a certain number of articles and studies that are regrouped into two parts. The first part, mostly theoretical, concerns some studies in the framework of factorial analysis. In the beginning several questions related to probability distribution functions appearing in factorial analysis are considereed, mainly about the evaluation and characterization of missing data. Then new results are given on projective transformations that allow to approach probability laws on compact sets. Finally another result on measureness (under the meaning of a given mass distribution) of two complementary subsets of convex sets defined by hyperplanes going through the gravity center. The second part aims at presenting a certain number of applications of Correspondence Factorial Analysis showing the diversity of concrete problems that can be invoked. It offers results of many studies conducted in France as in Lebanon in the framework of several researches that have facilitated the discovery of new information in very different sectors from experimental sciences going from earth science to economical, political and social sciences
Ly, Birama Apho. "Prévalence et facteurs associés aux données manquantes des registres de consultations médicales des médecins des centres de santé communautaires de Bamako". Thesis, Université Laval, 2012. http://www.theses.ulaval.ca/2012/28555/28555.pdf.
Texto completoObjective This study aims to estimate the prevalence of missing data in the medical consultation registries held by physicians working in Bamako community health Centers (COMHC) and to identify the factors which predict physicians’ intention to collect completely the data in their registries, based on the Theory of Planned Behaviour (TPB). Method A exploratory cross-sectional study was conducted, including a random sample of 3072 medical consultations and 32 physicians. Data were collected between January and February 2011 through a standardized extraction form and a questionnaire measuring physicians’ sociodemographic and professional characteristics as well as constructs from the Theory of Planned Behaviour (TPB). Descriptive statistics, correlations and linear regression were performed. Results All the variables contained in the medical consultations registries have missing data. However, only four variables (symptom, diagnosis, treatment and observation) have a high prevalence of missing data. The variable observation has the highest prevalence with 95.6% of missing data. Physician’s intention to collect completely the data is predicted by their subjective norm and the number of years of practice. Conclusion The results of this study should contribute to advance knowledge on the prevalence of missing data and possible strategies to improve the quality of health information collected from the CSCOM. This information can possibly allow to better inform the decisions concerning resource allocation.
Jebri, Mohamed Ali. "Estimation des données manquantes par la métrologie virtuelle pour l'amélioration du régulateur Run-To-Run dans le domaine des semi-conducteurs". Thesis, Aix-Marseille, 2018. http://www.theses.fr/2018AIXM0028.
Texto completoThe addressed work is about the virtual metrology (VM) for estimating missing data during semiconductor manufacturing processes. The use of virtual metrology tool also makes it possible to provide the software measurements (estimations) of the outputs to feed the run-to-run (R2R) controllers set up for the quality control of the manufactured products.To address these issues related to the delay of measurements caused by the static sampling imposed by the strategy and the equipments put in place, our contribution in this thesis is to introduce the notion of the dynamic dynamic sampling. This strategy is based on an algorithm that considers the neighborhood condition to avoid the actual measurement even if the static sampling requires it. This reduces the number of actual measurements, the cycle time and the cost of production. This approach is provided by a virtual metrology module (VM) that we have developed and which can be integrated into an R2R control loop. The obtained results were validated on academic examples and on real data provided by our partner STMicroelectronics of Rousset from a chemical mechanical planarization (CMP) process. This real data also enabled the results obtained from the virtual metrology to be validated and then supplied to the R2R regulators (who need the estimation of these data)
Dantan, Etienne. "Modèles conjoints pour données longitudinales et données de survie incomplètes appliqués à l'étude du vieillissement cognitif". Thesis, Bordeaux 2, 2009. http://www.theses.fr/2009BOR21658/document.
Texto completoIn cognitive ageing study, older people are highly selected by a risk of death associated with poor cognitive performances. Modeling the natural history of cognitive decline is difficult in presence of incomplete longitudinal and survival data. Moreover, the non observed cognitive decline acceleration beginning before the dementia diagnosis is difficult to evaluate. Cognitive decline is highly heterogeneous, e.g. there are various patterns associated with different risks of survival event. The objective is to study joint models for incomplete longitudinal and survival data to describe the cognitive evolution in older people. Latent variable approaches were used to take into account the non-observed mechanisms, e.g. heterogeneity and decline acceleration. First, we compared two approaches to consider missing data in longitudinal data analysis. Second, we propose a joint model with a latent state to model cognitive evolution and its pre-dementia acceleration, dementia risk and death risk