Dissertations / Theses on the topic 'Missing Value Imputation'

To see the other types of publications on this topic, follow the link: Missing Value Imputation.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 40 dissertations / theses for your research on the topic 'Missing Value Imputation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Aslan, Sipan. "Comparison Of Missing Value Imputation Methods For Meteorological Time Series Data." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/12612426/index.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Dealing with missing data in spatio-temporal time series constitutes important branch of general missing data problem. Since the statistical properties of time-dependent data characterized by sequentiality of observations then any interruption of consecutiveness in time series will cause severe problems. In order to make reliable analyses in this case missing data must be handled cautiously without disturbing the series statistical properties, mainly as temporal and spatial dependencies. In this study we aimed to compare several imputation methods for the appropriate completion of missing values of the spatio-temporal meteorological time series. For this purpose, several missing imputation methods are assessed on their imputation performances for artificially created missing data in monthly total precipitation and monthly mean temperature series which are obtained from the climate stations of Turkish State Meteorological Service. Artificially created missing data are estimated by using six methods. Single Arithmetic Average (SAA), Normal Ratio (NR) and NR Weighted with Correlations (NRWC) are the three simple methods used in the study. On the other hand, we used two computational intensive methods for missing data imputation which are called Multi Layer Perceptron type Neural Network (MLPNN) and Monte Carlo Markov Chain based on Expectation-Maximization Algorithm (EM-MCMC). In addition to these, we propose a modification in the EM-MCMC method in which results of simple imputation methods are used as auxiliary variables. Beside the using accuracy measure based on squared errors we proposed Correlation Dimension (CD) technique for appropriate evaluation of imputation performances which is also important subject of Nonlinear Dynamic Time Series Analysis.
2

Andersson, Joacim, and Henrik Falk. "Missing Data in Value-at-Risk Analysis : Conditional Imputation in Optimal Portfolios Using Regression." Thesis, KTH, Matematisk statistik, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-122276.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
A regression-based method is presented in order toregenerate missing data points in stock return time series. The method usesonly complete time series of assets in optimal portfolios, in which the returnsof the underlying tend to correlate inadequately with each other. The studyshows that the method is able to replicate empirical VaR-backtesting resultswhere all data are available, even when up to 90% of the time series in half ofthe assets in the portfolios have been removed.
3

Bischof, Stefan, Andreas Harth, Benedikt Kämpgen, Axel Polleres, and Patrik Schneider. "Enriching integrated statistical open city data by combining equational knowledge and missing value imputation." Elsevier, 2017. http://dx.doi.org/10.1016/j.websem.2017.09.003.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Several institutions collect statistical data about cities, regions, and countries for various purposes. Yet, while access to high quality and recent such data is both crucial for decision makers and a means for achieving transparency to the public, all too often such collections of data remain isolated and not re-useable, let alone comparable or properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and re-publish the resulting dataset in a re-useable manner as Linked Data. The main features of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques and reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data, including links to external data sources, such as DBpedia, available both in a web browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV. Apart from providing a contribution to the growing collection of data available as Linked Data, our enrichment process for missing values also contributes a novel methodology for combining rule-based inference about equational knowledge with inferences obtained from statistical Machine Learning approaches. While most existing works about inference in Linked Data have focused on ontological reasoning in RDFS and OWL, we believe that these complementary methods and particularly their combination could be fruitfully applied also in many other domains for integrating Statistical Linked Data, independent from our concrete use case of integrating city data.
4

Jagirdar, Suresh. "Investigation into Regression Analysis of Multivariate Additional Value and Missing Value Data Models Using Artificial Neural Networks and Imputation Techniques." Ohio University / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1219343139.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Bala, Abdalla. "Impact analysis of a multiple imputation technique for handling missing value in the ISBSG repository of software projects." Mémoire, École de technologie supérieure, 2013. http://espace.etsmtl.ca/1236/1/BALA_Abdalla.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Jusqu'au début des années 2000, la plupart des études empiriques pour construire des modèles d'estimation de projets logiciels ont été effectuées avec des échantillons de taille très faible (moins de 20 projets), tandis que seules quelques études ont utilisé des échantillons de plus grande taille (entre 60 à 90 projets). Avec la mise en place d’un répertoire de projets logiciels par l'International Software Benchmarking Standards Group - ISBSG - il existe désormais un plus grand ensemble de données disponibles pour construire des modèles d'estimation: la version 12 en 2013 du référentiel ISBSG contient plus de 6000 projets, ce qui constitue une base plus adéquate pour des études statistiques. Toutefois, dans le référentiel ISBSG un grand nombre de valeurs sont manquantes pour un nombre important de variables, ce qui rend assez difficile son utilisation pour des projets de recherche. Pour améliorer le développement de modèles d’estimation, le but de ce projet de recherche est de s'attaquer aux nouveaux problèmes d’accès à des plus grandes bases de données en génie logiciel en utilisant la technique d’imputation multiple pour tenir compte dans les analyses des données manquantes et des données aberrantes.
6

Etourneau, Lucas. "Contrôle du FDR et imputation de valeurs manquantes pour l'analyse de données de protéomiques par spectrométrie de masse." Electronic Thesis or Diss., Université Grenoble Alpes, 2024. http://www.theses.fr/2024GRALS001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
La protéomique consiste en la caractérisation du protéome d’un échantillon biologique, c’est-à-dire l’ensemble des protéines qu’il contient, et ce de la manière la plus exhaustive possible. Par l’identification et la quantification de fragments de protéines analysables en spectrométrie de masse (appelés peptides), la protéomique donne accès au niveau d’expression des gènes à un instant donné, ce qui est une information capitale pour améliorer la compréhension des mécanismes moléculaires en jeu au sein du vivant. Ces expériences produisent de grandes quantités de données, souvent complexes à interpréter et sujettes à certains biais. Elles requièrent des méthodes de traitement fiables et qui assurent un certain contrôle qualité, afin de garantir la pertinence des conclusions biologiques qui en résultent.Les travaux de cette thèse portent sur l'amélioration de ces traitements de données, et plus particulièrement sur les deux points majeurs suivants:Le premier est le contrôle du taux de fausses découvertes (abrégé en FDR pour “False Discovery Rate”), durant les étapes d’identification (1) des peptides, et (2) de biomarqueurs quantitativement différentiels entre une condition biologique testée et son contrôle négatif. Nos contributions portent sur l'établissement de liens entre les méthodes empiriques propres à la protéomique, et d’autres méthodes théoriquement bien établies. Cela nous permet notamment de donner des directions à suivre pour l’amélioration des méthodes de contrôle du FDR lors de l'identification de peptides.Le second point porte sur la gestion des valeurs manquantes, souvent nombreuses et de nature complexe, les rendant impossible à ignorer. En particulier, nous avons développé un nouvel algorithme d’imputation de valeurs manquantes qui tire parti des spécificités des données de protéomique. Notre algorithme a été testé et comparé à d’autres méthodes sur plusieurs jeux de données et selon des métriques variées, et obtient globalement les meilleures performances. De plus, il s’agit du premier algorithme permettant d’imputer en suivant le paradigme en vogue de la “multi-omique”: il peut en effet s’appuyer, lorsque cela est pertinent, sur des informations de type transcriptomique, qui quantifie le niveau d’expression des ARN messagers présents dans l’échantillon, pour imputer de manière plus fiable. Finalement, Pirat est implémenté dans un paquet logiciel disponible gratuitement, ce qui rend facilement utilisable pour la communauté protéomique
Proteomics involves characterizing the proteome of a biological sample, that is, the set of proteins it contains, and doing so as exhaustively as possible. By identifying and quantifying protein fragments that are analyzable by mass spectrometry (known as peptides), proteomics provides access to the level of gene expression at a given moment. This is crucial information for improving the understanding of molecular mechanisms at play within living organisms. These experiments produce large amounts of data, often complex to interpret and subject to various biases. They require reliable data processing methods that ensure a certain level of quality control, as to guarantee the relevance of the resulting biological conclusions.The work of this thesis focuses on improving this data processing, and specifically on the following two major points:The first is controlling for the false discovery rate (FDR), when either identifying (1) peptides or (2) quantitatively differential biomarkers between a tested biological condition and its negative control. Our contributions focus on establishing links between the empirical methods stemmed for proteomic practice and other theoretically supported methods. This notably allows us to provide directions for the improvement of FDR control methods used for peptide identification.The second point focuses on managing missing values, which are often numerous and complex in nature, making them impossible to ignore. Specifically, we have developed a new algorithm for imputing them that leverages the specificities of proteomics data. Our algorithm has been tested and compared to other methods on multiple datasets and according to various metrics, and it generally achieves the best performance. Moreover, it is the first algorithm that allows imputation following the trending paradigm of "multi-omics": if it is relevant to the experiment, it can impute more reliably by relying on transcriptomic information, which quantifies the level of messenger RNA expression present in the sample. Finally, Pirat is implemented in a freely available software package, making it easy to use for the proteomic community
7

Gheyas, Iffat A. "Novel computationally intelligent machine learning algorithms for data mining and knowledge discovery." Thesis, University of Stirling, 2009. http://hdl.handle.net/1893/2152.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model.
8

Alarcon, Sergio Arciniegas. "Imputação de dados em experimentos multiambientais: novos algoritmos utilizando a decomposição por valores singulares." Universidade de São Paulo, 2016. http://www.teses.usp.br/teses/disponiveis/11/11134/tde-10052016-130506/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
As análises biplot que utilizam os modelos de efeitos principais aditivos com inter- ação multiplicativa (AMMI) requerem matrizes de dados completas, mas, frequentemente os ensaios multiambientais apresentam dados faltantes. Nesta tese são propostas novas metodologias de imputação simples e múltipla que podem ser usadas para analisar da- dos desbalanceados em experimentos com interação genótipo por ambiente (G×E). A primeira, é uma nova extensão do método de validação cruzada por autovetor (Bro et al, 2008). A segunda, corresponde a um novo algoritmo não-paramétrico obtido por meio de modificações no método de imputação simples desenvolvido por Yan (2013). Também é incluído um estudo que considera sistemas de imputação recentemente relatados na literatura e os compara com o procedimento clássico recomendado para imputação em ensaios (G×E), ou seja, a combinação do algoritmo de Esperança-Maximização com os modelos AMMI ou EM-AMMI. Por último, são fornecidas generalizações da imputação simples descrita por Arciniegas-Alarcón et al. (2010) que mistura regressão com aproximação de posto inferior de uma matriz. Todas as metodologias têm como base a decomposição por valores singulares (DVS), portanto, são livres de pressuposições distribucionais ou estruturais. Para determinar o desempenho dos novos esquemas de imputação foram realizadas simulações baseadas em conjuntos de dados reais de diferentes espécies, com valores re- tirados aleatoriamente em diferentes porcentagens e a qualidade das imputações avaliada com distintas estatísticas. Concluiu-se que a DVS constitui uma ferramenta útil e flexível na construção de técnicas eficientes que contornem o problema de perda de informação em matrizes experimentais.
The biplot analysis using the additive main effects and multiplicative interaction models (AMMI) require complete data matrix, but often multi-environments trials have missing values. This thesis proposed new methods of single and multiple imputation that can be used to analyze unbalanced data in experiments with genotype by environment interaction (G×E). The first is a new extension of the cross-validation method by eigenvector (Bro et al., 2008). The second, corresponds to a new non-parametric algorithm obtained through modifications of the simple imputation method developed by Yan (2013). Also is included a study that considers imputation systems recently reported in the literature and compares them with the classic procedure recommended for imputation in trials (G×E), it means, the combination of the Expectation-Maximization (EM) algorithm with the additive main effects and multiplicative interaction (AMMI) model or EM-AMMI. Finally, are supplied generalizations of simple imputation described by Arciniegas-Alarcón et al. (2010) that combines regression with lower-rank approximation of a matrix. All methodologies are based on singular value decomposition (SVD), so, are free of any distributional or structural assumptions. In order to determine the performance of the new imputation schemes were performed simulations based on real data set of different species, with values deleted randomly at different percentages and the quality of the imputations was evaluated using different statistics. It was concluded that SVD provides a useful and flexible tool for the construction of efficient techniques that circumvent the problem of missing data in experimental matrices.
9

Bengtsson, Fanny, and Klara Lindblad. "Methods for handling missing values : A simulation study comparing imputation methods for missing values on a Poisson distributed explanatory variable." Thesis, Uppsala universitet, Statistiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-432467.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Huo, Zhao. "A Comparsion of Multiple Imputation Methods for Missing Covariate Values in Recurrent Event Data." Thesis, Uppsala universitet, Statistiska institutionen, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-256602.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Multiple imputation (MI) is a commonly used approach to impute missing data. This thesis studies missing covariates in recurrent event data, and discusses ways to include the survival outcomes in the imputation model. Some MI methods under consideration are the event indicator D combined with, respectively, the right-censored event times T, the logarithm of T and the cumulative baseline hazard H0(T). After imputation, we can then proceed to the complete data analysis. The Cox proportional hazards (PH) model and the PWP model are chosen as the analysis models, and the coefficient estimates are of substantive interest. A Monte Carlo simulation study is conducted to compare different MI methods, the relative bias and mean square error will be used in the evaluation process. Furthermore, an empirical study based on cardiovascular disease event data which contains missing values will be conducted. Overall, the results show that MI based on the Nelson-Aalen estimate of H0(T) is preferred in most circumstances.
11

Chan, Pui-shan, and 陳佩珊. "On the use of multiple imputation in handling missing values in longitudinal studies." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B45009879.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Raoufi-Danner, Torrin. "Effects of Missing Values on Neural Network Survival Time Prediction." Thesis, Linköpings universitet, Statistik och maskininlärning, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-150339.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Data sets with missing values are a pervasive problem within medical research. Building lifetime prediction models based solely upon complete-case data can bias the results, so imputation is preferred over listwise deletion. In this thesis, artificial neural networks (ANNs) are used as a prediction model on simulated data with which to compare various imputation approaches. The construction and optimization of ANNs is discussed in detail, and some guidelines are presented for activation functions, number of hidden layers and other tunable parameters. For the simulated data, binary lifetime prediction at five years was examined. The ANNs here performed best with tanh activation, binary cross-entropy loss with softmax output and three hidden layers of between 15 and 25 nodes. The imputation methods examined are random, mean, missing forest, multivariate imputation by chained equations (MICE), pooled MICE with imputed target and pooled MICE with non-imputed target. Random and mean imputation performed poorly compared to the others and were used as a baseline comparison case. The other algorithms all performed well up to 50% missingness. There were no statistical differences between these methods below 30% missingness, however missing forest had the best performance above this amount. It is therefore the recommendation of this thesis that the missing forest algorithm is used to impute missing data when constructing ANNs to predict breast cancer patient survival at the five-year mark.
13

Meister, Romy. "Evaluation verschiedener Imputationsverfahren zur Aufbereitung großer Datenbestände am Beispiel der SrV-Studie von 2013." Master's thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-198852.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Missing values are a serious problem in surveys. The literature suggests to replace these with realistic values using imputation methods. This master thesis examines four different imputation techniques concerning their ability for handling missing data. Therefore, mean imputation, conditional mean imputation, Expectation-Maximization algorithm and Markov-Chain-Monte-Carlo method are presented. In addition, the three first mentioned methods were simulated by using a large real data set. To analyse the quality of these techniques a metric variable of the original data set was chosen to generate some missing values considering different percentages of missingness and common missing data mechanism. After the replacement of the simulated missing values, several statistical parameters, like quantiles, arithmetic mean and variance of all completed data sets were calculated in order to compare them with the parameters from the original data set. The results, that have been established by empiric data analysis, show that the Expectation-Maximization algorithm estimates all considered statistical parameters of the complete data set far better than the other analysed imputation methods, although the assumption of a multivariate normal distribution could not be achieved. It is found, that the mean as well as the conditional mean imputation produce statistically significant estimator for the arithmetic mean under the supposition of missing completely at random, whereas other parameters as the variance do not show the estimated effects. Generally, the accuracy of all estimators from the three imputation methods decreases with increasing percentage of missingness. The results lead to the conclusion that the Expectation-Maximization algorithm should be preferred over the mean and the conditional mean imputation.
14

Silva, Jonathan de Andrade. "Substituição de valores ausentes: uma abordagem baseada em um algoritmo evolutivo para agrupamento de dados." Universidade de São Paulo, 2010. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-07062010-144250/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
A substituição de valores ausentes, também conhecida como imputação, é uma importante tarefa para a preparação dos dados em aplicações de mineração de dados. Este trabalho propõe e avalia um algoritmo para substituição de valores ausentes baseado em um algoritmo evolutivo para agrupamento de dados. Este algoritmo baseia-se na suposição de que grupos (previamente desconhecidos) de dados podem prover informações úteis para o processo de imputação. Para avaliar experimentalmente o algoritmo proposto, simulações de valores ausentes foram realizadas em seis bases de dados, para problemas de classificação, com a aplicação de dois mecanismos amplamente usados em experimentos controlados: MCAR e MAR. Os algoritmos de imputação têm sido tradicionalmente avaliados por algumas medidas de capacidade de predição. Entretanto, essas tradicionais medidas de avaliação não estimam a influência dos métodos de imputação na etapa final em tarefas de modelagem (e.g., em classificação). Este trabalho descreve resultados experimentais obtidos sob a perspectiva de predição e inserção de tendências (viés) em problemas de classificação. Os resultados de diferentes cenários nos quais o algoritmo proposto, apresenta em geral, desempenho semelhante a outros seis algoritmos de imputação reportados na literatura. Finalmente, as análises estatísticas reportadas sugerem que melhores resultados de predição não implicam necessariamente em menor viés na classificação
The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. This work proposes and evaluates an algorithm for missing values imputation that is based on an evolutionary algorithm for clustering. This algorithm is based on the assumption that clusters of (partially unknown) data can provide useful information for the imputation process. In order to experimentally assess the proposed method, simulations of missing values were performed on six classification datasets, with two missingness mechanisms widely used in practice: MCAR and MAR. Imputation algorithms have been traditionally assessed by some measures of prediction capability. However, this traditionall approach does not allow inferring the influence of imputed values in the ultimate modeling tasks (e.g., in classification). This work describes the experimental results obtained from the prediction and insertion bias perspectives in classification problems. The results illustrate different scenarios in which the proposed algorithm performs similarly to other six imputation algorithms reported in the literature. Finally, statistical analyses suggest that best prediction results do not necessarily imply in less classification bias
15

Assunção, Fernando. "Estratégias para tratamento de variáveis com dados faltantes durante o desenvolvimento de modelos preditivos." Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/45/45133/tde-15082012-203206/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Modelos preditivos têm sido cada vez mais utilizados pelo mercado a fim de auxiliarem as empresas na mitigação de riscos, expansão de carteiras, retenção de clientes, prevenção a fraudes, entre outros objetivos. Entretanto, durante o desenvolvimento destes modelos é comum existirem, dentre as variáveis preditivas, algumas que possuem dados não preenchidos (missings), sendo necessário assim adotar algum procedimento para tratamento destas variáveis. Dado este cenário, este estudo tem o objetivo de discutir metodologias de tratamento de dados faltantes em modelos preditivos, incentivando o uso de algumas delas já conhecidas pelo meio acadêmico, só que não utilizadas pelo mercado. Para isso, este trabalho descreve sete metodologias. Todas elas foram submetidas a uma aplicação empírica utilizando uma base de dados referente ao desenvolvimento de um modelo de Credit Score. Sobre esta base foram desenvolvidos sete modelos (um para cada metodologia descrita) e seus resultados foram avaliados e comparados através de índices de desempenho amplamente utilizados pelo mercado (KS, Gini, ROC e Curva de Aprovação). Nesta aplicação, as técnicas que apresentaram melhor desempenho foram a que tratam os dados faltantes como uma categoria à parte (técnica já utilizada pelo mercado) e a metodologia que consiste em agrupar os dados faltantes na categoria conceitualmente mais semelhante. Já a que apresentou o pior desempenho foi a metodologia que simplesmente não utiliza a variável com dados faltantes, outro procedimento comumente visto no mercado.
Predictive models have been increasingly used by the market in order to assist companies in risk mitigation, portfolio growth, customer retention, fraud prevention, among others. During the model development, however, it is usual to have, among the predictive variables, some who have data not filled in (missing values), thus it is necessary to adopt a procedure to treat these variables. Given this scenario, the aim of this study is to discuss frameworks to deal with missing data in predictive models, encouraging the use of some already known by academia that are still not used by the market. This paper describes seven methods, which were submitted to an empirical application using a Credit Score data set. Each framework described resulted in a predictive model developed and the results were evaluated and compared through a series of widely used performance metrics (KS, Gini, ROC curve, Approval curve). In this application, the frameworks that presented better performance were the ones that treated missing data as a separate category (technique already used by the market) and the framework which consists of grouping the missing data in the category most similar conceptually. The worst performance framework otherwise was the one that simply ignored the variable containing missing values, another procedure commonly used by the market.
16

Moreno, Betancur Margarita. "Regression modeling with missing outcomes : competing risks and longitudinal data." Thesis, Paris 11, 2013. http://www.theses.fr/2013PA11T076/document.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Les données manquantes sont fréquentes dans les études médicales. Dans les modèles de régression, les réponses manquantes limitent notre capacité à faire des inférences sur les effets des covariables décrivant la distribution de la totalité des réponses prévues sur laquelle porte l'intérêt médical. Outre la perte de précision, toute inférence statistique requière qu'une hypothèse sur le mécanisme de manquement soit vérifiée. Rubin (1976, Biometrika, 63:581-592) a appelé le mécanisme de manquement MAR (pour les sigles en anglais de « manquant au hasard ») si la probabilité qu'une réponse soit manquante ne dépend pas des réponses manquantes conditionnellement aux données observées, et MNAR (pour les sigles en anglais de « manquant non au hasard ») autrement. Cette distinction a des implications importantes pour la modélisation, mais en général il n'est pas possible de déterminer si le mécanisme de manquement est MAR ou MNAR à partir des données disponibles. Par conséquent, il est indispensable d'effectuer des analyses de sensibilité pour évaluer la robustesse des inférences aux hypothèses de manquement.Pour les données multivariées incomplètes, c'est-à-dire, lorsque l'intérêt porte sur un vecteur de réponses dont certaines composantes peuvent être manquantes, plusieurs méthodes de modélisation sous l'hypothèse MAR et, dans une moindre mesure, sous l'hypothèse MNAR ont été proposées. En revanche, le développement de méthodes pour effectuer des analyses de sensibilité est un domaine actif de recherche. Le premier objectif de cette thèse était de développer une méthode d'analyse de sensibilité pour les données longitudinales continues avec des sorties d'étude, c'est-à-dire, pour les réponses continues, ordonnées dans le temps, qui sont complètement observées pour chaque individu jusqu'à la fin de l'étude ou jusqu'à ce qu'il sorte définitivement de l'étude. Dans l'approche proposée, on évalue les inférences obtenues à partir d'une famille de modèles MNAR dits « de mélange de profils », indexés par un paramètre qui quantifie le départ par rapport à l'hypothèse MAR. La méthode a été motivée par un essai clinique étudiant un traitement pour le trouble du maintien du sommeil, durant lequel 22% des individus sont sortis de l'étude avant la fin.Le second objectif était de développer des méthodes pour la modélisation de risques concurrents avec des causes d'évènement manquantes en s'appuyant sur la théorie existante pour les données multivariées incomplètes. Les risques concurrents apparaissent comme une extension du modèle standard de l'analyse de survie où l'on distingue le type d'évènement ou la cause l'ayant entrainé. Les méthodes pour modéliser le risque cause-spécifique et la fonction d'incidence cumulée supposent en général que la cause d'évènement est connue pour tous les individus, ce qui n'est pas toujours le cas. Certains auteurs ont proposé des méthodes de régression gérant les causes manquantes sous l'hypothèse MAR, notamment pour la modélisation semi-paramétrique du risque. Mais d'autres modèles n'ont pas été considérés, de même que la modélisation sous MNAR et les analyses de sensibilité. Nous proposons des estimateurs pondérés et une approche par imputation multiple pour la modélisation semi-paramétrique de l'incidence cumulée sous l'hypothèse MAR. En outre, nous étudions une approche par maximum de vraisemblance pour la modélisation paramétrique du risque et de l'incidence sous MAR. Enfin, nous considérons des modèles de mélange de profils dans le contexte des analyses de sensibilité. Un essai clinique étudiant un traitement pour le cancer du sein de stade II avec 23% des causes de décès manquantes sert à illustrer les méthodes proposées
Missing data are a common occurrence in medical studies. In regression modeling, missing outcomes limit our capability to draw inferences about the covariate effects of medical interest, which are those describing the distribution of the entire set of planned outcomes. In addition to losing precision, the validity of any method used to draw inferences from the observed data will require that some assumption about the mechanism leading to missing outcomes holds. Rubin (1976, Biometrika, 63:581-592) called the missingness mechanism MAR (for “missing at random”) if the probability of an outcome being missing does not depend on missing outcomes when conditioning on the observed data, and MNAR (for “missing not at random”) otherwise. This distinction has important implications regarding the modeling requirements to draw valid inferences from the available data, but generally it is not possible to assess from these data whether the missingness mechanism is MAR or MNAR. Hence, sensitivity analyses should be routinely performed to assess the robustness of inferences to assumptions about the missingness mechanism. In the field of incomplete multivariate data, in which the outcomes are gathered in a vector for which some components may be missing, MAR methods are widely available and increasingly used, and several MNAR modeling strategies have also been proposed. On the other hand, although some sensitivity analysis methodology has been developed, this is still an active area of research. The first aim of this dissertation was to develop a sensitivity analysis approach for continuous longitudinal data with drop-outs, that is, continuous outcomes that are ordered in time and completely observed for each individual up to a certain time-point, at which the individual drops-out so that all the subsequent outcomes are missing. The proposed approach consists in assessing the inferences obtained across a family of MNAR pattern-mixture models indexed by a so-called sensitivity parameter that quantifies the departure from MAR. The approach was prompted by a randomized clinical trial investigating the benefits of a treatment for sleep-maintenance insomnia, from which 22% of the individuals had dropped-out before the study end. The second aim was to build on the existing theory for incomplete multivariate data to develop methods for competing risks data with missing causes of failure. The competing risks model is an extension of the standard survival analysis model in which failures from different causes are distinguished. Strategies for modeling competing risks functionals, such as the cause-specific hazards (CSH) and the cumulative incidence function (CIF), generally assume that the cause of failure is known for all patients, but this is not always the case. Some methods for regression with missing causes under the MAR assumption have already been proposed, especially for semi-parametric modeling of the CSH. But other useful models have received little attention, and MNAR modeling and sensitivity analysis approaches have never been considered in this setting. We propose a general framework for semi-parametric regression modeling of the CIF under MAR using inverse probability weighting and multiple imputation ideas. Also under MAR, we propose a direct likelihood approach for parametric regression modeling of the CSH and the CIF. Furthermore, we consider MNAR pattern-mixture models in the context of sensitivity analyses. In the competing risks literature, a starting point for methodological developments for handling missing causes was a stage II breast cancer randomized clinical trial in which 23% of the deceased women had missing cause of death. We use these data to illustrate the practical value of the proposed approaches
17

Chion, Marie. "Développement de nouvelles méthodologies statistiques pour l'analyse de données de protéomique quantitative." Thesis, Strasbourg, 2021. http://www.theses.fr/2021STRAD025.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
L’analyse protéomique consiste à étudier l’ensemble des protéines exprimées par un système biologique donné, à un moment donné et dans des conditions données. Les récents progrès technologiques en spectrométrie de masse et en chromatographie liquide permettent d’envisager aujourd’hui des études protéomiques à large échelle et à haut débit. Ce travail de thèse porte sur le développement de méthodologies statistiques pour l’analyse des données de protéomique quantitative et présente ainsi trois principales contributions. La première partie propose d’utiliser des modèles de régression par spline monotone pour estimer les quantités de tous les peptides détectés dans un échantillon grâce à l'utilisation de standards internes marqués pour un sous-ensemble de peptides ciblés. La deuxième partie présente une stratégie de prise en compte de l’incertitude induite par le processus d’imputation multiple dans l’analyse différentielle, également implémentée dans le package R mi4p. Enfin, la troisième partie propose un cadre bayésien pour l’analyse différentielle, permettant notamment de tenir compte des corrélations entre les intensités des peptides
Proteomic analysis consists of studying all the proteins expressed by a given biological system, at a given time and under given conditions. Recent technological advances in mass spectrometry and liquid chromatography make it possible to envisage large-scale and high-throughput proteomic studies.This thesis work focuses on developing statistical methodologies for the analysis of quantitative proteomics data and thus presents three main contributions. The first part proposes to use monotone spline regression models to estimate the amounts of all peptides detected in a sample using internal standards labelled for a subset of targeted peptides. The second part presents a strategy to account for the uncertainty induced by the multiple imputation process in the differential analysis, also implemented in the mi4p R package. Finally, the third part proposes a Bayesian framework for differential analysis, making it notably possible to consider the correlations between the intensities of peptides
18

Peña, Marisol Garcia. "Alternativas de análise para experimentos G × E multiatributo." Universidade de São Paulo, 2016. http://www.teses.usp.br/teses/disponiveis/11/11134/tde-04052016-111857/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Geralmente, nos experimentos genótipo por ambiente (G × E) é comum observar o comportamento dos genótipos em relação a distintos atributos nos ambientes considerados. A análise deste tipo de experimentos tem sido abordada amplamente para o caso de um único atributo. Nesta tese são apresentadas algumas alternativas de análise considerando genótipos, ambientes e atributos simultaneamente. A primeira, é baseada no método de mistura de máxima verossimilhança de agrupamento - Mixclus e a análise de componentes principais de 3 modos - 3MPCA, que permitem a análise de tabelas de tripla entrada, estes dois métodos têm sido muito usados na área da psicologia e da química, mas pouco na agricultura. A segunda, é uma metodologia que combina, o modelo de efeitos aditivos com interação multiplicativa - AMMI, modelo eficiente para a análise de experimentos (G × E) com um atributo e a análise de procrustes generalizada, que permite comparar configurações de pontos e proporcionar uma medida numérica de quanto elas diferem. Finalmente, é apresentada uma alternativa para realizar imputação de dados nos experimentos (G × E), pois, uma situação muito frequente nestes experimentos, é a presença de dados faltantes. Conclui-se que as metodologias propostas constituem ferramentas úteis para a análise de experimentos (G × E) multiatributo.
Usually, in the experiments genotype by environment (G×E) it is common to observe the behaviour of genotypes in relation to different attributes in the environments considered. The analysis of such experiments have been widely discussed for the case of a single attribute. This thesis presents some alternatives of analysis, considering genotypes, environments and attributes simultaneously. The first, is based on the mixture maximum likelihood method - Mixclus and the three-mode principal component analysis, these two methods have been very used in the psychology and chemistry, but little in agriculture. The second, is a methodology that combines the additive main effects and multiplicative interaction models - AMMI, efficient model for the analysis of experiments (G×E) with one attribute, and the generalised procrustes analysis, which allows compare configurations of points and provide a numerical measure of how much they differ. Finally, an alternative to perform data imputation in the experiments (G×E) is presented, because, a very frequent situation in these experiments, is the presence of missing values. It is concluded that the proposed methodologies are useful tools for the analysis of experiments (G×E) multi-attribute.
19

Li, Yun-Jie, and 李昀潔. "The Effect of Instance Selection on Missing Value Imputation." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/u8nt9j.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立中央大學
資訊管理學系
103
In data mining, the collected datasets are usually incomplete, which contain some missing attribute values. It is difficult to effectively develop a learning model using the incomplete datasets. In literature, missing value imputation can be approached for the problem of incomplete datasets. Its aim is to provide estimations for the missing values by the (observed) complete data samples. However, some of the complete data may contain some noisy information, which can be regarded as outliers. If these noisy data were used for missing value imputation, the quality of the imputation results would be affected. To solve this problem, we propose to perform instance selection over the complete data before the imputation step. The aim of instance selection is to filter out some unrepresentative data from a given dataset. Therefore, this research focuses on examining the effect of performing instance selection on missing value imputation. The experimental setup is based on using 33 UCI datasets, which are composed of categorical, numerical, and mixed types of data. In addition, three instance selection methods, which are IB3 (Instance-based learning), DROP3 (Decremental Reduction Optimization Procedure), and GA (Genetic Algorithm) are used for comparison. Similarly, three imputation methods including KNNI (K-Nearest Neighbor Imputation method), SVM (Support Vector Machine), and MLP (MultiLayers Perceptron) are also employed individually. The comparative results can allow us to understand which combination of instance selection and imputation methods performs the best and whether combining instance selection and missing value imputation is the better choice than performing missing value imputation alone for the incomplete datasets. According to the results of this research, we suggest that the combinations of instance selection methods and imputation methods may suitable than the imputation methods along over numerical datasets. In particular, the DROP3 instance selection method is more suitable for numerical and mixed datasets, except for categorical datasets, especially when the number of features is large. For the other two instance selection methods, the GA method can provide more stable reduction performance than IB3.
20

Lin, Ying-Siou, and 林盈秀. "The relationship between missing value, imputation and data pre-processing." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/21777269840849835589.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立中央大學
資訊管理學系
101
With the rapid development of information technology, computers can process and store huge amounts of data. This leads to the importance of finding useful content from large amounts of data in data mining. However, many collected datasets for data mining usually contain some missing values, which are likely to degrade the data mining performance. For incomplete data processing, it is a common and simple way to perform case deletion by ignoring the data samples with missing values if the missing rate was certainly small. Another approach is based on imputation, where various approaches have been proposed for missing value imputation. Generally speaking, the imputation algorithms aim at providing estimations for missing values by a reasoning process from the observed data. However, there is no answer for the question about when should we use the case deletion or imputation approach over different kinds of datasets. Another question is that will performing data pre-processing, i.e. feature and instance selection, affect the final imputation result? This thesis used 37 different data sets, which contain categorical, numerical, and both types of data, and 5% intervals for different missing rates per dataset (i.e. from 5% to 50%). Research topic is divided into two parts. The experimental results indicate that there are some specific patterns to consider case deletion over different datasets without significant performance degradation. A decision tree model is then constructed to extract useful rules to recommend when to use the case deletion approach. Furthermore, we found that imputation after instance selection can produce better classification performance than imputation alone. However, imputation after feature selection does not have a positive impact on the imputation result.
21

Wu, Yi-Jing, and 吳宜靜. "A Shrinkage Least Square Imputation Method for Microarray Missing Value Estimation." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/65919355846085221875.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立交通大學
統計學研究所
99
Microarray data analysis has widely used in biological studies. However, it is common that there are missing values in microarray data, which affects the result of analysis. As many downstream analysis methods require complete datasets, missing value estimation has been an important pre-processing step in the microarray analysis. Among the existed missing value imputation methods, the regression-based methods are very popular. Many algorithms are developed for reconstructing these missing values. In this study, we propose a James-Stein type modified estimator for the regression coefficients. We compare the performance of the conventional imputations and the James-Stein type adjusted imputation method, our approach shows better performance than the others on various datasets.
22

Dai, Yu-Ting, and 戴郁庭. "Missing value imputation for class imbalance data: a dynamic warping approach." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/6cax3v.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立中央大學
資訊管理學系
107
In a world full of information, more and more companies want to use this information to improve their competitiveness. However, the problems of “Class Imbalance” and “Missing Value” have always been important issues in the real world. For example, class imbalance datasets often occur in different fields such as medical diagnosis and bankruptcy prediction. In class imbalance, the number of samples of the majority class in the dataset is larger than that of the minority class, and the data will look skewed. In order to have a higher classification accuracy rate, the prediction model established by the general classifier will also be misjudged as a large class of data due to the influence of the skewed distribution. If the precious minority class contains some missing data, the available data are even rarer. In this thesis, dynamic time warping is used as the core for the missing value imputation task. Dynamic time warping correction feature is used to solve the problem of missing data in the minority class containing small numbers of samples. And this method is not limited to the need for a complete data sample. Therefore, in the experiment, 10%, 30%, 50%, 70%, and 90% missing rates of the minority class data are simulated. In this paper, we use 17 KEEL datasets for the experiment, and two classification models (SVM, Decision Tree) are constructed, and the AUC (Area Under Curve) are examined for different methods. The experimental results show that the dynamic time warping has good performance under the missing rate of 50%~90%, which performs better than the KNN imputation method.
23

Tsai, Chun-Hui, and 蔡純卉. "A Study on Applying Visual Analytics to the Imputation of Missing Value." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/v8x92z.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
中華大學
資訊管理學系
107
With the gradual increase in data utilization demand in recent years and the diversification of source channels, such as open data, the amount of data people can collect has also gradually increased, thus increasing the demand for data analysis. Data analysis is to extract potential information from data not known to have value in the past. At present, the government or enterprises also use data analysis to predict future trends and development business opportunities. If unprocessed data is used to carry out data analysis, data analysis results may be affected, leading to errors and failing to provide effective and reliable information. Therefore, many research scholars have proposed methods to process missing values in order to ensure data is the most complete, while data analysis is carried out to avoid affecting final decisions. Despite the issue of missing value processing gradually gaining attention, there are a diversity of methods for processing missing values as different types of missing data cannot be processed by using just one method. When confronted by miscellaneous data, it is difficult for human to quickly decide which method to use to process missing values. According to research, the human brain can absorb 80% of graphics and only 20% of text. Hence, using data visualization, complex abstract information can be presented in the form of images and can aid us in observing the regularities, trends, and correlations in the data, not only saving time but also effectively and quickly absorbing information, making it easier and clearer to understand the key points in the data. In this study, the research framework of using visualization analysis in missing value imputing was proposed. It mainly consists of six functional modules, including data upload module, missing analysis module, imputing setting module, missing value processing module, preview imputing module, and data output module. This framework was physically set up as a system. Finally, instances were used to verify the feasibility of the framework. This platform has integrated many kinds of common imputing methods, allowing analysts to have more choices. Additionally, through visualized ancillary data, analysts can observe original data and quickly understand and grasp data omission status and provide preview imputing function, thereby using graphics to present the results after missing value filing and the differences among the various imputing methods. Finally, the analysts were assisted in choosing a more suitable method to carry out missing value imputing, thus enhancing data quality and completeness.
24

Huang, Jing-Ya, and 黃靖雅. "Evaluation of missing value imputation methods for the helpfulness of online reviews." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/3tq957.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立中央大學
資訊管理學系
106
In today's world, everyone can comment on many public posts, including newspapers, magazines and books you have ever read. Online reviews are considered as trustworthy. Users can provide online reviews through several ways such as star ratings, text, images, and videos. Most users will also browse the reviews on the websites before purchasing goods and experiencing. This constant state of information overload is caused by the Internet that contains too much information; hence data mining techniques can be employed to solve this problem. This thesis considers the helpfulness of online hotel reviews for the research. During the data preprocessing, we found that it is very common that real-world review datasets usually contain certain numbers of missing attribute values. In literature, there is no a study focus on examining the performances of different types of techniques to handle incomplete online review datasets. The experiment is composed of two studies. In the first study, the dataset is collected from TripAdvisor, where some reviewer related information is missing, such as reviewer level, age, sex, etc. Three types of techniques are compared, which are case deletion, imputation methods including mean/mode, KNN, and SVM, and directly handle the incomplete dataset without imputation by C5.0. In the second study, the raining information is simulated for 10% to 50% missing rates of the dataset. The experiment results of the two studies show that the C5.0 decision tree algorithm is the better choice for dealing with missing values in online review datasets.
25

Han-DeCheng and 程瀚德. "Missing Value Estimation for Microarray Gene ExpressionData by Hybrid Local Least Squares Imputation." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/06546326894975258066.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Shi-YaoZhan and 詹士瑤. "A comprehensive study on comparison of missing value imputation methods for microarray data." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/19111541983723560460.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立成功大學
電機工程學系碩博士班
101
Microarray data frequently contain missing values due to various reasons. However, most downstream analyses for microarray data require complete datasets. Therefore, algorithms for missing value estimation must be developed to improve downstream analysis. Since 2001, lots of algorithms have been proposed, but the comparison performance among different algorithms is always insufficient in the numbers of benchmark datasets, the number of algorithms included and performance measure used, and the rounds of simulation performed. In this research, we used (I) nine algorithms, (II) thirteen microarray datasets, (III) 110 independent runs of the simulation procedure, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. In addition to the statistical measure, two other indices with more biological meanings are useful to reflect the impact of missing value imputation on downstream data analysis. Through our studies, we suggest that local-least-squares-based and least-squares methods would be better choices to handle missing values for most of datasets. In this work, we carried out a comprehensive comparison of algorithms of microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose an optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with other existing algorithms using this comparison strategy as a standard protocol in the future.
27

Che-WeiWu and 吳哲維. "The Study on Missing Value Imputation for Modeling the Data of Next Generation Sequence." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/29628471344179278525.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立成功大學
統計學系
103
As science progresses, DNA sequencing technology and DNA sequencing platform also followed innovation. After the sequencing staffs use the old platform, there will be a new platform to follow. The sequencing staffs usually buy new platform, the old platform not necessarily immediately eliminated. At this situation, the data analyst must analyzed read count data of gene sequences from different platforms. In this case, the platform effect is likely to affect analysis result. In addition, the gene chips may generate missing value due to the machine insufficient resolution, image corruption and other reasons, resulting in unusable statistical methods for analysis. In this study, gene alignment read count data of colorectal cancer patients is provided by Professor H. Sunny Sun from the Institute of Molecular Medicine, National Cheng Kung Univeristy Medical college, and Center for Genomic Medicine. Data are taken out of normal and tumor cells of 12 patients with colorectal cancer from two different platforms. Because the data has missing value, this paper will propose imputation methods and use generalized estimating equation model, and through statistical simulation to compare the behavior of imputation methods under several different situations of parameters.
28

Huang, Hao-Hsuan, and 黃浩軒. "A Nearest Neighbors Field Method Based on Distance for Missing Value Imputation in Medical Application." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/3a3f4d.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立雲林科技大學
資訊管理系
106
In medical filed, missing data is often existed, which will affect the analysis and prediction by doctors and scholars. Most studies have focused on the highest accuracy prediction model in current medical research. However, they do not have considered the stability of models with different missing degrees and different missing types. Based on data complete and easy operation, this study proposes the imputation method, which is nearest neighbors method based on distance to imputation missing value. In the experiment, this study selected several UCI datasets and produced different missing degrees and missing types. Comparing training accuracy with popular imputation methods. Moreover, using the Stroke dataset from the International Stroke Trial (IST) to verify whether the proposed method could be effectively used in practice. The results show that, the proposed method can have good performances under different simulations of missing degrees, missing types, and datasets. In addition, the proposed method can obtained 90-percentage accuracy in the Stoke dataset. It means the proposed method can be effectively used in the practice dataset.
29

Li, Yi-Cheng, and 李俋澄. "An Empirical Study of Missing-value Imputation Methods on The Accuracy of DNN-based Classification." Thesis, 2019. http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/login?o=dnclcdr&s=id=%22107NCHU5394045%22.&searchmode=basic.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立中興大學
資訊科學與工程學系所
107
Missing data is a common problem in statistical analysis. In the field of missing data research, if the proportion of missing data increases, its effect on data analysis can be very serious. A large amount of missing data may make the analysis result deviate from the facts, so missing value handling is extremely important. Previous studies of handling missing values focused on traditional machine learning methods. This study however intends to use the deep neural network (DNN) as a classification model and investigates the effect of various missing data imputation methods on the accuracy of DNN-based classification. In this study, eight different imputation methods for handling missing values are selected for comparisons. The simulation experiment includes three parts; missing values occurs in the training data, missing values occurs in the test data, and missing values occurs in both of the training and test data. According to the above three cases, simulation experiments are conducted to observe the classification accuracy of the eight imputation methods under different missing ratios. Experimental results show that the deep neural network using KNN imputation have the best classification accuracy among the eight imputation methods in different missing ratios and data sets. When the missing ratio is between 5% and 40%, the accuracy of KNN imputation in the three experiments exceeds that of other imputation methods by an average of 6.81%.
30

Meng-JhunJhou and 周孟諄. "Construction of a web tool for comprehensively evaluating the performance of a new microarray missing value imputation algorithm." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/72194686550011204177.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立成功大學
電機工程學系
104
Method of imputing the missing value for microarray dataset analyses is very important because the microarray data containing missing values will significantly reduce the performance and effectiveness of downstream analyses. Although today, there are a lot of missing value imputation algorithms, it is still a lack of objective and comprehensive performance architecture. In our previous study published, we instruct a comprehensive and objective comparison framework of various existing algorithms. Our framework can be used in the development of new missing value imputing algorithm. However, the construction of this framework of algorithms for the researchers is not an easy task. To save time and effort for researchers, we publish an easy to use web tool named MVIAeval (Missing Value Imputation Algorithm evaluator). MVIAeval provides a convenient user interface to use, allowing users to upload their new algorithm code, then follow these five steps to select: (1) the simulation test data among 20 microarray dataset. (2) the comparison algorithms in 12 existing algorithms. (3) the performance indices from three ones. (4) the comparison method of all algorithms in two performance scores. (5) the number of simulation times. Finally, it will shown results of simulated performance comparison in figure and tables. MVIAeval is a well useful tool for researchers, we can simply use to achieve a comprehensive and objective performance comparison of their new algorithm they developed for imputing missing value in microarray data.
31

Kuo, Yi-Ru, and 郭奕汝. "Applying Classification in Missing Values Imputation." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/23462033623115348014.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
銘傳大學
資訊工程學系碩士班
98
Abstract, Data Mining is used to find the useful and potential information from the large dataset so it is used generally. Classification, clustering and association rules are the techniques in the data mining field, and the applications are always in all areas. For example, in the banking field, in order to know who will come to grant a loan, classification technique will be used. In the medical field, it is easy to know what kind of people will need medical treatment. As for classification, the quality of the training data is the key factor of classification. In real data, there perhaps exist some missing values, so it will cause the deviation through classifying. In order to build a good model, it would need to collect the superior quality data. However, there are many ways to cause missing values such as data conflicts when matching data or respondent does not fill in the questionnaires. For completing the data, imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Recently, it always uses statistics methods to impute the missing values, for instance, mode or mean imputation. However, statistics methods are affected by the distributions of data so association rules will solve the problem. It is widely believed that association rule could mine the features among different attributes. A method proposed in this paper, a classification model is built by integrating the methods between clustering and association rules. The process is that using clustering method to generalize the data with the same features and find the association rules from each particular group to impute the missing values. Then, the complete and quality data will be used to construct the model.
32

Oh, Sohae. "Multiple Imputation on Missing Values in Time Series Data." Thesis, 2015. http://hdl.handle.net/10161/10447.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:

Financial stock market data, for various reasons, frequently contain missing values. One reason for this is that, because the markets close for holidays, daily stock prices are not always observed. This creates gaps in information, making it difficult to predict the following day’s stock prices. In this situation, information during the holiday can be “borrowed” from other countries’ stock market, since global stock prices tend to show similar movements and are in fact highly correlated. The main goal of this study is to combine stock index data from various markets around the world and develop an algorithm to impute the missing values in individual stock index using “information-sharing” between different time series. To develop imputation algorithm that accommodate time series-specific features, we take multiple imputation approach using dynamic linear model for time-series and panel data. This algorithm assumes ignorable missing data mechanism, as which missingness due to holiday. The posterior distributions of parameters, including missing values, is simulated using Monte Carlo Markov Chain (MCMC) methods and estimates from sets of draws are then combined using Rubin’s combination rule, rendering final inference of the data set. Specifically, we use the Gibbs sampler and Forward Filtering and Backward Sampling (FFBS) to simulate joint posterior distribution and posterior predictive distribution of latent variables and other parameters. A simulation study is conducted to check the validity and the performance of the algorithm using two error-based measurements: Root Mean Square Error (RMSE), and Normalized Root Mean Square Error (NRMSE). We compared the overall trend of imputed time series with complete data set, and inspected the in-sample predictability of the algorithm using Last Value Carried Forward (LVCF) method as a bench mark. The algorithm is applied to real stock price index data from US, Japan, Hong Kong, UK and Germany. From both of the simulation and the application, we concluded that the imputation algorithm performs well enough to achieve our original goal, predicting the stock price for the opening price after a holiday, outperforming the benchmark method. We believe this multiple imputation algorithm can be used in many applications that deal with time series with missing values such as financial and economic data and biomedical data.


Thesis
33

Fei, Shih Yuan, and 費詩元. "Multiple imputation for missing covariates in contingent valua-tion survey." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/65283401003442026240.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立政治大學
統計研究所
97
Most often, studies focus on willingness to pay (WTP) simply ignore the missing values and treat them as if they were missing completely at random. It is well-known that such a practice might cause serious bias and lead to incorrect results. Income is one of the most influential variables in CV (contingent valuation) study and is also the variable that respondents most likely fail to respond. In the present study, we evaluate the performance of multiple imputation (MI) on missing income in the analysis of WTP through a series of simulation experiments. Several approaches such as complete-case analysis, single imputation, and MI are considered and com-pared. We show that performance with MI is always better than complete-case analy-sis, especially when the missing rate gets high. We also show that MI is more stable and reliable than single imputation. As an illustration, we use data from Cardio Vascular Disease risk FACtor Two-township Study (CVDFACTS). We demonstrate how to determine the missing mechanism through comparing the survival curves and a logistic regression model fitting. Based on the empirical study, we find that discarding cases with missing in-come can lead to something different from that with multiple imputation. If the dis-carded cases are not missing complete at random, the remaining samples will be biased. That can be a serious problem in CV research. To conclude, MI is a useful method to deal with missing value problems and it should be worthwhile to give it a try in CV studies.
34

Hung, Chi-Lan, and 洪啟嵐. "Methods for imputation of missing values in air quality data sets." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/11587976008880936667.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立中興大學
環境工程學系所
96
In order to discuss the problem of the air quality, to understand the pollution and the different pollution transmission in each place, and to design strategies to control and improve the contamination, Environmental Protection Administration in Taiwan in 1994 divided Taiwan into 7 air quality districts based on their different pollution features, the terrain, and the climate in each district. They have already set up over 70 air quality monitoring stations in Taiwan. Because of the air quality monitoring stations which monitored air quality condition of each place, we are able to have the historical information of air quality of everywhere for everybody to research and find out the useful information from this huge database. But around 10% of data are missing only during the transmission process. However, the missing values of the database will affect the data analysis, hence, it is very important to resolve the missing value problem. My research uses Inverse Square Distance Weighting method and Kriging method to impute the missing values, and discusses, analyzes, and compares the result of using the 2 different methods. Monte Carlo method is then used to test and verify which one is the better method to yield the accurate values to replace the missing values. After my research, the absolute error of Inverse Square Distance Weighting is smaller than it of Monte Carlo method for imputation of air quality data. After verifying, the absolute error of Inverse Square Distance Weighting and Kriging method is respectively 25% and 19%. It shows that Kriging method is better imputation method than Inverse Square Distance Weighting.
35

Huang, Hsiang-Chi, and 黃纕淇. "Imputation of Missing Values of Regional Trial Data by EM-AMMI." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/30909519506673627860.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立臺灣大學
農藝學研究所
104
The purpose of regional trials is to confirm yield and agronomic traits of lines have stable and good performance in different environments. Some statistical methods were proposed to explain the patterns of genotype and environment interactions in the regional trial data. Particularly, Additive Main Effect and Multiplicative Interaction (AMMI) model uses singular decomposition value (SVD) to decompose genotype by environment interaction into the singular values, the genotype eigenvectors, and the environment eigenvectors to carry out the stability analysis on the tested genotype. However, a major limitation of AMMI model is that SVD requires a complete two-way table of genotype and environment mean yields. A typical multi-year or multi-location regional trial data is usually highly unbalanced so that the investigation of genotype by environment interaction across years is restricted. In this study we impute missing values by expectation-maximization-additive main effect and multiplicative interaction (EM-AMMI) method. The results of simulated data suggest to conduct EM-AMMI using one principal component to impute the missing values when the proportions of the missing values is less than 50%; when the proportion of missing values is more than 50%, we suggest to perform EM-AMMI using first three principal components. We also imputed the missing values of vegetable soybean regional trial data by EM-AMMI. In conclusion, providing a complete regional trial data by appropriate EM-AMMI can help the plant breeders to better understand of genotype by environment interaction.
36

Lin, Yu-shiang, and 林鈺翔. "A Study on Using Temoral/Spatial Imputation for Vehicle Detector Missing Values." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/12725294089432672692.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立中央大學
土木工程研究所
98
The main purpose of this study is the use of detector temporal/spatial data to imput missing values, try to find a best combination and the performance of missing value imputation. At first, in spatial we adopted two modes:single detector datas and accumulation detector datas imputing miss values, and then get three spatial imputatiol ways:upstream, downstream,upstream and downstream detector, to establish an optimal range of spatial imputation. After that, in temporal we assign two modes:non-include interpolated detector data mode and include interpolated detector data mode, and we handled the historical data to a single time interval, accumulated time intervals and time interval moving average of three data types for temporal imputation, the final to get a performance evaluation by the best combination of temporal/spatial imputation.In this study, before analysis the imputation performance, we using K-means method to cluster detector datas, then the information(flow,speed and occupancy) using recurrent neural network to imput missing values and analysis. The results showed that:the flow of imputation, taking the upstream and downstream cumulative detectors to no.6 set with self-information the 20 minutes ago mean historical datas could get the best imputation performance; the speed of imputation, taking the upstream and downstream cumulative detectors to no.7 set with self-information the 20 minutes ago mean historical datas could get the best imputation performance; the occupancy of imputation, taking the upstream and downstream cumulative detectors to no.6 set with self-information the 15 minutes ago mean historical datas could get the best imputation performance.
37

Sun, cheng-bin, and 孫承彬. "The comparison of two missing values imputation methods in single and multiple choice question." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/32249826382402737245.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立交通大學
統計學研究所
103
Questionnaire is a common way to collect data. In generally, a questionnaire usually consists of single-response questions and multiple-response questions. After we collect questionnaire data, it is possible that there exist missing values. To increase the accuracy of the survey result, we can use statistic methods to impute missing values. This paper mainly discusses two widely-used imputing methods for imputing missing values in analyzing data of single-response question and multiple-response question, respectively. These two methods are the K-nearest neighbors algorithm and the linear regression approach. We compare the accurate rates of these two methods in different conditions, such as different missing rate, number of questions and number of choices. In addition, we also use a real data example to compare these two methods and compare the results between real data and simulation.
38

Lin, Yue-Wei, and 林岳威. "Searching the Optimal Data Imputation Method for Missing Values of Vehicle Detector Using Linked List." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/18897154871480374324.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
國立中央大學
土木工程研究所
100
To search for the optimal imputation of Vehicle Detector, this paper using the Data Structure-Linked Lists to search the optimal solution of imputation of Vehicle Detector. We have to establish three the essential VD Lists and the numbers of data imputation Lists. Using the Lists to find the optimal solution of missing values VD. First, use the 35 VD to establish the original VD Lists and copy the data of the original VD Lists to establish another List, it’s call data complete lists. Then we input the VD of missing values to delete VD of data complete lists, at the same time, use the VD of missing values to establish missing data lists. And then using combination of statistics of data complete lists to find the imputation mode, MAPE use the random to decide. Use the imputation mode to establish the numbers of data imputation Lists. Finally , we can search the local optimal solution from the data imputation Lists, and find the optimal solution from those local optimal solution. The result showed that every VD of missing value the local optimal solutions MAPE are 5%, because the imputation mode is too much and the MAPE is random to decide. But we have to find the optimal solution of missing value VD, so according to Literature, we choose the most VD imputation Data is optimal solution.
39

Carrillo, Garcia Ivan Adolfo. "Analysis of Longitudinal Surveys with Missing Responses." Thesis, 2008. http://hdl.handle.net/10012/3971.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Longitudinal surveys have emerged in recent years as an important data collection tool for population studies where the primary interest is to examine population changes over time at the individual level. The National Longitudinal Survey of Children and Youth (NLSCY), a large scale survey with a complex sampling design and conducted by Statistics Canada, follows a large group of children and youth over time and collects measurement on various indicators related to their educational, behavioral and psychological development. One of the major objectives of the study is to explore how such development is related to or affected by familial, environmental and economical factors. The generalized estimating equation approach, sometimes better known as the GEE method, is the most popular statistical inference tool for longitudinal studies. The vast majority of existing literature on the GEE method, however, uses the method for non-survey settings; and issues related to complex sampling designs are ignored. This thesis develops methods for the analysis of longitudinal surveys when the response variable contains missing values. Our methods are built within the GEE framework, with a major focus on using the GEE method when missing responses are handled through hot-deck imputation. We first argue why, and further show how, the survey weights can be incorporated into the so-called Pseudo GEE method under a joint randomization framework. The consistency of the resulting Pseudo GEE estimators with complete responses is established under the proposed framework. The main focus of this research is to extend the proposed pseudo GEE method to cover cases where the missing responses are imputed through the hot-deck method. Both weighted and unweighted hot-deck imputation procedures are considered. The consistency of the pseudo GEE estimators under imputation for missing responses is established for both procedures. Linearization variance estimators are developed for the pseudo GEE estimators under the assumption that the finite population sampling fraction is small or negligible, a scenario often held for large scale population surveys. Finite sample performances of the proposed estimators are investigated through an extensive simulation study. The results show that the pseudo GEE estimators and the linearization variance estimators perform well under several sampling designs and for both continuous response and binary response.
40

Chagra, Djamila. "Sélection de modèle d'imputation à partir de modèles bayésiens hiérarchiques linéaires multivariés." Thèse, 2009. http://hdl.handle.net/1866/3936.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Résumé La technique connue comme l'imputation multiple semble être la technique la plus appropriée pour résoudre le problème de non-réponse. La littérature mentionne des méthodes qui modélisent la nature et la structure des valeurs manquantes. Une des méthodes les plus populaires est l'algorithme « Pan » de (Schafer & Yucel, 2002). Les imputations rapportées par cette méthode sont basées sur un modèle linéaire multivarié à effets mixtes pour la variable réponse. La méthode « BHLC » de (Murua et al, 2005) est une extension de « Pan » dont le modèle est bayésien hiérarchique avec groupes. Le but principal de ce travail est d'étudier le problème de sélection du modèle pour l'imputation multiple en termes d'efficacité et d'exactitude des prédictions des valeurs manquantes. Nous proposons une mesure de performance liée à la prédiction des valeurs manquantes. La mesure est une erreur quadratique moyenne reflétant la variance associée aux imputations multiples et le biais de prédiction. Nous montrons que cette mesure est plus objective que la mesure de variance de Rubin. Notre mesure est calculée en augmentant par une faible proportion le nombre de valeurs manquantes dans les données. La performance du modèle d'imputation est alors évaluée par l'erreur de prédiction associée aux valeurs manquantes. Pour étudier le problème objectivement, nous avons effectué plusieurs simulations. Les données ont été produites selon des modèles explicites différents avec des hypothèses particulières sur la structure des erreurs et la distribution a priori des valeurs manquantes. Notre étude examine si la vraie structure d'erreur des données a un effet sur la performance du choix des différentes hypothèses formulées pour le modèle d'imputation. Nous avons conclu que la réponse est oui. De plus, le choix de la distribution des valeurs manquantes semble être le facteur le plus important pour l'exactitude des prédictions. En général, les choix les plus efficaces pour de bonnes imputations sont une distribution de student avec inégalité des variances dans les groupes pour la structure des erreurs et une loi a priori choisie pour les valeurs manquantes est la loi normale avec moyenne et variance empirique des données observées, ou celle régularisé avec grande variabilité. Finalement, nous avons appliqué nos idées à un cas réel traitant un problème de santé. Mots clés : valeurs manquantes, imputations multiples, modèle linéaire bayésien hiérarchique, modèle à effets mixtes.
Abstract The technique known as multiple imputation seems to be the most suitable technique for solving the problem of non-response. The literature mentions methods that models the nature and structure of missing values. One of the most popular methods is the PAN algorithm of Schafer and Yucel (2002). The imputations yielded by this method are based on a multivariate linear mixed-effects model for the response variable. A Bayesian hierarchical clustered and more flexible extension of PAN is given by the BHLC model of Murua et al. (2005). The main goal of this work is to study the problem of model selection for multiple imputation in terms of efficiency and accuracy of missing-value predictions. We propose a measure of performance linked to the prediction of missing values. The measure is a mean squared error, and hence in addition to the variance associated to the multiple imputations, it includes a measure of bias in the prediction. We show that this measure is more objective than the most common variance measure of Rubin. Our measure is computed by incrementing by a small proportion the number of missing values in the data and supposing that those values are also missing. The performance of the imputation model is then assessed through the prediction error associated to these pseudo missing values. In order to study the problem objectively, we have devised several simulations. Data were generated according to different explicit models that assumed particular error structures. Several missing-value prior distributions as well as error-term distributions are then hypothesized. Our study investigates if the true error structure of the data has an effect on the performance of the different hypothesized choices for the imputation model. We concluded that the answer is yes. Moreover, the choice of missing-value prior distribution seems to be the most important factor for accuracy of predictions. In general, the most effective choices for good imputations are a t-Student distribution with different cluster variances for the error-term, and a missing-value Normal prior with data-driven mean and variance, or a missing-value regularizing Normal prior with large variance (a ridge-regression-like prior). Finally, we have applied our ideas to a real problem dealing with health outcome observations associated to a large number of countries around the world. Keywords: Missing values, multiple imputation, Bayesian hierarchical linear model, mixed effects model.
Les logiciels utilisés sont Splus et R.

To the bibliography