Dissertations / Theses on the topic 'Missing Value Imputation'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 40 dissertations / theses for your research on the topic 'Missing Value Imputation.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Aslan, Sipan. "Comparison Of Missing Value Imputation Methods For Meteorological Time Series Data." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/12612426/index.pdf.
Andersson, Joacim, and Henrik Falk. "Missing Data in Value-at-Risk Analysis : Conditional Imputation in Optimal Portfolios Using Regression." Thesis, KTH, Matematisk statistik, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-122276.
Bischof, Stefan, Andreas Harth, Benedikt Kämpgen, Axel Polleres, and Patrik Schneider. "Enriching integrated statistical open city data by combining equational knowledge and missing value imputation." Elsevier, 2017. http://dx.doi.org/10.1016/j.websem.2017.09.003.
Jagirdar, Suresh. "Investigation into Regression Analysis of Multivariate Additional Value and Missing Value Data Models Using Artificial Neural Networks and Imputation Techniques." Ohio University / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1219343139.
Bala, Abdalla. "Impact analysis of a multiple imputation technique for handling missing value in the ISBSG repository of software projects." Mémoire, École de technologie supérieure, 2013. http://espace.etsmtl.ca/1236/1/BALA_Abdalla.pdf.
Etourneau, Lucas. "Contrôle du FDR et imputation de valeurs manquantes pour l'analyse de données de protéomiques par spectrométrie de masse." Electronic Thesis or Diss., Université Grenoble Alpes, 2024. http://www.theses.fr/2024GRALS001.
Proteomics involves characterizing the proteome of a biological sample, that is, the set of proteins it contains, and doing so as exhaustively as possible. By identifying and quantifying protein fragments that are analyzable by mass spectrometry (known as peptides), proteomics provides access to the level of gene expression at a given moment. This is crucial information for improving the understanding of molecular mechanisms at play within living organisms. These experiments produce large amounts of data, often complex to interpret and subject to various biases. They require reliable data processing methods that ensure a certain level of quality control, as to guarantee the relevance of the resulting biological conclusions.The work of this thesis focuses on improving this data processing, and specifically on the following two major points:The first is controlling for the false discovery rate (FDR), when either identifying (1) peptides or (2) quantitatively differential biomarkers between a tested biological condition and its negative control. Our contributions focus on establishing links between the empirical methods stemmed for proteomic practice and other theoretically supported methods. This notably allows us to provide directions for the improvement of FDR control methods used for peptide identification.The second point focuses on managing missing values, which are often numerous and complex in nature, making them impossible to ignore. Specifically, we have developed a new algorithm for imputing them that leverages the specificities of proteomics data. Our algorithm has been tested and compared to other methods on multiple datasets and according to various metrics, and it generally achieves the best performance. Moreover, it is the first algorithm that allows imputation following the trending paradigm of "multi-omics": if it is relevant to the experiment, it can impute more reliably by relying on transcriptomic information, which quantifies the level of messenger RNA expression present in the sample. Finally, Pirat is implemented in a freely available software package, making it easy to use for the proteomic community
Gheyas, Iffat A. "Novel computationally intelligent machine learning algorithms for data mining and knowledge discovery." Thesis, University of Stirling, 2009. http://hdl.handle.net/1893/2152.
Alarcon, Sergio Arciniegas. "Imputação de dados em experimentos multiambientais: novos algoritmos utilizando a decomposição por valores singulares." Universidade de São Paulo, 2016. http://www.teses.usp.br/teses/disponiveis/11/11134/tde-10052016-130506/.
The biplot analysis using the additive main effects and multiplicative interaction models (AMMI) require complete data matrix, but often multi-environments trials have missing values. This thesis proposed new methods of single and multiple imputation that can be used to analyze unbalanced data in experiments with genotype by environment interaction (G×E). The first is a new extension of the cross-validation method by eigenvector (Bro et al., 2008). The second, corresponds to a new non-parametric algorithm obtained through modifications of the simple imputation method developed by Yan (2013). Also is included a study that considers imputation systems recently reported in the literature and compares them with the classic procedure recommended for imputation in trials (G×E), it means, the combination of the Expectation-Maximization (EM) algorithm with the additive main effects and multiplicative interaction (AMMI) model or EM-AMMI. Finally, are supplied generalizations of simple imputation described by Arciniegas-Alarcón et al. (2010) that combines regression with lower-rank approximation of a matrix. All methodologies are based on singular value decomposition (SVD), so, are free of any distributional or structural assumptions. In order to determine the performance of the new imputation schemes were performed simulations based on real data set of different species, with values deleted randomly at different percentages and the quality of the imputations was evaluated using different statistics. It was concluded that SVD provides a useful and flexible tool for the construction of efficient techniques that circumvent the problem of missing data in experimental matrices.
Bengtsson, Fanny, and Klara Lindblad. "Methods for handling missing values : A simulation study comparing imputation methods for missing values on a Poisson distributed explanatory variable." Thesis, Uppsala universitet, Statistiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-432467.
Huo, Zhao. "A Comparsion of Multiple Imputation Methods for Missing Covariate Values in Recurrent Event Data." Thesis, Uppsala universitet, Statistiska institutionen, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-256602.
Chan, Pui-shan, and 陳佩珊. "On the use of multiple imputation in handling missing values in longitudinal studies." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B45009879.
Raoufi-Danner, Torrin. "Effects of Missing Values on Neural Network Survival Time Prediction." Thesis, Linköpings universitet, Statistik och maskininlärning, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-150339.
Meister, Romy. "Evaluation verschiedener Imputationsverfahren zur Aufbereitung großer Datenbestände am Beispiel der SrV-Studie von 2013." Master's thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-198852.
Silva, Jonathan de Andrade. "Substituição de valores ausentes: uma abordagem baseada em um algoritmo evolutivo para agrupamento de dados." Universidade de São Paulo, 2010. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-07062010-144250/.
The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. This work proposes and evaluates an algorithm for missing values imputation that is based on an evolutionary algorithm for clustering. This algorithm is based on the assumption that clusters of (partially unknown) data can provide useful information for the imputation process. In order to experimentally assess the proposed method, simulations of missing values were performed on six classification datasets, with two missingness mechanisms widely used in practice: MCAR and MAR. Imputation algorithms have been traditionally assessed by some measures of prediction capability. However, this traditionall approach does not allow inferring the influence of imputed values in the ultimate modeling tasks (e.g., in classification). This work describes the experimental results obtained from the prediction and insertion bias perspectives in classification problems. The results illustrate different scenarios in which the proposed algorithm performs similarly to other six imputation algorithms reported in the literature. Finally, statistical analyses suggest that best prediction results do not necessarily imply in less classification bias
Assunção, Fernando. "Estratégias para tratamento de variáveis com dados faltantes durante o desenvolvimento de modelos preditivos." Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/45/45133/tde-15082012-203206/.
Predictive models have been increasingly used by the market in order to assist companies in risk mitigation, portfolio growth, customer retention, fraud prevention, among others. During the model development, however, it is usual to have, among the predictive variables, some who have data not filled in (missing values), thus it is necessary to adopt a procedure to treat these variables. Given this scenario, the aim of this study is to discuss frameworks to deal with missing data in predictive models, encouraging the use of some already known by academia that are still not used by the market. This paper describes seven methods, which were submitted to an empirical application using a Credit Score data set. Each framework described resulted in a predictive model developed and the results were evaluated and compared through a series of widely used performance metrics (KS, Gini, ROC curve, Approval curve). In this application, the frameworks that presented better performance were the ones that treated missing data as a separate category (technique already used by the market) and the framework which consists of grouping the missing data in the category most similar conceptually. The worst performance framework otherwise was the one that simply ignored the variable containing missing values, another procedure commonly used by the market.
Moreno, Betancur Margarita. "Regression modeling with missing outcomes : competing risks and longitudinal data." Thesis, Paris 11, 2013. http://www.theses.fr/2013PA11T076/document.
Missing data are a common occurrence in medical studies. In regression modeling, missing outcomes limit our capability to draw inferences about the covariate effects of medical interest, which are those describing the distribution of the entire set of planned outcomes. In addition to losing precision, the validity of any method used to draw inferences from the observed data will require that some assumption about the mechanism leading to missing outcomes holds. Rubin (1976, Biometrika, 63:581-592) called the missingness mechanism MAR (for “missing at random”) if the probability of an outcome being missing does not depend on missing outcomes when conditioning on the observed data, and MNAR (for “missing not at random”) otherwise. This distinction has important implications regarding the modeling requirements to draw valid inferences from the available data, but generally it is not possible to assess from these data whether the missingness mechanism is MAR or MNAR. Hence, sensitivity analyses should be routinely performed to assess the robustness of inferences to assumptions about the missingness mechanism. In the field of incomplete multivariate data, in which the outcomes are gathered in a vector for which some components may be missing, MAR methods are widely available and increasingly used, and several MNAR modeling strategies have also been proposed. On the other hand, although some sensitivity analysis methodology has been developed, this is still an active area of research. The first aim of this dissertation was to develop a sensitivity analysis approach for continuous longitudinal data with drop-outs, that is, continuous outcomes that are ordered in time and completely observed for each individual up to a certain time-point, at which the individual drops-out so that all the subsequent outcomes are missing. The proposed approach consists in assessing the inferences obtained across a family of MNAR pattern-mixture models indexed by a so-called sensitivity parameter that quantifies the departure from MAR. The approach was prompted by a randomized clinical trial investigating the benefits of a treatment for sleep-maintenance insomnia, from which 22% of the individuals had dropped-out before the study end. The second aim was to build on the existing theory for incomplete multivariate data to develop methods for competing risks data with missing causes of failure. The competing risks model is an extension of the standard survival analysis model in which failures from different causes are distinguished. Strategies for modeling competing risks functionals, such as the cause-specific hazards (CSH) and the cumulative incidence function (CIF), generally assume that the cause of failure is known for all patients, but this is not always the case. Some methods for regression with missing causes under the MAR assumption have already been proposed, especially for semi-parametric modeling of the CSH. But other useful models have received little attention, and MNAR modeling and sensitivity analysis approaches have never been considered in this setting. We propose a general framework for semi-parametric regression modeling of the CIF under MAR using inverse probability weighting and multiple imputation ideas. Also under MAR, we propose a direct likelihood approach for parametric regression modeling of the CSH and the CIF. Furthermore, we consider MNAR pattern-mixture models in the context of sensitivity analyses. In the competing risks literature, a starting point for methodological developments for handling missing causes was a stage II breast cancer randomized clinical trial in which 23% of the deceased women had missing cause of death. We use these data to illustrate the practical value of the proposed approaches
Chion, Marie. "Développement de nouvelles méthodologies statistiques pour l'analyse de données de protéomique quantitative." Thesis, Strasbourg, 2021. http://www.theses.fr/2021STRAD025.
Proteomic analysis consists of studying all the proteins expressed by a given biological system, at a given time and under given conditions. Recent technological advances in mass spectrometry and liquid chromatography make it possible to envisage large-scale and high-throughput proteomic studies.This thesis work focuses on developing statistical methodologies for the analysis of quantitative proteomics data and thus presents three main contributions. The first part proposes to use monotone spline regression models to estimate the amounts of all peptides detected in a sample using internal standards labelled for a subset of targeted peptides. The second part presents a strategy to account for the uncertainty induced by the multiple imputation process in the differential analysis, also implemented in the mi4p R package. Finally, the third part proposes a Bayesian framework for differential analysis, making it notably possible to consider the correlations between the intensities of peptides
Peña, Marisol Garcia. "Alternativas de análise para experimentos G × E multiatributo." Universidade de São Paulo, 2016. http://www.teses.usp.br/teses/disponiveis/11/11134/tde-04052016-111857/.
Usually, in the experiments genotype by environment (G×E) it is common to observe the behaviour of genotypes in relation to different attributes in the environments considered. The analysis of such experiments have been widely discussed for the case of a single attribute. This thesis presents some alternatives of analysis, considering genotypes, environments and attributes simultaneously. The first, is based on the mixture maximum likelihood method - Mixclus and the three-mode principal component analysis, these two methods have been very used in the psychology and chemistry, but little in agriculture. The second, is a methodology that combines the additive main effects and multiplicative interaction models - AMMI, efficient model for the analysis of experiments (G×E) with one attribute, and the generalised procrustes analysis, which allows compare configurations of points and provide a numerical measure of how much they differ. Finally, an alternative to perform data imputation in the experiments (G×E) is presented, because, a very frequent situation in these experiments, is the presence of missing values. It is concluded that the proposed methodologies are useful tools for the analysis of experiments (G×E) multi-attribute.
Li, Yun-Jie, and 李昀潔. "The Effect of Instance Selection on Missing Value Imputation." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/u8nt9j.
國立中央大學
資訊管理學系
103
In data mining, the collected datasets are usually incomplete, which contain some missing attribute values. It is difficult to effectively develop a learning model using the incomplete datasets. In literature, missing value imputation can be approached for the problem of incomplete datasets. Its aim is to provide estimations for the missing values by the (observed) complete data samples. However, some of the complete data may contain some noisy information, which can be regarded as outliers. If these noisy data were used for missing value imputation, the quality of the imputation results would be affected. To solve this problem, we propose to perform instance selection over the complete data before the imputation step. The aim of instance selection is to filter out some unrepresentative data from a given dataset. Therefore, this research focuses on examining the effect of performing instance selection on missing value imputation. The experimental setup is based on using 33 UCI datasets, which are composed of categorical, numerical, and mixed types of data. In addition, three instance selection methods, which are IB3 (Instance-based learning), DROP3 (Decremental Reduction Optimization Procedure), and GA (Genetic Algorithm) are used for comparison. Similarly, three imputation methods including KNNI (K-Nearest Neighbor Imputation method), SVM (Support Vector Machine), and MLP (MultiLayers Perceptron) are also employed individually. The comparative results can allow us to understand which combination of instance selection and imputation methods performs the best and whether combining instance selection and missing value imputation is the better choice than performing missing value imputation alone for the incomplete datasets. According to the results of this research, we suggest that the combinations of instance selection methods and imputation methods may suitable than the imputation methods along over numerical datasets. In particular, the DROP3 instance selection method is more suitable for numerical and mixed datasets, except for categorical datasets, especially when the number of features is large. For the other two instance selection methods, the GA method can provide more stable reduction performance than IB3.
Lin, Ying-Siou, and 林盈秀. "The relationship between missing value, imputation and data pre-processing." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/21777269840849835589.
國立中央大學
資訊管理學系
101
With the rapid development of information technology, computers can process and store huge amounts of data. This leads to the importance of finding useful content from large amounts of data in data mining. However, many collected datasets for data mining usually contain some missing values, which are likely to degrade the data mining performance. For incomplete data processing, it is a common and simple way to perform case deletion by ignoring the data samples with missing values if the missing rate was certainly small. Another approach is based on imputation, where various approaches have been proposed for missing value imputation. Generally speaking, the imputation algorithms aim at providing estimations for missing values by a reasoning process from the observed data. However, there is no answer for the question about when should we use the case deletion or imputation approach over different kinds of datasets. Another question is that will performing data pre-processing, i.e. feature and instance selection, affect the final imputation result? This thesis used 37 different data sets, which contain categorical, numerical, and both types of data, and 5% intervals for different missing rates per dataset (i.e. from 5% to 50%). Research topic is divided into two parts. The experimental results indicate that there are some specific patterns to consider case deletion over different datasets without significant performance degradation. A decision tree model is then constructed to extract useful rules to recommend when to use the case deletion approach. Furthermore, we found that imputation after instance selection can produce better classification performance than imputation alone. However, imputation after feature selection does not have a positive impact on the imputation result.
Wu, Yi-Jing, and 吳宜靜. "A Shrinkage Least Square Imputation Method for Microarray Missing Value Estimation." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/65919355846085221875.
國立交通大學
統計學研究所
99
Microarray data analysis has widely used in biological studies. However, it is common that there are missing values in microarray data, which affects the result of analysis. As many downstream analysis methods require complete datasets, missing value estimation has been an important pre-processing step in the microarray analysis. Among the existed missing value imputation methods, the regression-based methods are very popular. Many algorithms are developed for reconstructing these missing values. In this study, we propose a James-Stein type modified estimator for the regression coefficients. We compare the performance of the conventional imputations and the James-Stein type adjusted imputation method, our approach shows better performance than the others on various datasets.
Dai, Yu-Ting, and 戴郁庭. "Missing value imputation for class imbalance data: a dynamic warping approach." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/6cax3v.
國立中央大學
資訊管理學系
107
In a world full of information, more and more companies want to use this information to improve their competitiveness. However, the problems of “Class Imbalance” and “Missing Value” have always been important issues in the real world. For example, class imbalance datasets often occur in different fields such as medical diagnosis and bankruptcy prediction. In class imbalance, the number of samples of the majority class in the dataset is larger than that of the minority class, and the data will look skewed. In order to have a higher classification accuracy rate, the prediction model established by the general classifier will also be misjudged as a large class of data due to the influence of the skewed distribution. If the precious minority class contains some missing data, the available data are even rarer. In this thesis, dynamic time warping is used as the core for the missing value imputation task. Dynamic time warping correction feature is used to solve the problem of missing data in the minority class containing small numbers of samples. And this method is not limited to the need for a complete data sample. Therefore, in the experiment, 10%, 30%, 50%, 70%, and 90% missing rates of the minority class data are simulated. In this paper, we use 17 KEEL datasets for the experiment, and two classification models (SVM, Decision Tree) are constructed, and the AUC (Area Under Curve) are examined for different methods. The experimental results show that the dynamic time warping has good performance under the missing rate of 50%~90%, which performs better than the KNN imputation method.
Tsai, Chun-Hui, and 蔡純卉. "A Study on Applying Visual Analytics to the Imputation of Missing Value." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/v8x92z.
中華大學
資訊管理學系
107
With the gradual increase in data utilization demand in recent years and the diversification of source channels, such as open data, the amount of data people can collect has also gradually increased, thus increasing the demand for data analysis. Data analysis is to extract potential information from data not known to have value in the past. At present, the government or enterprises also use data analysis to predict future trends and development business opportunities. If unprocessed data is used to carry out data analysis, data analysis results may be affected, leading to errors and failing to provide effective and reliable information. Therefore, many research scholars have proposed methods to process missing values in order to ensure data is the most complete, while data analysis is carried out to avoid affecting final decisions. Despite the issue of missing value processing gradually gaining attention, there are a diversity of methods for processing missing values as different types of missing data cannot be processed by using just one method. When confronted by miscellaneous data, it is difficult for human to quickly decide which method to use to process missing values. According to research, the human brain can absorb 80% of graphics and only 20% of text. Hence, using data visualization, complex abstract information can be presented in the form of images and can aid us in observing the regularities, trends, and correlations in the data, not only saving time but also effectively and quickly absorbing information, making it easier and clearer to understand the key points in the data. In this study, the research framework of using visualization analysis in missing value imputing was proposed. It mainly consists of six functional modules, including data upload module, missing analysis module, imputing setting module, missing value processing module, preview imputing module, and data output module. This framework was physically set up as a system. Finally, instances were used to verify the feasibility of the framework. This platform has integrated many kinds of common imputing methods, allowing analysts to have more choices. Additionally, through visualized ancillary data, analysts can observe original data and quickly understand and grasp data omission status and provide preview imputing function, thereby using graphics to present the results after missing value filing and the differences among the various imputing methods. Finally, the analysts were assisted in choosing a more suitable method to carry out missing value imputing, thus enhancing data quality and completeness.
Huang, Jing-Ya, and 黃靖雅. "Evaluation of missing value imputation methods for the helpfulness of online reviews." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/3tq957.
國立中央大學
資訊管理學系
106
In today's world, everyone can comment on many public posts, including newspapers, magazines and books you have ever read. Online reviews are considered as trustworthy. Users can provide online reviews through several ways such as star ratings, text, images, and videos. Most users will also browse the reviews on the websites before purchasing goods and experiencing. This constant state of information overload is caused by the Internet that contains too much information; hence data mining techniques can be employed to solve this problem. This thesis considers the helpfulness of online hotel reviews for the research. During the data preprocessing, we found that it is very common that real-world review datasets usually contain certain numbers of missing attribute values. In literature, there is no a study focus on examining the performances of different types of techniques to handle incomplete online review datasets. The experiment is composed of two studies. In the first study, the dataset is collected from TripAdvisor, where some reviewer related information is missing, such as reviewer level, age, sex, etc. Three types of techniques are compared, which are case deletion, imputation methods including mean/mode, KNN, and SVM, and directly handle the incomplete dataset without imputation by C5.0. In the second study, the raining information is simulated for 10% to 50% missing rates of the dataset. The experiment results of the two studies show that the C5.0 decision tree algorithm is the better choice for dealing with missing values in online review datasets.
Han-DeCheng and 程瀚德. "Missing Value Estimation for Microarray Gene ExpressionData by Hybrid Local Least Squares Imputation." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/06546326894975258066.
Shi-YaoZhan and 詹士瑤. "A comprehensive study on comparison of missing value imputation methods for microarray data." Thesis, 2013. http://ndltd.ncl.edu.tw/handle/19111541983723560460.
國立成功大學
電機工程學系碩博士班
101
Microarray data frequently contain missing values due to various reasons. However, most downstream analyses for microarray data require complete datasets. Therefore, algorithms for missing value estimation must be developed to improve downstream analysis. Since 2001, lots of algorithms have been proposed, but the comparison performance among different algorithms is always insufficient in the numbers of benchmark datasets, the number of algorithms included and performance measure used, and the rounds of simulation performed. In this research, we used (I) nine algorithms, (II) thirteen microarray datasets, (III) 110 independent runs of the simulation procedure, and (IV) three types of measures to evaluate the performance of each imputation algorithm fairly. First, the effects of different types of datasets on the performance of each imputation algorithm were evaluated. Second, we discussed whether the datasets from different species have different impact on the performance of different algorithms. To assess the performance of each algorithm fairly, all evaluations were performed using three types of measures. In addition to the statistical measure, two other indices with more biological meanings are useful to reflect the impact of missing value imputation on downstream data analysis. Through our studies, we suggest that local-least-squares-based and least-squares methods would be better choices to handle missing values for most of datasets. In this work, we carried out a comprehensive comparison of algorithms of microarray missing value imputation. Based on such a comprehensive comparison, researchers could choose an optimal algorithm for their datasets easily. Moreover, new imputation algorithms could be compared with other existing algorithms using this comparison strategy as a standard protocol in the future.
Che-WeiWu and 吳哲維. "The Study on Missing Value Imputation for Modeling the Data of Next Generation Sequence." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/29628471344179278525.
國立成功大學
統計學系
103
As science progresses, DNA sequencing technology and DNA sequencing platform also followed innovation. After the sequencing staffs use the old platform, there will be a new platform to follow. The sequencing staffs usually buy new platform, the old platform not necessarily immediately eliminated. At this situation, the data analyst must analyzed read count data of gene sequences from different platforms. In this case, the platform effect is likely to affect analysis result. In addition, the gene chips may generate missing value due to the machine insufficient resolution, image corruption and other reasons, resulting in unusable statistical methods for analysis. In this study, gene alignment read count data of colorectal cancer patients is provided by Professor H. Sunny Sun from the Institute of Molecular Medicine, National Cheng Kung Univeristy Medical college, and Center for Genomic Medicine. Data are taken out of normal and tumor cells of 12 patients with colorectal cancer from two different platforms. Because the data has missing value, this paper will propose imputation methods and use generalized estimating equation model, and through statistical simulation to compare the behavior of imputation methods under several different situations of parameters.
Huang, Hao-Hsuan, and 黃浩軒. "A Nearest Neighbors Field Method Based on Distance for Missing Value Imputation in Medical Application." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/3a3f4d.
國立雲林科技大學
資訊管理系
106
In medical filed, missing data is often existed, which will affect the analysis and prediction by doctors and scholars. Most studies have focused on the highest accuracy prediction model in current medical research. However, they do not have considered the stability of models with different missing degrees and different missing types. Based on data complete and easy operation, this study proposes the imputation method, which is nearest neighbors method based on distance to imputation missing value. In the experiment, this study selected several UCI datasets and produced different missing degrees and missing types. Comparing training accuracy with popular imputation methods. Moreover, using the Stroke dataset from the International Stroke Trial (IST) to verify whether the proposed method could be effectively used in practice. The results show that, the proposed method can have good performances under different simulations of missing degrees, missing types, and datasets. In addition, the proposed method can obtained 90-percentage accuracy in the Stoke dataset. It means the proposed method can be effectively used in the practice dataset.
Li, Yi-Cheng, and 李俋澄. "An Empirical Study of Missing-value Imputation Methods on The Accuracy of DNN-based Classification." Thesis, 2019. http://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi/login?o=dnclcdr&s=id=%22107NCHU5394045%22.&searchmode=basic.
國立中興大學
資訊科學與工程學系所
107
Missing data is a common problem in statistical analysis. In the field of missing data research, if the proportion of missing data increases, its effect on data analysis can be very serious. A large amount of missing data may make the analysis result deviate from the facts, so missing value handling is extremely important. Previous studies of handling missing values focused on traditional machine learning methods. This study however intends to use the deep neural network (DNN) as a classification model and investigates the effect of various missing data imputation methods on the accuracy of DNN-based classification. In this study, eight different imputation methods for handling missing values are selected for comparisons. The simulation experiment includes three parts; missing values occurs in the training data, missing values occurs in the test data, and missing values occurs in both of the training and test data. According to the above three cases, simulation experiments are conducted to observe the classification accuracy of the eight imputation methods under different missing ratios. Experimental results show that the deep neural network using KNN imputation have the best classification accuracy among the eight imputation methods in different missing ratios and data sets. When the missing ratio is between 5% and 40%, the accuracy of KNN imputation in the three experiments exceeds that of other imputation methods by an average of 6.81%.
Meng-JhunJhou and 周孟諄. "Construction of a web tool for comprehensively evaluating the performance of a new microarray missing value imputation algorithm." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/72194686550011204177.
國立成功大學
電機工程學系
104
Method of imputing the missing value for microarray dataset analyses is very important because the microarray data containing missing values will significantly reduce the performance and effectiveness of downstream analyses. Although today, there are a lot of missing value imputation algorithms, it is still a lack of objective and comprehensive performance architecture. In our previous study published, we instruct a comprehensive and objective comparison framework of various existing algorithms. Our framework can be used in the development of new missing value imputing algorithm. However, the construction of this framework of algorithms for the researchers is not an easy task. To save time and effort for researchers, we publish an easy to use web tool named MVIAeval (Missing Value Imputation Algorithm evaluator). MVIAeval provides a convenient user interface to use, allowing users to upload their new algorithm code, then follow these five steps to select: (1) the simulation test data among 20 microarray dataset. (2) the comparison algorithms in 12 existing algorithms. (3) the performance indices from three ones. (4) the comparison method of all algorithms in two performance scores. (5) the number of simulation times. Finally, it will shown results of simulated performance comparison in figure and tables. MVIAeval is a well useful tool for researchers, we can simply use to achieve a comprehensive and objective performance comparison of their new algorithm they developed for imputing missing value in microarray data.
Kuo, Yi-Ru, and 郭奕汝. "Applying Classification in Missing Values Imputation." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/23462033623115348014.
銘傳大學
資訊工程學系碩士班
98
Abstract, Data Mining is used to find the useful and potential information from the large dataset so it is used generally. Classification, clustering and association rules are the techniques in the data mining field, and the applications are always in all areas. For example, in the banking field, in order to know who will come to grant a loan, classification technique will be used. In the medical field, it is easy to know what kind of people will need medical treatment. As for classification, the quality of the training data is the key factor of classification. In real data, there perhaps exist some missing values, so it will cause the deviation through classifying. In order to build a good model, it would need to collect the superior quality data. However, there are many ways to cause missing values such as data conflicts when matching data or respondent does not fill in the questionnaires. For completing the data, imputation is a term that denotes a procedure that replaces the missing values in a data set by some plausible values. Recently, it always uses statistics methods to impute the missing values, for instance, mode or mean imputation. However, statistics methods are affected by the distributions of data so association rules will solve the problem. It is widely believed that association rule could mine the features among different attributes. A method proposed in this paper, a classification model is built by integrating the methods between clustering and association rules. The process is that using clustering method to generalize the data with the same features and find the association rules from each particular group to impute the missing values. Then, the complete and quality data will be used to construct the model.
Oh, Sohae. "Multiple Imputation on Missing Values in Time Series Data." Thesis, 2015. http://hdl.handle.net/10161/10447.
Financial stock market data, for various reasons, frequently contain missing values. One reason for this is that, because the markets close for holidays, daily stock prices are not always observed. This creates gaps in information, making it difficult to predict the following day’s stock prices. In this situation, information during the holiday can be “borrowed” from other countries’ stock market, since global stock prices tend to show similar movements and are in fact highly correlated. The main goal of this study is to combine stock index data from various markets around the world and develop an algorithm to impute the missing values in individual stock index using “information-sharing” between different time series. To develop imputation algorithm that accommodate time series-specific features, we take multiple imputation approach using dynamic linear model for time-series and panel data. This algorithm assumes ignorable missing data mechanism, as which missingness due to holiday. The posterior distributions of parameters, including missing values, is simulated using Monte Carlo Markov Chain (MCMC) methods and estimates from sets of draws are then combined using Rubin’s combination rule, rendering final inference of the data set. Specifically, we use the Gibbs sampler and Forward Filtering and Backward Sampling (FFBS) to simulate joint posterior distribution and posterior predictive distribution of latent variables and other parameters. A simulation study is conducted to check the validity and the performance of the algorithm using two error-based measurements: Root Mean Square Error (RMSE), and Normalized Root Mean Square Error (NRMSE). We compared the overall trend of imputed time series with complete data set, and inspected the in-sample predictability of the algorithm using Last Value Carried Forward (LVCF) method as a bench mark. The algorithm is applied to real stock price index data from US, Japan, Hong Kong, UK and Germany. From both of the simulation and the application, we concluded that the imputation algorithm performs well enough to achieve our original goal, predicting the stock price for the opening price after a holiday, outperforming the benchmark method. We believe this multiple imputation algorithm can be used in many applications that deal with time series with missing values such as financial and economic data and biomedical data.
Thesis
Fei, Shih Yuan, and 費詩元. "Multiple imputation for missing covariates in contingent valua-tion survey." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/65283401003442026240.
國立政治大學
統計研究所
97
Most often, studies focus on willingness to pay (WTP) simply ignore the missing values and treat them as if they were missing completely at random. It is well-known that such a practice might cause serious bias and lead to incorrect results. Income is one of the most influential variables in CV (contingent valuation) study and is also the variable that respondents most likely fail to respond. In the present study, we evaluate the performance of multiple imputation (MI) on missing income in the analysis of WTP through a series of simulation experiments. Several approaches such as complete-case analysis, single imputation, and MI are considered and com-pared. We show that performance with MI is always better than complete-case analy-sis, especially when the missing rate gets high. We also show that MI is more stable and reliable than single imputation. As an illustration, we use data from Cardio Vascular Disease risk FACtor Two-township Study (CVDFACTS). We demonstrate how to determine the missing mechanism through comparing the survival curves and a logistic regression model fitting. Based on the empirical study, we find that discarding cases with missing in-come can lead to something different from that with multiple imputation. If the dis-carded cases are not missing complete at random, the remaining samples will be biased. That can be a serious problem in CV research. To conclude, MI is a useful method to deal with missing value problems and it should be worthwhile to give it a try in CV studies.
Hung, Chi-Lan, and 洪啟嵐. "Methods for imputation of missing values in air quality data sets." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/11587976008880936667.
國立中興大學
環境工程學系所
96
In order to discuss the problem of the air quality, to understand the pollution and the different pollution transmission in each place, and to design strategies to control and improve the contamination, Environmental Protection Administration in Taiwan in 1994 divided Taiwan into 7 air quality districts based on their different pollution features, the terrain, and the climate in each district. They have already set up over 70 air quality monitoring stations in Taiwan. Because of the air quality monitoring stations which monitored air quality condition of each place, we are able to have the historical information of air quality of everywhere for everybody to research and find out the useful information from this huge database. But around 10% of data are missing only during the transmission process. However, the missing values of the database will affect the data analysis, hence, it is very important to resolve the missing value problem. My research uses Inverse Square Distance Weighting method and Kriging method to impute the missing values, and discusses, analyzes, and compares the result of using the 2 different methods. Monte Carlo method is then used to test and verify which one is the better method to yield the accurate values to replace the missing values. After my research, the absolute error of Inverse Square Distance Weighting is smaller than it of Monte Carlo method for imputation of air quality data. After verifying, the absolute error of Inverse Square Distance Weighting and Kriging method is respectively 25% and 19%. It shows that Kriging method is better imputation method than Inverse Square Distance Weighting.
Huang, Hsiang-Chi, and 黃纕淇. "Imputation of Missing Values of Regional Trial Data by EM-AMMI." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/30909519506673627860.
國立臺灣大學
農藝學研究所
104
The purpose of regional trials is to confirm yield and agronomic traits of lines have stable and good performance in different environments. Some statistical methods were proposed to explain the patterns of genotype and environment interactions in the regional trial data. Particularly, Additive Main Effect and Multiplicative Interaction (AMMI) model uses singular decomposition value (SVD) to decompose genotype by environment interaction into the singular values, the genotype eigenvectors, and the environment eigenvectors to carry out the stability analysis on the tested genotype. However, a major limitation of AMMI model is that SVD requires a complete two-way table of genotype and environment mean yields. A typical multi-year or multi-location regional trial data is usually highly unbalanced so that the investigation of genotype by environment interaction across years is restricted. In this study we impute missing values by expectation-maximization-additive main effect and multiplicative interaction (EM-AMMI) method. The results of simulated data suggest to conduct EM-AMMI using one principal component to impute the missing values when the proportions of the missing values is less than 50%; when the proportion of missing values is more than 50%, we suggest to perform EM-AMMI using first three principal components. We also imputed the missing values of vegetable soybean regional trial data by EM-AMMI. In conclusion, providing a complete regional trial data by appropriate EM-AMMI can help the plant breeders to better understand of genotype by environment interaction.
Lin, Yu-shiang, and 林鈺翔. "A Study on Using Temoral/Spatial Imputation for Vehicle Detector Missing Values." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/12725294089432672692.
國立中央大學
土木工程研究所
98
The main purpose of this study is the use of detector temporal/spatial data to imput missing values, try to find a best combination and the performance of missing value imputation. At first, in spatial we adopted two modes:single detector datas and accumulation detector datas imputing miss values, and then get three spatial imputatiol ways:upstream, downstream,upstream and downstream detector, to establish an optimal range of spatial imputation. After that, in temporal we assign two modes:non-include interpolated detector data mode and include interpolated detector data mode, and we handled the historical data to a single time interval, accumulated time intervals and time interval moving average of three data types for temporal imputation, the final to get a performance evaluation by the best combination of temporal/spatial imputation.In this study, before analysis the imputation performance, we using K-means method to cluster detector datas, then the information(flow,speed and occupancy) using recurrent neural network to imput missing values and analysis. The results showed that:the flow of imputation, taking the upstream and downstream cumulative detectors to no.6 set with self-information the 20 minutes ago mean historical datas could get the best imputation performance; the speed of imputation, taking the upstream and downstream cumulative detectors to no.7 set with self-information the 20 minutes ago mean historical datas could get the best imputation performance; the occupancy of imputation, taking the upstream and downstream cumulative detectors to no.6 set with self-information the 15 minutes ago mean historical datas could get the best imputation performance.
Sun, cheng-bin, and 孫承彬. "The comparison of two missing values imputation methods in single and multiple choice question." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/32249826382402737245.
國立交通大學
統計學研究所
103
Questionnaire is a common way to collect data. In generally, a questionnaire usually consists of single-response questions and multiple-response questions. After we collect questionnaire data, it is possible that there exist missing values. To increase the accuracy of the survey result, we can use statistic methods to impute missing values. This paper mainly discusses two widely-used imputing methods for imputing missing values in analyzing data of single-response question and multiple-response question, respectively. These two methods are the K-nearest neighbors algorithm and the linear regression approach. We compare the accurate rates of these two methods in different conditions, such as different missing rate, number of questions and number of choices. In addition, we also use a real data example to compare these two methods and compare the results between real data and simulation.
Lin, Yue-Wei, and 林岳威. "Searching the Optimal Data Imputation Method for Missing Values of Vehicle Detector Using Linked List." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/18897154871480374324.
國立中央大學
土木工程研究所
100
To search for the optimal imputation of Vehicle Detector, this paper using the Data Structure-Linked Lists to search the optimal solution of imputation of Vehicle Detector. We have to establish three the essential VD Lists and the numbers of data imputation Lists. Using the Lists to find the optimal solution of missing values VD. First, use the 35 VD to establish the original VD Lists and copy the data of the original VD Lists to establish another List, it’s call data complete lists. Then we input the VD of missing values to delete VD of data complete lists, at the same time, use the VD of missing values to establish missing data lists. And then using combination of statistics of data complete lists to find the imputation mode, MAPE use the random to decide. Use the imputation mode to establish the numbers of data imputation Lists. Finally , we can search the local optimal solution from the data imputation Lists, and find the optimal solution from those local optimal solution. The result showed that every VD of missing value the local optimal solutions MAPE are 5%, because the imputation mode is too much and the MAPE is random to decide. But we have to find the optimal solution of missing value VD, so according to Literature, we choose the most VD imputation Data is optimal solution.
Carrillo, Garcia Ivan Adolfo. "Analysis of Longitudinal Surveys with Missing Responses." Thesis, 2008. http://hdl.handle.net/10012/3971.
Chagra, Djamila. "Sélection de modèle d'imputation à partir de modèles bayésiens hiérarchiques linéaires multivariés." Thèse, 2009. http://hdl.handle.net/1866/3936.
Abstract The technique known as multiple imputation seems to be the most suitable technique for solving the problem of non-response. The literature mentions methods that models the nature and structure of missing values. One of the most popular methods is the PAN algorithm of Schafer and Yucel (2002). The imputations yielded by this method are based on a multivariate linear mixed-effects model for the response variable. A Bayesian hierarchical clustered and more flexible extension of PAN is given by the BHLC model of Murua et al. (2005). The main goal of this work is to study the problem of model selection for multiple imputation in terms of efficiency and accuracy of missing-value predictions. We propose a measure of performance linked to the prediction of missing values. The measure is a mean squared error, and hence in addition to the variance associated to the multiple imputations, it includes a measure of bias in the prediction. We show that this measure is more objective than the most common variance measure of Rubin. Our measure is computed by incrementing by a small proportion the number of missing values in the data and supposing that those values are also missing. The performance of the imputation model is then assessed through the prediction error associated to these pseudo missing values. In order to study the problem objectively, we have devised several simulations. Data were generated according to different explicit models that assumed particular error structures. Several missing-value prior distributions as well as error-term distributions are then hypothesized. Our study investigates if the true error structure of the data has an effect on the performance of the different hypothesized choices for the imputation model. We concluded that the answer is yes. Moreover, the choice of missing-value prior distribution seems to be the most important factor for accuracy of predictions. In general, the most effective choices for good imputations are a t-Student distribution with different cluster variances for the error-term, and a missing-value Normal prior with data-driven mean and variance, or a missing-value regularizing Normal prior with large variance (a ridge-regression-like prior). Finally, we have applied our ideas to a real problem dealing with health outcome observations associated to a large number of countries around the world. Keywords: Missing values, multiple imputation, Bayesian hierarchical linear model, mixed effects model.
Les logiciels utilisés sont Splus et R.