Log in

Relevant bibliographies by topics / Missing Value Imputation / Journal articles

Journal articles on the topic 'Missing Value Imputation'

To see the other types of publications on this topic, follow the link: Missing Value Imputation.

Author: Grafiati

Published: 7 July 2024

Last updated: 7 July 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Missing Value Imputation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Zhao, Yuxuan, Eric Landgrebe, Eliot Shekhtman, and Madeleine Udell. "Online Missing Value Imputation and Change Point Detection with the Gaussian Copula." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 9199–207. http://dx.doi.org/10.1609/aaai.v36i8.20906.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Missing value imputation is crucial for real-world data science workflows. Imputation is harder in the online setting, as it requires the imputation method itself to be able to evolve over time. For practical applications, imputation algorithms should produce imputations that match the true data distribution, handle data of mixed types, including ordinal, boolean, and continuous variables, and scale to large datasets. In this work we develop a new online imputation algorithm for mixed data using the Gaussian copula. The online Gaussian copula model produces meets all the desiderata: its imputations match the data distribution even for mixed data, improve over its offline counterpart on the accuracy when the streaming data has a changing distribution, and on the speed (up to an order of magnitude) especially on large scale datasets. By fitting the copula model to online data, we also provide a new method to detect change points in the multivariate dependence structure for mixed data with missing values. Experimental results on synthetic and real world data validate the performance of the proposed methods.

2

Lu, Kaifeng. "Number of imputations needed to stabilize estimated treatment difference in longitudinal data analysis." Statistical Methods in Medical Research 26, no. 2 (October 10, 2014): 674–90. http://dx.doi.org/10.1177/0962280214554439.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Multiple imputation procedures replace each missing value with a set of plausible values based on the posterior predictive distribution of missing data given observed data. In many applications, as few as five imputations are adequate to achieve high efficiency relative to an infinite number of imputations. However, substantially more imputations are often needed to stabilize imputation-based inference at the analysis stage. Imputation-based inference at the analysis stage is considered stable if the conditional variability of the multiple imputation estimator, half-width of 95% confidence interval, test statistic, and estimated fraction of missing information given observed data is within specified thresholds for simulation error. For the estimation of treatment difference at study end for normally distributed responses in longitudinal trials, we calculate the multiple imputation quantities for an infinite number of imputations analytically and use simulations to assess the variability of the number of imputations needed at the analysis stage in repeated sampling.

3

Hameed, Wafaa Mustafa, and Nzar A. Ali. "Missing value imputation Techniques: A Survey." UHD Journal of Science and Technology 7, no. 1 (March 28, 2023): 72–81. http://dx.doi.org/10.21928/uhdjst.v7n1y2023.pp72-81.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Numerous of information is being accumulated and placed away every day. Big quantity of misplaced areas in a dataset might be a large problem confronted through analysts due to the fact it could cause numerous issues in quantitative investigates. To handle such misplaced values, numerous methods were proposed. This paper offers a review on different techniques available for imputation of unknown information, such as median imputation, hot (cold) deck imputation, regression imputation, expectation maximization, help vector device imputation, multivariate imputation using chained equation, SICE method, reinforcement programming, non-parametric iterative imputation algorithms, and multilayer perceptrons. This paper also explores a few satisfactory choices of methods to estimate missing values to be used by different researchers on this discipline of study. Furthermore, it aims to assist them to discern out what approach is commonly used now, the overview may additionally provide a view of every technique alongside its blessings and limitations to take into consideration of future studies on this area of study. It can be taking into account as baseline to solutions the question which techniques were used and that is the maximum popular.

4

Das, Dipalika, Maya Nayak, and Subhendu Kumar Pani. "Missing Value Imputation-A Review." International Journal of Computer Sciences and Engineering 7, no. 4 (April 30, 2019): 548–58. http://dx.doi.org/10.26438/ijcse/v7i4.548558.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Seu, Kimseth, Mi-Sun Kang, and HwaMin Lee. "An Intelligent Missing Data Imputation Techniques: A Review." JOIV : International Journal on Informatics Visualization 6, no. 1-2 (May 31, 2022): 278. http://dx.doi.org/10.30630/joiv.6.1-2.935.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The incomplete dataset is an unescapable problem in data preprocessing that primarily machine learning algorithms could not employ to train the model. Various data imputation approaches were proposed and challenged each other to resolve this problem. These imputations were established to predict the most appropriate value using different machine learning algorithms with various concepts. Furthermore, accurate estimation of the imputation method is exceptionally critical for some datasets to complete the missing value, especially imputing datasets in medical data. The purpose of this paper is to express the power of the distinguished state-of-the-art benchmarks, which have included the K-nearest Neighbors Imputation (KNNImputer) method, Bayesian Principal Component Analysis (BPCA) Imputation method, Multiple Imputation by Center Equation (MICE) Imputation method, Multiple Imputation with denoising autoencoder neural network (MIDAS) method. These methods have contributed to the achievable resolution to optimize and evaluate the appropriate data points for imputing the missing value. We demonstrate the experiment with all these imputation techniques based on the same four datasets which are collected from the hospital. Both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are utilized to measure the outcome of implementation and compare with each other to prove an extremely robust and appropriate method that overcomes missing data problems. As a result of the experiment, the KNNImputer and MICE have performed better than BPCA and MIDAS imputation, and BPCA has performed better than the MIDAS algorithm.

6

Huang, Min-Wei, Wei-Chao Lin, and Chih-Fong Tsai. "Outlier Removal in Model-Based Missing Value Imputation for Medical Datasets." Journal of Healthcare Engineering 2018 (2018): 1–9. http://dx.doi.org/10.1155/2018/1817479.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Many real-world medical datasets contain some proportion of missing (attribute) values. In general, missing value imputation can be performed to solve this problem, which is to provide estimations for the missing values by a reasoning process based on the (complete) observed data. However, if the observed data contain some noisy information or outliers, the estimations of the missing values may not be reliable or may even be quite different from the real values. The aim of this paper is to examine whether a combination of instance selection from the observed data and missing value imputation offers better performance than performing missing value imputation alone. In particular, three instance selection algorithms, DROP3, GA, and IB3, and three imputation algorithms, KNNI, MLP, and SVM, are used in order to find out the best combination. The experimental results show that that performing instance selection can have a positive impact on missing value imputation over the numerical data type of medical datasets, and specific combinations of instance selection and imputation methods can improve the imputation results over the mixed data type of medical datasets. However, instance selection does not have a definitely positive impact on the imputation result for categorical medical datasets.

7

Kumar, Nishith, Md Aminul Hoque, Md Shahjaman, S. M. Shahinul Islam, and Md Nurul Haque Mollah. "A New Approach of Outlier-robust Missing Value Imputation for Metabolomics Data Analysis." Current Bioinformatics 14, no. 1 (December 6, 2018): 43–52. http://dx.doi.org/10.2174/1574893612666171121154655.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Background: Metabolomics data generation and quantification are different from other types of molecular “omics” data in bioinformatics. Mass spectrometry (MS) based (gas chromatography mass spectrometry (GC-MS), liquid chromatography mass spectrometry (LC-MS), etc.) metabolomics data frequently contain missing values that make some quantitative analysis complex. Typically metabolomics datasets contain 10% to 20% missing values that originate from several reasons, like analytical, computational as well as biological hazard. Imputation of missing values is a very important and interesting issue for further metabolomics data analysis. </P><P> Objective: This paper introduces a new algorithm for missing value imputation in the presence of outliers for metabolomics data analysis. </P><P> Method: Currently, the most well known missing value imputation techniques in metabolomics data are knearest neighbours (kNN), random forest (RF) and zero imputation. However, these techniques are sensitive to outliers. In this paper, we have proposed an outlier robust missing imputation technique by minimizing twoway empirical mean absolute error (MAE) loss function for imputing missing values in metabolomics data. Results: We have investigated the performance of the proposed missing value imputation technique in a comparison of the other traditional imputation techniques using both simulated and real data analysis in the absence and presence of outliers. Conclusion: Results of both simulated and real data analyses show that the proposed outlier robust missing imputation technique is better performer than the traditional missing imputation methods in both absence and presence of outliers.

8

Zimmermann, Pavel, Petr Mazouch, and Klára Hulíková Tesárková. "Missing Categorical Data Imputation and Individual Observation Level Imputation." Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis 62, no. 6 (2014): 1527–34. http://dx.doi.org/10.11118/actaun201462061527.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Traditional missing data techniques of imputation schemes focus on prediction of the missing value based on other observed values. In the case of continuous missing data the imputation of missing values often focuses on regression models. In the case of categorical data, usual techniques are then focused on classification techniques which sets the missing value to the ‘most likely’ category. This however leads to overrepresentation of the categories which are in general observed more often and hence can lead to biased results in many tasks especially in the case of presence of dominant categories. We present original methodology of imputation of missing values which results in the most likely structure (distribution) of the missing data conditional on the observed values. The methodology is based on the assumption that the categorical variable containing the missing values has multinomial distribution. Values of the parameters of this distribution are than estimated using the multinomial logistic regression. Illustrative example of missing value and its reconstruction of the highest education level of persons in some population is described.

9

H.Mohamed, Marghny, Abdel-Rahiem A. Hashem, and Mohammed M. Abdelsamea. "Scalable Algorithms for Missing Value Imputation." International Journal of Computer Applications 87, no. 11 (February 14, 2014): 35–42. http://dx.doi.org/10.5120/15255-4019.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Gashler, Michael S, Michael R Smith, Richard Morris, and Tony Martinez. "Missing Value Imputation with Unsupervised Backpropagation." Computational Intelligence 32, no. 2 (July 1, 2014): 196–215. http://dx.doi.org/10.1111/coin.12048.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

H. Mohamed, Marghny. "Scalable Algorithms for Missing Value Imputation." International Journal of Computer Applications 28, no. 11 (August 31, 2011): 1–7. http://dx.doi.org/10.5120/3431-4669.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Yan, Xiaobo, Weiqing Xiong, Liang Hu, Feng Wang, and Kuo Zhao. "Missing Value Imputation Based on Gaussian Mixture Model for the Internet of Things." Mathematical Problems in Engineering 2015 (2015): 1–8. http://dx.doi.org/10.1155/2015/548605.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper addresses missing value imputation for the Internet of Things (IoT). Nowadays, the IoT has been used widely and commonly by a variety of domains, such as transportation and logistics domain and healthcare domain. However, missing values are very common in the IoT for a variety of reasons, which results in the fact that the experimental data are incomplete. As a result of this, some work, which is related to the data of the IoT, can’t be carried out normally. And it leads to the reduction in the accuracy and reliability of the data analysis results. This paper, for the characteristics of the data itself and the features of missing data in IoT, divides the missing data into three types and defines three corresponding missing value imputation problems. Then, we propose three new models to solve the corresponding problems, and they are model of missing value imputation based on context and linear mean (MCL), model of missing value imputation based on binary search (MBS), and model of missing value imputation based on Gaussian mixture model (MGI). Experimental results showed that the three models can improve the accuracy, reliability, and stability of missing value imputation greatly and effectively.

13

Choi, Yoon-Young, Heeseung Shon, Young-Ji Byon, Dong-Kyu Kim, and Seungmo Kang. "Enhanced Application of Principal Component Analysis in Machine Learning for Imputation of Missing Traffic Data." Applied Sciences 9, no. 10 (May 26, 2019): 2149. http://dx.doi.org/10.3390/app9102149.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Missing value imputation approaches have been widely used to support and maintain the quality of traffic data. Although the spatiotemporal dependency-based approaches can improve the imputation performance for large and continuous missing patterns, additionally considering traffic states can lead to more reliable results. In order to improve the imputation performances further, a section-based approach is also needed. This study proposes a novel approach for identifying traffic-states of different spots of road sections that comprise, namely, a section-based traffic state (SBTS), and determining their spatiotemporal dependencies customized for each SBTS, for missing value imputations. A principal component analysis (PCA) was employed, and angles obtained from the first principal component were used to identify the SBTSs. The pre-processing was combined with a support vector machine for developing the imputation model. It was found that the segmentation of the SBTS using the angles and considering the spatiotemporal dependency for each state by the proposed approach outperformed other existing models.

14

Lin, Yiming, and Sharad Mehrotra. "ZIP: Lazy Imputation during Query Processing." Proceedings of the VLDB Endowment 17, no. 1 (September 2023): 28–40. http://dx.doi.org/10.14778/3617838.3617841.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper develops a query-time missing value imputation framework, entitled ZIP, that modifies relational operators to be imputation aware in order to minimize the joint cost of imputing and query processing. The modified operators use a cost-based decision function to determine whether to invoke imputation or to defer to downstream operators to resolve missing values. The modified query processing logic ensures results with deferred imputations are identical to those produced if all missing values were imputed first. ZIP includes a novel outer-join based approach to preserve missing values during execution, and a bloom filter based index to optimize the space and running overhead. Extensive experiments on both real and synthetic data sets demonstrate 10 to 25 times improvement when augmenting the state-of-the-art technology, ImputeDB, with ZIP-based deferred imputation. ZIP also outperforms the offline approach by up to 19607 times in a real data set.

15

Gardner, Miranda L., and Michael A. Freitas. "Multiple Imputation Approaches Applied to the Missing Value Problem in Bottom-Up Proteomics." International Journal of Molecular Sciences 22, no. 17 (September 6, 2021): 9650. http://dx.doi.org/10.3390/ijms22179650.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Analysis of differential abundance in proteomics data sets requires careful application of missing value imputation. Missing abundance values widely vary when performing comparisons across different sample treatments. For example, one would expect a consistent rate of “missing at random” (MAR) across batches of samples and varying rates of “missing not at random” (MNAR) depending on the inherent difference in sample treatments within the study. The missing value imputation strategy must thus be selected that best accounts for both MAR and MNAR simultaneously. Several important issues must be considered when deciding the appropriate missing value imputation strategy: (1) when it is appropriate to impute data; (2) how to choose a method that reflects the combinatorial manner of MAR and MNAR that occurs in an experiment. This paper provides an evaluation of missing value imputation strategies used in proteomics and presents a case for the use of hybrid left-censored missing value imputation approaches that can handle the MNAR problem common to proteomics data.

16

Raja Kumaran, Shamini, Mohd Shahizan Othman, and Lizawati Mi Yusuf. "ESTIMATION OF MISSING VALUES USING OPTIMISED HYBRID FUZZY C-MEANS AND MAJORITY VOTE FOR MICROARRAY DATA." Journal of Information and Communication Technology 19, Number 4 (August 20, 2020): 459–82. http://dx.doi.org/10.32890/jict2020.19.4.1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Missing values are a huge constraint in microarray technologies towards improving and identifying disease-causing genes. Estimating missing values is an undeniable scenario faced by field experts. The imputation method is an effective way to impute the proper values to proceed with the next process in microarray technology. Missing value imputation methods may increase the classification accuracy. Although these methods might predict the values, classification accuracy rates prove the ability of the methods to identify the missing values in gene expression data. In this study, a novel method, Optimised Hybrid of Fuzzy C-Means and Majority Vote (opt-FCMMV), was proposed to identify the missing values in the data. Using the Majority Vote (MV) and optimisation through Particle Swarm Optimisation (PSO), this study predicted missing values in the data to form more informative and solid data. In order to verify the effectiveness of opt-FCMMV, several experiments were carried out on two publicly available microarray datasets (i.e. Ovary and Lung Cancer) under three missing value mechanisms with five different percentage values in the biomedical domain using Support Vector Machine (SVM) classifier. The experimental results showed that the proposed method functioned efficiently by showcasing the highest accuracy rate as compared to the one without imputations, with imputation by Fuzzy C-Means (FCM), and imputation by Fuzzy C-Means with Majority Vote (FCMMV). For example, the accuracy rates for Ovary Cancer data with 5% missing values were 64.0% for no imputation, 81.8% (FCM), 90.0% (FCMMV), and 93.7% (opt-FCMMV). Such an outcome indicates that the opt-FCMMV may also be applied in different domains in order to prepare the dataset for various data mining tasks.

17

Tada, Mayu, Natsumi Suzuki, and Yoshifumi Okada. "Missing Value Imputation Method for Multiclass Matrix Data Based on Closed Itemset." Entropy 24, no. 2 (February 16, 2022): 286. http://dx.doi.org/10.3390/e24020286.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Handling missing values in matrix data is an important step in data analysis. To date, many methods to estimate missing values based on data pattern similarity have been proposed. Most previously proposed methods perform missing value imputation based on data trends over the entire feature space. However, individual missing values are likely to show similarity to data patterns in local feature space. In addition, most existing methods focus on single class data, while multiclass analysis is frequently required in various fields. Missing value imputation for multiclass data must consider the characteristics of each class. In this paper, we propose two methods based on closed itemsets, CIimpute and ICIimpute, to achieve missing value imputation using local feature space for multiclass matrix data. CIimpute estimates missing values using closed itemsets extracted from each class. ICIimpute is an improved method of CIimpute in which an attribute reduction process is introduced. Experimental results demonstrate that attribute reduction considerably reduces computational time and improves imputation accuracy. Furthermore, it is shown that, compared to existing methods, ICIimpute provides superior imputation accuracy but requires more computational time.

18

Pettersson, Nicklas. "Bias reduction of finite population imputation by kernel methods." Statistics in Transition new series 14, no. 1 (March 4, 2013): 139–60. http://dx.doi.org/10.59170/stattrans-2013-009.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Missing data is a nuisance in statistics. Real donor imputation can be used with item nonresponse. A pool of donor units with similar values on auxiliary variables is matched to each unit with missing values. The missing value is then replaced by a copy of the corresponding observed value from a randomly drawn donor. Such methods can to some extent protect against nonresponse bias. But bias also depends on the estimator and the nature of the data. We adopt techniques from kernel estimation to combat this bias. Motivated by Pólya urn sampling, we sequentially update the set of potential donors with units already imputed, and use multiple imputations via Bayesian bootstrap to account for imputation uncertainty. Simulations with a single auxiliary variable show that our imputation method performs almost as well as competing methods with linear data, but better when data is nonlinear, especially with large samples.

19

Hameed, Wafaa Mustafa, and Nzar A. Ali. "Comparison of Seventeen Missing Value Imputation Techniques." Journal of Hunan University Natural Sciences 49, no. 7 (July 30, 2022): 26–36. http://dx.doi.org/10.55463/issn.1674-2974.49.7.4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Copious data are collected and put away each day. That information can be utilized to extricate curiously designs. However, the information that we collect is ordinarily inadequate. Presently, utilizing that information to extricate any data may allow deceiving comes about. Utilizing that, we pre-process the information to exterminate the variations from the norm. In case of a low rate of lost values, those occurrences can be overlooked, but, in the case of huge sums, overlooking them will not allow wanted results. Many lost spaces in a dataset could be a huge issue confronted by analysts because it can lead to numerous issues in quantitative investigations. So, performing any information mining procedures to extricate a little good data out of a dataset, a few pre-processings of information can be done to dodge such paradoxes and, in this manner, move forward the quality of information. For handling such lost values, numerous methods have been proposed since 1980. The best procedure is to disregard the records containing lost values. Another method is ascription, which includes supplanting those lost spaces with a few gauges by doing certain computations. This would increment the quality of information and would extemporize forecast comes about. This paper gives an audit on methods for handling lost information like median imputation (MDI), hot (cold) deck imputation, regression imputation, expectation maximization (EM), support vector machine imputation (SVMI), multivariate imputation by chained equation (MICE), SICE technique, reinforcement programming, nonparametric iterative imputation algorithms (NIIA), and multilayer perceptrons. This paper also explores some good options of methods to estimate missing values to be used by other researchers in this field of study. Also, it aims to help them to figure out what method is commonly used now. The overview may also provide insight into each method and its advantages and limitations to consider for future research in this field of study. It can be a baseline to answer the questions of which techniques have been used and which is the most popular.

20

Salem, Awsan, Nurul Akmar Emran, Azah Kamilah Muda, Zahriah Sahri, and Abdulrazzak Ali. "Missing values imputation in Arabic datasets using enhanced robust association rules." Indonesian Journal of Electrical Engineering and Computer Science 28, no. 2 (November 1, 2022): 1067. http://dx.doi.org/10.11591/ijeecs.v28.i2.pp1067-1075.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Missing value (MV) is one form of data completeness problem in massive datasets. To deal with missing values, data imputation methods were proposed with the aim to improve the completeness of the datasets concerned. Data imputation's accuracy is a common indicator of a data imputation technique's efficiency. However, the efficiency of data imputation can be affected by the nature of the language in which the dataset is written. To overcome this problem, it is necessary to normalize the data, especially in non-Latin languages such as the Arabic language. This paper proposes a method that will address the challenge inherent in Arabic datasets by extending the enhanced robust association rules (ERAR) method with Arabic detection and correction functions. Iterative and Decision Tree methods were used to evaluate the proposed method in an experiment. Experiment results show that the proposed method offers a higher data imputation accuracy than the Iterative and Decision Tree methods.

21

Fouad, Khaled M., Mahmoud M. Ismail, Ahmad Taher Azar, and Mona M. Arafa. "Advanced methods for missing values imputation based on similarity learning." PeerJ Computer Science 7 (July 21, 2021): e619. http://dx.doi.org/10.7717/peerj-cs.619.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.

22

Batra, Shivani, Rohan Khurana, Mohammad Zubair Khan, Wadii Boulila, Anis Koubaa, and Prakash Srivastava. "A Pragmatic Ensemble Strategy for Missing Values Imputation in Health Records." Entropy 24, no. 4 (April 10, 2022): 533. http://dx.doi.org/10.3390/e24040533.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Pristine and trustworthy data are required for efficient computer modelling for medical decision-making, yet data in medical care is frequently missing. As a result, missing values may occur not just in training data but also in testing data that might contain a single undiagnosed episode or a participant. This study evaluates different imputation and regression procedures identified based on regressor performance and computational expense to fix the issues of missing values in both training and testing datasets. In the context of healthcare, several procedures are introduced for dealing with missing values. However, there is still a discussion concerning which imputation strategies are better in specific cases. This research proposes an ensemble imputation model that is educated to use a combination of simple mean imputation, k-nearest neighbour imputation, and iterative imputation methods, and then leverages them in a manner where the ideal imputation strategy is opted among them based on attribute correlations on missing value features. We introduce a unique Ensemble Strategy for Missing Value to analyse healthcare data with considerable missing values to identify unbiased and accurate prediction statistical modelling. The performance metrics have been generated using the eXtreme gradient boosting regressor, random forest regressor, and support vector regressor. The current study uses real-world healthcare data to conduct experiments and simulations of data with varying feature-wise missing frequencies indicating that the proposed technique surpasses standard missing value imputation approaches as well as the approach of dropping records holding missing values in terms of accuracy.

23

Zhang, Shichao. "Estimating Semi-Parametric Missing Values with Iterative Imputation." International Journal of Data Warehousing and Mining 6, no. 3 (July 2010): 1–10. http://dx.doi.org/10.4018/jdwm.2010070101.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In this paper, the author designs an efficient method for imputing iteratively missing target values with semi-parametric kernel regression imputation, known as the semi-parametric iterative imputation algorithm (SIIA). While there is little prior knowledge on the datasets, the proposed iterative imputation method, which impute each missing value several times until the algorithms converges in each model, utilize a substantially useful amount of information. Additionally, this information includes occurrences involving missing values as well as capturing the real dataset distribution easier than the parametric or nonparametric imputation techniques. Experimental results show that the author’s imputation methods outperform the existing methods in terms of imputation accuracy, in particular in the situation with high missing ratio.

24

Bansal, Parikshit, Prathamesh Deshpande, and Sunita Sarawagi. "Missing value imputation on multidimensional time series." Proceedings of the VLDB Endowment 14, no. 11 (July 2021): 2533–45. http://dx.doi.org/10.14778/3476249.3476300.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets. Missing values are commonplace in decision support platforms that aggregate data over long time stretches from disparate sources, whereas reliable data analytics calls for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation, matrix factorization methods like SVD, statistical models like Kalman filters, and recent deep learning methods. We show that often these provide worse results on aggregate analytics compared to just excluding the missing data. DeepMVI expresses the distribution of each missing value conditioned on coarse and fine-grained signals along a time series, and signals from correlated series at the same time. Instead of resorting to linearity assumptions of conventional matrix factorization methods, DeepMVI harnesses a flexible deep network to extract and combine these signals in an end-to-end manner. To prevent over-fitting with high-capacity neural networks, we design a robust parameter training with labeled data created using synthetic missing blocks around available indices. Our neural network uses a modular design with a novel temporal transformer with convolutional features, and kernel regression with learned embeddings. Experiments across ten real datasets, five different missing scenarios, comparing seven conventional and three deep learning methods show that DeepMVI is significantly more accurate, reducing error by more than 50% in more than half the cases, compared to the best existing method. Although slower than simpler matrix factorization methods, we justify the increased time overheads by showing that DeepMVI provides significantly more accurate imputation that finally impacts quality of downstream analytics.

25

Mohamed, Marghny H., Abdel-Rahiem A. Hashem, and M. M. AbdelSamea. "DATA MINING TECHNIQUES FOR MISSING VALUE IMPUTATION." JES. Journal of Engineering Sciences 38, no. 4 (July 1, 2010): 1001–12. http://dx.doi.org/10.21608/jesaun.2010.125559.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Armitage, Emily Grace, Joanna Godzien, Vanesa Alonso-Herranz, Ángeles López-Gonzálvez, and Coral Barbas. "Missing value imputation strategies for metabolomics data." ELECTROPHORESIS 36, no. 24 (October 20, 2015): 3050–60. http://dx.doi.org/10.1002/elps.201500352.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Aziz, RZ Abdul, Sri Lestari, Fitria Fitria, and Febri Arianto. "Imputation missing value to overcome sparsity problems." TELKOMNIKA (Telecommunication Computing Electronics and Control) 22, no. 4 (August 1, 2024): 949. http://dx.doi.org/10.12928/telkomnika.v22i4.25940.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Alade, Oyekale Abel, Ali Selamat, and Roselina Sallehuddin. "The Effects of Missing Data Characteristics on the Choice of Imputation Techniques." Vietnam Journal of Computer Science 07, no. 02 (March 20, 2020): 161–77. http://dx.doi.org/10.1142/s2196888820500098.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

One major characteristic of data is completeness. Missing data is a significant problem in medical datasets. It leads to incorrect classification of patients and is dangerous to the health management of patients. Many factors lead to the missingness of values in databases in medical datasets. In this paper, we propose the need to examine the causes of missing data in a medical dataset to ensure that the right imputation method is used in solving the problem. The mechanism of missingness in datasets was studied to know the missing pattern of datasets and determine a suitable imputation technique to generate complete datasets. The pattern shows that the missingness of the dataset used in this study is not a monotone missing pattern. Also, single imputation techniques underestimate variance and ignore relationships among the variables; therefore, we used multiple imputations technique that runs in five iterations for the imputation of each missing value. The whole missing values in the dataset were 100% regenerated. The imputed datasets were validated using an extreme learning machine (ELM) classifier. The results show improvement in the accuracy of the imputed datasets. The work can, however, be extended to compare the accuracy of the imputed datasets with the original dataset with different classifiers like support vector machine (SVM), radial basis function (RBF), and ELMs.

29

Thomas, Tressy, and Enayat Rajabi. "A systematic review of machine learning-based missing value imputation techniques." Data Technologies and Applications 55, no. 4 (April 2, 2021): 558–85. http://dx.doi.org/10.1108/dta-12-2020-0298.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

PurposeThe primary aim of this study is to review the studies from different dimensions including type of methods, experimentation setup and evaluation metrics used in the novel approaches proposed for data imputation, particularly in the machine learning (ML) area. This ultimately provides an understanding about how well the proposed framework is evaluated and what type and ratio of missingness are addressed in the proposals. The review questions in this study are (1) what are the ML-based imputation methods studied and proposed during 2010–2020? (2) How the experimentation setup, characteristics of data sets and missingness are employed in these studies? (3) What metrics were used for the evaluation of imputation method?Design/methodology/approachThe review process went through the standard identification, screening and selection process. The initial search on electronic databases for missing value imputation (MVI) based on ML algorithms returned a large number of papers totaling at 2,883. Most of the papers at this stage were not exactly an MVI technique relevant to this study. The literature reviews are first scanned in the title for relevancy, and 306 literature reviews were identified as appropriate. Upon reviewing the abstract text, 151 literature reviews that are not eligible for this study are dropped. This resulted in 155 research papers suitable for full-text review. From this, 117 papers are used in assessment of the review questions.FindingsThis study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square error (RMSE) are most used evaluation metrics in these studies. For experimentation, majority of the studies sourced the data sets from publicly available data set repositories. A common approach is that the complete data set is set as baseline to evaluate the effectiveness of imputation on the test data sets with artificially induced missingness. The data set size and missingness ratio varied across the experimentations, while missing datatype and mechanism are pertaining to the capability of imputation. Computational expense is a concern, and experimentation using large data sets appears to be a challenge.Originality/valueIt is understood from the review that there is no single universal solution to missing data problem. Variants of ML approaches work well with the missingness based on the characteristics of the data set. Most of the methods reviewed lack generalization with regard to applicability. Another concern related to applicability is the complexity of the formulation and implementation of the algorithm. Imputations based on k-nearest neighbors (kNN) and clustering algorithms which are simple and easy to implement make it popular across various domains.

30

Li, Xiao, Huan Li, Hua Luf, Christian S. Jensen, Varun Pandey, and Volker Markl. "Missing Value Imputation for Multi-Attribute Sensor Data Streams via Message Propagation." Proceedings of the VLDB Endowment 17, no. 3 (November 2023): 345–58. http://dx.doi.org/10.14778/3632093.3632100.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Sensor data streams occur widely in various real-time applications in the context of the Internet of Things (IoT). However, sensor data streams feature missing values due to factors such as sensor failures, communication errors, or depleted batteries. Missing values can compromise the quality of real-time analytics tasks and downstream applications. Existing imputation methods either make strong assumptions about streams or have low efficiency. In this study, we aim to accurately and efficiently impute missing values in data streams that satisfy only general characteristics in order to benefit real-time applications more widely. First, we propose a message propagation imputation network (MPIN) that is able to recover the missing values of data instances in a time window. We give a theoretical analysis of why MPIN is effective. Second, we present a continuous imputation framework that consists of data update and model update mechanisms to enable MPIN to perform continuous imputation both effectively and efficiently. Extensive experiments on multiple real datasets show that MPIN can outperform the existing data imputers by wide margins and that the continuous imputation framework is efficient and accurate.

31

Raudhatunnisa, Tsasya, and Nori Wilantika. "Performance Comparison of Hot-Deck Imputation, K-Nearest Neighbor Imputation, and Predictive Mean Matching in Missing Value Handling, Case Study: March 2019 SUSENAS Kor Dataset." Proceedings of The International Conference on Data Science and Official Statistics 2021, no. 1 (January 4, 2022): 753–70. http://dx.doi.org/10.34123/icdsos.v2021i1.93.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Missing value can cause bias and makes the dataset not represent the actual situation. The selection of methods for handling missing values is important because it will affect the estimated value generated. Therefore, this study aims to compare three imputation methods to handle missing values—Hot-Deck Imputation, K-Nearest Neighbor Imputation (KNNI), and Predictive Mean Matching (PMM). The difference in the way the three methods work causes the estimation results to be different. The criteria used to compare the three methods are the Root Mean Squared Error (RMSE), Unsupervised Classification Error (UCE), Supervised Classification Error (SCE), and the time used to run the algorithm. This study uses two pieces of analysis, comparison analysis, and scoring analysis. The comparative analysis applying a simulation that pays attention to the mechanism of missing value. The mechanism of the missing value used in the simulation is Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). Then, scoring analysis aims to narrow down the results of comparative analysis by giving a score on the results of the imputation of the three methods. The result suggests Hot-Deck Imputation is the most excellent in dealing with a missing value based on the score.

32

Lee, Do-Hoon, and Han-joon Kim. "A Self-Attention-Based Imputation Technique for Enhancing Tabular Data Quality." Data 8, no. 6 (June 4, 2023): 102. http://dx.doi.org/10.3390/data8060102.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Recently, data-driven decision-making has attracted great interest; this requires high-quality datasets. However, real-world datasets often feature missing values for unknown or intentional reasons, rendering data-driven decision-making inaccurate. If a machine learning model is trained using incomplete datasets with missing values, the inferred results may be biased. In this case, a commonly used technique is the missing value imputation (MVI), which fills missing data with possible values estimated based on observed values. Various data imputation methods using machine learning, statistical inference, and relational database theories have been developed. Among them, conventional machine learning based imputation methods that handle tabular data can deal with only numerical columns or are time-consuming and cumbersome because they create an individualized predictive model for each column. Therefore, we have developed a novel imputational neural network that we term the Denoising Self-Attention Network (DSAN). Our proposed DSAN can deal with tabular datasets containing both numerical and categorical columns; it considers discretized numerical values as categorical values for embedding and self-attention layers. Furthermore, the DSAN learns robust feature expression vectors by combining self-attention and denoising techniques, and can predict multiple, appropriate substituted values simultaneously (via multi-task learning). To verify the validity of the method, we performed data imputation experiments after arbitrarily generating missing values for several real-world tabular datasets. We evaluated both imputational and downstream task performances, and we have seen that the DSAN outperformed the other models, especially in terms of category variable imputation.

33

Lenz, Michael, Andreas Schulz, Thomas Koeck, Steffen Rapp, Markus Nagler, Madeleine Sauer, Lisa Eggebrecht, et al. "Missing value imputation in proximity extension assay-based targeted proteomics data." PLOS ONE 15, no. 12 (December 14, 2020): e0243487. http://dx.doi.org/10.1371/journal.pone.0243487.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Targeted proteomics utilizing antibody-based proximity extension assays provides sensitive and highly specific quantifications of plasma protein levels. Multivariate analysis of this data is hampered by frequent missing values (random or left censored), calling for imputation approaches. While appropriate missing-value imputation methods exist, benchmarks of their performance in targeted proteomics data are lacking. Here, we assessed the performance of two methods for imputation of values missing completely at random, the previously top-benchmarked ‘missForest’ and the recently published ‘GSimp’ method. Evaluation was accomplished by comparing imputed with remeasured relative concentrations of 91 inflammation related circulating proteins in 86 samples from a cohort of 645 patients with venous thromboembolism. The median Pearson correlation between imputed and remeasured protein expression values was 69.0% for missForest and 71.6% for GSimp (p = 5.8e-4). Imputation with missForest resulted in stronger reduction of variance compared to GSimp (median relative variance of 25.3% vs. 68.6%, p = 2.4e-16) and undesired larger bias in downstream analyses. Irrespective of the imputation method used, the 91 imputed proteins revealed large variations in imputation accuracy, driven by differences in signal to noise ratio and information overlap between proteins. In summary, GSimp outperformed missForest, while both methods show good overall imputation accuracy with large variations between proteins.

34

Cihan, Pinar, and Zeynep Banu Ozger. "A New Heuristic Approach for Treating Missing Value: ABCimp." Elektronika ir Elektrotechnika 25, no. 6 (December 6, 2019): 48–54. http://dx.doi.org/10.5755/j01.eie.25.6.24826.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Missing values in datasets present an important problem for traditional and modern statistical methods. Many statistical methods have been developed to analyze the complete datasets. However, most of the real world datasets contain missing values. Therefore, in recent years, many methods have been developed to overcome the missing value problem. Heuristic methods have become popular in this field due to their superior performance in many other optimization problems. This paper introduces an Artificial Bee Colony algorithm based new approach for missing value imputation in the four real-world discrete datasets. At the proposed Artificial Bee Colony Imputation (ABCimp) method, Bayesian Optimization is integrated into the Artificial Bee Colony algorithm. The performance of the proposed technique is compared with other well-known six methods, which are Mean, Median, k Nearest Neighbor (k-NN), Multivariate Equation by Chained Equation (MICE), Singular Value Decomposition (SVD), and MissForest (MF). The classification error and root mean square error are used as the evaluation criteria of the imputation methods performance and the Naive Bayes algorithm is used as the classifier. The empirical results show that state-of-the-art ABCimp performs better than the other most popular imputation methods at the variable missing rates ranging from 3 % to 15 %.

35

Md Soom, Afiqah Bazlla, Aisyah Mat Jasin, Aszila Asmat, Roger Canda, and Juhaida Ismail. "A BAD IDEA OF USING MODE IMPUTATION METHOD." Journal of Information System and Technology Management 7, no. 29 (December 1, 2022): 01–09. http://dx.doi.org/10.35631/jistm.729001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Missing data is a recurring issue in psychology questionnaire when a respondent does not respond to questions due to personal reasons. In general, two types of imputation techniques are used to replace missing data: single imputation and multiple imputation (MI). The single imputation technique generates a single value to impute each missing data. The simplest methods of single imputation are mean, mode and median. In contrast, the multiple imputation technique imputes each missing data several times resulting in multiple complete datasets. The most popular method in MI that can deal with numerical and categorical data type is the predictive mean matching (PMM). The aim of this article is to compare and visualize how the mode imputation method in the single imputation technique will lead to a biased data distribution and the PMM method in the MI techniques will reduce this issue. Both methods, mode imputation and PMM are often considered when dealing with categorical data types. The mode imputation replaces a missing data with the most frequent value of an item in a survey. Meanwhile, the predictive mean matching is an extension of regression model that apply donor selection strategy to replace a missing data. Results from bar charts visualize the multiple imputation shows less discrepancy between the original distribution and imputed distribution. Thus, in this research, it can be concluded that the PMM method in MI technique shows a less biased distribution than implementing the mode imputation method. A comparison of imputation methods with different missing rates on a survey dataset should be considered for future work.

36

Li, Cong, Xupeng Ren, and Guohui Zhao. "Machine-Learning-Based Imputation Method for Filling Missing Values in Ground Meteorological Observation Data." Algorithms 16, no. 9 (September 2, 2023): 422. http://dx.doi.org/10.3390/a16090422.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Ground meteorological observation data (GMOD) are the core of research on earth-related disciplines and an important reference for societal production and life. Unfortunately, due to operational issues or equipment failures, missing values may occur in GMOD. Hence, the imputation of missing data is a prevalent issue during the pre-processing of GMOD. Although a large number of machine-learning methods have been applied to the field of meteorological missing value imputation and have achieved good results, they are usually aimed at specific meteorological elements, and few studies discuss imputation when multiple elements are randomly missing in the dataset. This paper designed a machine-learning-based multidimensional meteorological data imputation framework (MMDIF), which can use the predictions of machine-learning methods to impute the GMOD with random missing values in multiple attributes, and tested the effectiveness of 20 machine-learning methods on imputing missing values within 124 meteorological stations across six different climatic regions based on the MMDIF. The results show that MMDIF-RF was the most effective missing value imputation method; it is better than other methods for imputing 11 types of hourly meteorological elements. Although this paper applied MMDIF to the imputation of missing values in meteorological data, the method can also provide guidance for dataset reconstruction in other industries.

37

Purwar, Archana, and Sandeep Kumar Singh. "DBSCANI: Noise-Resistant Method for Missing Value Imputation." Journal of Intelligent Systems 25, no. 3 (July 1, 2016): 431–40. http://dx.doi.org/10.1515/jisys-2014-0172.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.

38

Hina, Ayub, and Jamil Harun. "Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling." IgMin Research 2, no. 1 (January 23, 2024): 025–31. http://dx.doi.org/10.61927/igmin140.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

This paper tackles the vital issue of missing value imputation in data preprocessing, where traditional techniques like zero, mean, and KNN imputation fall short in capturing intricate data relationships. This often results in suboptimal outcomes, and discarding records with missing values leads to significant information loss. Our innovative approach leverages advanced transformer models renowned for handling sequential data. The proposed predictive framework trains a transformer model to predict missing values, yielding a marked improvement in imputation accuracy. Comparative analysis against traditional methods—zero, mean, and KNN imputation—consistently favors our transformer model. Importantly, LSTM validation further underscores the superior performance of our approach. In hourly data, our model achieves a remarkable R2 score of 0.96, surpassing KNN imputation by 0.195. For daily data, the R2 score of 0.806 outperforms KNN imputation by 0.015 and exhibits a notable superiority of 0.25 over mean imputation. Additionally, in monthly data, the proposed model’s R2 score of 0.796 excels, showcasing a significant improvement of 0.1 over mean imputation. These compelling results highlight the proposed model’s ability to capture underlying patterns, offering valuable insights for enhancing missing values imputation in data analyses.

39

Bergamo, Genevile Carife, Carlos Tadeu dos Santos Dias, and Wojtek Janusz Krzanowski. "Distribution-free multiple imputation in an interaction matrix through singular value decomposition." Scientia Agricola 65, no. 4 (2008): 422–27. http://dx.doi.org/10.1590/s0103-90162008000400015.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Some techniques of multivariate statistical analysis can only be conducted on a complete data matrix, but the process of data collection often misses some elements. Imputation is a technique by which the missing elements are replaced by plausible values, so that a valid analysis can be performed on the completed data set. A multiple imputation method is proposed based on a modification to the singular value decomposition (SVD) method for single imputation, developed by Krzanowski. The method was evaluated on a genotype × environment (G × E) interaction matrix obtained from a randomized blocks experiment on Eucalyptus grandis grown in multienvironments. Values of E. grandis heights in the G × E complete interaction matrix were deleted randomly at three different rates (5%, 10%, 30%) and were then imputed by the proposed methodology. The results were assessed by means of a general measure of performance (Tacc), and showed a small bias when compared to the original data. However, bias values were greater than the variability of imputations relative to their mean, indicating a smaller accuracy of the proposed method in relation to its precision. The proposed methodology uses the maximum amount of available information, does not have any restrictions regarding the pattern or mechanism of the missing values, and is free of assumptions on the data distribution or structure.

40

G, Madhu, and Nagachandrika G. "A New Paradigm for Development of Data Imputation Approach for Missing Value Estimation." International Journal of Electrical and Computer Engineering (IJECE) 6, no. 6 (December 1, 2016): 3222. http://dx.doi.org/10.11591/ijece.v6i6.10632.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

<p>Many real-world applications encountered a common issue in data analysis is the presence of missing data value and challenging task in many applications such as wireless sensor networks, medical applications and psychological domain and others. Learning and prediction in the presence of missing value can be treacherous in machine learning, data mining and statistical analysis. A missing value can signify important information about dataset in the mining process. Handling missing data value is a challenging task for the data mining process. In this paper, we propose new paradigm for the development of data imputation method for missing data value estimation based on centroids and the nearest neighbours. Firstly, identify clusters based on the k-means algorithm and calculate centroids and the nearest neighbour data records. Secondly, the nearest distances from complete dataset as well as incomplete dataset from the centroids and estimated the nearest data record which tends to be curse dimensionality. Finally, impute the missing value based nearest neighbour record using statistical measure called z-score. The experimental study demonstrates strengthen of the proposed paradigm for the imputation of the missing data value estimation in dataset. Tests have been run using different types of datasets in order to validate our approach and compare the results with other imputation methods such as KNNI, SVMI, WKNNI, KMI and FKNNI. The proposed approach is geared towards maximizing the utility of imputation with respect to missing data value estimation.</p>

41

G, Madhu, and Nagachandrika G. "A New Paradigm for Development of Data Imputation Approach for Missing Value Estimation." International Journal of Electrical and Computer Engineering (IJECE) 6, no. 6 (December 1, 2016): 3222. http://dx.doi.org/10.11591/ijece.v6i6.pp3222-3228.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

<p>Many real-world applications encountered a common issue in data analysis is the presence of missing data value and challenging task in many applications such as wireless sensor networks, medical applications and psychological domain and others. Learning and prediction in the presence of missing value can be treacherous in machine learning, data mining and statistical analysis. A missing value can signify important information about dataset in the mining process. Handling missing data value is a challenging task for the data mining process. In this paper, we propose new paradigm for the development of data imputation method for missing data value estimation based on centroids and the nearest neighbours. Firstly, identify clusters based on the k-means algorithm and calculate centroids and the nearest neighbour data records. Secondly, the nearest distances from complete dataset as well as incomplete dataset from the centroids and estimated the nearest data record which tends to be curse dimensionality. Finally, impute the missing value based nearest neighbour record using statistical measure called z-score. The experimental study demonstrates strengthen of the proposed paradigm for the imputation of the missing data value estimation in dataset. Tests have been run using different types of datasets in order to validate our approach and compare the results with other imputation methods such as KNNI, SVMI, WKNNI, KMI and FKNNI. The proposed approach is geared towards maximizing the utility of imputation with respect to missing data value estimation.</p>

42

Liu, Chia-Hui, Chih-Fong Tsai, Kuen-Liang Sue, and Min-Wei Huang. "The Feature Selection Effect on Missing Value Imputation of Medical Datasets." Applied Sciences 10, no. 7 (March 29, 2020): 2344. http://dx.doi.org/10.3390/app10072344.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

In practice, many medical domain datasets are incomplete, containing a proportion of incomplete data with missing attribute values. Missing value imputation can be performed to solve the problem of incomplete datasets. To impute missing values, some of the observed data (i.e., complete data) are generally used as the reference or training set, and then the relevant statistical and machine learning techniques are employed to produce estimations to replace the missing values. Since the collected dataset usually contains a certain number of feature dimensions, it is useful to perform feature selection for better pattern recognition. Therefore, the aim of this paper is to examine the effect of performing feature selection on missing value imputation of medical datasets. Experiments are carried out on five different medical domain datasets containing various feature dimensions. In addition, three different types of feature selection methods and imputation techniques are employed for comparison. The results show that combining feature selection and imputation is a better choice for many medical datasets. However, the feature selection algorithm should be carefully chosen in order to produce the best result. Particularly, the genetic algorithm and information gain models are suitable for lower dimensional datasets, whereas the decision tree model is a better choice for higher dimensional datasets.

43

Rodgers, Danielle M., Ross Jacobucci, and Kevin J. Grimm. "A Multiple Imputation Approach for Handling Missing Data in Classification and Regression Trees." Journal of Behavioral Data Science 1, no. 1 (May 2021): 127–53. http://dx.doi.org/10.35566/jbds/v1n1/p6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Decision trees (DTs) is a machine learning technique that searches the predictor space for the variable and observed value that leads to the best prediction when the data are split into two nodes based on the variable and splitting value. The algorithm repeats its search within each partition of the data until a stopping rule ends the search. Missing data can be problematic in DTs because of an inability to place an observation with a missing value into a node based on the chosen splitting variable. Moreover, missing data can alter the selection process because of its inability to place observations with missing values. Simple missing data approaches (e.g., listwise deletion, majority rule, and surrogate split) have been implemented in DT algorithms; however, more sophisticated missing data techniques have not been thoroughly examined. We propose a modified multiple imputation approach to handling missing data in DTs, and compare this approach with simple missing data approaches as well as single imputation and a multiple imputation with prediction averaging via Monte Carlo Simulation. This study evaluated the performance of each missing data approach when data were MAR or MCAR. The proposed multiple imputation approach and surrogate splits had superior performance with the proposed multiple imputation approach performing best in the more severe missing data conditions. We conclude with recommendations for handling missing data in DTs.

44

Dueck, A., P. Atherton, A. Tan, and J. Sloan. "How much missing data is too much? A single study exploration." Journal of Clinical Oncology 24, no. 18_suppl (June 20, 2006): 6116. http://dx.doi.org/10.1200/jco.2006.24.18_suppl.6116.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

6116 Background: Analyses of patient-reported outcomes rely on the dependability of patients to complete and submit assessments in a timely manner – not all data is obtained. In recent work focusing on quality of life (QOL) data and imputation, it has been found that most methods do not alter study results. But how much data can be missing before study results are affected? Methods: Missing data was investigated using a 2-arm study (109 patients) who completed Linear Analogue Self Assessments at 4 intervals. Patients (11%) had missing data at the second interval. Existing data was analysed for differences in scores between arms, then cases were randomly deleted to create increasing percentages (12%-20%) of missing data. Ten simulations were conducted per percent. Imputation methods applied were carrying forward the last value (LVCF), average value (AVCF), and maximum value (MVCF). Student’s t-tests were performed between arms for each simulation. Results: Imputation did not alter results of our study data which was statistically significant (SS) between arms for overall QOL (p=0.036) and spiritual well-being (SWB) (p=0.006), and not statistically significant (NS) for mental well-being (MWB) (p=0.174). After data deletion and t-test calculations, AVCF did not impact results. For overall QOL, data deletion changed the p-value to NS in 1 of 10 simulations starting at 12% missing data and 5 of 10 simulations starting at 16% missing data. No matter what percentage of missing data, imputation produced a SS p-value over 80% of the time. Data deletion and subsequent imputation did not affect the study decision for SWB. For MWB, all differences between arms were NS prior to imputation. After imputation, there was at most a 7% disagreement in conclusions. LVCF and MVCF performed equally in all simulations. Conclusions: For this particular study, when p-values are close to the study-defined alpha, the increase in missing data can change the study results and imputation methods are more likely to determine SS differences. The further the p-values are from the study alpha, there is little effect from increasing missing data or applying imputation. These results are for one particular study and further research is needed. No significant financial relationships to disclose.

45

Wang, Shuyu, Wengen Li, Siyun Hou, Jihong Guan, and Jiamin Yao. "STA-GAN: A Spatio-Temporal Attention Generative Adversarial Network for Missing Value Imputation in Satellite Data." Remote Sensing 15, no. 1 (December 23, 2022): 88. http://dx.doi.org/10.3390/rs15010088.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Satellite data is of high importance for ocean environment monitoring and protection. However, due to the missing values in satellite data, caused by various force majeure factors such as cloud cover, bad weather and sensor failure, the quality of satellite data is reduced greatly, which hinders the applications of satellite data in practice. Therefore, a variety of methods have been proposed to conduct missing data imputation for satellite data to improve its quality. However, these methods cannot well learn the short-term temporal dependence and dynamic spatial dependence in satellite data, resulting in bad imputation performance when the data missing rate is large. To address this issue, we propose the Spatio-Temporal Attention Generative Adversarial Network (STA-GAN) for missing value imputation in satellite data. First, we develop the Spatio-Temporal Attention (STA) mechanism based on Graph Attention Network (GAT) to learn features for capturing both short-term temporal dependence and dynamic spatial dependence in satellite data. Then, the learned features from STA are fused to enrich the spatio-temporal information for training the generator and discriminator of STA-GAN. Finally, we use the generated imputation data by the trained generator of STA-GAN to fill the missing values in satellite data. Experimental results on real datasets show that STA-GAN largely outperforms the baseline data imputation methods, especially for filling satellite data with large missing rates.

46

CAI, ZHIPENG, MAYSAM HEYDARI, and GUOHUI LIN. "ITERATED LOCAL LEAST SQUARES MICROARRAY MISSING VALUE IMPUTATION." Journal of Bioinformatics and Computational Biology 04, no. 05 (October 2006): 935–57. http://dx.doi.org/10.1142/s0219720006002302.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Microarray gene expression data often contains multiple missing values due to various reasons. However, most of gene expression data analysis algorithms require complete expression data. Therefore, accurate estimation of the missing values is critical to further data analysis. In this paper, an Iterated Local Least Squares Imputation (ILLSimpute) method is proposed for estimating missing values. Two unique features of ILLSimpute method are: ILLSimpute method does not fix a common number of coherent genes for target genes for estimation purpose, but defines coherent genes as those within a distance threshold to the target genes. Secondly, in ILLSimpute method, estimated values in one iteration are used for missing value estimation in the next iteration and the method terminates after certain iterations or the imputed values converge. Experimental results on six real microarray datasets showed that ILLSimpute method performed at least as well as, and most of the time much better than, five most recent imputation methods.

47

Di Lena, Pietro, Claudia Sala, Andrea Prodi, and Christine Nardini. "Missing value estimation methods for DNA methylation data." Bioinformatics 35, no. 19 (February 23, 2019): 3786–93. http://dx.doi.org/10.1093/bioinformatics/btz134.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Abstract Motivation DNA methylation is a stable epigenetic mark with major implications in both physiological (development, aging) and pathological conditions (cancers and numerous diseases). Recent research involving methylation focuses on the development of molecular age estimation methods based on DNA methylation levels (mAge). An increasing number of studies indicate that divergences between mAge and chronological age may be associated to age-related diseases. Current advances in high-throughput technologies have allowed the characterization of DNA methylation levels throughout the human genome. However, experimental methylation profiles often contain multiple missing values that can affect the analysis of the data and also mAge estimation. Although several imputation methods exist, a major deficiency lies in the inability to cope with large datasets, such as DNA methylation chips. Specific methods for imputing missing methylation data are therefore needed. Results We present a simple and computationally efficient imputation method, metyhLImp, based on linear regression. The rationale of the approach lies in the observation that methylation levels show a high degree of inter-sample correlation. We performed a comparative study of our approach with other imputation methods on DNA methylation data of healthy and disease samples from different tissues. Performances have been assessed both in terms of imputation accuracy and in terms of the impact imputed values have on mAge estimation. In comparison to existing methods, our linear regression model proves to perform equally or better and with good computational efficiency. The results of our analysis provide recommendations for accurate estimation of missing methylation values. Availability and implementation The R-package methyLImp is freely available at https://github.com/pdilena/methyLImp. Supplementary information Supplementary data are available at Bioinformatics online.

48

Kim, Taesung, Jinhee Kim, Wonho Yang, Hunjoo Lee, and Jaegul Choo. "Missing Value Imputation of Time-Series Air-Quality Data via Deep Neural Networks." International Journal of Environmental Research and Public Health 18, no. 22 (November 20, 2021): 12213. http://dx.doi.org/10.3390/ijerph182212213.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

To prevent severe air pollution, it is important to analyze time-series air quality data, but this is often challenging as the time-series data is usually partially missing, especially when it is collected from multiple locations simultaneously. To solve this problem, various deep-learning-based missing value imputation models have been proposed. However, often they are barely interpretable, which makes it difficult to analyze the imputed data. Thus, we propose a novel deep learning-based imputation model that achieves high interpretability as well as shows great performance in missing value imputation for spatio-temporal data. We verify the effectiveness of our method through quantitative and qualitative results on a publicly available air-quality dataset.

49

Pan, Hu, Zhiwei Ye, Qiyi He, Chunyan Yan, Jianyu Yuan, Xudong Lai, Jun Su, and Ruihan Li. "Discrete Missing Data Imputation Using Multilayer Perceptron and Momentum Gradient Descent." Sensors 22, no. 15 (July 28, 2022): 5645. http://dx.doi.org/10.3390/s22155645.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Data are a strategic resource for industrial production, and an efficient data-mining process will increase productivity. However, there exist many missing values in data collected in real life due to various problems. Because the missing data may reduce productivity, missing value imputation is an important research topic in data mining. At present, most studies mainly focus on imputation methods for continuous missing data, while a few concentrate on discrete missing data. In this paper, a discrete missing value imputation method based on a multilayer perceptron (MLP) is proposed, which employs a momentum gradient descent algorithm, and some prefilling strategies are utilized to improve the convergence speed of the MLP. To verify the effectiveness of the method, experiments are conducted to compare the classification accuracy with eight common imputation methods, such as the mode, random, hot-deck, KNN, autoencoder, and MLP, under different missing mechanisms and missing proportions. Experimental results verify that the improved MLP model (IMLP) can effectively impute discrete missing values in most situations under three missing patterns.

50

Sallaby, Achmad Fikri, and Azlan Azlan. "Analysis of Missing Value Imputation Application with K-Nearest Neighbor (K-NN) Algorithm in Dataset." IJICS (International Journal of Informatics and Computer Science) 5, no. 2 (August 1, 2021): 141. http://dx.doi.org/10.30865/ijics.v5i2.3185.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Abstract:

Missing value is a problem that is still often found in many studies. Missing value is where data or data features are not available completely and intact. This still happens a lot in datasets that will be used in research. The missing value is caused by many factors such as human error, unavailable data or even from a virus in the database. Data is important for research, incomplete data will affect the results obtained. Data mining is a process that is very influential on data, including the classification process. Classification in data mining can be done if the data is complete. These problems can be overcome by the Imputation process by combining it with the K-Nearest Neighbor process or the process can be called K-Nearest Neighbor Imputation (K-NNI). In the research that has been done the K-Nearest Neighbor Imputation algorithm can overcome the problem of missing values in the dataset. This can be seen from the level of accuracy obtained where the accuracy of the classification process before handling the missing value is 77.01% while after the imputation process the accuracy is 78.31%