Journal articles on the topic 'Big data with missingness'

To see the other types of publications on this topic, follow the link: Big data with missingness.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Big data with missingness.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Elleman, Lorien G., Sarah K. McDougald, David M. Condon, and William Revelle. "That Takes the BISCUIT." European Journal of Psychological Assessment 36, no. 6 (November 2020): 948–58. http://dx.doi.org/10.1027/1015-5759/a000590.

Full text
Abstract:
Abstract. The predictive accuracy of personality-criterion regression models may be improved with statistical learning (SL) techniques. This study introduced a novel SL technique, BISCUIT (Best Items Scale that is Cross-validated, Unit-weighted, Informative, and Transparent). The predictive accuracy and parsimony of BISCUIT were compared with three established SL techniques (the lasso, elastic net, and random forest) and regression using two sets of scales, for five criteria, across five levels of data missingness. BISCUIT’s predictive accuracy was competitive with other SL techniques at higher levels of data missingness. BISCUIT most frequently produced the most parsimonious SL model. In terms of predictive accuracy, the elastic net and lasso dominated other techniques in the complete data condition and in conditions with up to 50% data missingness. Regression using 27 narrow traits was an intermediate choice for predictive accuracy. For most criteria and levels of data missingness, regression using the Big Five had the worst predictive accuracy. Overall, loss in predictive accuracy due to data missingness was modest, even at 90% data missingness. Findings suggest that personality researchers should consider incorporating planned data missingness and SL techniques into their designs and analyses.
APA, Harvard, Vancouver, ISO, and other styles
2

Neuenschwander, Beat, and Michael Branson. "Modeling Missingness for Time-to-Event Data: A Case Study in Osteoporosis." Journal of Biopharmaceutical Statistics 14, no. 4 (December 31, 2004): 1005–19. http://dx.doi.org/10.1081/bip-200035478.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Nwakuya, Nwakuya, M. T, and Onyegbuchulam B. O. "Quantile Regression-based Multiple Imputation of Skewed Data with Different Percentages of Missingness." Scholars Journal of Physics, Mathematics and Statistics 9, no. 4 (May 10, 2022): 41–45. http://dx.doi.org/10.36347/sjpms.2022.v09i04.002.

Full text
Abstract:
This study investigates the Quantile Regression-Based Multiple Imputation (QR-based MI) on a simulated right skewed data with 5% and 25% missing data points. Quantile regression analysis on three data sets that comprises of the complete skewed data without missing values, data set with 5% missing values and data set with 25% missing values was performed at 0.25, 0.5, 0.75 and 0.95 quantiles. The data sets with 5% and 25% missing values were imputed using QR-based MI technique, giving rise to two complete data sets. This analysis was performed using both transformed and untransformed version of the three data sets. The transformation was carried out by applying the Yeo-Johnson transformation technique and comparison of results was based on the Mean Square Error (MSE), Akiake Information Criteria (AIC) and Bayesian Information Criteria (BIC). The result from the original complete right skewed data shows that the untransformed data presented better results at 0.25 and 0.50 quantiles compared to the transformed data while results at 0.75 and 0.95 quantiles of the transformed data showed a better result compared to the untransformed. This result is attributed to the fact that the data was right skewed, so that the transformation will benefit the heavy tails on the right while the lighter tail on the left needs not to be transformed hence the 0.25 and 0.50 quantile better result with untransformed data and the 0.75 and 0.95 better result with transformed data. Considering the imputed complete data sets from the 5% and 25% missingness, it was seen that for both data sets at all quantiles considered, the untransformed data produced better results than the transformed data. This led us to conclude that the QR-based MI is not distribution dependent hence it is not sensitive to skewness. Therefore it can be stated based on the results that QR-based MI is robust to skewness, thus can be applied to skewed data sets.
APA, Harvard, Vancouver, ISO, and other styles
4

Poyatos, Rafael, Oliver Sus, Llorenç Badiella, Maurizio Mencuccini, and Jordi Martínez-Vilalta. "Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information." Biogeosciences 15, no. 9 (May 4, 2018): 2601–17. http://dx.doi.org/10.5194/bg-15-2601-2018.

Full text
Abstract:
Abstract. The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density) in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2). We simulated gaps at different missingness levels (10–80 %) in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN), ordinary and regression kriging, and multivariate imputation using chained equations (MICE) to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %), species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables allowed us to fill the gaps in the IEFC incomplete dataset (5495 plots) and quantify imputation uncertainty. Resulting spatial patterns of the studied traits in Catalan forests were broadly similar when using species means, regression kriging or the best-performing MICE application, but some important discrepancies were observed at the local level. Our results highlight the need to assess imputation quality beyond just imputation accuracy and show that including environmental information in statistical imputation approaches yields more plausible imputations in spatially explicit plant trait datasets.
APA, Harvard, Vancouver, ISO, and other styles
5

Ghazali, Shamihah Muhammad, Norshahida Shaadan, and Zainura Idrus. "Missing data exploration in air quality data set using R-package data visualisation tools." Bulletin of Electrical Engineering and Informatics 9, no. 2 (April 1, 2020): 755–63. http://dx.doi.org/10.11591/eei.v9i2.2088.

Full text
Abstract:
Missing values often occur in many data sets of various research areas. This has been recognized as data quality problem because missing values could affect the performance of analysis results. To overcome the problem, the incomplete data set need to be treated or replaced using imputation method. Thus, exploring missing values pattern must be conducted beforehand to determine a suitable method. This paper discusses on the application of data visualisation as a smart technique for missing data exploration aiming to increase understanding on missing data behaviour which include missing data mechanism (MCAR, MAR and MNAR), distribution pattern of missingness in terms of percentage as well as the gap size. This paper presents the application of several data visualisation tools from five R-packges such as visdat, VIM, ggplot2, Amelia and UpSetR for data missingness exploration. For an illustration, based on an air quality data set in Malaysia, several graphics were produced and discussed to illustrate the contribution of the visualisation tools in providing input and the insight on the pattern of data missingness. Based on the results, it is shown that missing values in air quality data set of the chosen sites in Malaysia behave as missing at random (MAR) with small percentage of missingness and do contain long gap size of missingness.
APA, Harvard, Vancouver, ISO, and other styles
6

Beesley, Lauren J., Irina Bondarenko, Michael R. Elliot, Allison W. Kurian, Steven J. Katz, and Jeremy MG Taylor. "Multiple imputation with missing data indicators." Statistical Methods in Medical Research 30, no. 12 (October 13, 2021): 2685–700. http://dx.doi.org/10.1177/09622802211047346.

Full text
Abstract:
Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation, also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approach, however, assumes that the missingness mechanism is missing at random, and it is not well-justified under not-at-random missingness without additional modification. In this paper, we describe how we can generalize the sequential regression multiple imputation imputation procedure to handle missingness not at random in the setting where missingness may depend on other variables that are also missing but not on the missing variable itself, conditioning on fully observed variables. We provide algebraic justification for several generalizations of standard sequential regression multiple imputation using Taylor series and other approximations of the target imputation distribution under missingness not at random. Resulting regression model approximations include indicators for missingness, interactions, or other functions of the missingness not at random missingness model and observed data. In a simulation study, we demonstrate that the proposed sequential regression multiple imputation modifications result in reduced bias in the final analysis compared to standard sequential regression multiple imputation, with an approximation strategy involving inclusion of an offset in the imputation model performing the best overall. The method is illustrated in a breast cancer study, where the goal is to estimate the prevalence of a specific genetic pathogenic variant.
APA, Harvard, Vancouver, ISO, and other styles
7

Beesley, Lauren J., Irina Bondarenko, Michael R. Elliot, Allison W. Kurian, Steven J. Katz, and Jeremy MG Taylor. "Multiple imputation with missing data indicators." Statistical Methods in Medical Research 30, no. 12 (October 13, 2021): 2685–700. http://dx.doi.org/10.1177/09622802211047346.

Full text
Abstract:
Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation, also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approach, however, assumes that the missingness mechanism is missing at random, and it is not well-justified under not-at-random missingness without additional modification. In this paper, we describe how we can generalize the sequential regression multiple imputation imputation procedure to handle missingness not at random in the setting where missingness may depend on other variables that are also missing but not on the missing variable itself, conditioning on fully observed variables. We provide algebraic justification for several generalizations of standard sequential regression multiple imputation using Taylor series and other approximations of the target imputation distribution under missingness not at random. Resulting regression model approximations include indicators for missingness, interactions, or other functions of the missingness not at random missingness model and observed data. In a simulation study, we demonstrate that the proposed sequential regression multiple imputation modifications result in reduced bias in the final analysis compared to standard sequential regression multiple imputation, with an approximation strategy involving inclusion of an offset in the imputation model performing the best overall. The method is illustrated in a breast cancer study, where the goal is to estimate the prevalence of a specific genetic pathogenic variant.
APA, Harvard, Vancouver, ISO, and other styles
8

ZHANG, WEN, YE YANG, and QING WANG. "A COMPARATIVE STUDY OF ABSENT FEATURES AND UNOBSERVED VALUES IN SOFTWARE EFFORT DATA." International Journal of Software Engineering and Knowledge Engineering 22, no. 02 (March 2012): 185–202. http://dx.doi.org/10.1142/s0218194012400025.

Full text
Abstract:
Software effort data contains a large amount of missing values of project attributes. The problem of absent features, which occurred recently in machine learning, is often neglected by researchers of software engineering when handling the missingness in software effort data. In essence, absent features (structural missingness) and unobserved values (unstructured missingness) are different cases of missingness although their appearance in the data set are the same. This paper attempts to clarify the root cause of missingness of software effort data. When regarding missingness as absent features, we develop Max-margin regression to predict real effort of software projects. When regarding missingness as unobserved values, we use existing imputation techniques to impute missing values. Then, ε – SVR is used to predict real effort of software projects with the input data sets. Experiments on ISBSG (International Software Benchmarking Standard Group) and CSBSG (Chinese Software Benchmarking Standard Group) data sets demonstrate that, with the tasks of effort prediction, the treatment regarding missingness in software effort data set as unobserved values can produce more desirable performance than that of regarding missingness as absent features. This paper is the first to introduce the concept of absent features to deal with missingness of software effort data.
APA, Harvard, Vancouver, ISO, and other styles
9

De Raadt, Alexandra, Matthijs J. Warrens, Roel J. Bosker, and Henk A. L. Kiers. "Kappa Coefficients for Missing Data." Educational and Psychological Measurement 79, no. 3 (January 16, 2019): 558–76. http://dx.doi.org/10.1177/0013164418823249.

Full text
Abstract:
Cohen’s kappa coefficient is commonly used for assessing agreement between classifications of two raters on a nominal scale. Three variants of Cohen’s kappa that can handle missing data are presented. Data are considered missing if one or both ratings of a unit are missing. We study how well the variants estimate the kappa value for complete data under two missing data mechanisms—namely, missingness completely at random and a form of missingness not at random. The kappa coefficient considered in Gwet ( Handbook of Inter-rater Reliability, 4th ed.) and the kappa coefficient based on listwise deletion of units with missing ratings were found to have virtually no bias and mean squared error if missingness is completely at random, and small bias and mean squared error if missingness is not at random. Furthermore, the kappa coefficient that treats missing ratings as a regular category appears to be rather heavily biased and has a substantial mean squared error in many of the simulations. Because it performs well and is easy to compute, we recommend to use the kappa coefficient that is based on listwise deletion of missing ratings if it can be assumed that missingness is completely at random or not at random.
APA, Harvard, Vancouver, ISO, and other styles
10

Arioli, Angelica, Arianna Dagliati, Bethany Geary, Niels Peek, Philip A. Kalra, Anthony D. Whetton, and Nophar Geifman. "OptiMissP: A dashboard to assess missingness in proteomic data-independent acquisition mass spectrometry." PLOS ONE 16, no. 4 (April 15, 2021): e0249771. http://dx.doi.org/10.1371/journal.pone.0249771.

Full text
Abstract:
Background Missing values are a key issue in the statistical analysis of proteomic data. Defining the strategy to address missing values is a complex task in each study, potentially affecting the quality of statistical analyses. Results We have developed OptiMissP, a dashboard to visually and qualitatively evaluate missingness and guide decision making in the handling of missing values in proteomics studies that use data-independent acquisition mass spectrometry. It provides a set of visual tools to retrieve information about missingness through protein densities and topology-based approaches, and facilitates exploration of different imputation methods and missingness thresholds. Conclusions OptiMissP provides support for researchers’ and clinicians’ qualitative assessment of missingness in proteomic datasets in order to define study-specific strategies for the handling of missing values. OptiMissP considers biases in protein distributions related to the choice of imputation method and helps analysts to balance the information loss caused by low missingness thresholds and the noise introduced by selecting high missingness thresholds. This is complemented by topological data analysis which provides additional insight to the structure of the data and their missingness. We use an example in Chronic Kidney Disease to illustrate the main functionalities of OptiMissP.
APA, Harvard, Vancouver, ISO, and other styles
11

Babcock, Ben, Peter E. L. Marks, Yvonne H. M. van den Berg, and Antonius H. N. Cillessen. "Implications of systematic nominator missingness for peer nomination data." International Journal of Behavioral Development 42, no. 1 (August 19, 2016): 148–54. http://dx.doi.org/10.1177/0165025416664431.

Full text
Abstract:
Missing data are a persistent problem in psychological research. Peer nomination data present a unique missing data problem, because a nominator’s nonparticipation results in missing data for other individuals in the study. This study examined the range of effects of systematic nonparticipation on the correlations between peer nomination data when nominators with various levels of popularity and social preference are missing. Results showed that, compared to completely random nominator missingness, systematic missingness of raters based on popularity had a significant impact on the correlations between various peer nomination variables. Systematic missingness based on social preference had a smaller impact. These results demonstrate varying (and potentially large) effects of systematically missing nominators on studies using nomination data. It is important that researchers using peer nomination data explore whether nominators are missing in any sort of systematic way and include these results as part of each study. Future research into the nature of systematic nominator missingness could make it possible to use advanced methodologies, such as multiple imputation, in an attempt to minimize the issues associated with systematic missingness.
APA, Harvard, Vancouver, ISO, and other styles
12

Spineli, Loukia M., Chrysostomos Kalyvas, and Katerina Papadimitropoulou. "Continuous(ly) missing outcome data in network meta-analysis: A one-stage pattern-mixture model approach." Statistical Methods in Medical Research 30, no. 4 (January 6, 2021): 958–75. http://dx.doi.org/10.1177/0962280220983544.

Full text
Abstract:
Appropriate handling of aggregate missing outcome data is necessary to minimise bias in the conclusions of systematic reviews. The two-stage pattern-mixture model has been already proposed to address aggregate missing continuous outcome data. While this approach is more proper compared with the exclusion of missing continuous outcome data and simple imputation methods, it does not offer flexible modelling of missing continuous outcome data to investigate their implications on the conclusions thoroughly. Therefore, we propose a one-stage pattern-mixture model approach under the Bayesian framework to address missing continuous outcome data in a network of interventions and gain knowledge about the missingness process in different trials and interventions. We extend the hierarchical network meta-analysis model for one aggregate continuous outcome to incorporate a missingness parameter that measures the departure from the missing at random assumption. We consider various effect size estimates for continuous data, and two informative missingness parameters, the informative missingness difference of means and the informative missingness ratio of means. We incorporate our prior belief about the missingness parameters while allowing for several possibilities of prior structures to account for the fact that the missingness process may differ in the network. The method is exemplified in two networks from published reviews comprising a different amount of missing continuous outcome data.
APA, Harvard, Vancouver, ISO, and other styles
13

Xie, Hui. "Analyzing longitudinal clinical trial data with nonignorable missingness and unknown missingness reasons." Computational Statistics & Data Analysis 56, no. 5 (May 2012): 1287–300. http://dx.doi.org/10.1016/j.csda.2010.11.021.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Gerstung, Moritz, Elli Papaemmanuil, Inigo Martincorena, Lars Bullinger, Verena I. Gaidzik, Peter Paschka, Michael Heuser, et al. "Personally Tailored Risk Prediction of AML Based on Comprehensive Genomic and Clinical Data." Blood 126, no. 23 (December 3, 2015): 85. http://dx.doi.org/10.1182/blood.v126.23.85.85.

Full text
Abstract:
Over the past years it has emerged that acute myeloid leukemia (AML) is a disease often driven by multiple co-occurring genomic lesions. It is a great challenge to understand the logic of these mutational patterns and how the particular constellation of genomic risk factors affects a patient's outcome in conjunction with common clinical variables such as blood counts. Here we present a novel prognostic framework based genomic sequencing data of 111 cancer genes matched with detailed diagnostic, treatment and survival data from 1,540 patients with AML enrolled in three different trials run by the German-Austrian AML Study Group (AML-HD 98A, AML-HD 98B, and AMLSG 07-04). A systematic evaluation of risk modeling strategies reveals that much of the risk determining overall survival is captured in our comprehensive panel of genomic and prognostic clinical variables. Cox proportional hazards models with random effects achieved the highest cross-validated prognostic accuracy (Harrel's concordance C=0.72), better than models with variable selection (C=0.70 for AIC and BIC), and clearly superior to the ELN risk classification (C=0.63). It emerges that patient risk is the aggregate of many small and few large factors, such as previously established mutations in NPM1, CEBPA-/-, FLT3ITD and TP53; fusion genes generated by t(15;17), inv(16), and inv(3) rearrangements; and complex karyotype, del(5q) and trisomy 21. Multiple risk factors act mostly additively, with the exception of gene-gene interaction terms, including NPM1:FLT3ITD:DNMT3A (n=93; HR=1.50; P<0.03; Wald test, Benjamini-Yekutieli adjusted) that indicate the presence of epistatic effects on outcome. We found substantial heterogeneity in the presence of risk factors with almost unique constellations for each patient. We observed that approximately 2/3 of the predicted inter-patient risk variation was related to genomic factors (balanced rearrangements, copy number changes and point mutations), the remainder being mostly attributed to diagnostic blood counts, demographic data and treatment. Hence a large share, but not all, prognostic information seems to be determined by genomic factors. Using multistage models with random effects we have assessed differential effects of prognostic variables at different stages of therapy. These models yield detailed predictions about the probability of being alive in induction, first complete remission and after relapse, as well as the mortality during each of the three stages. Importantly, our model computes how these probabilities change depending on a patient's constellation of risk factors. The resulting personalized predictions provide a quantitative risk assessment and allow evaluating the effect of treatment decisions such as allogeneic stem cell transplant versus standard chemotherapy in first complete remission. Our analysis shows that detailed and accurate predictions can be made based on knowledge banks of genomic and clinical data. As a proof of principle we have implemented our prediction framework into a web portal to explore risk predictions. Our method is able to impute missing variables and quantify the uncertainty due to missingness and finite training data. Power calculations show that cohorts of 10,000 patients will be needed for precise clinical decision support. Disclosures McDermott: 14M Genomics: Other: co-founder, stock-holder and consultant. Stratton:14M Genomics: Other: co-founder, stock-holder and consultant. Schlenk:Janssen: Membership on an entity's Board of Directors or advisory committees; Daiichi Sankyo: Membership on an entity's Board of Directors or advisory committees; Arog: Honoraria, Research Funding; Teva: Honoraria, Research Funding; Novartis: Honoraria, Research Funding; Pfizer: Honoraria, Research Funding; Boehringer-Ingelheim: Honoraria. Campbell:14M genomics: Other: Co-founder and consultant.
APA, Harvard, Vancouver, ISO, and other styles
15

McGurk, Kathryn A., Arianna Dagliati, Davide Chiasserini, Dave Lee, Darren Plant, Ivona Baricevic-Jones, Janet Kelsall, et al. "The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination." Bioinformatics 36, no. 7 (December 2, 2019): 2217–23. http://dx.doi.org/10.1093/bioinformatics/btz898.

Full text
Abstract:
Abstract Motivation Data-independent acquisition mass spectrometry allows for comprehensive peptide detection and relative quantification than standard data-dependent approaches. While less prone to missing values, these still exist. Current approaches for handling the so-called missingness have challenges. We hypothesized that non-random missingness is a useful biological measure and demonstrate the importance of analysing missingness for proteomic discovery within a longitudinal study of disease activity. Results The magnitude of missingness did not correlate with mean peptide concentration. The magnitude of missingness for each protein strongly correlated between collection time points (baseline, 3 months, 6 months; R = 0.95–0.97, confidence interval = 0.94–0.97) indicating little time-dependent effect. This allowed for the identification of proteins with outlier levels of missingness that differentiate between the patient groups characterized by different patterns of disease activity. The association of these proteins with disease activity was confirmed by machine learning techniques. Our novel approach complements analyses on complete observations and other missing value strategies in biomarker prediction of disease activity. Supplementary information Supplementary data are available at Bioinformatics online.
APA, Harvard, Vancouver, ISO, and other styles
16

Rhemtulla, Mijke, Fan Jia, Wei Wu, and Todd D. Little. "Planned missing designs to optimize the efficiency of latent growth parameter estimates." International Journal of Behavioral Development 38, no. 5 (January 23, 2014): 423–34. http://dx.doi.org/10.1177/0165025413514324.

Full text
Abstract:
We examine the performance of planned missing (PM) designs for correlated latent growth curve models. Using simulated data from a model where latent growth curves are fitted to two constructs over five time points, we apply three kinds of planned missingness. The first is item-level planned missingness using a three-form design at each wave such that 25% of data are missing. The second is wave-level planned missingness such that each participant is missing up to two waves of data. The third combines both forms of missingness. We find that three-form missingness results in high convergence rates, little parameter estimate or standard error bias, and high efficiency relative to the complete data design for almost all parameter types. In contrast, wave missingness and the combined design result in dramatically lowered efficiency for parameters measuring individual variability in rates of change (e.g., latent slope variances and covariances), and bias in both estimates and standard errors for these same parameters. We conclude that wave missingness should not be used except with large effect sizes and very large samples.
APA, Harvard, Vancouver, ISO, and other styles
17

Fernstad, Sara Johansson. "To identify what is not there: A definition of missingness patterns and evaluation of missing value visualization." Information Visualization 18, no. 2 (July 25, 2018): 230–50. http://dx.doi.org/10.1177/1473871618785387.

Full text
Abstract:
While missing data is a commonly occurring issue in many domains, it is a topic that has been greatly overlooked by visualization scientists. Missing data values reduce the reliability of analysis results. A range of methods exist to replace the missing values with estimated values, but their appropriateness often depend on the patterns of missingness. Increased understanding of the missingness patterns and the distribution of missing values in data may greatly improve reliability, as well as provide valuable insight into potential problems in data gathering and analyses processes, and better understanding of the data as a whole. Visualization methods have a unique possibility to support investigation and understanding of missingness patterns by making the missing values and their relationship to recorded values visible. This article provides an overview of visualization of missing data values and defines a set of three missingness patterns of relevance for understanding missingness in data. It also contributes a usability evaluation which compares visualization methods representing missing values and how well they help users identify missingness patterns. The results indicate differences in performance depending on the visualization method as well as missingness pattern. Recommendations for future design of missing data visualization are provided based on the outcome of the study.
APA, Harvard, Vancouver, ISO, and other styles
18

Mitra, Robin, Sarah F. McGough, Tapabrata Chakraborti, Chris Holmes, Ryan Copping, Niels Hagenbuch, Stefanie Biedermann, et al. "Learning from data with structured missingness." Nature Machine Intelligence 5, no. 1 (January 25, 2023): 13–23. http://dx.doi.org/10.1038/s42256-022-00596-z.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Forna, Alpha, Ilaria Dorigatti, Pierre Nouvellet, and Christl A. Donnelly. "Comparison of machine learning methods for estimating case fatality ratios: An Ebola outbreak simulation study." PLOS ONE 16, no. 9 (September 15, 2021): e0257005. http://dx.doi.org/10.1371/journal.pone.0257005.

Full text
Abstract:
Background Machine learning (ML) algorithms are now increasingly used in infectious disease epidemiology. Epidemiologists should understand how ML algorithms behave within the context of outbreak data where missingness of data is almost ubiquitous. Methods Using simulated data, we use a ML algorithmic framework to evaluate data imputation performance and the resulting case fatality ratio (CFR) estimates, focusing on the scale and type of data missingness (i.e., missing completely at random—MCAR, missing at random—MAR, or missing not at random—MNAR). Results Across ML methods, dataset sizes and proportions of training data used, the area under the receiver operating characteristic curve decreased by 7% (median, range: 1%–16%) when missingness was increased from 10% to 40%. Overall reduction in CFR bias for MAR across methods, proportion of missingness, outbreak size and proportion of training data was 0.5% (median, range: 0%–11%). Conclusion ML methods could reduce bias and increase the precision in CFR estimates at low levels of missingness. However, no method is robust to high percentages of missingness. Thus, a datacentric approach is recommended in outbreak settings—patient survival outcome data should be prioritised for collection and random-sample follow-ups should be implemented to ascertain missing outcomes.
APA, Harvard, Vancouver, ISO, and other styles
20

Goel, Naman, Alfonso Amayuelas, Amit Deshpande, and Amit Sharma. "The Importance of Modeling Data Missingness in Algorithmic Fairness: A Causal Perspective." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 9 (May 18, 2021): 7564–73. http://dx.doi.org/10.1609/aaai.v35i9.16926.

Full text
Abstract:
Training datasets for machine learning often have some form of missingness. For example, to learn a model for deciding whom to give a loan, the available training data includes individuals who were given a loan in the past, but not those who were not. This missingness, if ignored, nullifies any fairness guarantee of the training procedure when the model is deployed. Using causal graphs, we characterize the missingness mechanisms in different real-world scenarios. We show conditions under which various distributions, used in popular fairness algorithms, can or can not be recovered from the training data. Our theoretical results imply that many of these algorithms can not guarantee fairness in practice. Modeling missingness also helps to identify correct design principles for fair algorithms. For example, in multi-stage settings where decisions are made in multiple screening rounds, we use our framework to derive the minimal distributions required to design a fair algorithm. Our proposed algorithm also decentralizes the decision-making process and still achieves similar performance to the optimal algorithm that requires centralization and non-recoverable distributions.
APA, Harvard, Vancouver, ISO, and other styles
21

Lu, Zhenqiu, and Zhiyong Zhang. "Bayesian Approach to Non-ignorable Missingness in Latent Growth Models." Journal of Behavioral Data Science 1, no. 2 (May 2021): 1–30. http://dx.doi.org/10.35566/jbds/v1n2/p1.

Full text
Abstract:
Latent growth curve models (LGCMs) are becoming increasingly important among growth models because they can effectively capture individuals' latent growth trajectories and also explain the factors that influence such growth by analyzing the repeatedly measured manifest variables. However, with the increase in complexity of LGCMs, there is an increase in issues on model estimation. This research proposes a Bayesian approach to LGCMs to address the perennial problem of almost all longitudinal research, namely, missing data. First, different missingness models are formulated. We focus on non-ignorable missingness in this article. Specifically, these models include the latent intercept dependent missingness, the latent slope dependent missingness, and the potential outcome dependent missingness. To implement the model estimation, this study proposes a full Bayesian approach through data augmentation algorithm and Gibbs sampling procedure. Simulation studies are conducted and results show that the proposed method accurately recover model parameters and the mis-specified missingness may result in severely misleading conclusions. Finally, the implications of the approach and future research directions are discussed.
APA, Harvard, Vancouver, ISO, and other styles
22

Ribeiro, Silvana Mara, and Cristiano Leite Castro. "Missing Data in Time Series: A Review of Imputation Methods and Case Study." Learning and Nonlinear Models 20, no. 1 (October 13, 2022): 31–46. http://dx.doi.org/10.21528/lnlm-vol20-no1-art3.

Full text
Abstract:
Dealing with missingness in time series data is a very important, but oftentimes overlooked, step in data analysis. In this paper, the nature of time series data and missingness mechanisms are described to help identify which imputation method should be used to impute missing data, along with a review of imputation methods and how they work. Recommended methods from literature are used to impute synthetic data of different nature and the results are discussed. In addition, a case study concerning the prediction (classification) of US market instability (BEAR or BULL) using a data set with mixed missingness mechanisms and mixed nature is presented to evaluate how different types of imputation methods can affect the final results of the classification task.
APA, Harvard, Vancouver, ISO, and other styles
23

St-Louis, Etienne, Daniel Roizblatt, Dan L. Deckelbaum, Robert Baird, César V. Millán, and Alicia Ebensperger. "Identifying Pediatric Trauma Data Gaps at a Large Urban Trauma Referral Center in Santiago, Chile." Panamerican Journal of Trauma, Critical Care & Emergency Surgery 6, no. 3 (2017): 169–76. http://dx.doi.org/10.5005/jp-journals-10030-1188.

Full text
Abstract:
ABSTRACT Background Trauma registries contribute to improving trauma care, but their impact is highly dependent on the quality of the data. A simplified point of care pediatric trauma registry (PTR) was developed at the Centre for Global Surgery from the McGill University Health Centre (MUHC) for implementation in Low-middle income countries (LMICs). Pilot deployment was launched at a large urban trauma center in May 2016 in Santiago, Chile. Prior to deployment, we sought to identify missing data in existing trauma records in order to optimize PTR practicality and user benefit. Materials and methods The project was approved by the local Institutional Review Board. Retrospective chart review was conducted on trauma patients below the age of 15 who were evaluated at the emergency room (ER) of Hospital Dr. Sotero del Rio (HSR) between January 1st and June 30th 2015. Data missingness was evaluated for each component of the PTR (demographics, mechanism, injury and outcomes). Potential independent predictors of data missingness were evaluated using multiple linear regression. Results A total of 351 patients were included. Demographic data missingness ranged from 0% (age) to 95% (mode of arrival). Mechanism data missingness ranged from 6% (cause of injury) to 42% (site of injury). Injury physiology data missingness ranged from 37% (oxygen saturation) to 99% (respiratory rate). Interestingly, mean injury anatomy data missingness was significantly inferior to physiology data (0.6% vs. 78.6%, p < 0.05). Outcome data missingness reached 54% at 2 weeks. Conclusion In resource-limited settings, high quality data is essential to guide responsible resource allocation. We believe implementation of a simplified trauma registry has the potential to reduce data gaps for pediatric trauma patients by streamlining trauma data collection at point of care. This should include streamlined data collection with a short per-patient completion time, and should forego attempts to collect data at 2 weeks, which has proven unsuccessful. How to cite this article St-Louis E, Roizblatt D, Deckelbaum DL, Baird R, Millán CV, Ebensperger A, Razek T. Identifying Pediatric Trauma Data Gaps at a Large Urban Trauma Referral Center in Santiago, Chile. Panam J Trauma Crit Care Emerg Surg 2017;6(3):169-176.
APA, Harvard, Vancouver, ISO, and other styles
24

Sadinle, Mauricio, and Jerome P. Reiter. "Sequentially additive nonignorable missing data modelling using auxiliary marginal information." Biometrika 106, no. 4 (October 26, 2019): 889–911. http://dx.doi.org/10.1093/biomet/asz054.

Full text
Abstract:
Summary We study a class of missingness mechanisms, referred to as sequentially additive nonignorable, for modelling multivariate data with item nonresponse. These mechanisms explicitly allow the probability of nonresponse for each variable to depend on the value of that variable, thereby representing nonignorable missingness mechanisms. These missing data models are identified by making use of auxiliary information on marginal distributions, such as marginal probabilities for multivariate categorical variables or moments for numeric variables. We prove identification results and illustrate the use of these mechanisms in an application.
APA, Harvard, Vancouver, ISO, and other styles
25

Plancade, Sandra, Magali Berland, Mélisande Blein-Nicolas, Olivier Langella, Ariane Bassignani, and Catherine Juste. "A combined test for feature selection on sparse metaproteomics data—an alternative to missing value imputation." PeerJ 10 (June 24, 2022): e13525. http://dx.doi.org/10.7717/peerj.13525.

Full text
Abstract:
One of the difficulties encountered in the statistical analysis of metaproteomics data is the high proportion of missing values, which are usually treated by imputation. Nevertheless, imputation methods are based on restrictive assumptions regarding missingness mechanisms, namely “at random” or “not at random”. To circumvent these limitations in the context of feature selection in a multi-class comparison, we propose a univariate selection method that combines a test of association between missingness and classes, and a test for difference of observed intensities between classes. This approach implicitly handles both missingness mechanisms. We performed a quantitative and qualitative comparison of our procedure with imputation-based feature selection methods on two experimental data sets, as well as simulated data with various scenarios regarding the missingness mechanisms and the nature of the difference of expression (differential intensity or differential presence). Whereas we observed similar performances in terms of prediction on the experimental data set, the feature ranking and selection from various imputation-based methods were strongly divergent. We showed that the combined test reaches a compromise by correlating reasonably with other methods, and remains efficient in all simulated scenarios unlike imputation-based feature selection methods.
APA, Harvard, Vancouver, ISO, and other styles
26

Zhou, Sherry, and Anne Corinne Huggins-Manley. "The Performance of the Semigeneralized Partial Credit Model for Handling Item-Level Missingness." Educational and Psychological Measurement 80, no. 6 (May 15, 2020): 1196–215. http://dx.doi.org/10.1177/0013164420918392.

Full text
Abstract:
The semi-generalized partial credit model (Semi-GPCM) has been proposed as a unidimensional modeling method for handling not applicable scale responses and neutral scale responses, and it has been suggested that the model may be of use in handling missing data in scale items. The purpose of this study is to evaluate the ability of the unidimensional Semi-GPCM to aid in the recovery of person parameters from item response data in the presence of item-level missingness, and to compare the performance of the model with two other proposed methods for handling such missingness: a multidimensional modeling approach for missingness and full information maximum likelihood estimation. The results indicate that the Semi-GPCM performs acceptably in an absolute sense when less than 30% of the item data is missing but does not outperform the other two methods under any particular conditions. We conclude with a discussion about when practitioners may or may not want to use the Semi-GPCM to recover person parameters from item response data with missingness.
APA, Harvard, Vancouver, ISO, and other styles
27

Derks, Eske M., Conor V. Dolan, and Dorret I. Boomsma. "Statistical Power to Detect Genetic and Environmental Influences in the Presence of Data Missing at Random." Twin Research and Human Genetics 10, no. 1 (February 1, 2007): 159–67. http://dx.doi.org/10.1375/twin.10.1.159.

Full text
Abstract:
AbstractWe study the situation in which a cheap measure (X) is observed in a large, representative twin sample, and a more expensive measure (Y) is observed in a selected subsample. The aim of this study is to investigate the optimal selection design in terms of the statistical power to detect genetic and environmental influences on the variance of Y and on the covariance of X and Y. Data were simulated for 4000 dizygotic and 2000 monozygotic twins. Missingness (87% vs. 97%) was then introduced in accordance with 7 selection designs: (i) concordant low + individual high design; (ii) extreme concordant design; (iii) extreme concordant and discordant design (EDAC); (iv) extreme discordant design; (v) individual score selection design; (vi) selection of an optimal number of MZ and DZ twins; and (vii) missing completely at random. The statistical power to detect the influence of additive and dominant genetic and shared environmental effects on the variance of Y and on the covariance between X and Y was investigated. The best selection design is the individual score selection design. The power to detect additive genetic effects is high irrespective of the percentage of missingness or selection design. The power to detect shared environmental effects is acceptable when the percentage of missingness is 87%, but is low when the percentage of missingness is 97%, except for the individual score selection design, in which the power remains acceptable. The power to detect D is low, irrespective of selection design or percentage of missingness. The individual score selection design is therefore the best design for detecting genetic and environmental influences on the variance of Y and on the covariance of X and Y. However, the EDAC design may be preferred when an additional purpose of a study is to detect quantitative trait loci effects.
APA, Harvard, Vancouver, ISO, and other styles
28

Yu, Yue, Emily J. Smith, and Carter T. Butts. "Retrospective Network Imputation from Life History Data: The Impact of Designs." Sociological Methodology 50, no. 1 (February 26, 2020): 131–67. http://dx.doi.org/10.1177/0081175020905624.

Full text
Abstract:
Retrospective life history designs are among the few practical approaches for collecting longitudinal network information from large populations, particularly in the context of relationships like sexual partnerships that cannot be measured via digital traces or documentary evidence. While all such designs afford the ability to “peer into the past” vis-à-vis the point of data collection, little is known about the impact of the specific design parameters on the time horizon over which such information is useful. In this article, we investigate the effect of two different survey designs on retrospective network imputation: (1) intervalN, where subjects are asked to provide information on all partners within the past [Formula: see text] time units; and (2) lastK, where subjects are asked to provide information about their [Formula: see text] most recent partners. We simulate a “ground truth” sexual partnership network using a published model of Krivitsky (2012), and we then sample this data using the two retrospective designs under various choices of [Formula: see text] and [Formula: see text]. We examine the accumulation of missingness as a function of time prior to interview, and we investigate the impact of this missingness on model-based imputation of the state of the network at prior time points via conditional ERGM prediction. We quantitatively show that—even setting aside problems of alter identification and informant accuracy—choice of survey design and parameters used can drastically change the amount of missingness in the dataset. These differences in missingness have a large impact on the quality of retrospective parameter estimation and network imputation, including important effects on properties related to disease transmission.
APA, Harvard, Vancouver, ISO, and other styles
29

Imai, Takumi. "Methodology of Semiparametric Estimation for Data with Missingness." Japanese Journal of Applied Statistics 46, no. 2 (2017): 87–106. http://dx.doi.org/10.5023/jappstat.46.87.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Molenberghs, Geert, Els J. T. Goetghebeur, Stuart R. Lipsitz, and Michael G. Kenward. "Nonrandom Missingness in Categorical Data: Strengths and Limitations." American Statistician 53, no. 2 (May 1999): 110. http://dx.doi.org/10.2307/2685728.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Cho Paik, Myunghee. "Nonignorable Missingness in Matched Case-Control Data Analyses." Biometrics 60, no. 2 (June 2004): 306–14. http://dx.doi.org/10.1111/j.0006-341x.2004.00174.x.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Molenberghs, Geert, Els J. T. Goetghebeur, Stuart R. Lipsitz, and Michael G. Kenward. "Nonrandom Missingness in Categorical Data: Strengths and Limitations." American Statistician 53, no. 2 (May 1999): 110–18. http://dx.doi.org/10.1080/00031305.1999.10474442.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Chaimani, Anna, Dimitris Mavridis, Georgia Salanti, Julian P. T. Higgins, and Ian R. White. "Allowing for Informative Missingness in Aggregate Data Meta-Analysis with Continuous or Binary Outcomes: Extensions to Metamiss." Stata Journal: Promoting communications on statistics and Stata 18, no. 3 (September 2018): 716–40. http://dx.doi.org/10.1177/1536867x1801800310.

Full text
Abstract:
Missing outcome data can invalidate the results of randomized trials and their meta-analysis. However, addressing missing data is often a challenging issue because it requires untestable assumptions. The impact of missing outcome data on the meta-analysis summary effect can be explored by assuming a relationship between the outcome in the observed and the missing participants via an informative missingness parameter. The informative missingness parameters cannot be estimated from the observed data, but they can be specified, with associated uncertainty, using evidence external to the meta-analysis, such as expert opinion. The use of informative missingness parameters in pairwise meta-analysis of aggregate data with binary outcomes has been previously implemented in Stata by the metamiss command. In this article, we present the new command metamiss2, which is an extension of metamiss for binary or continuous data in pairwise or network meta-analysis. The command can be used to explore the robustness of results to different assumptions about the missing data via sensitivity analysis.
APA, Harvard, Vancouver, ISO, and other styles
34

Alade, Oyekale Abel, Ali Selamat, and Roselina Sallehuddin. "The Effects of Missing Data Characteristics on the Choice of Imputation Techniques." Vietnam Journal of Computer Science 07, no. 02 (March 20, 2020): 161–77. http://dx.doi.org/10.1142/s2196888820500098.

Full text
Abstract:
One major characteristic of data is completeness. Missing data is a significant problem in medical datasets. It leads to incorrect classification of patients and is dangerous to the health management of patients. Many factors lead to the missingness of values in databases in medical datasets. In this paper, we propose the need to examine the causes of missing data in a medical dataset to ensure that the right imputation method is used in solving the problem. The mechanism of missingness in datasets was studied to know the missing pattern of datasets and determine a suitable imputation technique to generate complete datasets. The pattern shows that the missingness of the dataset used in this study is not a monotone missing pattern. Also, single imputation techniques underestimate variance and ignore relationships among the variables; therefore, we used multiple imputations technique that runs in five iterations for the imputation of each missing value. The whole missing values in the dataset were 100% regenerated. The imputed datasets were validated using an extreme learning machine (ELM) classifier. The results show improvement in the accuracy of the imputed datasets. The work can, however, be extended to compare the accuracy of the imputed datasets with the original dataset with different classifiers like support vector machine (SVM), radial basis function (RBF), and ELMs.
APA, Harvard, Vancouver, ISO, and other styles
35

A, Eicher,. "Big business with Big Data Big Business mit Big Data." GIS Business 12, no. 3 (June 12, 2019): 20–25. http://dx.doi.org/10.26643/gis.v12i3.5173.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Franks, Alexander M., Edoardo M. Airoldi, and Donald B. Rubin. "Nonstandard conditionally specified models for nonignorable missing data." Proceedings of the National Academy of Sciences 117, no. 32 (July 28, 2020): 19045–53. http://dx.doi.org/10.1073/pnas.1815563117.

Full text
Abstract:
Data analyses typically rely upon assumptions about the missingness mechanisms that lead to observed versus missing data, assumptions that are typically unassessable. We explore an approach where the joint distribution of observed data and missing data are specified in a nonstandard way. In this formulation, which traces back to a representation of the joint distribution of the data and missingness mechanism, apparently first proposed by J. W. Tukey, the modeling assumptions about the distributions are either assessable or are designed to allow relatively easy incorporation of substantive knowledge about the problem at hand, thereby offering a possibly realistic portrayal of the data, both observed and missing. We develop Tukey’s representation for exponential-family models, propose a computationally tractable approach to inference in this class of models, and offer some general theoretical comments. We then illustrate the utility of this approach with an example in systems biology.
APA, Harvard, Vancouver, ISO, and other styles
37

Fang, Zhou, Tianzhou Ma, Gong Tang, Li Zhu, Qi Yan, Ting Wang, Juan C. Celedón, Wei Chen, and George C. Tseng. "Bayesian integrative model for multi-omics data with missingness." Bioinformatics 34, no. 22 (September 1, 2018): 3801–8. http://dx.doi.org/10.1093/bioinformatics/bty775.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Khoshgoftaar, Taghi M., and Jason Van Hulse. "Imputation techniques for multivariate missingness in software measurement data." Software Quality Journal 16, no. 4 (June 11, 2008): 563–600. http://dx.doi.org/10.1007/s11219-008-9054-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

McNeish, Daniel. "Missing data methods for arbitrary missingness with small samples." Journal of Applied Statistics 44, no. 1 (March 22, 2016): 24–39. http://dx.doi.org/10.1080/02664763.2016.1158246.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Park, Soomin, Mari Palta, Jun Shao, and Lei Shen. "Bias adjustment in analysing longitudinal data with informative missingness." Statistics in Medicine 21, no. 2 (2001): 277–91. http://dx.doi.org/10.1002/sim.992.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Bartlett, Jonathan W., James R. Carpenter, Kate Tilling, and Stijn Vansteelandt. "Improving upon the efficiency of complete case analysis when covariates are MNAR." Biostatistics 15, no. 4 (June 6, 2014): 719–30. http://dx.doi.org/10.1093/biostatistics/kxu023.

Full text
Abstract:
Abstract Missing values in covariates of regression models are a pervasive problem in empirical research. Popular approaches for analyzing partially observed datasets include complete case analysis (CCA), multiple imputation (MI), and inverse probability weighting (IPW). In the case of missing covariate values, these methods (as typically implemented) are valid under different missingness assumptions. In particular, CCA is valid under missing not at random (MNAR) mechanisms in which missingness in a covariate depends on the value of that covariate, but is conditionally independent of outcome. In this paper, we argue that in some settings such an assumption is more plausible than the missing at random assumption underpinning most implementations of MI and IPW. When the former assumption holds, although CCA gives consistent estimates, it does not make use of all observed information. We therefore propose an augmented CCA approach which makes the same conditional independence assumption for missingness as CCA, but which improves efficiency through specification of an additional model for the probability of missingness, given the fully observed variables. The new method is evaluated using simulations and illustrated through application to data on reported alcohol consumption and blood pressure from the US National Health and Nutrition Examination Survey, in which data are likely MNAR independent of outcome.
APA, Harvard, Vancouver, ISO, and other styles
42

Plichta, Jennifer Kay, Christel N. Rushing, Holly C. Lewis, Dan G. Blazer, Terry Hyslop, and Rachel Adams Greenup. "Missing data in breast cancer: Relationship with survival in national databases." Journal of Clinical Oncology 38, no. 15_suppl (May 20, 2020): e19114-e19114. http://dx.doi.org/10.1200/jco.2020.38.15_suppl.e19114.

Full text
Abstract:
e19114 Background: National cancer registries are valuable tools used to analyze patterns of care and clinical oncology outcomes; yet, patients with missing data may impact the accuracy and generalizability of these data. We sought to evaluate the association between missing data and overall survival (OS). Methods: Using the NCDB and SEER, we compared data missingness among patients diagnosed with invasive breast cancer from 2010-2014. Key variables included: demographic variables (age, race, ethnicity, insurance, education, income), tumor variables (grade, ER, PR, HER2, TNM stage), and treatment variables (surgery in both databases; chemotherapy and radiation in NCDB). OS was compared between those with and without missing data via Cox proportional hazards models. Results: Overall, 775,996 patients in the NCDB and 263,016 in SEER were identified; missingness of at least 1 key variable was 29% and 13%, respectively. Of those, the majority were missing a tumor variable (NCDB 80%; SEER 88%), while demographic and treatment variables were missing less often. When compared to patients with complete data, missingness was associated with a greater risk of death; NCDB 17% vs. 14% (HR 1.23, 99% CI 1.21-1.25) and SEER 27% vs 14% (HR 2.11, 99% CI 2.05-2.18). Rate of death was similar whether the patient was missing 1 or ≥2 variables. When stratified by the type of missing variable, differences in OS between those with and without missing data in the NCDB were small. In SEER, reductions in OS were largest for those missing tumor variables (HR 2.26, 99% CI 2.19-2.33) or surgery data (HR 3.84, 99% CI 3.32-4.45). Among the tumor variables specifically, few clinically meaningful differences in OS were noted in the NCDB, while the most significant differences in SEER were noted in T and N stage (table). Conclusions: Missingness of select variables is associated with a worse OS and is not uncommon within large national cancer registries. Therefore, researchers must use caution when choosing inclusion/exclusion criteria for outcomes studies. Future research is needed to elucidate which patients are most often missing data and why OS differences are observed. [Table: see text]
APA, Harvard, Vancouver, ISO, and other styles
43

Sizov, Ivan Aleksandrovich. "BIG DATA – BIG DATA IN BUSINESS." Economy. Business. Computer science, no. 3 (January 1, 2016): 8–23. http://dx.doi.org/10.19075/2500-2074-2016-3-8-23.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Habegger, Benjamin. "Big Data vs. Privacy Big Data." Services Transactions on Big Data 1, no. 1 (January 2014): 25–35. http://dx.doi.org/10.29268/stbd.2014.1.1.3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Singh, Janmajay, Masahiro Sato, and Tomoko Ohkuma. "On Missingness Features in Machine Learning Models for Critical Care: Observational Study." JMIR Medical Informatics 9, no. 12 (December 8, 2021): e25022. http://dx.doi.org/10.2196/25022.

Full text
Abstract:
Background Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient’s health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. Objective The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. Methods A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. Results Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2% to 7.7%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. Conclusions This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further.
APA, Harvard, Vancouver, ISO, and other styles
46

Mager, Astrid. "The politics of big data. Big data, big brother?" Information, Communication & Society 22, no. 10 (January 22, 2019): 1523–25. http://dx.doi.org/10.1080/1369118x.2019.1567804.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Viceconti, Marco, Peter Hunter, and Rod Hose. "Big Data, Big Knowledge: Big Data for Personalized Healthcare." IEEE Journal of Biomedical and Health Informatics 19, no. 4 (July 2015): 1209–15. http://dx.doi.org/10.1109/jbhi.2015.2406883.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Lesk, Michael. "Big Data, Big Brother, Big Money." IEEE Security & Privacy 11, no. 4 (July 2013): 85–89. http://dx.doi.org/10.1109/msp.2013.81.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Meisner, Jonas, Siyang Liu, Mingxi Huang, and Anders Albrechtsen. "Large-scale inference of population structure in presence of missingness using PCA." Bioinformatics 37, no. 13 (January 18, 2021): 1868–75. http://dx.doi.org/10.1093/bioinformatics/btab027.

Full text
Abstract:
Abstract Motivation Principal component analysis (PCA) is a commonly used tool in genetics to capture and visualize population structure. Due to technological advances in sequencing, such as the widely used non-invasive prenatal test, massive datasets of ultra-low coverage sequencing are being generated. These datasets are characterized by having a large amount of missing genotype information. Results We present EMU, a method for inferring population structure in the presence of rampant non-random missingness. We show through simulations that several commonly used PCA methods cannot handle missing data arisen from various sources, which leads to biased results as individuals are projected into the PC space based on their amount of missingness. In terms of accuracy, EMU outperforms an existing method that also accommodates missingness while being competitively fast. We further tested EMU on around 100K individuals of the Phase 1 dataset of the Chinese Millionome Project, that were shallowly sequenced to around 0.08×. From this data we are able to capture the population structure of the Han Chinese and to reproduce previous analysis in a matter of CPU hours instead of CPU years. EMU’s capability to accurately infer population structure in the presence of missingness will be of increasing importance with the rising number of large-scale genetic datasets. Availability and implementation EMU is written in Python and is freely available at https://github.com/rosemeis/emu. Supplementary information Supplementary data are available at Bioinformatics online.
APA, Harvard, Vancouver, ISO, and other styles
50

Martin, Joseph. "Big data, big future." BioTechniques 68, no. 4 (April 2020): 166–68. http://dx.doi.org/10.2144/btn-2020-0027.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography