Academic literature on the topic 'Big data with missingness'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Big data with missingness.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Big data with missingness"

1

Elleman, Lorien G., Sarah K. McDougald, David M. Condon, and William Revelle. "That Takes the BISCUIT." European Journal of Psychological Assessment 36, no. 6 (November 2020): 948–58. http://dx.doi.org/10.1027/1015-5759/a000590.

Full text
Abstract:
Abstract. The predictive accuracy of personality-criterion regression models may be improved with statistical learning (SL) techniques. This study introduced a novel SL technique, BISCUIT (Best Items Scale that is Cross-validated, Unit-weighted, Informative, and Transparent). The predictive accuracy and parsimony of BISCUIT were compared with three established SL techniques (the lasso, elastic net, and random forest) and regression using two sets of scales, for five criteria, across five levels of data missingness. BISCUIT’s predictive accuracy was competitive with other SL techniques at higher levels of data missingness. BISCUIT most frequently produced the most parsimonious SL model. In terms of predictive accuracy, the elastic net and lasso dominated other techniques in the complete data condition and in conditions with up to 50% data missingness. Regression using 27 narrow traits was an intermediate choice for predictive accuracy. For most criteria and levels of data missingness, regression using the Big Five had the worst predictive accuracy. Overall, loss in predictive accuracy due to data missingness was modest, even at 90% data missingness. Findings suggest that personality researchers should consider incorporating planned data missingness and SL techniques into their designs and analyses.
APA, Harvard, Vancouver, ISO, and other styles
2

Neuenschwander, Beat, and Michael Branson. "Modeling Missingness for Time-to-Event Data: A Case Study in Osteoporosis." Journal of Biopharmaceutical Statistics 14, no. 4 (December 31, 2004): 1005–19. http://dx.doi.org/10.1081/bip-200035478.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Nwakuya, Nwakuya, M. T, and Onyegbuchulam B. O. "Quantile Regression-based Multiple Imputation of Skewed Data with Different Percentages of Missingness." Scholars Journal of Physics, Mathematics and Statistics 9, no. 4 (May 10, 2022): 41–45. http://dx.doi.org/10.36347/sjpms.2022.v09i04.002.

Full text
Abstract:
This study investigates the Quantile Regression-Based Multiple Imputation (QR-based MI) on a simulated right skewed data with 5% and 25% missing data points. Quantile regression analysis on three data sets that comprises of the complete skewed data without missing values, data set with 5% missing values and data set with 25% missing values was performed at 0.25, 0.5, 0.75 and 0.95 quantiles. The data sets with 5% and 25% missing values were imputed using QR-based MI technique, giving rise to two complete data sets. This analysis was performed using both transformed and untransformed version of the three data sets. The transformation was carried out by applying the Yeo-Johnson transformation technique and comparison of results was based on the Mean Square Error (MSE), Akiake Information Criteria (AIC) and Bayesian Information Criteria (BIC). The result from the original complete right skewed data shows that the untransformed data presented better results at 0.25 and 0.50 quantiles compared to the transformed data while results at 0.75 and 0.95 quantiles of the transformed data showed a better result compared to the untransformed. This result is attributed to the fact that the data was right skewed, so that the transformation will benefit the heavy tails on the right while the lighter tail on the left needs not to be transformed hence the 0.25 and 0.50 quantile better result with untransformed data and the 0.75 and 0.95 better result with transformed data. Considering the imputed complete data sets from the 5% and 25% missingness, it was seen that for both data sets at all quantiles considered, the untransformed data produced better results than the transformed data. This led us to conclude that the QR-based MI is not distribution dependent hence it is not sensitive to skewness. Therefore it can be stated based on the results that QR-based MI is robust to skewness, thus can be applied to skewed data sets.
APA, Harvard, Vancouver, ISO, and other styles
4

Poyatos, Rafael, Oliver Sus, Llorenç Badiella, Maurizio Mencuccini, and Jordi Martínez-Vilalta. "Gap-filling a spatially explicit plant trait database: comparing imputation methods and different levels of environmental information." Biogeosciences 15, no. 9 (May 4, 2018): 2601–17. http://dx.doi.org/10.5194/bg-15-2601-2018.

Full text
Abstract:
Abstract. The ubiquity of missing data in plant trait databases may hinder trait-based analyses of ecological patterns and processes. Spatially explicit datasets with information on intraspecific trait variability are rare but offer great promise in improving our understanding of functional biogeography. At the same time, they offer specific challenges in terms of data imputation. Here we compare statistical imputation approaches, using varying levels of environmental information, for five plant traits (leaf biomass to sapwood area ratio, leaf nitrogen content, maximum tree height, leaf mass per area and wood density) in a spatially explicit plant trait dataset of temperate and Mediterranean tree species (Ecological and Forest Inventory of Catalonia, IEFC, dataset for Catalonia, north-east Iberian Peninsula, 31 900 km2). We simulated gaps at different missingness levels (10–80 %) in a complete trait matrix, and we used overall trait means, species means, k nearest neighbours (kNN), ordinary and regression kriging, and multivariate imputation using chained equations (MICE) to impute missing trait values. We assessed these methods in terms of their accuracy and of their ability to preserve trait distributions, multi-trait correlation structure and bivariate trait relationships. The relatively good performance of mean and species mean imputations in terms of accuracy masked a poor representation of trait distributions and multivariate trait structure. Species identity improved MICE imputations for all traits, whereas forest structure and topography improved imputations for some traits. No method performed best consistently for the five studied traits, but, considering all traits and performance metrics, MICE informed by relevant ecological variables gave the best results. However, at higher missingness (> 30 %), species mean imputations and regression kriging tended to outperform MICE for some traits. MICE informed by relevant ecological variables allowed us to fill the gaps in the IEFC incomplete dataset (5495 plots) and quantify imputation uncertainty. Resulting spatial patterns of the studied traits in Catalan forests were broadly similar when using species means, regression kriging or the best-performing MICE application, but some important discrepancies were observed at the local level. Our results highlight the need to assess imputation quality beyond just imputation accuracy and show that including environmental information in statistical imputation approaches yields more plausible imputations in spatially explicit plant trait datasets.
APA, Harvard, Vancouver, ISO, and other styles
5

Ghazali, Shamihah Muhammad, Norshahida Shaadan, and Zainura Idrus. "Missing data exploration in air quality data set using R-package data visualisation tools." Bulletin of Electrical Engineering and Informatics 9, no. 2 (April 1, 2020): 755–63. http://dx.doi.org/10.11591/eei.v9i2.2088.

Full text
Abstract:
Missing values often occur in many data sets of various research areas. This has been recognized as data quality problem because missing values could affect the performance of analysis results. To overcome the problem, the incomplete data set need to be treated or replaced using imputation method. Thus, exploring missing values pattern must be conducted beforehand to determine a suitable method. This paper discusses on the application of data visualisation as a smart technique for missing data exploration aiming to increase understanding on missing data behaviour which include missing data mechanism (MCAR, MAR and MNAR), distribution pattern of missingness in terms of percentage as well as the gap size. This paper presents the application of several data visualisation tools from five R-packges such as visdat, VIM, ggplot2, Amelia and UpSetR for data missingness exploration. For an illustration, based on an air quality data set in Malaysia, several graphics were produced and discussed to illustrate the contribution of the visualisation tools in providing input and the insight on the pattern of data missingness. Based on the results, it is shown that missing values in air quality data set of the chosen sites in Malaysia behave as missing at random (MAR) with small percentage of missingness and do contain long gap size of missingness.
APA, Harvard, Vancouver, ISO, and other styles
6

Beesley, Lauren J., Irina Bondarenko, Michael R. Elliot, Allison W. Kurian, Steven J. Katz, and Jeremy MG Taylor. "Multiple imputation with missing data indicators." Statistical Methods in Medical Research 30, no. 12 (October 13, 2021): 2685–700. http://dx.doi.org/10.1177/09622802211047346.

Full text
Abstract:
Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation, also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approach, however, assumes that the missingness mechanism is missing at random, and it is not well-justified under not-at-random missingness without additional modification. In this paper, we describe how we can generalize the sequential regression multiple imputation imputation procedure to handle missingness not at random in the setting where missingness may depend on other variables that are also missing but not on the missing variable itself, conditioning on fully observed variables. We provide algebraic justification for several generalizations of standard sequential regression multiple imputation using Taylor series and other approximations of the target imputation distribution under missingness not at random. Resulting regression model approximations include indicators for missingness, interactions, or other functions of the missingness not at random missingness model and observed data. In a simulation study, we demonstrate that the proposed sequential regression multiple imputation modifications result in reduced bias in the final analysis compared to standard sequential regression multiple imputation, with an approximation strategy involving inclusion of an offset in the imputation model performing the best overall. The method is illustrated in a breast cancer study, where the goal is to estimate the prevalence of a specific genetic pathogenic variant.
APA, Harvard, Vancouver, ISO, and other styles
7

Beesley, Lauren J., Irina Bondarenko, Michael R. Elliot, Allison W. Kurian, Steven J. Katz, and Jeremy MG Taylor. "Multiple imputation with missing data indicators." Statistical Methods in Medical Research 30, no. 12 (October 13, 2021): 2685–700. http://dx.doi.org/10.1177/09622802211047346.

Full text
Abstract:
Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation, also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approach, however, assumes that the missingness mechanism is missing at random, and it is not well-justified under not-at-random missingness without additional modification. In this paper, we describe how we can generalize the sequential regression multiple imputation imputation procedure to handle missingness not at random in the setting where missingness may depend on other variables that are also missing but not on the missing variable itself, conditioning on fully observed variables. We provide algebraic justification for several generalizations of standard sequential regression multiple imputation using Taylor series and other approximations of the target imputation distribution under missingness not at random. Resulting regression model approximations include indicators for missingness, interactions, or other functions of the missingness not at random missingness model and observed data. In a simulation study, we demonstrate that the proposed sequential regression multiple imputation modifications result in reduced bias in the final analysis compared to standard sequential regression multiple imputation, with an approximation strategy involving inclusion of an offset in the imputation model performing the best overall. The method is illustrated in a breast cancer study, where the goal is to estimate the prevalence of a specific genetic pathogenic variant.
APA, Harvard, Vancouver, ISO, and other styles
8

ZHANG, WEN, YE YANG, and QING WANG. "A COMPARATIVE STUDY OF ABSENT FEATURES AND UNOBSERVED VALUES IN SOFTWARE EFFORT DATA." International Journal of Software Engineering and Knowledge Engineering 22, no. 02 (March 2012): 185–202. http://dx.doi.org/10.1142/s0218194012400025.

Full text
Abstract:
Software effort data contains a large amount of missing values of project attributes. The problem of absent features, which occurred recently in machine learning, is often neglected by researchers of software engineering when handling the missingness in software effort data. In essence, absent features (structural missingness) and unobserved values (unstructured missingness) are different cases of missingness although their appearance in the data set are the same. This paper attempts to clarify the root cause of missingness of software effort data. When regarding missingness as absent features, we develop Max-margin regression to predict real effort of software projects. When regarding missingness as unobserved values, we use existing imputation techniques to impute missing values. Then, ε – SVR is used to predict real effort of software projects with the input data sets. Experiments on ISBSG (International Software Benchmarking Standard Group) and CSBSG (Chinese Software Benchmarking Standard Group) data sets demonstrate that, with the tasks of effort prediction, the treatment regarding missingness in software effort data set as unobserved values can produce more desirable performance than that of regarding missingness as absent features. This paper is the first to introduce the concept of absent features to deal with missingness of software effort data.
APA, Harvard, Vancouver, ISO, and other styles
9

De Raadt, Alexandra, Matthijs J. Warrens, Roel J. Bosker, and Henk A. L. Kiers. "Kappa Coefficients for Missing Data." Educational and Psychological Measurement 79, no. 3 (January 16, 2019): 558–76. http://dx.doi.org/10.1177/0013164418823249.

Full text
Abstract:
Cohen’s kappa coefficient is commonly used for assessing agreement between classifications of two raters on a nominal scale. Three variants of Cohen’s kappa that can handle missing data are presented. Data are considered missing if one or both ratings of a unit are missing. We study how well the variants estimate the kappa value for complete data under two missing data mechanisms—namely, missingness completely at random and a form of missingness not at random. The kappa coefficient considered in Gwet ( Handbook of Inter-rater Reliability, 4th ed.) and the kappa coefficient based on listwise deletion of units with missing ratings were found to have virtually no bias and mean squared error if missingness is completely at random, and small bias and mean squared error if missingness is not at random. Furthermore, the kappa coefficient that treats missing ratings as a regular category appears to be rather heavily biased and has a substantial mean squared error in many of the simulations. Because it performs well and is easy to compute, we recommend to use the kappa coefficient that is based on listwise deletion of missing ratings if it can be assumed that missingness is completely at random or not at random.
APA, Harvard, Vancouver, ISO, and other styles
10

Arioli, Angelica, Arianna Dagliati, Bethany Geary, Niels Peek, Philip A. Kalra, Anthony D. Whetton, and Nophar Geifman. "OptiMissP: A dashboard to assess missingness in proteomic data-independent acquisition mass spectrometry." PLOS ONE 16, no. 4 (April 15, 2021): e0249771. http://dx.doi.org/10.1371/journal.pone.0249771.

Full text
Abstract:
Background Missing values are a key issue in the statistical analysis of proteomic data. Defining the strategy to address missing values is a complex task in each study, potentially affecting the quality of statistical analyses. Results We have developed OptiMissP, a dashboard to visually and qualitatively evaluate missingness and guide decision making in the handling of missing values in proteomics studies that use data-independent acquisition mass spectrometry. It provides a set of visual tools to retrieve information about missingness through protein densities and topology-based approaches, and facilitates exploration of different imputation methods and missingness thresholds. Conclusions OptiMissP provides support for researchers’ and clinicians’ qualitative assessment of missingness in proteomic datasets in order to define study-specific strategies for the handling of missing values. OptiMissP considers biases in protein distributions related to the choice of imputation method and helps analysts to balance the information loss caused by low missingness thresholds and the noise introduced by selecting high missingness thresholds. This is complemented by topological data analysis which provides additional insight to the structure of the data and their missingness. We use an example in Chronic Kidney Disease to illustrate the main functionalities of OptiMissP.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Big data with missingness"

1

Cao, Yu. "Bayesian nonparametric analysis of longitudinal data with non-ignorable non-monotone missingness." VCU Scholars Compass, 2019. https://scholarscompass.vcu.edu/etd/5750.

Full text
Abstract:
In longitudinal studies, outcomes are measured repeatedly over time, but in reality clinical studies are full of missing data points of monotone and non-monotone nature. Often this missingness is related to the unobserved data so that it is non-ignorable. In such context, pattern-mixture model (PMM) is one popular tool to analyze the joint distribution of outcome and missingness patterns. Then the unobserved outcomes are imputed using the distribution of observed outcomes, conditioned on missing patterns. However, the existing methods suffer from model identification issues if data is sparse in specific missing patterns, which is very likely to happen with a small sample size or a large number of repetitions. We extend the existing methods using latent class analysis (LCA) and a shared-parameter PMM. The LCA groups patterns of missingness with similar features and the shared-parameter PMM allows a subset of parameters to be different among latent classes when fitting a model, thus restoring model identifiability. A novel imputation method is also developed using the distribution of observed data conditioned on latent classes. We develop this model for continuous response data and extend it to handle ordinal rating scale data. Our model performs better than existing methods for data with small sample size. The method is applied to two datasets from a phase II clinical trial that studies the quality of life for patients with prostate cancer receiving radiation therapy, and another to study the relationship between the perceived neighborhood condition in adolescence and the drinking habit in adulthood.
APA, Harvard, Vancouver, ISO, and other styles
2

Hansen, Simon, and Erik Markow. "Big Data : Implementation av Big Data i offentlig verksamhet." Thesis, Högskolan i Halmstad, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-38756.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Deng, Wei. "Multiple imputation for marginal and mixed models in longitudinal data with informative missingness." Connect to resource, 2005. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1126890027.

Full text
Abstract:
Thesis (Ph. D.)--Ohio State University, 2005.
Title from first page of PDF file. Document formatted into pages; contains xiii, 108 p.; also includes graphics. Includes bibliographical references (p. 104-108). Available online via OhioLINK's ETD Center
APA, Harvard, Vancouver, ISO, and other styles
4

Lundvall, Helena. "Big data = Big money? : En kvantitativ studie om big data, förtroende och köp online." Thesis, Uppsala universitet, Företagsekonomiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-451065.

Full text
Abstract:
Tidigare forskning har entydigt visat på att ett ökat förtroende hos kunder i köpsituationer ökar deras vilja att genomföra köp. Vilka faktorer som påverkar kunders förtroende har även det undersökts flitigt och faktorer som kan kopplas till hantering av kunders data tas allt oftare upp som avgörande. Dock behandlas dessa faktorer många gånger på ett övergripande plan och studier som djupdyker i vilka underliggande faktorer kopplat till datahantering som påverkar kunders förtroende saknas. Genom att samla in kvantitativ data om hur kunder förhåller sig till företags insamling och användande av big data, deras förtroende för e-handelsföretag, samt deras vilja att genomföra köp online ämnar denna studie till att besvara syftet att undersöka effekten av företags insamling och användande av big data på kunders förtroende för företag inom e-handel, samt att undersöka effekten av kunders förtroende på deras vilja att genomföra köp. Studiens resultat visar att företags insamling av big data har en signifikant negativ effekt på kundernas förtroende, samt att kunders förtroende har ett signifikant positivt samband med kunders köpintention. Gällande företags användande av big data kunde däremot inte en signifikant negativ effekt på kundernas förtroende påvisas.
APA, Harvard, Vancouver, ISO, and other styles
5

Rizk, Raya. "Big Data Validation." Thesis, Uppsala universitet, Informationssystem, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-353850.

Full text
Abstract:
With the explosion in usage of big data, stakes are high for companies to develop workflows that translate the data into business value. Those data transformations are continuously updated and refined in order to meet the evolving business needs, and it is imperative to ensure that a new version of a workflow still produces the correct output. This study focuses on the validation of big data in a real-world scenario, and implements a validation tool that compares two databases that hold the results produced by different versions of a workflow in order to detect and prevent potential unwanted alterations, with row-based and column-based statistics being used to validate the two versions. The tool was shown to provide accurate results in test scenarios, providing leverage to companies that need to validate the outputs of the workflows. In addition, by automating this process, the risk of human error is eliminated, and it has the added benefit of improved speed compared to the more labour-intensive manual alternative. All this allows for a more agile way of performing updates on the data transformation workflows by improving on the turnaround time of the validation process.
APA, Harvard, Vancouver, ISO, and other styles
6

Jaber, Carolin. "Big data visualisering." Thesis, Örebro universitet, Institutionen för naturvetenskap och teknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-79898.

Full text
Abstract:
Visualisering av data i grafiska presentationer är viktigt inom många olika områden för attenklare förstå information och relationer av insamlad data. Mängden data växer snabbt tillstora skalor som är svåra att hantera och bidrar till nya utmaningar vid visualisering av data igrafiska presentationer. System är beroende av data visualisering för att upptäcka defekteroch fel av produktion. Genom att förbättra prestandan av tidsseriedata visualisering ökar detmöjligheten att upptäcka fel och defekter av produktion.Rapporten tar upp metoder för visualisering av tidsseriedata med snabb prestanda ochdiskuterar hur Big data av multivaribler kan visualiseras med PCA.
Presenting data in graphical forms is important in many different industries in order tounderstand information asset from data that is being collected. The amount of data is growingfast and brings new challenges for visualizing the data in graphical representations. Systemsare dependent on data visualization for detecting defects and faults of productions. Byimproved performance of time series data visualization increases the ability of detectingfaults and defects of productions.This report takes up a methods for visualizing time series data with high velocity in toaccount and discusses how big data of multivariable can be visualized with PCA.
APA, Harvard, Vancouver, ISO, and other styles
7

Blahová, Leontýna. "Big Data Governance." Master's thesis, Vysoká škola ekonomická v Praze, 2016. http://www.nusl.cz/ntk/nusl-203994.

Full text
Abstract:
This master thesis is about Big Data Governance and about software, which is used for this purposes. Because Big Data are huge opportunity and also risk, I wanted to map products which can be easily use for Data Quality and Big Data Governance in one platform. This thesis is not only on theoretical knowledge level, but also evaluates five key products (from my point of view). I defined requirements for every kind of domain and then I set up the weights and points. The main objective is to evaluate software capabilities and compere them.
APA, Harvard, Vancouver, ISO, and other styles
8

Kämpe, Gabriella. "How Big Data Affects UserExperienceReducing cognitive load in big data applications." Thesis, Umeå universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163995.

Full text
Abstract:
We have entered the age of big data. Massive data sets are common in enterprises, government, and academia. Interpreting such scales of data is still hard for the human mind. This thesis investigates how proper design can decrease the cognitive load in data-heavy applications. It focuses on numeric data describing economic growth in retail organizations. It aims to answer the questions: What is important to keep in mind when designing an interface that holds large amounts of data? and How to decrease the cognitive load in complex user interfaces without reducing functionality?. It aims to answer these questions by comparing two user interfaces in terms of efficiency, structure, ease of use and navigation. Each interface holds the same functionality and amount of data, but one is designed to increase user experience by reducing cognitive load. The design choices in the second application are based on the theory found in the literature study in the thesis.
APA, Harvard, Vancouver, ISO, and other styles
9

Hafez, Mai. "Analysis of multivariate longitudinal categorical data subject to nonrandom missingness : a latent variable approach." Thesis, London School of Economics and Political Science (University of London), 2015. http://etheses.lse.ac.uk/3184/.

Full text
Abstract:
Longitudinal data are collected for studying changes across time. In social sciences, interest is often in theoretical constructs, such as attitudes, behaviour or abilities, which cannot be directly measured. In that case, multiple related manifest (observed) variables, for example survey questions or items in an ability test, are used as indicators for the constructs, which are themselves treated as latent (unobserved) variables. In this thesis, multivariate longitudinal data is considered where multiple observed variables, measured at each time point, are used as indicators for theoretical constructs (latent variables) of interest. The observed items and the latent variables are linked together via statistical latent variable models. A common problem in longitudinal studies is missing data, where missingness can be classiffed into one of two forms. Dropout occurs when subjects exit the study prematurely, while intermittent missingness takes place when subjects miss one or more occasions but show up on a subsequent wave of the study. Ignoring the missingness mechanism can lead to biased estimates, especially when the missingness is nonrandom. The approach proposed in this thesis uses latent variable models to capture the evolution of a latent phenomenon over time, while incorporating a missingness mechanism to account for possibly nonrandom forms of missingness. Two model specifications are presented, the first of which incorporates dropout only in the missingness mechanism, while the other accounts for both dropout and intermittent missingness allowing them to be informative by being modelled as functions of the latent variables and possibly observed covariates. Models developed in this thesis consider ordinal and binary observed items, because such variables are often met in social surveys, while the underlying latent variables are assumed to be continuous. The proposed models are illustrated by analysing people's perceptions on women's work using three questions from five waves of the British Household Panel Survey.
APA, Harvard, Vancouver, ISO, and other styles
10

Andersson, Oscar, and Tim Andersson. "AI applications on healthcare data." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-44752.

Full text
Abstract:
The purpose of this research is to get a better understanding of how different machine learning algorithms work with different amounts of data corruption. This is important since data corruption is an overbearing issue within data collection and thus, in extension, any work that relies on the collected data. The questions we were looking at were: What feature is the most important? How significant is the correlation of features? What algorithms should be used given the data available? And, How much noise (inaccurate or unhelpful captured data) is acceptable?  The study is structured to introduce AI in healthcare, data missingness, and the machine learning algorithms we used in the study. In the method section, we give a recommended workflow for handling data with machine learning in mind. The results show us that when a dataset is filled with random values, the run-time of algorithms increases since many patterns are lost. Randomly removing values also caused less of a problem than first anticipated since we ran multiple trials, evening out any problems caused by the lost values. Lastly, imputation is a preferred way of handling missing data since it retained many dataset structures. One has to keep in mind if the imputation is done on categories or numerical values. However, there is no easy "best-fit" for any dataset. It is hard to give a concrete answer when choosing a machine learning algorithm that fits any dataset. Nevertheless, since it is easy to simply plug-and-play with many algorithms, we would recommend any user try different ones before deciding which one fits a project the best.
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Big data with missingness"

1

Mei, Hong, Weiguo Zhang, Wenfei Fan, Zili Zhang, Yihua Huang, Jiajun Bu, Yang Gao, and Li Wang, eds. Big Data. Singapore: Springer Singapore, 2021. http://dx.doi.org/10.1007/978-981-16-0705-9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Liao, Xiangke, Wei Zhao, Enhong Chen, Nong Xiao, Li Wang, Yang Gao, Yinghuan Shi, Changdong Wang, and Dan Huang, eds. Big Data. Singapore: Springer Singapore, 2022. http://dx.doi.org/10.1007/978-981-16-9709-8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Xu, Zongben, Xinbo Gao, Qiguang Miao, Yunquan Zhang, and Jiajun Bu, eds. Big Data. Singapore: Springer Singapore, 2018. http://dx.doi.org/10.1007/978-981-13-2922-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

King, Stefanie. Big Data. Wiesbaden: Springer Fachmedien Wiesbaden, 2014. http://dx.doi.org/10.1007/978-3-658-06586-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Fasel, Daniel, and Andreas Meier, eds. Big Data. Wiesbaden: Springer Fachmedien Wiesbaden, 2016. http://dx.doi.org/10.1007/978-3-658-11589-0.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Mohanty, Hrushikesha, Prachet Bhuyan, and Deepak Chenthati, eds. Big Data. New Delhi: Springer India, 2015. http://dx.doi.org/10.1007/978-81-322-2494-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Jin, Hai, Xuemin Lin, Xueqi Cheng, Xuanhua Shi, Nong Xiao, and Yihua Huang, eds. Big Data. Singapore: Springer Singapore, 2019. http://dx.doi.org/10.1007/978-981-15-1899-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

König, Christian, Jette Schröder, and Erich Wiegand, eds. Big Data. Wiesbaden: Springer Fachmedien Wiesbaden, 2018. http://dx.doi.org/10.1007/978-3-658-20083-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Gottlob, Georg, Giovanni Grasso, Dan Olteanu, and Christian Schallhart, eds. Big Data. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-39467-6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Chen, Min, Shiwen Mao, Yin Zhang, and Victor C. M. Leung. Big Data. Cham: Springer International Publishing, 2014. http://dx.doi.org/10.1007/978-3-319-06245-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Big data with missingness"

1

Laaksonen, Seppo. "Missingness, Its Reasons and Treatment." In Survey Methodology and Missing Data, 99–110. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-79011-4_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Laaksonen, Seppo. "Sampling Principles, Missingness Mechanisms, and Design Weighting." In Survey Methodology and Missing Data, 49–76. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-79011-4_4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Rodrigues de Morais, Sérgio, and Alex Aussem. "Exploiting Data Missingness in Bayesian Network Modeling." In Advances in Intelligent Data Analysis VIII, 35–46. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009. http://dx.doi.org/10.1007/978-3-642-03915-7_4.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Bautista, Elizabeth, Cary Whitney, and Thomas Davis. "Big Data Behind Big Data." In Conquering Big Data with High Performance Computing, 163–89. Cham: Springer International Publishing, 2016. http://dx.doi.org/10.1007/978-3-319-33742-5_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Estrada, Raul, and Isaac Ruiz. "Big Data, Big Challenges." In Big Data SMACK, 3–7. Berkeley, CA: Apress, 2016. http://dx.doi.org/10.1007/978-1-4842-2175-4_1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Estrada, Raul, and Isaac Ruiz. "Big Data, Big Solutions." In Big Data SMACK, 9–16. Berkeley, CA: Apress, 2016. http://dx.doi.org/10.1007/978-1-4842-2175-4_2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Amirian, Pouria, Francois van Loggerenberg, and Trudie Lang. "Big Data and Big Data Technologies." In Big Data in Healthcare, 39–58. Cham: Springer International Publishing, 2017. http://dx.doi.org/10.1007/978-3-319-62990-2_3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Shi, Yong. "Big Data and Big Data Analytics." In Advances in Big Data Analytics, 3–21. Singapore: Springer Singapore, 2022. http://dx.doi.org/10.1007/978-981-16-3607-3_1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Anderson, Billie. "Big Data." In Intelligent Credit Scoring, 149–72. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2016. http://dx.doi.org/10.1002/9781119282396.ch9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Richter, Philipp. "Big Data." In Handbuch Medien- und Informationsethik, 210–16. Stuttgart: J.B. Metzler, 2016. http://dx.doi.org/10.1007/978-3-476-05394-7_28.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Big data with missingness"

1

Ghorbani, Amirata, and James Y. Zou. "Embedding for Informative Missingness: Deep Learning With Incomplete Data." In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2018. http://dx.doi.org/10.1109/allerton.2018.8636008.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Oltjen, William C., Yangxin Fan, Jiqi Liu, Liangyi Huang, Xuanji Yu, Mengjie Li, Hubert Seigneur, et al. "FAIRification, Quality Assessment, and Missingness Pattern Discovery for Spatiotemporal Photovoltaic Data." In 2022 IEEE 49th Photovoltaics Specialists Conference (PVSC). IEEE, 2022. http://dx.doi.org/10.1109/pvsc48317.2022.9938523.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Mohan, Karthika, Felix Thoemmes, and Judea Pearl. "Estimation with Incomplete Data: The Linear Case." In Twenty-Seventh International Joint Conference on Artificial Intelligence {IJCAI-18}. California: International Joint Conferences on Artificial Intelligence Organization, 2018. http://dx.doi.org/10.24963/ijcai.2018/705.

Full text
Abstract:
Traditional methods for handling incomplete data, including Multiple Imputation and Maximum Likelihood, require that the data be Missing At Random (MAR). In most cases, however, missingness in a variable depends on the underlying value of that variable. In this work, we devise model-based methods to consistently estimate mean, variance and covariance given data that are Missing Not At Random (MNAR). While previous work on MNAR data require variables to be discrete, we extend the analysis to continuous variables drawn from Gaussian distributions. We demonstrate the merits of our techniques by comparing it empirically to state of the art software packages.
APA, Harvard, Vancouver, ISO, and other styles
4

Becker, David, Trish Dunn King, and Bill McMullen. "Big data, big data quality problem." In 2015 IEEE International Conference on Big Data (Big Data). IEEE, 2015. http://dx.doi.org/10.1109/bigdata.2015.7364064.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Gopalkrishnan, Vivekanand, David Steier, Harvey Lewis, and James Guszcza. "Big data, big business." In the 1st International Workshop. New York, New York, USA: ACM Press, 2012. http://dx.doi.org/10.1145/2351316.2351318.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Gote, Christoph, Pavlin Mavrodiev, Frank Schweitzer, and Ingo Scholtes. "Big data = big insights?" In ICSE '22: 44th International Conference on Software Engineering. New York, NY, USA: ACM, 2022. http://dx.doi.org/10.1145/3510003.3510619.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Wang, Wei. "Big Data, Big Challenges." In 2014 IEEE International Conference on Semantic Computing (ICSC). IEEE, 2014. http://dx.doi.org/10.1109/icsc.2014.65.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Callahan, Sequoia. "Ecologically Plausible? Comparing the Independent and Paired Samples t-Test With Nonrandom Missingness and Skewed Data." In 2021 AERA Annual Meeting. Washington DC: AERA, 2021. http://dx.doi.org/10.3102/1691413.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Shamsuddin, Siti Mariyam, and Shafaatunnur Hasan. "Data science vs big data @ UTM big data centre." In 2015 International Conference on Science in Information Technology (ICSITech). IEEE, 2015. http://dx.doi.org/10.1109/icsitech.2015.7407766.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Benjamins, V. Richard. "Big Data." In the 4th International Conference. New York, New York, USA: ACM Press, 2014. http://dx.doi.org/10.1145/2611040.2611042.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Big data with missingness"

1

Zwitter, Andrej J., and Amelia Hadfield. Governing Big Data. Librello, January 2014. http://dx.doi.org/10.12924/pag2014.02010001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Gildea, Timothy R. Big Data health Physics. Office of Scientific and Technical Information (OSTI), March 2020. http://dx.doi.org/10.2172/1603973.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Goldstein, Itay, Chester Spatt, and Mao Ye. Big Data in Finance. Cambridge, MA: National Bureau of Economic Research, March 2021. http://dx.doi.org/10.3386/w28615.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Big data en salud digital. Chair Alberto Urueña López and José María San Segundo Encinar. ONTSI : Fundación Vodafone España, March 2017. http://dx.doi.org/10.30923/5896-8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Alewijn, M. Big data - Banana origin determination. Wageningen: Wageningen Food Safety Research, 2020. http://dx.doi.org/10.18174/516096.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Doucet, Rachel A., Deyan M. Dontchev, Javon S. Burden, and Thomas L. Skoff. Big Data Analytics Test Bed. Fort Belvoir, VA: Defense Technical Information Center, September 2013. http://dx.doi.org/10.21236/ada589903.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Farboodi, Maryam, Roxana Mihet, Thomas Philippon, and Laura Veldkamp. Big Data and Firm Dynamics. Cambridge, MA: National Bureau of Economic Research, January 2019. http://dx.doi.org/10.3386/w25515.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Ahrens, James, Jim M. Brase, Bill Hart, Dimitri Kusnezov, and John Shalf. Where Big Data and Prediction Meet. Office of Scientific and Technical Information (OSTI), September 2014. http://dx.doi.org/10.2172/1169890.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Rathinam, Francis, P. Thissen, and M. Gaarder. Using big data for impact evaluations. Centre of Excellence for Development Impact and Learning (CEDIL), February 2021. http://dx.doi.org/10.51744/cmb2.

Full text
Abstract:
The amount of big data available has exploded with recent innovations in satellites, sensors, mobile devices, call detail records, social media applications, and digital business records. Big data offers great potential for examining whether programmes and policies work, particularly in contexts where traditional methods of data collection are challenging. During pandemics, conflicts, and humanitarian emergency situations, data collection can be challenging or even impossible. This CEDIL Methods Brief takes a step-by-step, practical approach to guide researchers designing impact evaluations based on big data. This brief is based on the CEDIL Methods Working Paper on ‘Using big data for evaluating development outcomes: a systematic map’.
APA, Harvard, Vancouver, ISO, and other styles
10

Francke, Angela, Sven Lißner, and Anke Juliane. Big Data im Radverkehr : Teil II. Technische Universität Dresden, September 2021. http://dx.doi.org/10.26128/2021.241.

Full text
Abstract:
Die Nutzung verfügbarer Radverkehrsdaten auf GPS-Basis stellt eine preisgünstige Möglichkeit für Kommunen dar, einen Überblick über das Nutzungsverhalten ihrer Radfahrenden zu erhalten. Mit den vorliegenden Ergebnissen soll eine Lücke bei der Interpretation von GPS-basierten Daten geschlossen werden. Die Radfahrtypologie auf Basis des geäußerten Verhaltens kann dabei helfen, GPS-Daten auch ohne detaillierte Kenntnisse der zugrundeliegenden Nutzergruppen zielgenauer zu interpretieren. Damit können zukünftig Kommunen die Potenziale entstehender oder bereits vorhandener Angebote an GPS-Radverkehrsdaten zielführender nutzen und ihre Radverkehrsinfrastruktur besser darauf abstimmen. In einem ersten Schritt wurde auf Basis einer Befragung eine empirisch belegte und wissenschaftlich hergeleitete multidimensionale Typologisierung von Radfahrenden erstellt. Anschließend wurde eine umfangreiche heterogene Probandengruppe mit unterschiedlichen soziodemografischen Ausprägungen mit Geräten für die Aufzeichnung ihrer Radrouten ausgestattet. Das auf diesem Weg erhobene Radverkehrsverhalten wurde, gestützt durch kontinuierliche begleitende Befragungen, ausgewertet und anhand unterschiedlicher Indikatoren beschrieben. Damit wurden Präferenzen einzelner Gruppen, z. B. im Hinblick auf Geschwindigkeit, Streckenlänge, Typ der Radverkehrsinfrastruktur, Fahrtzweck oder Routenwahl identifiziert. Auf Basis einer Onlineumfrage konnten vier unterschiedliche Typen von Radfahrenden beschrieben werden, die sich hinsichtlich der Nutzungshäufigkeit, zurückgelegter Entfernungen, Fahrverhalten, Sicherheitsempfinden, Identifikation als Radfahrerende, Wetterabhängigkeit und in motivationalen Aspekten unterscheiden. Anhand der unterschiedlichen Ausprägungen in diesen Merkmalen werden sie als die ambitionierten, die funktionellen, die pragmatischen und die passionierten Radfahrenden bezeichnet. Bezogen auf das Verkehrsverhalten steigt die Nutzungshäufigkeit von ambitionierten über passionierte und pragmatische Radfahrende an. Funktionelle Radfahrende geben die mit Abstand geringste Fahrradnutzung unter allen vier Typen an. Hinsichtlich der angegebenen Distanzen, die zurückgelegt werden, liegen passionierte, pragmatische und funktionelle Radfahrende dicht beieinander. Ambitionierte Radfahrende gaben dagegen an, deutlich größere Distanzen zurückzulegen. Die Ergebnisse aus der Umfrage zeigten sich in einer anschließenden Felduntersuchung in abgeschwächter Form. Insbesondere der ambitionierte Radfahrtyp lässt sich durch höhere Tageskilometerwerte, Geschwindigkeiten und Beschleunigungen von den anderen Typen abgrenzen. Bei den anderen Typen ist eine Unterscheidung weniger ausgeprägt. Hier zeigte sich, dass vor allem die Zugehörigkeit zu einer bestimmten Altersgruppe einen Einfluss auf das Fahrverhalten hat. In Übereinstimmung mit bisherigen Erkenntnissen zeigte sich, dass mit zunehmendem Alter tendenziell etwas langsamer und stetiger gefahren wird. Ebenso radeln auch weibliche Personen etwas langsamer und stetiger als männliche Radfahrer. In der Nutzerbefragung zeigten sich geringe Unterschiede für die Präferenz bei der Infrastrukturnutzung zwischen den Typen, z.B. bei funktionellen Radfahrenden, die eine getrennte Führung im Seitenraum bevorzugen. In der Feldstudie wurde dies ebenfalls untersucht. Auch hier zeigten sich nur geringe Unterschiede. Die Ergebnisse werden auch vor dem Hintergrund eines, eventuell durch die Versuchssituation veränderten Fahrverhaltens der teilnehmenden Radfahrenden, diskutiert. Es konnte vor allem eine hohe Nutzungsfrequenz und Häufigkeit beobachtet werden, die die angegebenen Werte aus der Typenbefragung übertrafen. Für die Nutzung von GPS-Daten für die Radverkehrsplanung wird aus den Ergebnissen abgeleitet, dass eine mögliche Skalierung beziehungsweise Wichtung von Daten entlang soziodemografischer Faktoren die größten Potenziale bietet.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography