Dissertations / Theses on the topic 'Statistical data science'

To see the other types of publications on this topic, follow the link: Statistical data science.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Statistical data science.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Alarcón, Soto Yovaninna. "Data science in HIV : statistical approaches for therapeutic HIV vaccine data." Doctoral thesis, Universitat Politècnica de Catalunya, 2021. http://hdl.handle.net/10803/672179.

Full text
Abstract:
The present dissertation contributes to Data Science in the Human lmmunodeficiency Virus (HIV) field, addressing specific issues related to the modelling of data coming from three different clinical trials based on the development of HIV therapeutic vaccines. The biological questions that these studies raise are identify biomarkers that predict HIV viral rebound; explain the time to viral rebound as a consequence of antiretroviral therapy (cART) stop considering the variability of data sources; and find the relationship between spot size and spot count from Enzyme-Linked lmmunosorbent spot (ELISpot) assays data. To handle these problems from a statistical perspective, in this thesis we: adapt the elastic net penalization to the accelerated failure time model with interval-censored data, fit a mixed effects Cox model with interval-censored data, and improve statistical methodologies to deal with ELISpot assays data and a binary response, respectively. In order to address the variable selection among a vast number of predictors to explain the time to viral rebound, we consideran elastic-net penalization approach within the accelerated failure time model. Elastic-net regularization considers a possible correlation structure among covariates, which is the case of messenger RNA (mRNA) data. For this purpose, we derive the expression of the penalized log-likelihood function for the special case of the interval-censored response (time to viral rebound). Following, we maximize this function using distinct approaches and optimization methods. Finally, we apply these approaches to the Dendritic Cell-Based Vaccine clinical trial, and we discuss different numerical methods for the maximization of the log-likelihood. To explain the time to viral rebound in the context of another study with data from several clinical trials, we use a mixed effects Cox model to account for the data heterogeneity. This model allows us to handle the heterogeneity between the Analytical Treatment lnterruption (ATI) studies and the fact that the patients had different number of ATI episodes. Our method proposes the use of a multiple imputation approach based on a truncated Weibull distribution to replace the interval-censored by imputed survival times. Our simulation studies show that our method has desirable properties in terms of accuracy and precision of the estimators of the fixed effects parameters. Concerning the clinical results, the higher the pre-cART VL, the larger the instantaneous risk of a viral rebound. Our method could be applied to any data set that presents both interval-censored survival times and a grouped data structure that could be treated as a random effect. We finally address two different issues that have arisen when analyzing the BCN02 clinical trial. On one hand, we fit univariate log-binomial models as an alternative to the usual logistic regression. On the other hand, we use one/two- way unbalanced ANOVA to analyze the variability of the main outcomes from the ELISpot assays across time. Although these assays are widely used in the context of the HIV study, the relationship between spot size or spot count and other variables has not been studied until now. In this thesis, we propose, develop, and apply different statistical approaches that contributes to answer diverse clinical questions that are relevant in several clinical trials. We have tried to highlight that to be able to choose the appropriate methodology, make correct clinical interpretations and contribute to a meaningful scientific progress, a narrow collaboration with scientists is necessary. We expect that the original results from this thesis will contribute to the path of development and evaluation of a therapeutic HIV vaccine, helping to improve the way of living of HIV-infected people.
La presente tesis contribuye a la ciencia de datos abordando problemas biológicos relevantes en el desarrollo de vacunas terapéuticas para el Virus de Inmunodeficiencia Humana (VIH) mediante la modelización de datos procedentes de tres ensayos clínicos diferentes. Algunas de las cuestiones suscitadas en estos estudios y que esta tesis aborda son: identificar biomarcadores para estudiar los factores de riesgo del rebote viral del VIH, explicar el tiempo transcurrido hasta el rebote viral como consecuencia del cese de la terapia antirretroviral (cART) considerando la variabilidad de las fuentes de datos y estudiar la relación entre las variables spot size y spot count en ensayos inmunoabsorbentes (ELISpot). Para abordar cada uno de estos interrogantes desde una perspectiva estadística, en esta tesis hemos adaptado una penalización de red elástica para el modelo de vida acelerada (AFT) con datos censurados en un intervalo, ajustado un modelo de Cox de efectos mixtos con datos censurados en un intervalo y mejorado las metodologías estadísticas existentes para tratar los datos de los ensayos ELISpot y de respuesta binaria, respectivamente. En primer lugar, hemos abordado el problema de tener más de cinco mil ARN mensajeros (ARNm) para explicar el tiempo hasta el rebote viral. Para ello, hemos considerado un enfoque de penalización de red elástica para el modelo de vida acelerada. Esta regularización considera una posible estructura de correlación entre las covariables, como sucede con los ARNm. Para este objetivo, primero derivamos la expresión de la función de verosimilitud penalizada considerando una respuesta censurada en un intervalo (tiempo hasta el rebote viral). A continuación, maximizamos esta función utilizando distintos enfoques y métodos de optimización. Finalmente, aplicamos estos métodos al ensayo clínico DCV2 y discutimos sobre diferentes enfoques numéricos para la maximización de la verosimilitud. En segundo lugar, para explicar el tiempo hasta el rebote viral proponemos ajustar un modelo de Cox de efectos mixtos. Dado que el tiempo hasta el rebote viral está censurado en un intervalo utilizamos imputación múltiple basada en una distribución de Weibull truncada. Este modelo nos permite controlar la heterogeneidad entre los estudios de interrupción analítica del tratamiento (ATI) y el hecho de que los pacientes tengan diferente número de episodios ATI. Según el estudio de simulación que realizamos, nuestro método tiene propiedades deseables en términos de exactitud y precisión de los estimadores de los parámetros de efectos fijos. Finalmente abordamos dos problemas diferentes dentro del ensayo clínico BCN02. Por un lado, ajustamos modelos log-binomiales univariados como alternativa a la clásica regresión logística. Por otro lado, utilizamos un modelo ANOVA no balanceado para analizar la variabilidad de los resultados principales de los ensayos ELISpot a lo largo del tiempo. Aunque los ensayos ELISpot se usan a menudo en el estudio del VIH, la relación entre variables como el spot size, spot count y otras no se había estudiado hasta ahora. En esta tesis hemos propuesto y desarrollado diferentes enfoques estadísticos que han dado respuesta a preguntas biológicas planteadas en tres ensayos clínicos. En este trabajo se destaca la importancia de que los distintos miembros de un equipo científico-multidisciplinar colaboren estrechamente, para así poder determinar la metodología apropiada, hacer correctas interpretaciones clínicas de los resultados de éste y, de esta forma, contribuir a un progreso científico significativo. Esperamos que los resultados originales de esta tesis contribuyan al desarrollo y la evaluación de una vacuna terapéutica del VIH, lo cual ayudaría notablemente a mejorar la calidad de vida de las personas infectadas por VIH.
APA, Harvard, Vancouver, ISO, and other styles
2

Bruno, Rexanne Marie. "Statistical Analysis of Survival Data." UNF Digital Commons, 1994. http://digitalcommons.unf.edu/etd/150.

Full text
Abstract:
The terminology and ideas involved in the statistical analysis of survival data are explained including the survival function, the probability density function, the hazard function, censored observations, parametric and nonparametric estimations of these functions, the product limit estimation of the survival function, and the proportional hazards estimation of the hazard function with explanatory variables. In Appendix A these ideas are applied to the actual analysis of the survival data for 54 cervical cancer patients.
APA, Harvard, Vancouver, ISO, and other styles
3

Ramaboa, Kutlwano K. K. M. "A comparative evaluation of data mining classification techniques on medical trauma data." Master's thesis, University of Cape Town, 2004. http://hdl.handle.net/11427/5973.

Full text
Abstract:
Includes bibliographical references (leaves 109-113).
The purpose of this research was to determine the extent to which a selection of data mining classification techniques (specifically, Discriminant Analysis, Decision Trees, and three artifical neural network models - Backpropogation, Probablilistic Neural Networks, and the Radial Basis Function) are able to correctly classify cases into the different categories of an outcome measure from a given set of input variables (i.e. estimate their classification accuracy) on a common database.
APA, Harvard, Vancouver, ISO, and other styles
4

Yuan, Yinyin. "Statistical inference from large-scale genomic data." Thesis, University of Warwick, 2009. http://wrap.warwick.ac.uk/1066/.

Full text
Abstract:
This thesis explores the potential of statistical inference methodologies in their applications in functional genomics. In essence, it summarises algorithmic findings in this field, providing step-by-step analytical methodologies for deciphering biological knowledge from large-scale genomic data, mainly microarray gene expression time series. This thesis covers a range of topics in the investigation of complex multivariate genomic data. One focus involves using clustering as a method of inference and another is cluster validation to extract meaningful biological information from the data. Information gained from the application of these various techniques can then be used conjointly in the elucidation of gene regulatory networks, the ultimate goal of this type of analysis. First, a new tight clustering method for gene expression data is proposed to obtain tighter and potentially more informative gene clusters. Next, to fully utilise biological knowledge in clustering validation, a validity index is defined based on one of the most important ontologies within the Bioinformatics community, Gene Ontology. The method bridges a gap in current literature, in the sense that it takes into account not only the variations of Gene Ontology categories in biological specificities and their significance to the gene clusters, but also the complex structure of the Gene Ontology. Finally, Bayesian probability is applied to making inference from heterogeneous genomic data, integrated with previous efforts in this thesis, for the aim of large-scale gene network inference. The proposed system comes with a stochastic process to achieve robustness to noise, yet remains efficient enough for large-scale analysis. Ultimately, the solutions presented in this thesis serve as building blocks of an intelligent system for interpreting large-scale genomic data and understanding the functional organisation of the genome.
APA, Harvard, Vancouver, ISO, and other styles
5

Guo, Danni. "Contributions to spatial uncertainty modelling in GIS : small sample data." Doctoral thesis, University of Cape Town, 2007. http://hdl.handle.net/11427/19031.

Full text
Abstract:
Includes bibliographical references.
Environmental data is very costly and difficult to collect and are often vague (subjective) or imprecise in nature (e.g. hazard level of pollutants are classified as "harmful for human beings"). These realities in practise (fuzziness and small datasets) leads to uncertainty, which is addressed by my research objective: "To model spatial environmental data with .fuzzy uncertainty, and to explore the use of small sample data in spatial modelling predictions, within Geographic Information System (GIS)." The methodologies underlying the theoretical foundations for spatial modelling are examined, such as geostatistics, fuzzy mathematics Grey System Theory, and (V,·) Credibility Measure Theory. Fifteen papers including three journal papers were written in contribution to the developments of spatial fuzzy and grey uncertainty modelling, in which I have a contributed portion of 50 to 65%. The methods and theories have been merged together in these papers, and they are applied to two datasets, PM10 air pollution data and soil dioxin data. The papers can be classified into two broad categories: fuzzy spatial GIS modelling and grey spatial GIS modelling. In fuzzy spatial GIS modelling, the fuzzy uncertainty (Zadeh, 1965) in environmental data is addressed. The thesis developed a fuzzy membership grades kriging approach by converting fuzzy subsets spatial modelling into membership grade spatial modelling. As this method develops, the fuzzy membership grades kriging is put into the foundation of the credibility measure theory, and approached a full data-assimilated membership function in terms of maximum fuzzy entropy principle. The variable modelling method in dealing with fuzzy data is a unique contribution to the fuzzy spatial GIS modelling literature. In grey spatial GIS modelling, spatial predictions using small sample data is addressed. The thesis developed a Grey GIS modelling approach, and two-dimensional order-less spatially observations are converted into two one-dimensional ordered data sequences. The thesis papers also explored foundational problems within the grey differential equation models (Deng, 1985). It is discovered the coupling feature of grey differential equations together with the help of e-similarity measure, generalise the classical GM( 1,1) model into more classes of extended GM( 1,1) models, in order to fully assimilate with sample data information. The development of grey spatial GIS modelling is a creative contribution to handling small sample data.
APA, Harvard, Vancouver, ISO, and other styles
6

Smith, Jeremy Stewart. "A statistical approach to automated detection of multi-component radio sources." Master's thesis, Faculty of Science, 2021. http://hdl.handle.net/11427/32986.

Full text
Abstract:
Advances in radio astronomy are allowing for deeper and wider areas of the sky to be observed than ever before. Source counts of future radio surveys are expected to number in the tens of millions. Source finding techniques are used to identify sources in a radio image, however, these techniques identify single distinct sources and are challenged to identify multi-component sources, that is to say, where two or more distinct sources belong to the same underlying physical phenomenon, such as a radio galaxy. Identification of such phenomena is an important step in generating catalogues from surveys on which much of the radio astronomy science is based. Historically, identifying multi-component sources was conducted by visual inspection, however, the size of future surveys makes manual identification prohibitive. An algorithm to automate this process using statistical techniques is proposed. The algorithm is demonstrated on two radio images. The output of the algorithm is a catalogue where nearest neighbour source pairs are assigned a probability score of being a component of the same physical object. By applying several selection criteria, pairs of sources which are likely to be multi-component sources can be determined. Radio image cutouts are then generated from this selection and may be used as input into radio source classification techniques. Successful identification of multi-component sources using this method is demonstrated.
APA, Harvard, Vancouver, ISO, and other styles
7

Rao, Ashwani Pratap. "Statistical information retrieval models| Experiments, evaluation on real time data." Thesis, University of Delaware, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1567821.

Full text
Abstract:

We are all aware of the rise of information age: heterogeneous sources of information and the ability to publish rapidly and indiscriminately are responsible for information chaos. In this work, we are interested in a system which can separate the "wheat" of vital information from the chaff within this information chaos. An efficient filtering system can accelerate meaningful utilization of knowledge. Consider Wikipedia, an example of community-driven knowledge synthesis. Facts about topics on Wikipedia are continuously being updated by users interested in a particular topic. Consider an automatic system (or an invisible robot) to which a topic such as "President of the United States" can be fed. This system will work ceaselessly, filtering new information created on the web in order to provide the small set of documents about the "President of the United States" that are vital to keeping the Wikipedia page relevant and up-to-date. In this work, we present an automatic information filtering system for this task. While building such a system, we have encountered issues related to scalability, retrieval algorithms, and system evaluation; we describe our efforts to understand and overcome these issues.

APA, Harvard, Vancouver, ISO, and other styles
8

Dikkala, Sai Nishanth. "Statistical inference from dependent data : networks and Markov chains." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/127016.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020
Cataloged from the official PDF of thesis.
Includes bibliographical references (pages 259-270).
In recent decades, the study of high-dimensional probability has taken centerstage within many research communities including Computer Science, Statistics and Machine Learning. Very often, due to the process according to which data is collected, the samples in a dataset have implicit correlations amongst them. Such correlations are commonly ignored as a first approximation when trying to analyze statistical and computational aspects of an inference task. In this thesis, we explore how to model such dependences between samples using structured high-dimensional distributions which result from imposing a Markovian property on the joint distribution of the data, namely Markov Random Fields (MRFs) and Markov chains. On MRFs, we explore a quantification for the amount of dependence and we strengthen previously known measure concentration results under a certain weak dependence condition on an MRF called the high-temperature regime. We then go on to apply our novel measure concentration bounds to improve the accuracy of samples computed according to a certain Markov Chain Monte Carlo procedure. We then show how to extend some classical results from statistical learning theory on PAC-learnability and uniform convergence to training data which is dependent under the high temperature condition. Then, we explore the task of regression on data which is dependent according to an MRF under a stronger amount of dependence than is allowed by the high-temperature condition. We then shift our focus to Markov chains where we explore the question of testing whether a certain trajectory we observe corresponds to a chain P or not. We discuss what is a reasonable formulation of this problem and provide a tester which works without observing a trajectory whose length contains multiplicative factors of the mixing or covering time of the chain P. We finally conclude with some broad directions for further research on statistical inference under data dependence.
by Sai Nishanth Dikkala.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
APA, Harvard, Vancouver, ISO, and other styles
9

Hennessey, Anthony. "Statistical shape analysis of large molecular data sets." Thesis, University of Nottingham, 2018. http://eprints.nottingham.ac.uk/52088/.

Full text
Abstract:
Protein classification databases are widely used in the prediction of protein structure and function, and amongst these databases the manually-curated Structural Classification of Proteins database (SCOP) is considered to be a gold standard. In SCOP, functional relationships are described by hyperfamily and superfamily categories and structural relationships are described by family, species and protein categories. We present a method to calculate a difference measure between pairs of proteins that can be used to reproduce SCOP2 structural relationship classifications, and that can also be used to reproduce a subset of functional relationship classifications at the superfamily level. Calculating the difference measure requires first finding the best correspondence between atoms in two protein configurations. The problem of finding the best correspondence is known as the unlabelled, partial matching problem. We consider the unlabelled, partial matching problem through a detailed analysis of the approach presented in Green and Mardia (2006). Using this analysis, and applying domain-specific constraints, we develop a new algorithm called GProtA for protein structure alignment. The proposed difference measure is constructed from the root mean squared deviation of the aligned protein structures and a binary similarity measure, where the binary similarity measure takes into account the proportions of atoms matching from each configuration. The GProtA algorithm and difference measure are applied to protein structure data taken from the Protein Data Bank. The difference measure is shown to correctly classify 62 of a set of 72 proteins into the correct SCOP family categories when clustered. Of the remaining 9 proteins, 2 are assigned incorrectly and 7 are considered indeterminate. In addition, a method for deriving characteristic signatures for categories is proposed. The signatures offer a mechanism by which a single comparison can be made to judge similarity to a particular category. Comparison using characteristic signatures is shown to correctly delineate proteins at the family level, including the identification of both families for a subset of proteins described by two family level categories.
APA, Harvard, Vancouver, ISO, and other styles
10

Chaudhuri, Abon. "Geometric and Statistical Summaries for Big Data Visualization." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1382235351.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Chavali, Krishna Kumar. "Integration of statistical and neural network method for data analysis." Morgantown, W. Va. : [West Virginia University Libraries], 2006. https://eidr.wvu.edu/etd/documentdata.eTD?documentid=4749.

Full text
Abstract:
Thesis (M.S.)--West Virginia University, 2006.
Title from document title page. Document formatted into pages; contains viii, 68 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 50-51).
APA, Harvard, Vancouver, ISO, and other styles
12

Lai, Ian 1980. "A Web-based tutorial for statistical analysis of fMRI data." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/29669.

Full text
Abstract:
Thesis (M.Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.
Includes bibliographical references (p. 59-63).
A dearth of educational material exists for functional magnetic resonance imaging (fMRI), a relatively new tool used in neuroscience research. A computer demonstration for understanding statistical analysis in fMRI was developed in Matlab, along with an accompanying tutorial for its users. The demo makes use of Dview, an existing software package for viewing 3D brain data, and utilizes precomputed data to improve interactivity. The demo and client were used in an HST graduate course in methods for acquisition and analysis of fMRI data. For wider accessibility, a Web-based version of the demo was designed with a client/server architecture. The Java client has a layered design for flexibility, and the Matlab server interfaces with Dview to take advantage of its functionality. The client and server communicate via a simple protocol through the Matlab Web Server. The Web-based version of the demo was implemented successfully. Future work includes implementation of additional demo features and expansion of the tutorial before dissemination to a wider group of medical and neuroscience researchers.
by Ian Lai.
M.Eng.and S.B.
APA, Harvard, Vancouver, ISO, and other styles
13

Hong, Xinting. "INTEGRATED DATA INTEGRATION AND STATISTICAL ANALYSIS PLATFORM FOR MULTI-CENTER EPILEPSY RESEARCH." Case Western Reserve University School of Graduate Studies / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=case1562864784609067.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Malherbe, Chanel. "Fourier method for the measurement of univariate and multivariate volatility in the presence of high frequency data." Master's thesis, University of Cape Town, 2007. http://hdl.handle.net/11427/4386.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Matteusson, Theodor, and Niclas Persson. "Statistical Modelling of Plug-In Hybrid Fuel Consumption : A study using data science methods on test fleet driving data." Thesis, Umeå universitet, Institutionen för matematik och matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-172812.

Full text
Abstract:
The automotive industry is undertaking major technological steps in an effort to reduce emissions and fight climate change. To reduce the reliability on fossil fuels a lot of research is invested into electric motors (EM) and their applications. One such application is plug-in hybrid electric vehicles (PHEV), in which internal combustion engines (ICE) and EM are used in combination, and take turns to propel the vehicle based on driving conditions. The main optimization problem of PHEV is to decide when to use which motor. If this optimization is done with respect to emissions, the entire electric charge should be used up before the end of the trip. But if the charge is used up too early, latter driving segments for which the optimal choice would have been to use the EM will have to be done using the ICE. To address this optimization problem, we studied the fuel consumption during different driving conditions. These driving conditions are characterized by hundreds of sensors which collect data about the state of the vehicle continuously when driving. From these data, we constructed 150 seconds segments, including e.g. vehicle speed, before new descriptive features were engineered for each segment, e.g. max vehicle speed. By using the characteristics of typical driving conditions specified by the Worldwide Harmonized Light Vehicles Test Cycle (WLTC), segments were labelled as a highway or city road segments. To reduce the dimensions without losing information, principle component analysis was conducted, and a Gaussian mixture model was used to uncover hidden structures in the data. Three machine learning regression models were trained and tested: a linear mixed model, a kernel ridge regression model with linear kernel function, and lastly a kernel ridge regression model with an RBF kernel function. By splitting the data into a training set and a test set the models were evaluated on data which they have not been trained on. The model performance and explanation rate obtained for each model, such as R2, Mean Absolute Error and Mean Squared Error, were compared to find the best model. The study shows that the fuel consumption can be modelled by the sensor data of a PHEV test fleet where 6 features contributes to an explanation ratio of 0.5, thus having highest impact on the fuel consumption. One needs to keep in mind the data were collected during the Covid-19 outbreak where travel patterns were not considered to be normal. No regression model can explain the real world better than what the underlying data does.
Fordonsindustrin vidtar stora tekniska steg för att minska utsläppen och bekämpa klimatförändringar. För att minska tillförlitligheten på fossila bränslen investeras en hel del forskning i elmotorer (EM) och deras tillämpningar. En sådan applikation är laddhybrider (PHEV), där förbränningsmotorer (ICE) och EM används i kombination, och turas om för att driva fordonet baserat på rådande körförhållanden. PHEV: s huvudoptimeringsproblem är att bestämma när man ska använda vilken motor. Om denna optimering görs med avseende på utsläpp bör hela den elektriska laddningen användas innan resan är slut. Men om laddningen används för tidigt måste senare delar av resan, för vilka det optimala valet hade varit att använda EM, göras med ICE. För att ta itu med detta optimeringsproblem, studerade vi bränsleförbrukningen under olika körförhållanden. Dessa körförhållanden kännetecknas av hundratals sensorer som samlar in data om fordonets tillstånd kontinuerligt vid körning. Från dessa data konstruerade vi 150 sekunder segment, inkluderandes exempelvis fordonshastighet, innan nya beskrivande attribut konstruerades för varje segment, exempelvis högsta fordonshastighet. Genom att använda egenskaperna för typiska körförhållanden som specificerats av Worldwide Harmonized Light Vehicles Test Cycle (WLTC), märktes segment som motorvägs- eller stadsvägsegment. För att minska dimensioner på data utan att förlora information, användes principal component analysis och en Gaussian Mixture model för att avslöja dolda strukturer i data. Tre maskininlärnings regressionsmodeller skapades och testades: en linjär blandad modell, en kernel ridge regression modell med linjär kernel funktion och slutligen en en kernel ridge regression modell med RBF kernel funktion. Genom att dela upp informationen i ett tränings set och ett test set utvärderades de tre modellerna på data som de inte har tränats på. För utvärdering och förklaringsgrad av varje modell användes, R2, Mean Absolute Error och Mean Squared Error. Studien visar att bränsleförbrukningen kan modelleras av sensordata för en PHEV-testflotta där 6 stycken attribut har en förklaringsgrad av 0.5 och därmed har störst inflytande på bränsleförbrukningen . Man måste komma ihåg att all data samlades in under Covid-19-utbrottet där resmönster inte ansågs vara normala och att ingen regressionsmodell kan förklara den verkliga världen bättre än vad underliggande data gör.
APA, Harvard, Vancouver, ISO, and other styles
16

Blocker, Alexander Weaver. "Distributed and Multiphase Inference in Theory and Practice: Principles, Modeling, and Computation for High-Throughput Science." Thesis, Harvard University, 2013. http://dissertations.umi.com/gsas.harvard:10977.

Full text
Abstract:
The rise of high-throughput scientific experimentation and data collection has introduced new classes of statistical and computational challenges. The technologies driving this data explosion are subject to complex new forms of measurement error, requiring sophisticated statistical approaches. Simultaneously, statistical computing must adapt to larger volumes of data and new computational environments, particularly parallel and distributed settings. This dissertation presents several computational and theoretical contributions to these challenges. In chapter 1, we consider the problem of estimating the genome-wide distribution of nucleosome positions from paired-end sequencing data. We develop a modeling approach based on nonparametric templates that controls for variability due to enzymatic digestion. We use this to construct a calibrated Bayesian method to detect local concentrations of nucleosome positions. Inference is carried out via a distributed HMC algorithm that scales linearly in complexity with the length of the genome being analyzed. We provide MPI-based implementations of the proposed methods, stand-alone and on Amazon EC2, which can provide inferences on an entire S. cerevisiae genome in less than 1 hour on EC2. We then present a method for absolute quantitation from LC-MS/MS proteomics experiments in chapter 2. We present a Bayesian model for the non-ignorable missing data mechanism induced by this technology, which includes an unusual combination of censoring and truncation. We provide a scalable MCMC sampler for inference in this setting, enabling full-proteome analyses using cluster computing environments. A set of simulation studies and actual experiments demonstrate this approach's validity and utility. We close in chapter 3 by proposing a theoretical framework for the analysis of preprocessing under the banner of multiphase inference. Preprocessing forms an oft-neglected foundation for a wide range of statistical and scientific analyses. We provide some initial theoretical foundations for this area, including distributed preprocessing, building upon previous work in multiple imputation. We demonstrate that multiphase inferences can, in some cases, even surpass standard single-phase estimators in efficiency and robustness. Our work suggests several paths for further research into the statistical principles underlying preprocessing.
Statistics
APA, Harvard, Vancouver, ISO, and other styles
17

Chatora, Tinashe. "Joint models for nonlinear longitudinal profiles in the presence of informative censoring." Doctoral thesis, University of Cape Town, 2018. http://hdl.handle.net/11427/29564.

Full text
Abstract:
Malaria is the parasitic disease which affects the most humans, with Plasmodium falciparum malaria being responsible for the majority of severe malaria and malaria related deaths. The asexual form of the parasite causes the signs and symptoms associated with malaria infection. The sexual form of the parasite, also known as a gametocyte, is the stage responsible for infectivity of the human host (patient) to the mosquito vector, and thus ongoing transmission of malaria and the spread of antimalarial drug resistance. Historically malaria therapeutic efficacy studies have focused mainly on the clearance of asexual parasites. However, malaria in a community can only be truly combated if a treatment program is implemented which is able to clear both asexual and sexual parasites effectively. In this thesis focus will be on the modeling of the key features of gametocytemia. Particular emphasis will be on the modeling of the time to gametocyte emergence, the density of gametocytes and the duration of gametocytemia. It is also of interest to investigate the impact of the administered treatment on the aforementioned features. Gametocyte data has several interesting features. Firstly, the distribution of gametocyte data is zero-inflated with a long tail to the right. The observed longitudinal gametocyte profile also has a nonlinear relationship with time. In addition, since most malaria intervention studies are not designed to optimally measure the evolution of the longitudinal gametocyte profile, there are very few observation points in the time period where the gametocyte profile is expected to peak. Gametocyte data collected from malaria intervention studies are also affected by informative censoring, which leads to incomplete gametocyte profiles. An example of informative censoring is when a patient who experiences treatment failure is “rescued", and withdrawn, from the study in order to receive alternative treatment. This patient can be considered to be in worse health as compared to the patients who remain in this study. There are also competing risks of exit from the study, as a patient can either experience treatment failure or be lost to follow-up. The above mentioned features of gametocyte data make it a statistically appealing dataset to analyze. In literature there are several modeling techniques which can be used to analyze individual features of the data. These techniques include standard survival models for modeling the time to gametocyte emergence and the duration of gametocytemia. The longitudinal nonlinear gametocyte profile would typically be modeled using nonlinear mixed effect models. These nonlinear models could then subsequently be extended to accommodate the zero-inflation in the data, by changing the underlying assumption around the distribution of the response variable. However, it is important to note that these standard techniques do not account for informative censoring. Failure to account for informative censoring leads to bias in parameter estimates. Joint modeling techniques can be used to account for informative censoring. The joint models applied in this thesis combined the longitudinal nonlinear gametocyte densities and the time to censoring due to either lost to follow up or treatment failure. The data analyzed in this thesis were collected from a series of clinical trials conducted be- tween 2002 and 2004 in Mozambique and the Mpumulanga province of South Africa. These trials were a part of the South East African Combination Antimalarial Therapy (SEACAT) evaluation of the phased introduction of combination anti-malarial therapy, nested in the Lubombo Spatial Development Initiative. The aim of these studies was primarily to measure the efficacy of sulfadoxine-pyrimethamine (SP) and a combination of artesunate and sulfadoxine-pyrimethamine (ACT), in eliminating asexual parasites in patients. The patients enrolled in the study had uncomplicated malaria, at a time of increasing resistance to sulfadoxine-pyrimethamine (SP) treatment. Blood samples were taken from patients during the course of 6 weeks on days 0, 1, 2, 3, 7, 14, 21, 28 and 42. Analysis of these blood samples provided longitudinal measurements for asexual 1 parasite densities, gametocyte densities, sulfadoxine drug concentrations and pyrimethamine drug concentrations. The gametocyte data collected in this study was initially analyzed using standard survival modeling techniques. Non-parametric Cox regression models and parametric survival models were applied to the data as part of this initial investigation. These models were used to investigate the factors which affected the time to gametocyte emergence. Subsequently, using the subset of the population which experienced gametocytemia, accelerated failure time models were applied to investigate the factors which affected the duration of gametocytemia. It is evident that the findings from the aforementioned duration investigation would only be able to provide valid duration estimates for patients who were detected to have gametocytemia. This work was extended to allow for population level duration estimates by incorporating the prevalence of gametocytemia into the estimation of duration, for generic patients with specific covariate patterns. The prevalence of gametocytemia was modeled using an underlying binomial distribution. The delta method was subsequently used to derive confidence intervals for the population level duration estimates which were associated with specific covariate patterns. An investigation into the factors affecting the early withdrawal of patients from the study was also conducted. Early exit from the study arose either through loss to follow-up (LTFU) or through treatment failure. The longitudinal gametocyte profile was modeled using joint modeling techniques. The resulting joint model used shared random effects to combine a Weibull survival model, describing the cause- specific hazards of patient exit from the study, with a nonlinear zero-adjusted gamma mixed effect model for the longitudinal gametocyte profile. This model was used to impute the incomplete gametocyte profiles, after adjusting for informative censoring. These imputed profiles were then used to estimate the duration of gametocytemia. It was found, in this thesis, that treatment had a very strong effect on the hazard of gametocyte emergence, density of gametocytes and the duration of gametocytemia. Patients who received a combination of sulfadoxine-pyrimethamine and artesunate were found to have significantly lower hazards of gametocyte emergence, lower predicted durations of gametocytemia and lower predicted longitudinal gametocyte densities as compared to patients who received sulfadoxine-pyrimethamine treatment only.
APA, Harvard, Vancouver, ISO, and other styles
18

Yamangil, Elif. "Rich Linguistic Structure from Large-Scale Web Data." Thesis, Harvard University, 2013. http://dissertations.umi.com/gsas.harvard:11162.

Full text
Abstract:
The past two decades have shown an unexpected effectiveness of Web-scale data in natural language processing. Even the simplest models, when paired with unprecedented amounts of unstructured and unlabeled Web data, have been shown to outperform sophisticated ones. It has been argued that the effectiveness of Web-scale data has undermined the necessity of sophisticated modeling or laborious data set curation. In this thesis, we argue for and illustrate an alternative view, that Web-scale data not only serves to improve the performance of simple models, but also can allow the use of qualitatively more sophisticated models that would not be deployable otherwise, leading to even further performance gains.
Engineering and Applied Sciences
APA, Harvard, Vancouver, ISO, and other styles
19

Scholz, Stefan [Verfasser]. "Dealing with uncertainty in health economic decision modeling. Applying statistical and data science methods / Stefan Scholz." Bielefeld : Universitätsbibliothek Bielefeld, 2021. http://d-nb.info/1241740089/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Muller, Christoffel Joseph Brand. "Bayesian approaches of Markov models embedded in unbalanced panel data." Thesis, Stellenbosch : Stellenbosch University, 2012. http://hdl.handle.net/10019.1/71910.

Full text
Abstract:
Thesis (PhD)--Stellenbosch University, 2012.
ENGLISH ABSTRACT: Multi-state models are used in this dissertation to model panel data, also known as longitudinal or cross-sectional time-series data. These are data sets which include units that are observed across two or more points in time. These models have been used extensively in medical studies where the disease states of patients are recorded over time. A theoretical overview of the current multi-state Markov models when applied to panel data is presented and based on this theory, a simulation procedure is developed to generate panel data sets for given Markov models. Through the use of this procedure a simulation study is undertaken to investigate the properties of the standard likelihood approach when fitting Markov models and then to assess its shortcomings. One of the main shortcomings highlighted by the simulation study, is the unstable estimates obtained by the standard likelihood models, especially when fitted to small data sets. A Bayesian approach is introduced to develop multi-state models that can overcome these unstable estimates by incorporating prior knowledge into the modelling process. Two Bayesian techniques are developed and presented, and their properties are assessed through the use of extensive simulation studies. Firstly, Bayesian multi-state models are developed by specifying prior distributions for the transition rates, constructing a likelihood using standard Markov theory and then obtaining the posterior distributions of the transition rates. A selected few priors are used in these models. Secondly, Bayesian multi-state imputation techniques are presented that make use of suitable prior information to impute missing observations in the panel data sets. Once imputed, standard likelihood-based Markov models are fitted to the imputed data sets to estimate the transition rates. Two different Bayesian imputation techniques are presented. The first approach makes use of the Dirichlet distribution and imputes the unknown states at all time points with missing observations. The second approach uses a Dirichlet process to estimate the time at which a transition occurred between two known observations and then a state is imputed at that estimated transition time. The simulation studies show that these Bayesian methods resulted in more stable results, even when small samples are available.
AFRIKAANSE OPSOMMING: Meerstadium-modelle word in hierdie verhandeling gebruik om paneeldata, ook bekend as longitudinale of deursnee tydreeksdata, te modelleer. Hierdie is datastelle wat eenhede insluit wat oor twee of meer punte in tyd waargeneem word. Hierdie tipe modelle word dikwels in mediese studies gebruik indien verskillende stadiums van ’n siekte oor tyd waargeneem word. ’n Teoretiese oorsig van die huidige meerstadium Markov-modelle toegepas op paneeldata word gegee. Gebaseer op hierdie teorie word ’n simulasieprosedure ontwikkel om paneeldatastelle te simuleer vir gegewe Markov-modelle. Hierdie prosedure word dan gebruik in ’n simulasiestudie om die eienskappe van die standaard aanneemlikheidsbenadering tot die pas vanMarkov modelle te ondersoek en dan enige tekortkominge hieruit te beoordeel. Een van die hoof tekortkominge wat uitgewys word deur die simulasiestudie, is die onstabiele beramings wat verkry word indien dit gepas word op veral klein datastelle. ’n Bayes-benadering tot die modellering van meerstadiumpaneeldata word ontwikkel omhierdie onstabiliteit te oorkom deur a priori-inligting in die modelleringsproses te inkorporeer. Twee Bayes-tegnieke word ontwikkel en aangebied, en hulle eienskappe word ondersoek deur ’n omvattende simulasiestudie. Eerstens word Bayes-meerstadium-modelle ontwikkel deur a priori-verdelings vir die oorgangskoerse te spesifiseer en dan die aanneemlikheidsfunksie te konstrueer deur van standaard Markov-teorie gebruik te maak en die a posteriori-verdelings van die oorgangskoerse te bepaal. ’n Gekose aantal a priori-verdelings word gebruik in hierdie modelle. Tweedens word Bayesmeerstadium invul tegnieke voorgestel wat gebruik maak van a priori-inligting om ontbrekende waardes in die paneeldatastelle in te vul of te imputeer. Nadat die waardes ge-imputeer is, word standaard Markov-modelle gepas op die ge-imputeerde datastel om die oorgangskoerse te beraam. Twee verskillende Bayes-meerstadium imputasie tegnieke word bespreek. Die eerste tegniek maak gebruik van ’n Dirichletverdeling om die ontbrekende stadium te imputeer by alle tydspunte met ’n ontbrekende waarneming. Die tweede benadering gebruik ’n Dirichlet-proses om die oorgangstyd tussen twee waarnemings te beraam en dan die ontbrekende stadium te imputeer op daardie beraamde oorgangstyd. Die simulasiestudies toon dat die Bayes-metodes resultate oplewer wat meer stabiel is, selfs wanneer klein datastelle beskikbaar is.
APA, Harvard, Vancouver, ISO, and other styles
21

Lienhard, Jasper Z. (Jasper Zebulon). "What is measured is managed : statistical analysis of compositional data towards improved materials recovery." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/98661.

Full text
Abstract:
Thesis: S.B., Massachusetts Institute of Technology, Department of Materials Science and Engineering, 2015.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 35-36).
As materials consumption increases globally, minimizing the end-of-life impact of solid waste has become a critical challenge. Cost-effective methods of quantifying and tracking municipal solid waste contents and disposal processes are necessary to drive and track increases in material recovery and recycling. This work presents an algorithm for estimating the average quantity and composition of municipal waste produced by individual locations. Mass fraction confidence intervals for different types of waste were calculated from data collected by sorting and weighing waste samples from municipal sites. This algorithm recognizes the compositional nature of mass fraction waste data. The algorithm developed in this work also evaluated the value of additional waste samples in refining mass fraction confidence intervals. Additionally, a greenhouse gas emissions model compared carbon dioxide emissions for different disposal methods of waste, in particular landfilling and recycling, based on the waste stream. This allowed for identification of recycling opportunities based on carbon dioxide emission savings from offsetting the need for primary materials extraction. Casework was conduced with this methodology using site-specific waste audit data from industry. The waste streams and carbon dioxide emissions of three categories of municipal waste producers, retail, commercial, and industrial, were compared. Paper and plastic products, whose mass fraction averages ranged from 40% to 52% and 26% to 29%, respectively, dominated the waste streams of these three industries. Average carbon dioxide emissions in each of these three industries ranged from 2.18 kg of CO₂ to 2.5 kg of CO₂ per kilogram of waste thrown away. On average, Americans throw away about 2 kilograms per person per day of solid waste.
by Jasper Z. Lienhard.
S.B.
APA, Harvard, Vancouver, ISO, and other styles
22

Ntushelo, Nombasa Sheroline. "Exploratory and inferential multivariate statistical techniques for multidimensional count and binary data with applications in R." Thesis, Stellenbosch : Stellenbosch University, 2011. http://hdl.handle.net/10019.1/17949.

Full text
Abstract:
Thesis (MComm)--Stellenbosch University, 2011.
ENGLISH ABSTRACT: The analysis of multidimensional (multivariate) data sets is a very important area of research in applied statistics. Over the decades many techniques have been developed to deal with such datasets. The multivariate techniques that have been developed include inferential analysis, regression analysis, discriminant analysis, cluster analysis and many more exploratory methods. Most of these methods deal with cases where the data contain numerical variables. However, there are powerful methods in the literature that also deal with multidimensional binary and count data. The primary purpose of this thesis is to discuss the exploratory and inferential techniques that can be used for binary and count data. In Chapter 2 of this thesis we give the detail of correspondence analysis and canonical correspondence analysis. These methods are used to analyze the data in contingency tables. Chapter 3 is devoted to cluster analysis. In this chapter we explain four well-known clustering methods and we also discuss the distance (dissimilarity) measures available in the literature for binary and count data. Chapter 4 contains an explanation of metric and non-metric multidimensional scaling. These methods can be used to represent binary or count data in a lower dimensional Euclidean space. In Chapter 5 we give a method for inferential analysis called the analysis of distance. This method use a similar reasoning as the analysis of variance, but the inference is based on a pseudo F-statistic with the p-value obtained using permutations of the data. Chapter 6 contains real-world applications of these above methods on two special data sets called the Biolog data and Barents Fish data. The secondary purpose of the thesis is to demonstrate how the above techniques can be performed in the software package R. Several R packages and functions are discussed throughout this thesis. The usage of these functions is also demonstrated with appropriate examples. Attention is also given to the interpretation of the output and graphics. The thesis ends with some general conclusions and ideas for further research.
AFRIKAANSE OPSOMMING: Die analise van meerdimensionele (meerveranderlike) datastelle is ’n belangrike area van navorsing in toegepaste statistiek. Oor die afgelope dekades is daar verskeie tegnieke ontwikkel om sulke data te ontleed. Die meerveranderlike tegnieke wat ontwikkel is sluit in inferensie analise, regressie analise, diskriminant analise, tros analise en vele meer verkennende data analise tegnieke. Die meerderheid van hierdie metodes hanteer gevalle waar die data numeriese veranderlikes bevat. Daar bestaan ook kragtige metodes in die literatuur vir die analise van meerdimensionele binêre en telling data. Die primêre doel van hierdie tesis is om tegnieke vir verkennende en inferensiële analise van binêre en telling data te bespreek. In Hoofstuk 2 van hierdie tesis bespreek ons ooreenkoms analise en kanoniese ooreenkoms analise. Hierdie metodes word gebruik om data in gebeurlikheidstabelle te analiseer. Hoofstuk 3 bevat tegnieke vir tros analise. In hierdie hoofstuk verduidelik ons vier gewilde tros analise metodes. Ons bespreek ook die afstand maatstawwe wat beskikbaar is in die literatuur vir binêre en telling data. Hoofstuk 4 bevat ’n verduideliking van metriese en nie-metriese meerdimensionele skalering. Hierdie metodes kan gebruik word om binêre of telling data in ‘n lae dimensionele Euclidiese ruimte voor te stel. In Hoofstuk 5 beskryf ons ’n inferensie metode wat bekend staan as die analise van afstande. Hierdie metode gebruik ’n soortgelyke redenasie as die analise van variansie. Die inferensie hier is gebaseer op ’n pseudo F-toetsstatistiek en die p-waardes word verkry deur gebruik te maak van permutasies van die data. Hoofstuk 6 bevat toepassings van bogenoemde tegnieke op werklike datastelle wat bekend staan as die Biolog data en die Barents Fish data. Die sekondêre doel van die tesis is om te demonstreer hoe hierdie tegnieke uitgevoer word in the R sagteware. Verskeie R pakette en funksies word deurgaans bespreek in die tesis. Die gebruik van die funksies word gedemonstreer met toepaslike voorbeelde. Aandag word ook gegee aan die interpretasie van die afvoer en die grafieke. Die tesis sluit af met algemene gevolgtrekkings en voorstelle vir verdere navorsing.
APA, Harvard, Vancouver, ISO, and other styles
23

Khakipoor, Banafsheh. "Applied Science for Water Quality Monitoring." University of Akron / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=akron1595858677325397.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Comerford, Michael. "Statistical disclosure control : an interdisciplinary approach to the problem of balancing privacy risk and data utility." Thesis, University of Glasgow, 2014. http://theses.gla.ac.uk/7044/.

Full text
Abstract:
The recent increase in the availability of data sources for research has put significant strain on existing data management work-flows, especially in the field of statistical disclosure control. New statistical methods for disclosure control are frequently set out in the literature, however, few of these methods become functional implementations for data owners to utilise. Current workflows often provide inconsistent results dependent on ad hoc approaches, and bottlenecks can form around statistical disclosure control checks which prevent research from progressing. These problems contribute to a lack of trust between researchers and data owners and contribute to the under utilisation of data sources. This research is an interdisciplinary exploration of the existing methods. It hypothesises that algorithms which invoke a range of statistical disclosure control methods (recoding, suppression, noise addition and synthetic data generation) in a semi-automatic way will enable data owners to release data with a higher level of data utility without any increase in disclosure risk when compared to existing methods. These semi-automatic techniques will be applied in the context of secure data-linkage in the e-Health sphere through projects such as DAMES and SHIP. This thesis sets out a theoretical framework for statistical disclosure control and draws on qualitative data from data owners, researchers, and analysts. With these contextual frames in place, the existing literature and methods were reviewed, and a tool set for implementing k-anonymity and a range of disclosure control methods was created. This tool-set is demonstrated in a standard workflow and it is shown how it could be integrated into existing e-Science projects and governmental settings. Comparing this approach with existing workflows within the Scottish Government and NHS Scotland, it allows data owners to process queries from data users in a semi-automatic way and thus provides for an enhanced user experience. This utility is drawn from the consistency and replicability of the approach combined with the increase in the speed of query processing.
APA, Harvard, Vancouver, ISO, and other styles
25

Lauretig, Adam M. "Natural Language Processing, Statistical Inference, and American Foreign Policy." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1562147711514566.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Kobakian, Stephanie Rose. "New algorithms for effectively visualising Australian spatio-temporal disease data." Thesis, Queensland University of Technology, 2020. https://eprints.qut.edu.au/203908/1/Stephanie_Kobakian_Thesis.pdf.

Full text
Abstract:
This thesis contributes to improvements in effectively communicating population related cancer distributions and the associated burden of cancer on Australian communities. This thesis presents a new algorithm for creating an alternative map displays of tessellating hexagons. Alternative map displays can emphasise statistics in countries that contain densely populated cities. It is accompanied by a software implementation that automates the choice of one hexagon to represent each geographic unit, ensuring the statistic for each is equitably presented. The case study comparing a traditional choropleth map to the alternative hexagon tile map contributes to a growing field of visual inference studies.
APA, Harvard, Vancouver, ISO, and other styles
27

Akinc, Deniz. "Statistical Modelling Of Financial Statements Of Turkey: A Panel Data Analysis." Master's thesis, METU, 2008. http://etd.lib.metu.edu.tr/upload/2/12609824/index.pdf.

Full text
Abstract:
Financial failure is an important subject for both the economical development of the country and for the self - evaluation of individual companies. Increase in the number of financially failed companies points out the misuse of the country resources. Recently, financial failure threatens both small and large companies in Turkey. It is important to determine factors that affect the financial failure by analyzing models and to use these models for auditing the financial situation. In today&rsquo
s Turkey, the statistical methods that are used for this purpose involve single level models applied to cross-sectional data. However, multilevel models applied to panel data are more preferable as they gather more information, and also, enable the calculated financial success probabilities to be more trustworthy. In this thesis, publicly available panel data that are collected from The Istanbul Stock Exchange are investigated. Mainly, financial success of companies from two sectors, namely industry and services, are investigated. For the analysis of this panel data, data exploration methods, missing data imputation, possible solutions to multicollinearity problem, single level logistic regression models and multilevel models are used. By these models, financial success probabilities for each company are calculated
the factors related to the financial failure are determined, and changes in time are observed. Models and early warning systems resulted in correct classification rates of up to 100%. In the services sector, a small number of companies having publicly available data result in a decline in the success of models. It is concluded that sharing data with more subjects observed in a longer time period collected in the same format with academicians, will result in better justified outputs, which are useful for both academicians and managers.
APA, Harvard, Vancouver, ISO, and other styles
28

Offei, Felix. "Denoising Tandem Mass Spectrometry Data." Digital Commons @ East Tennessee State University, 2017. https://dc.etsu.edu/etd/3218.

Full text
Abstract:
Protein identification using tandem mass spectrometry (MS/MS) has proven to be an effective way to identify proteins in a biological sample. An observed spectrum is constructed from the data produced by the tandem mass spectrometer. A protein can be identified if the observed spectrum aligns with the theoretical spectrum. However, data generated by the tandem mass spectrometer are affected by errors thus making protein identification challenging in the field of proteomics. Some of these errors include wrong calibration of the instrument, instrument distortion and noise. In this thesis, we present a pre-processing method, which focuses on the removal of noisy data with the hope of aiding in better identification of proteins. We employ the method of binning to reduce the number of noise peaks in the data without sacrificing the alignment of the observed spectrum with the theoretical spectrum. In some cases, the alignment of the two spectra improved.
APA, Harvard, Vancouver, ISO, and other styles
29

Hansson, Lisbeth. "Statistical Considerations in the Analysis of Matched Case-Control Studies. With Applications in Nutritional Epidemiology." Doctoral thesis, Uppsala University, Department of Information Science, 2001. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-1092.

Full text
Abstract:

The case-control study is one of the most frequently used study designs in analytical epidemiology. This thesis focuses on some methodological aspects in the analysis of the results from this kind of study.

A population based case-control study was conducted in northern Norway and central Sweden in order to study the associations of several potential risk factors with thyroid cancer. Cases and controls were individually matched and the information on the factors under study was provided by means of a self-completed questionnaire. The analysis was conducted with logistic regression. No association was found with pregnancies, oral contraceptives and hormone replacement after menopause. Early pregnancy and artificial menopause were associated with an increased risk, and cigarette smoking with a decreased risk, of thyroid cancer (paper I). The relation with diet was also examined. High consumption with fat- and starch-rich diet was associated with an increased risk (paper II).

Conditional and unconditional maximum likelihood estimations of the parameters in a logistic regression were compared through a simulation study. Conditional estimation had higher root mean square error but better model fit than unconditional, especially for 1:1 matching, with relatively little effect of the proportion of missing values (paper III). Two common approaches to handle partial non-response in a questionnaire when calculating nutrient intake from diet variables were compared. In many situations it is reasonable to interpret the omitted self-reports of food consumption as indication of "zero-consumption" (paper IV).

The reproducibility of dietary reports was presented and problems for its measurements and analysis discussed. The most advisable approach to measure repeatability is to look at different correlation methods. Among factors affecting reproducibility frequency and homogeneity of consumption are presumably the most important ones (paper V). Nutrient variables can often have a mixed distribution form and therefore transformation to normality will be troublesome. When analysing nutrients we therefore recommend comparing the result from a parametric test with an analogous distribution-free test. Different methods to transform nutrient variables to achieve normality were discussed (paper VI).

APA, Harvard, Vancouver, ISO, and other styles
30

Robbin, Alice, and Lee Frost-Kumpf. "Extending theory for user-centered information systems: Diagnosing and learning from error in complex statistical data." John Wiley & Sons, Inc, 1997. http://hdl.handle.net/10150/105746.

Full text
Abstract:
Utilization of complex statistical data has come at great cost to individual researchers, the information community, and to the national information infrastructure. Dissatisfaction with the traditional approach to information system design and information services provision, and, by implication, the theoretical bases on which these systems and services have been developed has led librarians and information scientists to propose that information is a user construct and therefore system designs should place greater emphasis on user-centered approaches. This article extends Dervinâ s and Morris's theoretical framework for designing effective information services by synthesizing and integrating theory and research derived from multiple approaches in the social and behavioral sciences. These theoretical frameworks are applied to develop general design strategies and principles for information systems and services that rely on complex statistical data. The focus of this article is on factors that contribute to error in the production of high quality scientific output and on failures of communication during the process of data production and data utilization. Such insights provide useful frameworks to diagnose, communicate, and learn from error. Strategies to design systems that support communicative competence and cognitive competence emphasize the utilization of information systems in a user centered learning environment. This includes viewing cognition as a generative process and recognizing the continuing interdependence and active involvement of experts, novices, and technological gatekeepers.
APA, Harvard, Vancouver, ISO, and other styles
31

Zaremba, Wojciech. "Modeling the variability of EEG/MEG data through statistical machine learning." Habilitation à diriger des recherches, Ecole Polytechnique X, 2012. http://tel.archives-ouvertes.fr/tel-00803958.

Full text
Abstract:
Brain neural activity generates electrical discharges, which manifest as electrical and magnetic potentials around the scalp. Those potentials can be registered with magnetoencephalography (MEG) and electroencephalography (EEG) devices. Data acquired by M/EEG is extremely difficult to work with due to the inherent complexity of underlying brain processes and low signal-to-noise ratio (SNR). Machine learning techniques have to be employed in order to reveal the underlying structure of the signal and to understand the brain state. This thesis explores a diverse range of machine learning techniques which model the structure of M/EEG data in order to decode the mental state. It focuses on measuring a subject's variability and on modeling intrasubject variability. We propose to measure subject variability with a spectral clustering setup. Further, we extend this approach to a unified classification framework based on Laplacian regularized support vector machine (SVM). We solve the issue of intrasubject variability by employing a model with latent variables (based on a latent SVM). Latent variables describe transformations that map samples into a comparable state. We focus mainly on intrasubject experiments to model temporal misalignment.
APA, Harvard, Vancouver, ISO, and other styles
32

Mendez, Kevin M. "Deriving statistical inference from the application of artificial neural networks to clinical metabolomics data." Thesis, Edith Cowan University, Research Online, Perth, Western Australia, 2020. https://ro.ecu.edu.au/theses/2296.

Full text
Abstract:
Metabolomics data are complex with a high degree of multicollinearity. As such, multivariate linear projection methods, such as partial least squares discriminant analysis (PLS-DA) have become standard. Non-linear projections methods, typified by Artificial Neural Networks (ANNs) may be more appropriate to model potential nonlinear latent covariance; however, they are not widely used due to difficulty in deriving statistical inference, and thus biological interpretation. Herein, we illustrate the utility of ANNs for clinical metabolomics using publicly available data sets and develop an open framework for deriving and visualising statistical inference from ANNs equivalent to standard PLS-DA methods.
APA, Harvard, Vancouver, ISO, and other styles
33

Holmgren, Rachelle. "Challenges Involved in the Automation of Regression Analysis." Scholarship @ Claremont, 2016. http://scholarship.claremont.edu/cmc_theses/1405.

Full text
Abstract:
Extracting meaningful insights from massive datasets to help guide business decisions requires specialized skills in data analysis. Unfortunately, the supply of these skills does not meet the demand, due to the massive amount of data generated by society each day. This leaves businesses with a large amount of unanalyzed data that could have been used to support business decision making. Automating the process of analyzing this data would help address many companies' key challenge of a lack of appropriate analytical skills. This paper examines the process and challenges in automating this analysis of data. Central challenges include removing outliers without context, transforming data to a format that is compatible with the analysis method that will be used, and analyzing the results of the model.
APA, Harvard, Vancouver, ISO, and other styles
34

Hazarika, Subhashis. "Statistical and Machine Learning Approaches For Visualizing and Analyzing Large-Scale Simulation Data." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1574692702479196.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Hlongwane, Rivalani Willie. "Selecting the best model for predicting a term deposit product take-up in banking." Master's thesis, University of Cape Town, 2018. http://hdl.handle.net/11427/29789.

Full text
Abstract:
In this study, we use data mining techniques to build predictive models on data collected by a Portuguese bank through a term savings product campaign conducted between May 2008 and November 2010. This data is imbalanced, given an observed take-up rate of 11.27%. Ling et al. (1998) indicated that predictive models built on imbalanced data tend to yield low sensitivity and high specificity, an indication of low true positive and high true negative rates. Our study confirms this finding. We, therefore, use three sampling techniques, namely, under-sampling, oversampling and Synthetic Minority Over-sampling Technique, to balance the data, this results in three additional datasets to use for modelling. We build the following predictive models: random forest, multivariate adaptive regression splines, neural network and support vector machine on the datasets and we compare the models against each other for their ability to identify customers that are likely to take-up a term savings product. As part of the model building process, we investigate parameter permutations related to each modelling technique to tune the models, we find that this assists in building robust models. We assess our models for predictive performance through the use of the receiver operating characteristic curve, confusion matrix, GINI, kappa, sensitivity, specificity, and lift and gains charts. A multivariate adaptive regression splines model built on over-sampled data is found to be the best model for predicting term savings product takeup.
APA, Harvard, Vancouver, ISO, and other styles
36

D'Antuono, Damiano. "Torque-based statistical analysis for condition monitoring of automatic machines." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019.

Find full text
Abstract:
This dissertation deals with the study, development and implementation of condition monitoring algorithms for automatic machines. These are implemented in the machine's PLC. First, data are collected from the machine. Each motion group has been studied to sample the most relevant points (10 points in total) in its trajectory. Second, data have been analysed. In particular, normality of data has been checked. Third, the condition monitoring task is fulfilled with a statistical analysis. In this context two tests are presented: the scalar test and the vector test. The scalar test consists in a simple and immediate test. This must be evaluated for all the 10 sampled points. The vector test takes into account the relationship between all the points. The outcome is just a single number. The vector test can be used to check the state of all different axes and, if showing some criticalities, the scalar test can be used to check why a particular axis is not performing as expected. A crucial issue is to find the healthy values (reference) for each motion group. This, in fact, is not constant but varies with temperature. A solution to obtain a robust reference value is presented in this thesis. Lastly, the condition monitoring task has been fulfilled with some machine learning techniques as well. A classifier is trained to estimate the time elapsed from last maintenance of a motion group. This can be then compared with the actual time from last maintenance to detect if the component is ageing worse than it should. To build the classifier, data have been acquired and elaborated. The final classifier is a voting classifier that shows some interesting robustness properties. To decide, it makes three sub-classifiers vote and chooses the majority. Also, a graphical user interface has been created with the machine learning approach. This can be added to the human-machine interface panel. Finally, a thorough comparison between the two approaches is presented.
APA, Harvard, Vancouver, ISO, and other styles
37

Gustafson, Fredrik, and Marcus Lindahl. "Evaluation of Statistical Distributions for VoIP Traffic Modelling." Thesis, University West, Department of Economics and IT, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:hv:diva-1643.

Full text
Abstract:

Statistical distributions are used to model behaviour of real VoIP traffic. We investigate call holding and inter-arrival times as well as speech patterns. The consequences of using an inappropriate model for network dimensioning are briefly discussed. Visual examination is used to compare well known distributions with empirical data. Our results support the general opinion that the Exponential distribution is not appropriate for modelling call holding time. We find that the distribution of talkspurt periods is well modelled by the Lognormal distribution and the silence periods by the generalized Pareto distribution. It is also observed that the call inter-arrival times tend to follow a heavy tailed distribution.

APA, Harvard, Vancouver, ISO, and other styles
38

Percival, Colin. "Matching with mismatches and assorted applications." Thesis, University of Oxford, 2006. http://ora.ox.ac.uk/objects/uuid:4f0d53cc-fb9f-4246-a835-3c8734eba735.

Full text
Abstract:
This thesis consists of three parts, each of independent interest, yet tied together by the problem of matching with mismatches. In the first chapter, we present a motivated exposition of a new randomized algorithm for indexed matching with mismatches which, for constant error (substitution) rates, locates a substring of length m within a string of length n faster than existing algorithms by a factor of O(m/ log(n)). The second chapter turns from this theoretical problem to an entirely practical concern: delta compression of executable code. In contrast to earlier work which has either generated very large deltas when applied to executable code, or has generated small deltas by utilizing platform and processor-specific knowledge, we present a naïve approach — that is, one which does not rely upon any external knowledge — which nevertheless constructs deltas of size comparable to those produced by a platformspecific approach. In the course of this construction, we utilize the result from the first chapter, although it is of primary utility only when producing deltas between very similar executables. The third chapter lies between the horn and ivory gates, being both highly interesting from a theoretical viewpoint and of great practical value. Using the algorithm for matching with mismatches from the first chapter, combined with error correcting codes, we give a practical algorithm for “universal” delta compression (often called “feedback-free file synchronization”) which can operate in the presence of multiple indels and a large number of substitutions.
APA, Harvard, Vancouver, ISO, and other styles
39

Ebrahimvandi, Alireza. "Three Essays on Analysis of U.S. Infant Mortality Using Systems and Data Science Approaches." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/96266.

Full text
Abstract:
High infant mortality (IM) rates in the U.S. have been a major public health concern for decades. Many studies have focused on understanding causes, risk factors, and interventions that can reduce IM. However, death of an infant is the result of the interplay between many risk factors, which in some cases can be traced to the infancy of their parents. Consequently, these complex interactions challenge the effectiveness of many interventions. The long-term goal of this study is to advance the common understanding of effective interventions for improving health outcomes and, in particular, infant mortality. To achieve this goal, I implemented systems and data science methods in three essays to contribute to the understanding of IM causes and risk factors. In the first study, the goal was to identify patterns in the leading causes of infant mortality across states that successfully reduced their IM rates. I explore the trends at the state-level between 2000 and 2015 to identify patterns in the leading causes of IM. This study shows that the main drivers of IM rate reduction is the preterm-related mortality rate. The second study builds on these findings and investigates the risk factors of preterm birth (PTB) in the largest obstetric population that has ever been studied in this field. By applying the latest statistical and machine learning techniques, I study the PTB risk factors that are both generalizable and identifiable during the early stages of pregnancy. A major finding of this study is that socioeconomic factors such as parent education are more important than generally known factors such as race in the prediction of PTB. This finding is significant evidence for theories like Lifecourse, which postulate that the main determinants of a health trajectory are the social scaffolding that addresses the upstream roots of health. These results point to the need for more comprehensive approaches that change the focus from medical interventions during pregnancy to the time where mothers become vulnerable to the risk factors of PTB. Therefore, in the third study, I take an aggregate approach to study the dynamics of population health that results in undesirable outcomes in major indicators like infant mortality. Based on these new explanations, I offer a systematic approach that can help in addressing adverse birth outcomes—including high infant mortality and preterm birth rates—which is the central contribution of this dissertation. In conclusion, this dissertation contributes to a better understanding of the complexities in infant mortality and health-related policies. This work contributes to the body of literature both in terms of the application of statistical and machine learning techniques, as well as in advancing health-related theories.
Doctor of Philosophy
The U.S. infant mortality rate (IMR) is 71% higher than the average rate for comparable countries in the Organization for Economic Co-operation and Development (OECD). High infant mortality and preterm birth rates (PBR) are major public health concerns in the U.S. A wide range of studies have focused on understanding the causes and risk factors of infant mortality and interventions that can reduce it. However, infant mortality is a complex phenomenon that challenges the effectiveness of the interventions, and the IMR and PBR in the U.S. are still higher than any other advanced OECD nation. I believe that systems and data science methods can help in enhancing our understanding of infant mortality causes, risk factors, and effective interventions. There are more than 130 diagnoses—causes—for infant mortality. Therefore, for 50 states tracking the causes of infant mortality trends over a long time period is very challenging. In the first essay, I focus on the medical aspects of infant mortality to find the causes that helped the reduction of the infant mortality rates in certain states from 2000 to 2015. In addition, I investigate the relationship between different risk factors with infant mortality in a regression model to investigate and find significant correlations. This study provides critical recommendations to policymakers in states with high infant mortality rates and guides them on leveraging appropriate interventions. Preterm birth (PTB) is the most significant contributor to the IMR. The first study showed that a reduction in infant mortality happened in states that reduced their preterm birth. There exists a considerable body of literature on identifying the PTB risk factors in order to find possible explanations for consistently high rates of PTB and IMR in the U.S. However, they have fallen short in two key areas: generalizability and being able to detect PTB in early pregnancy. In the second essay, I investigate a wide range of risk factors in the largest obstetric population that has ever been studied in PTB research. The predictors in this study consist of a wide range of variables from environmental (e.g., air pollution) to medical (e.g., history of hypertension) factors. Our objective is to increase the understanding of factors that are both generalizable and identifiable during the early stage of pregnancy. I implemented state-of-the-art statistical and machine learning techniques and improved the performance measures compared to the previous studies. The results of this study reveal the importance of socioeconomic factors such as, parent education, which can be as important as biomedical indicators like the mother's body mass index in predicting preterm delivery. The second study showed an important relationship between socioeconomic factors such as, education and major health outcomes such as preterm birth. Short-term interventions that focus on improving the socioeconomic status of a mother during pregnancy have limited to no effect on birth outcomes. Therefore, we need to implement more comprehensive approaches and change the focus from medical interventions during pregnancy to the time where mothers become vulnerable to the risk factors of PTB. Hence, we use a systematic approach in the third study to explore the dynamics of health over time. This is a novel study, which enhances our understanding of the complex interactions between health and socioeconomic factors over time. I explore why some communities experience the downward spiral of health deterioration, how resources are generated and allocated, how the generation and allocation mechanisms are interconnected, and why we can see significantly different health outcomes across otherwise similar states. I use Ohio as the case study, because it suffers from poor health outcomes despite having one of the best healthcare systems in the nation. The results identify the trap of health expenditure and how an external financial shock can exacerbate health and socioeconomic factors in such a community. I demonstrate how overspending or underspending in healthcare can affect health outcomes in a society in the long-term. Overall, this dissertation contributes to a better understanding of the complexities associated with major health issues of the U.S. I provide health professionals with theoretical and empirical foundations of risk assessment for reducing infant mortality and preterm birth. In addition, this study provides a systematic perspective on the issue of health deterioration that many communities in the US are experiencing, and hope that this perspective improves policymakers' decision-making.
APA, Harvard, Vancouver, ISO, and other styles
40

Yildiz, Meliha Yetisgen. "Using statistical and knowledge-based approaches for literature-based discovery /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/7178.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Barra, Hugo Botelho 1976. "Evaluating the implementation of new services models in the financial advisory industry : a statistical data mining and system dynamics approach." Thesis, Massachusetts Institute of Technology, 2002. http://hdl.handle.net/1721.1/8067.

Full text
Abstract:
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.
Includes bibliographical references (p. 74).
Program Alpha is a new business practice model designed to increase service quality and productivity of one of the world's largest financial services organizations, by implementing structured time management and a disciplined client and prospect contract process. This thesis quantitatively and qualitatively evaluates business impact of this program, by developing and applying two analytical frameworks. We first present and develop a System Dynamics framework for interpretation of qualitative information collected through interviews, focus groups and surveys, which measure the impact of Program Alpha from operational, organizational and behavioral perspectives. Secondly, we present a Statistical Data Mining framework for interpretation of quantitative financial and customer preference information. Using this framework, we generate a preliminary set of algorithmic guidelines for improvement of Program Alpha in future deployment stages. Such guidelines, based on statistical learning algorithms applied to historical data, aim to streamline the client segmentation process at the core of Program Alpha.
by Hugh Botelho Barra.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
42

Trutschel, Diana [Verfasser], Ivo [Gutachter] Große, Steffen Gutachter] Neumann, and André [Gutachter] [Scherag. "Multivariate statistical methods to analyse multidimensional data in applied life science : [kumulative Dissertation] / Diana Trutschel ; Gutachter: Ivo Grosse, Steffen Neumann, André Scherag." Halle (Saale) : Universitäts- und Landesbibliothek Sachsen-Anhalt, 2019. http://d-nb.info/1210731126/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Trutschel, Diana [Verfasser], Ivo [Gutachter] Grosse, Steffen [Gutachter] Neumann, and André [Gutachter] Scherag. "Multivariate statistical methods to analyse multidimensional data in applied life science : [kumulative Dissertation] / Diana Trutschel ; Gutachter: Ivo Grosse, Steffen Neumann, André Scherag." Halle (Saale) : Universitäts- und Landesbibliothek Sachsen-Anhalt, 2019. http://d-nb.info/1210731126/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Hechter, Trudie. "A comparison of support vector machines and traditional techniques for statistical regression and classification." Thesis, Stellenbosch : Stellenbosch University, 2004. http://hdl.handle.net/10019.1/49810.

Full text
Abstract:
Thesis (MComm)--Stellenbosch University, 2004.
ENGLISH ABSTRACT: Since its introduction in Boser et al. (1992), the support vector machine has become a popular tool in a variety of machine learning applications. More recently, the support vector machine has also been receiving increasing attention in the statistical community as a tool for classification and regression. In this thesis support vector machines are compared to more traditional techniques for statistical classification and regression. The techniques are applied to data from a life assurance environment for a binary classification problem and a regression problem. In the classification case the problem is the prediction of policy lapses using a variety of input variables, while in the regression case the goal is to estimate the income of clients from these variables. The performance of the support vector machine is compared to that of discriminant analysis and classification trees in the case of classification, and to that of multiple linear regression and regression trees in regression, and it is found that support vector machines generally perform well compared to the traditional techniques.
AFRIKAANSE OPSOMMING: Sedert die bekendstelling van die ondersteuningspuntalgoritme in Boser et al. (1992), het dit 'n populêre tegniek in 'n verskeidenheid masjienleerteorie applikasies geword. Meer onlangs het die ondersteuningspuntalgoritme ook meer aandag in die statistiese gemeenskap begin geniet as 'n tegniek vir klassifikasie en regressie. In hierdie tesis word ondersteuningspuntalgoritmes vergelyk met meer tradisionele tegnieke vir statistiese klassifikasie en regressie. Die tegnieke word toegepas op data uit 'n lewensversekeringomgewing vir 'n binêre klassifikasie probleem sowel as 'n regressie probleem. In die klassifikasiegeval is die probleem die voorspelling van polisvervallings deur 'n verskeidenheid invoer veranderlikes te gebruik, terwyl in die regressiegeval gepoog word om die inkomste van kliënte met behulp van hierdie veranderlikes te voorspel. Die resultate van die ondersteuningspuntalgoritme word met dié van diskriminant analise en klassifikasiebome vergelyk in die klassifikasiegeval, en met veelvoudige linêere regressie en regressiebome in die regressiegeval. Die gevolgtrekking is dat ondersteuningspuntalgoritmes oor die algemeen goed vaar in vergelyking met die tradisionele tegnieke.
APA, Harvard, Vancouver, ISO, and other styles
45

Flaspohler, Genevieve Elaine. "Statistical models and decision making for robotic scientific information gathering." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/120607.

Full text
Abstract:
Thesis: S.M., Joint Program in Applied Ocean Physics and Engineering (Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science; and the Woods Hole Oceanographic Institution), 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 97-107).
Mobile robots and autonomous sensors have seen increasing use in scientific applications, from planetary rovers surveying for signs of life on Mars, to environmental buoys measuring and logging oceanographic conditions in coastal regions. This thesis makes contributions in both planning algorithms and model design for autonomous scientific information gathering, demonstrating how theory from machine learning, decision theory, theory of optimal experimental design, and statistical inference can be used to develop online algorithms for robotic information gathering that are robust to modeling errors, account for spatiotemporal structure in scientific data, and have probabilistic performance guarantees. This thesis first introduces a novel sample selection algorithm for online, irrevocable sampling in data streams that have spatiotemporal structure, such as those that commonly arise in robotics and environmental monitoring. Given a limited sampling capacity, the proposed periodic secretary algorithm uses an information-theoretic reward function to select samples in real-time that maximally reduce posterior uncertainty in a given scientific model. Additionally, we provide a lower bound on the quality of samples selected by the periodic secretary algorithm by leveraging the submodularity of the information-theoretic reward function. Finally, we demonstrate the robustness of the proposed approach by employing the periodic secretary algorithm to select samples irrevocably from a seven-year oceanographic data stream collected at the Martha's Vineyard Coastal Observatory off the coast of Cape Cod, USA. Secondly, we consider how scientific models can be specified in environments - such as the deep sea or deep space - where domain scientists may not have enough a priori knowledge to formulate a formal scientific model and hypothesis. These domains require scientific models that start with very little prior information and construct a model of the environment online as observations are gathered. We propose unsupervised machine learning as a technique for science model-learning in these environments. To this end, we introduce a hybrid Bayesian-deep learning model that learns a nonparametric topic model of a visual environment. We use this semantic visual model to identify observations that are poorly explained in the current model, and show experimentally that these highly perplexing observations often correspond to scientifically interesting phenomena. On a marine dataset collected by the SeaBED AUV on the Hannibal Sea Mount, images of high perplexity in the learned model corresponded, for example, to a scientifically novel crab congregation in the deep sea. The approaches presented in this thesis capture the depth and breadth of the problems facing the field of autonomous science. Developing robust autonomous systems that enhance our ability to perform exploratory science in environments such as the oceans, deep space, agricultural and disaster-relief zones will require insight and techniques from classical areas of robotics, such as motion and path planning, mapping, and localization, and from other domains, including machine learning, spatial statistics, optimization, and theory of experimental design. This thesis demonstrates how theory and practice from these diverse disciplines can be unified to address problems in autonomous scientific information gathering.
by Genevieve Elaine Flaspohler.
S.M.
APA, Harvard, Vancouver, ISO, and other styles
46

Okazawa, Yasuhiro. "The scientific rationality of early statistics, 1833-1877." Thesis, University of Cambridge, 2019. https://www.repository.cam.ac.uk/handle/1810/289440.

Full text
Abstract:
This thesis examines the activities of the Statistical Society of London (SSL) and its contribution to early statistics-conceived as the science of humans in society-in Britain. The SSL as a collective entity played a crucial role in the formation of early statistics, as statisticians envisaged early statistics as a collaborative scientific project and prompted large-scale observation, which required cooperation among numerous statistical observers. The first three chapters discuss how the SSL shaped the concepts, practices, and institutions of statistical data production. The SSL demonstrated how the use of a hierarchical division of labour and blank form minimised observers' leeway to exercise individual observational skills and ensured uniformity in the production of statistical facts. This arrangement effectively depreciated first-hand observation in statistics and allowed statisticians to rely on the statistical facts collected by other people. It prompted the SSL to launch the Journal of the Statistical Society of London to serve as a virtual storage of observed facts where one could share their data for further aggregation and retrieve that of others for their analysis. The statisticians also engaged in contemporaneous discussion on the best mode of a statistical office with a view towards producing complete and internationally comparable statistical facts. The SSL's endorsement of the Belgian Central Statistical Commission model and the International Statistical Congress was intended to support the introduction of uniformity into statistical data at both the national and international levels. The last two chapters of this thesis discuss how the SSL's activities contributed to the historical formation of human sciences and the emergence of social scientists. Statisticians demanded the recognition of a scientific field which, independent from natural science, studied people as social beings and whose discourses moulded the treatment of the people they studied. The SSL's activities helped statisticians not only establish their scientific expertise but also develop their unique scientific ethos. Statisticians learnt not to trust their personal observations since individuals could see only a partial, and potentially distorted, picture of society. Instead, statisticians disciplined themselves to patiently wait for the accumulation of statistical facts and analyse data in their entirety because this was the only way, they believed, to truly understand the complex relationships people had with each other. The SSL's activities assisted statisticians' conception of statistical fact and produced a new kind of intellectual inquirer who patiently collected statistical facts as the basis of knowing and intervening in people's lives.
APA, Harvard, Vancouver, ISO, and other styles
47

Almér, Henrik. "Machine learning and statistical analysis in fuel consumption prediction for heavy vehicles." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-172306.

Full text
Abstract:
I investigate how to use machine learning to predict fuel consumption in heavy vehicles. I examine data from several different sources describing road, vehicle, driver and weather characteristics and I find a regression to a fuel consumption measured in liters per distance. The thesis is done for Scania and uses data sources available to Scania. I evaluate which machine learning methods are most successful, how data collection frequency affects the prediction and which features are most influential for fuel consumption. I find that a lower collection frequency of 10 minutes is preferable to a higher collection frequency of 1 minute. I also find that the evaluated models are comparable in their performance and that the most important features for fuel consumption are related to the road slope, vehicle speed and vehicle weight.
Jag undersöker hur maskininlärning kan användas för att förutsäga bränsleförbrukning i tunga fordon. Jag undersöker data från flera olika källor som beskriver väg-, fordons-, förar- och väderkaraktäristiker. Det insamlade datat används för att hitta en regression till en bränsleförbrukning mätt i liter per sträcka. Studien utförs på uppdrag av Scania och jag använder mig av datakällor som är tillgängliga för Scania. Jag utvärderar vilka maskininlärningsmetoder som är bäst lämpade för problemet, hur insamlingsfrekvensen påverkar resultatet av förutsägelsen samt vilka attribut i datat som är mest inflytelserika för bränsleförbrukning. Jag finner att en lägre insamlingsfrekvens av 10 minuter är att föredra framför en högre frekvens av 1 minut. Jag finner även att de utvärderade modellerna ger likvärdiga resultat samt att de viktigaste attributen har att göra med vägens lutning, fordonets hastighet och fordonets vikt.
APA, Harvard, Vancouver, ISO, and other styles
48

Choi, Ickwon. "Computational Modeling for Censored Time to Event Data Using Data Integration in Biomedical Research." Case Western Reserve University School of Graduate Studies / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=case1307969890.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Anbalagan, Sindhuja. "On Occurrence Of Plagiarism In Published Computer Science Thesis Reports At Swedish Universities." Thesis, Högskolan Dalarna, Datateknik, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:du-5377.

Full text
Abstract:
In recent years, it has been observed that software clones and plagiarism are becoming an increased threat for one?s creativity. Clones are the results of copying and using other?s work. According to the Merriam – Webster dictionary, “A clone is one that appears to be a copy of an original form”. It is synonym to duplicate. Clones lead to redundancy of codes, but not all redundant code is a clone.On basis of this background knowledge ,in order to safeguard one?s idea and to avoid intentional code duplication for pretending other?s work as if their owns, software clone detection should be emphasized more. The objective of this paper is to review the methods for clone detection and to apply those methods for finding the extent of plagiarism occurrence among the Swedish Universities in Master level computer science department and to analyze the results.The rest part of the paper, discuss about software plagiarism detection which employs data analysis technique and then statistical analysis of the results.Plagiarism is an act of stealing and passing off the idea?s and words of another person?s as one?s own. Using data analysis technique, samples(Master level computer Science thesis report) were taken from various Swedish universities and processed in Ephorus anti plagiarism software detection. Ephorus gives the percentage of plagiarism for each thesis document, from this results statistical analysis were carried out using Minitab Software.The results gives a very low percentage of Plagiarism extent among the Swedish universities, which concludes that Plagiarism is not a threat to Sweden?s standard of education in computer science.This paper is based on data analysis, intelligence techniques, EPHORUS software plagiarism detection tool and MINITAB statistical software analysis.
APA, Harvard, Vancouver, ISO, and other styles
50

Myers, James William. "Stochastic algorithms for learning with incomplete data an application to Bayesian networks /." Full text available online (restricted access), 1999. http://images.lib.monash.edu.au/ts/theses/Myers.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography