Дисертації з теми "Statistical data science"
Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями
Ознайомтеся з топ-50 дисертацій для дослідження на тему "Statistical data science".
Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.
Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.
Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.
Alarcón, Soto Yovaninna. "Data science in HIV : statistical approaches for therapeutic HIV vaccine data." Doctoral thesis, Universitat Politècnica de Catalunya, 2021. http://hdl.handle.net/10803/672179.
Повний текст джерелаLa presente tesis contribuye a la ciencia de datos abordando problemas biológicos relevantes en el desarrollo de vacunas terapéuticas para el Virus de Inmunodeficiencia Humana (VIH) mediante la modelización de datos procedentes de tres ensayos clínicos diferentes. Algunas de las cuestiones suscitadas en estos estudios y que esta tesis aborda son: identificar biomarcadores para estudiar los factores de riesgo del rebote viral del VIH, explicar el tiempo transcurrido hasta el rebote viral como consecuencia del cese de la terapia antirretroviral (cART) considerando la variabilidad de las fuentes de datos y estudiar la relación entre las variables spot size y spot count en ensayos inmunoabsorbentes (ELISpot). Para abordar cada uno de estos interrogantes desde una perspectiva estadística, en esta tesis hemos adaptado una penalización de red elástica para el modelo de vida acelerada (AFT) con datos censurados en un intervalo, ajustado un modelo de Cox de efectos mixtos con datos censurados en un intervalo y mejorado las metodologías estadísticas existentes para tratar los datos de los ensayos ELISpot y de respuesta binaria, respectivamente. En primer lugar, hemos abordado el problema de tener más de cinco mil ARN mensajeros (ARNm) para explicar el tiempo hasta el rebote viral. Para ello, hemos considerado un enfoque de penalización de red elástica para el modelo de vida acelerada. Esta regularización considera una posible estructura de correlación entre las covariables, como sucede con los ARNm. Para este objetivo, primero derivamos la expresión de la función de verosimilitud penalizada considerando una respuesta censurada en un intervalo (tiempo hasta el rebote viral). A continuación, maximizamos esta función utilizando distintos enfoques y métodos de optimización. Finalmente, aplicamos estos métodos al ensayo clínico DCV2 y discutimos sobre diferentes enfoques numéricos para la maximización de la verosimilitud. En segundo lugar, para explicar el tiempo hasta el rebote viral proponemos ajustar un modelo de Cox de efectos mixtos. Dado que el tiempo hasta el rebote viral está censurado en un intervalo utilizamos imputación múltiple basada en una distribución de Weibull truncada. Este modelo nos permite controlar la heterogeneidad entre los estudios de interrupción analítica del tratamiento (ATI) y el hecho de que los pacientes tengan diferente número de episodios ATI. Según el estudio de simulación que realizamos, nuestro método tiene propiedades deseables en términos de exactitud y precisión de los estimadores de los parámetros de efectos fijos. Finalmente abordamos dos problemas diferentes dentro del ensayo clínico BCN02. Por un lado, ajustamos modelos log-binomiales univariados como alternativa a la clásica regresión logística. Por otro lado, utilizamos un modelo ANOVA no balanceado para analizar la variabilidad de los resultados principales de los ensayos ELISpot a lo largo del tiempo. Aunque los ensayos ELISpot se usan a menudo en el estudio del VIH, la relación entre variables como el spot size, spot count y otras no se había estudiado hasta ahora. En esta tesis hemos propuesto y desarrollado diferentes enfoques estadísticos que han dado respuesta a preguntas biológicas planteadas en tres ensayos clínicos. En este trabajo se destaca la importancia de que los distintos miembros de un equipo científico-multidisciplinar colaboren estrechamente, para así poder determinar la metodología apropiada, hacer correctas interpretaciones clínicas de los resultados de éste y, de esta forma, contribuir a un progreso científico significativo. Esperamos que los resultados originales de esta tesis contribuyan al desarrollo y la evaluación de una vacuna terapéutica del VIH, lo cual ayudaría notablemente a mejorar la calidad de vida de las personas infectadas por VIH.
Bruno, Rexanne Marie. "Statistical Analysis of Survival Data." UNF Digital Commons, 1994. http://digitalcommons.unf.edu/etd/150.
Повний текст джерелаRamaboa, Kutlwano K. K. M. "A comparative evaluation of data mining classification techniques on medical trauma data." Master's thesis, University of Cape Town, 2004. http://hdl.handle.net/11427/5973.
Повний текст джерелаThe purpose of this research was to determine the extent to which a selection of data mining classification techniques (specifically, Discriminant Analysis, Decision Trees, and three artifical neural network models - Backpropogation, Probablilistic Neural Networks, and the Radial Basis Function) are able to correctly classify cases into the different categories of an outcome measure from a given set of input variables (i.e. estimate their classification accuracy) on a common database.
Yuan, Yinyin. "Statistical inference from large-scale genomic data." Thesis, University of Warwick, 2009. http://wrap.warwick.ac.uk/1066/.
Повний текст джерелаGuo, Danni. "Contributions to spatial uncertainty modelling in GIS : small sample data." Doctoral thesis, University of Cape Town, 2007. http://hdl.handle.net/11427/19031.
Повний текст джерелаEnvironmental data is very costly and difficult to collect and are often vague (subjective) or imprecise in nature (e.g. hazard level of pollutants are classified as "harmful for human beings"). These realities in practise (fuzziness and small datasets) leads to uncertainty, which is addressed by my research objective: "To model spatial environmental data with .fuzzy uncertainty, and to explore the use of small sample data in spatial modelling predictions, within Geographic Information System (GIS)." The methodologies underlying the theoretical foundations for spatial modelling are examined, such as geostatistics, fuzzy mathematics Grey System Theory, and (V,·) Credibility Measure Theory. Fifteen papers including three journal papers were written in contribution to the developments of spatial fuzzy and grey uncertainty modelling, in which I have a contributed portion of 50 to 65%. The methods and theories have been merged together in these papers, and they are applied to two datasets, PM10 air pollution data and soil dioxin data. The papers can be classified into two broad categories: fuzzy spatial GIS modelling and grey spatial GIS modelling. In fuzzy spatial GIS modelling, the fuzzy uncertainty (Zadeh, 1965) in environmental data is addressed. The thesis developed a fuzzy membership grades kriging approach by converting fuzzy subsets spatial modelling into membership grade spatial modelling. As this method develops, the fuzzy membership grades kriging is put into the foundation of the credibility measure theory, and approached a full data-assimilated membership function in terms of maximum fuzzy entropy principle. The variable modelling method in dealing with fuzzy data is a unique contribution to the fuzzy spatial GIS modelling literature. In grey spatial GIS modelling, spatial predictions using small sample data is addressed. The thesis developed a Grey GIS modelling approach, and two-dimensional order-less spatially observations are converted into two one-dimensional ordered data sequences. The thesis papers also explored foundational problems within the grey differential equation models (Deng, 1985). It is discovered the coupling feature of grey differential equations together with the help of e-similarity measure, generalise the classical GM( 1,1) model into more classes of extended GM( 1,1) models, in order to fully assimilate with sample data information. The development of grey spatial GIS modelling is a creative contribution to handling small sample data.
Smith, Jeremy Stewart. "A statistical approach to automated detection of multi-component radio sources." Master's thesis, Faculty of Science, 2021. http://hdl.handle.net/11427/32986.
Повний текст джерелаRao, Ashwani Pratap. "Statistical information retrieval models| Experiments, evaluation on real time data." Thesis, University of Delaware, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1567821.
Повний текст джерелаWe are all aware of the rise of information age: heterogeneous sources of information and the ability to publish rapidly and indiscriminately are responsible for information chaos. In this work, we are interested in a system which can separate the "wheat" of vital information from the chaff within this information chaos. An efficient filtering system can accelerate meaningful utilization of knowledge. Consider Wikipedia, an example of community-driven knowledge synthesis. Facts about topics on Wikipedia are continuously being updated by users interested in a particular topic. Consider an automatic system (or an invisible robot) to which a topic such as "President of the United States" can be fed. This system will work ceaselessly, filtering new information created on the web in order to provide the small set of documents about the "President of the United States" that are vital to keeping the Wikipedia page relevant and up-to-date. In this work, we present an automatic information filtering system for this task. While building such a system, we have encountered issues related to scalability, retrieval algorithms, and system evaluation; we describe our efforts to understand and overcome these issues.
Dikkala, Sai Nishanth. "Statistical inference from dependent data : networks and Markov chains." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/127016.
Повний текст джерелаCataloged from the official PDF of thesis.
Includes bibliographical references (pages 259-270).
In recent decades, the study of high-dimensional probability has taken centerstage within many research communities including Computer Science, Statistics and Machine Learning. Very often, due to the process according to which data is collected, the samples in a dataset have implicit correlations amongst them. Such correlations are commonly ignored as a first approximation when trying to analyze statistical and computational aspects of an inference task. In this thesis, we explore how to model such dependences between samples using structured high-dimensional distributions which result from imposing a Markovian property on the joint distribution of the data, namely Markov Random Fields (MRFs) and Markov chains. On MRFs, we explore a quantification for the amount of dependence and we strengthen previously known measure concentration results under a certain weak dependence condition on an MRF called the high-temperature regime. We then go on to apply our novel measure concentration bounds to improve the accuracy of samples computed according to a certain Markov Chain Monte Carlo procedure. We then show how to extend some classical results from statistical learning theory on PAC-learnability and uniform convergence to training data which is dependent under the high temperature condition. Then, we explore the task of regression on data which is dependent according to an MRF under a stronger amount of dependence than is allowed by the high-temperature condition. We then shift our focus to Markov chains where we explore the question of testing whether a certain trajectory we observe corresponds to a chain P or not. We discuss what is a reasonable formulation of this problem and provide a tester which works without observing a trajectory whose length contains multiplicative factors of the mixing or covering time of the chain P. We finally conclude with some broad directions for further research on statistical inference under data dependence.
by Sai Nishanth Dikkala.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
Hennessey, Anthony. "Statistical shape analysis of large molecular data sets." Thesis, University of Nottingham, 2018. http://eprints.nottingham.ac.uk/52088/.
Повний текст джерелаChaudhuri, Abon. "Geometric and Statistical Summaries for Big Data Visualization." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1382235351.
Повний текст джерелаChavali, Krishna Kumar. "Integration of statistical and neural network method for data analysis." Morgantown, W. Va. : [West Virginia University Libraries], 2006. https://eidr.wvu.edu/etd/documentdata.eTD?documentid=4749.
Повний текст джерелаTitle from document title page. Document formatted into pages; contains viii, 68 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 50-51).
Lai, Ian 1980. "A Web-based tutorial for statistical analysis of fMRI data." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/29669.
Повний текст джерелаIncludes bibliographical references (p. 59-63).
A dearth of educational material exists for functional magnetic resonance imaging (fMRI), a relatively new tool used in neuroscience research. A computer demonstration for understanding statistical analysis in fMRI was developed in Matlab, along with an accompanying tutorial for its users. The demo makes use of Dview, an existing software package for viewing 3D brain data, and utilizes precomputed data to improve interactivity. The demo and client were used in an HST graduate course in methods for acquisition and analysis of fMRI data. For wider accessibility, a Web-based version of the demo was designed with a client/server architecture. The Java client has a layered design for flexibility, and the Matlab server interfaces with Dview to take advantage of its functionality. The client and server communicate via a simple protocol through the Matlab Web Server. The Web-based version of the demo was implemented successfully. Future work includes implementation of additional demo features and expansion of the tutorial before dissemination to a wider group of medical and neuroscience researchers.
by Ian Lai.
M.Eng.and S.B.
Hong, Xinting. "INTEGRATED DATA INTEGRATION AND STATISTICAL ANALYSIS PLATFORM FOR MULTI-CENTER EPILEPSY RESEARCH." Case Western Reserve University School of Graduate Studies / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=case1562864784609067.
Повний текст джерелаMalherbe, Chanel. "Fourier method for the measurement of univariate and multivariate volatility in the presence of high frequency data." Master's thesis, University of Cape Town, 2007. http://hdl.handle.net/11427/4386.
Повний текст джерелаMatteusson, Theodor, and Niclas Persson. "Statistical Modelling of Plug-In Hybrid Fuel Consumption : A study using data science methods on test fleet driving data." Thesis, Umeå universitet, Institutionen för matematik och matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-172812.
Повний текст джерелаFordonsindustrin vidtar stora tekniska steg för att minska utsläppen och bekämpa klimatförändringar. För att minska tillförlitligheten på fossila bränslen investeras en hel del forskning i elmotorer (EM) och deras tillämpningar. En sådan applikation är laddhybrider (PHEV), där förbränningsmotorer (ICE) och EM används i kombination, och turas om för att driva fordonet baserat på rådande körförhållanden. PHEV: s huvudoptimeringsproblem är att bestämma när man ska använda vilken motor. Om denna optimering görs med avseende på utsläpp bör hela den elektriska laddningen användas innan resan är slut. Men om laddningen används för tidigt måste senare delar av resan, för vilka det optimala valet hade varit att använda EM, göras med ICE. För att ta itu med detta optimeringsproblem, studerade vi bränsleförbrukningen under olika körförhållanden. Dessa körförhållanden kännetecknas av hundratals sensorer som samlar in data om fordonets tillstånd kontinuerligt vid körning. Från dessa data konstruerade vi 150 sekunder segment, inkluderandes exempelvis fordonshastighet, innan nya beskrivande attribut konstruerades för varje segment, exempelvis högsta fordonshastighet. Genom att använda egenskaperna för typiska körförhållanden som specificerats av Worldwide Harmonized Light Vehicles Test Cycle (WLTC), märktes segment som motorvägs- eller stadsvägsegment. För att minska dimensioner på data utan att förlora information, användes principal component analysis och en Gaussian Mixture model för att avslöja dolda strukturer i data. Tre maskininlärnings regressionsmodeller skapades och testades: en linjär blandad modell, en kernel ridge regression modell med linjär kernel funktion och slutligen en en kernel ridge regression modell med RBF kernel funktion. Genom att dela upp informationen i ett tränings set och ett test set utvärderades de tre modellerna på data som de inte har tränats på. För utvärdering och förklaringsgrad av varje modell användes, R2, Mean Absolute Error och Mean Squared Error. Studien visar att bränsleförbrukningen kan modelleras av sensordata för en PHEV-testflotta där 6 stycken attribut har en förklaringsgrad av 0.5 och därmed har störst inflytande på bränsleförbrukningen . Man måste komma ihåg att all data samlades in under Covid-19-utbrottet där resmönster inte ansågs vara normala och att ingen regressionsmodell kan förklara den verkliga världen bättre än vad underliggande data gör.
Blocker, Alexander Weaver. "Distributed and Multiphase Inference in Theory and Practice: Principles, Modeling, and Computation for High-Throughput Science." Thesis, Harvard University, 2013. http://dissertations.umi.com/gsas.harvard:10977.
Повний текст джерелаStatistics
Chatora, Tinashe. "Joint models for nonlinear longitudinal profiles in the presence of informative censoring." Doctoral thesis, University of Cape Town, 2018. http://hdl.handle.net/11427/29564.
Повний текст джерелаYamangil, Elif. "Rich Linguistic Structure from Large-Scale Web Data." Thesis, Harvard University, 2013. http://dissertations.umi.com/gsas.harvard:11162.
Повний текст джерелаEngineering and Applied Sciences
Scholz, Stefan [Verfasser]. "Dealing with uncertainty in health economic decision modeling. Applying statistical and data science methods / Stefan Scholz." Bielefeld : Universitätsbibliothek Bielefeld, 2021. http://d-nb.info/1241740089/34.
Повний текст джерелаMuller, Christoffel Joseph Brand. "Bayesian approaches of Markov models embedded in unbalanced panel data." Thesis, Stellenbosch : Stellenbosch University, 2012. http://hdl.handle.net/10019.1/71910.
Повний текст джерелаENGLISH ABSTRACT: Multi-state models are used in this dissertation to model panel data, also known as longitudinal or cross-sectional time-series data. These are data sets which include units that are observed across two or more points in time. These models have been used extensively in medical studies where the disease states of patients are recorded over time. A theoretical overview of the current multi-state Markov models when applied to panel data is presented and based on this theory, a simulation procedure is developed to generate panel data sets for given Markov models. Through the use of this procedure a simulation study is undertaken to investigate the properties of the standard likelihood approach when fitting Markov models and then to assess its shortcomings. One of the main shortcomings highlighted by the simulation study, is the unstable estimates obtained by the standard likelihood models, especially when fitted to small data sets. A Bayesian approach is introduced to develop multi-state models that can overcome these unstable estimates by incorporating prior knowledge into the modelling process. Two Bayesian techniques are developed and presented, and their properties are assessed through the use of extensive simulation studies. Firstly, Bayesian multi-state models are developed by specifying prior distributions for the transition rates, constructing a likelihood using standard Markov theory and then obtaining the posterior distributions of the transition rates. A selected few priors are used in these models. Secondly, Bayesian multi-state imputation techniques are presented that make use of suitable prior information to impute missing observations in the panel data sets. Once imputed, standard likelihood-based Markov models are fitted to the imputed data sets to estimate the transition rates. Two different Bayesian imputation techniques are presented. The first approach makes use of the Dirichlet distribution and imputes the unknown states at all time points with missing observations. The second approach uses a Dirichlet process to estimate the time at which a transition occurred between two known observations and then a state is imputed at that estimated transition time. The simulation studies show that these Bayesian methods resulted in more stable results, even when small samples are available.
AFRIKAANSE OPSOMMING: Meerstadium-modelle word in hierdie verhandeling gebruik om paneeldata, ook bekend as longitudinale of deursnee tydreeksdata, te modelleer. Hierdie is datastelle wat eenhede insluit wat oor twee of meer punte in tyd waargeneem word. Hierdie tipe modelle word dikwels in mediese studies gebruik indien verskillende stadiums van ’n siekte oor tyd waargeneem word. ’n Teoretiese oorsig van die huidige meerstadium Markov-modelle toegepas op paneeldata word gegee. Gebaseer op hierdie teorie word ’n simulasieprosedure ontwikkel om paneeldatastelle te simuleer vir gegewe Markov-modelle. Hierdie prosedure word dan gebruik in ’n simulasiestudie om die eienskappe van die standaard aanneemlikheidsbenadering tot die pas vanMarkov modelle te ondersoek en dan enige tekortkominge hieruit te beoordeel. Een van die hoof tekortkominge wat uitgewys word deur die simulasiestudie, is die onstabiele beramings wat verkry word indien dit gepas word op veral klein datastelle. ’n Bayes-benadering tot die modellering van meerstadiumpaneeldata word ontwikkel omhierdie onstabiliteit te oorkom deur a priori-inligting in die modelleringsproses te inkorporeer. Twee Bayes-tegnieke word ontwikkel en aangebied, en hulle eienskappe word ondersoek deur ’n omvattende simulasiestudie. Eerstens word Bayes-meerstadium-modelle ontwikkel deur a priori-verdelings vir die oorgangskoerse te spesifiseer en dan die aanneemlikheidsfunksie te konstrueer deur van standaard Markov-teorie gebruik te maak en die a posteriori-verdelings van die oorgangskoerse te bepaal. ’n Gekose aantal a priori-verdelings word gebruik in hierdie modelle. Tweedens word Bayesmeerstadium invul tegnieke voorgestel wat gebruik maak van a priori-inligting om ontbrekende waardes in die paneeldatastelle in te vul of te imputeer. Nadat die waardes ge-imputeer is, word standaard Markov-modelle gepas op die ge-imputeerde datastel om die oorgangskoerse te beraam. Twee verskillende Bayes-meerstadium imputasie tegnieke word bespreek. Die eerste tegniek maak gebruik van ’n Dirichletverdeling om die ontbrekende stadium te imputeer by alle tydspunte met ’n ontbrekende waarneming. Die tweede benadering gebruik ’n Dirichlet-proses om die oorgangstyd tussen twee waarnemings te beraam en dan die ontbrekende stadium te imputeer op daardie beraamde oorgangstyd. Die simulasiestudies toon dat die Bayes-metodes resultate oplewer wat meer stabiel is, selfs wanneer klein datastelle beskikbaar is.
Lienhard, Jasper Z. (Jasper Zebulon). "What is measured is managed : statistical analysis of compositional data towards improved materials recovery." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/98661.
Повний текст джерелаCataloged from PDF version of thesis.
Includes bibliographical references (pages 35-36).
As materials consumption increases globally, minimizing the end-of-life impact of solid waste has become a critical challenge. Cost-effective methods of quantifying and tracking municipal solid waste contents and disposal processes are necessary to drive and track increases in material recovery and recycling. This work presents an algorithm for estimating the average quantity and composition of municipal waste produced by individual locations. Mass fraction confidence intervals for different types of waste were calculated from data collected by sorting and weighing waste samples from municipal sites. This algorithm recognizes the compositional nature of mass fraction waste data. The algorithm developed in this work also evaluated the value of additional waste samples in refining mass fraction confidence intervals. Additionally, a greenhouse gas emissions model compared carbon dioxide emissions for different disposal methods of waste, in particular landfilling and recycling, based on the waste stream. This allowed for identification of recycling opportunities based on carbon dioxide emission savings from offsetting the need for primary materials extraction. Casework was conduced with this methodology using site-specific waste audit data from industry. The waste streams and carbon dioxide emissions of three categories of municipal waste producers, retail, commercial, and industrial, were compared. Paper and plastic products, whose mass fraction averages ranged from 40% to 52% and 26% to 29%, respectively, dominated the waste streams of these three industries. Average carbon dioxide emissions in each of these three industries ranged from 2.18 kg of CO₂ to 2.5 kg of CO₂ per kilogram of waste thrown away. On average, Americans throw away about 2 kilograms per person per day of solid waste.
by Jasper Z. Lienhard.
S.B.
Ntushelo, Nombasa Sheroline. "Exploratory and inferential multivariate statistical techniques for multidimensional count and binary data with applications in R." Thesis, Stellenbosch : Stellenbosch University, 2011. http://hdl.handle.net/10019.1/17949.
Повний текст джерелаENGLISH ABSTRACT: The analysis of multidimensional (multivariate) data sets is a very important area of research in applied statistics. Over the decades many techniques have been developed to deal with such datasets. The multivariate techniques that have been developed include inferential analysis, regression analysis, discriminant analysis, cluster analysis and many more exploratory methods. Most of these methods deal with cases where the data contain numerical variables. However, there are powerful methods in the literature that also deal with multidimensional binary and count data. The primary purpose of this thesis is to discuss the exploratory and inferential techniques that can be used for binary and count data. In Chapter 2 of this thesis we give the detail of correspondence analysis and canonical correspondence analysis. These methods are used to analyze the data in contingency tables. Chapter 3 is devoted to cluster analysis. In this chapter we explain four well-known clustering methods and we also discuss the distance (dissimilarity) measures available in the literature for binary and count data. Chapter 4 contains an explanation of metric and non-metric multidimensional scaling. These methods can be used to represent binary or count data in a lower dimensional Euclidean space. In Chapter 5 we give a method for inferential analysis called the analysis of distance. This method use a similar reasoning as the analysis of variance, but the inference is based on a pseudo F-statistic with the p-value obtained using permutations of the data. Chapter 6 contains real-world applications of these above methods on two special data sets called the Biolog data and Barents Fish data. The secondary purpose of the thesis is to demonstrate how the above techniques can be performed in the software package R. Several R packages and functions are discussed throughout this thesis. The usage of these functions is also demonstrated with appropriate examples. Attention is also given to the interpretation of the output and graphics. The thesis ends with some general conclusions and ideas for further research.
AFRIKAANSE OPSOMMING: Die analise van meerdimensionele (meerveranderlike) datastelle is ’n belangrike area van navorsing in toegepaste statistiek. Oor die afgelope dekades is daar verskeie tegnieke ontwikkel om sulke data te ontleed. Die meerveranderlike tegnieke wat ontwikkel is sluit in inferensie analise, regressie analise, diskriminant analise, tros analise en vele meer verkennende data analise tegnieke. Die meerderheid van hierdie metodes hanteer gevalle waar die data numeriese veranderlikes bevat. Daar bestaan ook kragtige metodes in die literatuur vir die analise van meerdimensionele binêre en telling data. Die primêre doel van hierdie tesis is om tegnieke vir verkennende en inferensiële analise van binêre en telling data te bespreek. In Hoofstuk 2 van hierdie tesis bespreek ons ooreenkoms analise en kanoniese ooreenkoms analise. Hierdie metodes word gebruik om data in gebeurlikheidstabelle te analiseer. Hoofstuk 3 bevat tegnieke vir tros analise. In hierdie hoofstuk verduidelik ons vier gewilde tros analise metodes. Ons bespreek ook die afstand maatstawwe wat beskikbaar is in die literatuur vir binêre en telling data. Hoofstuk 4 bevat ’n verduideliking van metriese en nie-metriese meerdimensionele skalering. Hierdie metodes kan gebruik word om binêre of telling data in ‘n lae dimensionele Euclidiese ruimte voor te stel. In Hoofstuk 5 beskryf ons ’n inferensie metode wat bekend staan as die analise van afstande. Hierdie metode gebruik ’n soortgelyke redenasie as die analise van variansie. Die inferensie hier is gebaseer op ’n pseudo F-toetsstatistiek en die p-waardes word verkry deur gebruik te maak van permutasies van die data. Hoofstuk 6 bevat toepassings van bogenoemde tegnieke op werklike datastelle wat bekend staan as die Biolog data en die Barents Fish data. Die sekondêre doel van die tesis is om te demonstreer hoe hierdie tegnieke uitgevoer word in the R sagteware. Verskeie R pakette en funksies word deurgaans bespreek in die tesis. Die gebruik van die funksies word gedemonstreer met toepaslike voorbeelde. Aandag word ook gegee aan die interpretasie van die afvoer en die grafieke. Die tesis sluit af met algemene gevolgtrekkings en voorstelle vir verdere navorsing.
Khakipoor, Banafsheh. "Applied Science for Water Quality Monitoring." University of Akron / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=akron1595858677325397.
Повний текст джерелаComerford, Michael. "Statistical disclosure control : an interdisciplinary approach to the problem of balancing privacy risk and data utility." Thesis, University of Glasgow, 2014. http://theses.gla.ac.uk/7044/.
Повний текст джерелаLauretig, Adam M. "Natural Language Processing, Statistical Inference, and American Foreign Policy." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1562147711514566.
Повний текст джерелаKobakian, Stephanie Rose. "New algorithms for effectively visualising Australian spatio-temporal disease data." Thesis, Queensland University of Technology, 2020. https://eprints.qut.edu.au/203908/1/Stephanie_Kobakian_Thesis.pdf.
Повний текст джерелаAkinc, Deniz. "Statistical Modelling Of Financial Statements Of Turkey: A Panel Data Analysis." Master's thesis, METU, 2008. http://etd.lib.metu.edu.tr/upload/2/12609824/index.pdf.
Повний текст джерелаs Turkey, the statistical methods that are used for this purpose involve single level models applied to cross-sectional data. However, multilevel models applied to panel data are more preferable as they gather more information, and also, enable the calculated financial success probabilities to be more trustworthy. In this thesis, publicly available panel data that are collected from The Istanbul Stock Exchange are investigated. Mainly, financial success of companies from two sectors, namely industry and services, are investigated. For the analysis of this panel data, data exploration methods, missing data imputation, possible solutions to multicollinearity problem, single level logistic regression models and multilevel models are used. By these models, financial success probabilities for each company are calculated
the factors related to the financial failure are determined, and changes in time are observed. Models and early warning systems resulted in correct classification rates of up to 100%. In the services sector, a small number of companies having publicly available data result in a decline in the success of models. It is concluded that sharing data with more subjects observed in a longer time period collected in the same format with academicians, will result in better justified outputs, which are useful for both academicians and managers.
Offei, Felix. "Denoising Tandem Mass Spectrometry Data." Digital Commons @ East Tennessee State University, 2017. https://dc.etsu.edu/etd/3218.
Повний текст джерелаHansson, Lisbeth. "Statistical Considerations in the Analysis of Matched Case-Control Studies. With Applications in Nutritional Epidemiology." Doctoral thesis, Uppsala University, Department of Information Science, 2001. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-1092.
Повний текст джерелаThe case-control study is one of the most frequently used study designs in analytical epidemiology. This thesis focuses on some methodological aspects in the analysis of the results from this kind of study.
A population based case-control study was conducted in northern Norway and central Sweden in order to study the associations of several potential risk factors with thyroid cancer. Cases and controls were individually matched and the information on the factors under study was provided by means of a self-completed questionnaire. The analysis was conducted with logistic regression. No association was found with pregnancies, oral contraceptives and hormone replacement after menopause. Early pregnancy and artificial menopause were associated with an increased risk, and cigarette smoking with a decreased risk, of thyroid cancer (paper I). The relation with diet was also examined. High consumption with fat- and starch-rich diet was associated with an increased risk (paper II).
Conditional and unconditional maximum likelihood estimations of the parameters in a logistic regression were compared through a simulation study. Conditional estimation had higher root mean square error but better model fit than unconditional, especially for 1:1 matching, with relatively little effect of the proportion of missing values (paper III). Two common approaches to handle partial non-response in a questionnaire when calculating nutrient intake from diet variables were compared. In many situations it is reasonable to interpret the omitted self-reports of food consumption as indication of "zero-consumption" (paper IV).
The reproducibility of dietary reports was presented and problems for its measurements and analysis discussed. The most advisable approach to measure repeatability is to look at different correlation methods. Among factors affecting reproducibility frequency and homogeneity of consumption are presumably the most important ones (paper V). Nutrient variables can often have a mixed distribution form and therefore transformation to normality will be troublesome. When analysing nutrients we therefore recommend comparing the result from a parametric test with an analogous distribution-free test. Different methods to transform nutrient variables to achieve normality were discussed (paper VI).
Robbin, Alice, and Lee Frost-Kumpf. "Extending theory for user-centered information systems: Diagnosing and learning from error in complex statistical data." John Wiley & Sons, Inc, 1997. http://hdl.handle.net/10150/105746.
Повний текст джерелаZaremba, Wojciech. "Modeling the variability of EEG/MEG data through statistical machine learning." Habilitation à diriger des recherches, Ecole Polytechnique X, 2012. http://tel.archives-ouvertes.fr/tel-00803958.
Повний текст джерелаMendez, Kevin M. "Deriving statistical inference from the application of artificial neural networks to clinical metabolomics data." Thesis, Edith Cowan University, Research Online, Perth, Western Australia, 2020. https://ro.ecu.edu.au/theses/2296.
Повний текст джерелаHolmgren, Rachelle. "Challenges Involved in the Automation of Regression Analysis." Scholarship @ Claremont, 2016. http://scholarship.claremont.edu/cmc_theses/1405.
Повний текст джерелаHazarika, Subhashis. "Statistical and Machine Learning Approaches For Visualizing and Analyzing Large-Scale Simulation Data." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1574692702479196.
Повний текст джерелаHlongwane, Rivalani Willie. "Selecting the best model for predicting a term deposit product take-up in banking." Master's thesis, University of Cape Town, 2018. http://hdl.handle.net/11427/29789.
Повний текст джерелаD'Antuono, Damiano. "Torque-based statistical analysis for condition monitoring of automatic machines." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019.
Знайти повний текст джерелаGustafson, Fredrik, and Marcus Lindahl. "Evaluation of Statistical Distributions for VoIP Traffic Modelling." Thesis, University West, Department of Economics and IT, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:hv:diva-1643.
Повний текст джерелаStatistical distributions are used to model behaviour of real VoIP traffic. We investigate call holding and inter-arrival times as well as speech patterns. The consequences of using an inappropriate model for network dimensioning are briefly discussed. Visual examination is used to compare well known distributions with empirical data. Our results support the general opinion that the Exponential distribution is not appropriate for modelling call holding time. We find that the distribution of talkspurt periods is well modelled by the Lognormal distribution and the silence periods by the generalized Pareto distribution. It is also observed that the call inter-arrival times tend to follow a heavy tailed distribution.
Percival, Colin. "Matching with mismatches and assorted applications." Thesis, University of Oxford, 2006. http://ora.ox.ac.uk/objects/uuid:4f0d53cc-fb9f-4246-a835-3c8734eba735.
Повний текст джерелаEbrahimvandi, Alireza. "Three Essays on Analysis of U.S. Infant Mortality Using Systems and Data Science Approaches." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/96266.
Повний текст джерелаDoctor of Philosophy
The U.S. infant mortality rate (IMR) is 71% higher than the average rate for comparable countries in the Organization for Economic Co-operation and Development (OECD). High infant mortality and preterm birth rates (PBR) are major public health concerns in the U.S. A wide range of studies have focused on understanding the causes and risk factors of infant mortality and interventions that can reduce it. However, infant mortality is a complex phenomenon that challenges the effectiveness of the interventions, and the IMR and PBR in the U.S. are still higher than any other advanced OECD nation. I believe that systems and data science methods can help in enhancing our understanding of infant mortality causes, risk factors, and effective interventions. There are more than 130 diagnoses—causes—for infant mortality. Therefore, for 50 states tracking the causes of infant mortality trends over a long time period is very challenging. In the first essay, I focus on the medical aspects of infant mortality to find the causes that helped the reduction of the infant mortality rates in certain states from 2000 to 2015. In addition, I investigate the relationship between different risk factors with infant mortality in a regression model to investigate and find significant correlations. This study provides critical recommendations to policymakers in states with high infant mortality rates and guides them on leveraging appropriate interventions. Preterm birth (PTB) is the most significant contributor to the IMR. The first study showed that a reduction in infant mortality happened in states that reduced their preterm birth. There exists a considerable body of literature on identifying the PTB risk factors in order to find possible explanations for consistently high rates of PTB and IMR in the U.S. However, they have fallen short in two key areas: generalizability and being able to detect PTB in early pregnancy. In the second essay, I investigate a wide range of risk factors in the largest obstetric population that has ever been studied in PTB research. The predictors in this study consist of a wide range of variables from environmental (e.g., air pollution) to medical (e.g., history of hypertension) factors. Our objective is to increase the understanding of factors that are both generalizable and identifiable during the early stage of pregnancy. I implemented state-of-the-art statistical and machine learning techniques and improved the performance measures compared to the previous studies. The results of this study reveal the importance of socioeconomic factors such as, parent education, which can be as important as biomedical indicators like the mother's body mass index in predicting preterm delivery. The second study showed an important relationship between socioeconomic factors such as, education and major health outcomes such as preterm birth. Short-term interventions that focus on improving the socioeconomic status of a mother during pregnancy have limited to no effect on birth outcomes. Therefore, we need to implement more comprehensive approaches and change the focus from medical interventions during pregnancy to the time where mothers become vulnerable to the risk factors of PTB. Hence, we use a systematic approach in the third study to explore the dynamics of health over time. This is a novel study, which enhances our understanding of the complex interactions between health and socioeconomic factors over time. I explore why some communities experience the downward spiral of health deterioration, how resources are generated and allocated, how the generation and allocation mechanisms are interconnected, and why we can see significantly different health outcomes across otherwise similar states. I use Ohio as the case study, because it suffers from poor health outcomes despite having one of the best healthcare systems in the nation. The results identify the trap of health expenditure and how an external financial shock can exacerbate health and socioeconomic factors in such a community. I demonstrate how overspending or underspending in healthcare can affect health outcomes in a society in the long-term. Overall, this dissertation contributes to a better understanding of the complexities associated with major health issues of the U.S. I provide health professionals with theoretical and empirical foundations of risk assessment for reducing infant mortality and preterm birth. In addition, this study provides a systematic perspective on the issue of health deterioration that many communities in the US are experiencing, and hope that this perspective improves policymakers' decision-making.
Yildiz, Meliha Yetisgen. "Using statistical and knowledge-based approaches for literature-based discovery /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/7178.
Повний текст джерелаBarra, Hugo Botelho 1976. "Evaluating the implementation of new services models in the financial advisory industry : a statistical data mining and system dynamics approach." Thesis, Massachusetts Institute of Technology, 2002. http://hdl.handle.net/1721.1/8067.
Повний текст джерелаIncludes bibliographical references (p. 74).
Program Alpha is a new business practice model designed to increase service quality and productivity of one of the world's largest financial services organizations, by implementing structured time management and a disciplined client and prospect contract process. This thesis quantitatively and qualitatively evaluates business impact of this program, by developing and applying two analytical frameworks. We first present and develop a System Dynamics framework for interpretation of qualitative information collected through interviews, focus groups and surveys, which measure the impact of Program Alpha from operational, organizational and behavioral perspectives. Secondly, we present a Statistical Data Mining framework for interpretation of quantitative financial and customer preference information. Using this framework, we generate a preliminary set of algorithmic guidelines for improvement of Program Alpha in future deployment stages. Such guidelines, based on statistical learning algorithms applied to historical data, aim to streamline the client segmentation process at the core of Program Alpha.
by Hugh Botelho Barra.
M.Eng.
Trutschel, Diana [Verfasser], Ivo [Gutachter] Große, Steffen Gutachter] Neumann, and André [Gutachter] [Scherag. "Multivariate statistical methods to analyse multidimensional data in applied life science : [kumulative Dissertation] / Diana Trutschel ; Gutachter: Ivo Grosse, Steffen Neumann, André Scherag." Halle (Saale) : Universitäts- und Landesbibliothek Sachsen-Anhalt, 2019. http://d-nb.info/1210731126/34.
Повний текст джерелаTrutschel, Diana [Verfasser], Ivo [Gutachter] Grosse, Steffen [Gutachter] Neumann, and André [Gutachter] Scherag. "Multivariate statistical methods to analyse multidimensional data in applied life science : [kumulative Dissertation] / Diana Trutschel ; Gutachter: Ivo Grosse, Steffen Neumann, André Scherag." Halle (Saale) : Universitäts- und Landesbibliothek Sachsen-Anhalt, 2019. http://d-nb.info/1210731126/34.
Повний текст джерелаHechter, Trudie. "A comparison of support vector machines and traditional techniques for statistical regression and classification." Thesis, Stellenbosch : Stellenbosch University, 2004. http://hdl.handle.net/10019.1/49810.
Повний текст джерелаENGLISH ABSTRACT: Since its introduction in Boser et al. (1992), the support vector machine has become a popular tool in a variety of machine learning applications. More recently, the support vector machine has also been receiving increasing attention in the statistical community as a tool for classification and regression. In this thesis support vector machines are compared to more traditional techniques for statistical classification and regression. The techniques are applied to data from a life assurance environment for a binary classification problem and a regression problem. In the classification case the problem is the prediction of policy lapses using a variety of input variables, while in the regression case the goal is to estimate the income of clients from these variables. The performance of the support vector machine is compared to that of discriminant analysis and classification trees in the case of classification, and to that of multiple linear regression and regression trees in regression, and it is found that support vector machines generally perform well compared to the traditional techniques.
AFRIKAANSE OPSOMMING: Sedert die bekendstelling van die ondersteuningspuntalgoritme in Boser et al. (1992), het dit 'n populêre tegniek in 'n verskeidenheid masjienleerteorie applikasies geword. Meer onlangs het die ondersteuningspuntalgoritme ook meer aandag in die statistiese gemeenskap begin geniet as 'n tegniek vir klassifikasie en regressie. In hierdie tesis word ondersteuningspuntalgoritmes vergelyk met meer tradisionele tegnieke vir statistiese klassifikasie en regressie. Die tegnieke word toegepas op data uit 'n lewensversekeringomgewing vir 'n binêre klassifikasie probleem sowel as 'n regressie probleem. In die klassifikasiegeval is die probleem die voorspelling van polisvervallings deur 'n verskeidenheid invoer veranderlikes te gebruik, terwyl in die regressiegeval gepoog word om die inkomste van kliënte met behulp van hierdie veranderlikes te voorspel. Die resultate van die ondersteuningspuntalgoritme word met dié van diskriminant analise en klassifikasiebome vergelyk in die klassifikasiegeval, en met veelvoudige linêere regressie en regressiebome in die regressiegeval. Die gevolgtrekking is dat ondersteuningspuntalgoritmes oor die algemeen goed vaar in vergelyking met die tradisionele tegnieke.
Flaspohler, Genevieve Elaine. "Statistical models and decision making for robotic scientific information gathering." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/120607.
Повний текст джерелаThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 97-107).
Mobile robots and autonomous sensors have seen increasing use in scientific applications, from planetary rovers surveying for signs of life on Mars, to environmental buoys measuring and logging oceanographic conditions in coastal regions. This thesis makes contributions in both planning algorithms and model design for autonomous scientific information gathering, demonstrating how theory from machine learning, decision theory, theory of optimal experimental design, and statistical inference can be used to develop online algorithms for robotic information gathering that are robust to modeling errors, account for spatiotemporal structure in scientific data, and have probabilistic performance guarantees. This thesis first introduces a novel sample selection algorithm for online, irrevocable sampling in data streams that have spatiotemporal structure, such as those that commonly arise in robotics and environmental monitoring. Given a limited sampling capacity, the proposed periodic secretary algorithm uses an information-theoretic reward function to select samples in real-time that maximally reduce posterior uncertainty in a given scientific model. Additionally, we provide a lower bound on the quality of samples selected by the periodic secretary algorithm by leveraging the submodularity of the information-theoretic reward function. Finally, we demonstrate the robustness of the proposed approach by employing the periodic secretary algorithm to select samples irrevocably from a seven-year oceanographic data stream collected at the Martha's Vineyard Coastal Observatory off the coast of Cape Cod, USA. Secondly, we consider how scientific models can be specified in environments - such as the deep sea or deep space - where domain scientists may not have enough a priori knowledge to formulate a formal scientific model and hypothesis. These domains require scientific models that start with very little prior information and construct a model of the environment online as observations are gathered. We propose unsupervised machine learning as a technique for science model-learning in these environments. To this end, we introduce a hybrid Bayesian-deep learning model that learns a nonparametric topic model of a visual environment. We use this semantic visual model to identify observations that are poorly explained in the current model, and show experimentally that these highly perplexing observations often correspond to scientifically interesting phenomena. On a marine dataset collected by the SeaBED AUV on the Hannibal Sea Mount, images of high perplexity in the learned model corresponded, for example, to a scientifically novel crab congregation in the deep sea. The approaches presented in this thesis capture the depth and breadth of the problems facing the field of autonomous science. Developing robust autonomous systems that enhance our ability to perform exploratory science in environments such as the oceans, deep space, agricultural and disaster-relief zones will require insight and techniques from classical areas of robotics, such as motion and path planning, mapping, and localization, and from other domains, including machine learning, spatial statistics, optimization, and theory of experimental design. This thesis demonstrates how theory and practice from these diverse disciplines can be unified to address problems in autonomous scientific information gathering.
by Genevieve Elaine Flaspohler.
S.M.
Okazawa, Yasuhiro. "The scientific rationality of early statistics, 1833-1877." Thesis, University of Cambridge, 2019. https://www.repository.cam.ac.uk/handle/1810/289440.
Повний текст джерелаAlmér, Henrik. "Machine learning and statistical analysis in fuel consumption prediction for heavy vehicles." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-172306.
Повний текст джерелаJag undersöker hur maskininlärning kan användas för att förutsäga bränsleförbrukning i tunga fordon. Jag undersöker data från flera olika källor som beskriver väg-, fordons-, förar- och väderkaraktäristiker. Det insamlade datat används för att hitta en regression till en bränsleförbrukning mätt i liter per sträcka. Studien utförs på uppdrag av Scania och jag använder mig av datakällor som är tillgängliga för Scania. Jag utvärderar vilka maskininlärningsmetoder som är bäst lämpade för problemet, hur insamlingsfrekvensen påverkar resultatet av förutsägelsen samt vilka attribut i datat som är mest inflytelserika för bränsleförbrukning. Jag finner att en lägre insamlingsfrekvens av 10 minuter är att föredra framför en högre frekvens av 1 minut. Jag finner även att de utvärderade modellerna ger likvärdiga resultat samt att de viktigaste attributen har att göra med vägens lutning, fordonets hastighet och fordonets vikt.
Choi, Ickwon. "Computational Modeling for Censored Time to Event Data Using Data Integration in Biomedical Research." Case Western Reserve University School of Graduate Studies / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=case1307969890.
Повний текст джерелаAnbalagan, Sindhuja. "On Occurrence Of Plagiarism In Published Computer Science Thesis Reports At Swedish Universities." Thesis, Högskolan Dalarna, Datateknik, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:du-5377.
Повний текст джерелаMyers, James William. "Stochastic algorithms for learning with incomplete data an application to Bayesian networks /." Full text available online (restricted access), 1999. http://images.lib.monash.edu.au/ts/theses/Myers.pdf.
Повний текст джерела