Дисертації з теми "Statistics and Computer Science"
Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями
Ознайомтеся з топ-50 дисертацій для дослідження на тему "Statistics and Computer Science".
Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.
Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.
Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.
Raj, Alvin Andrew. "Ambiguous statistics - how a statistical encoding in the periphery affects perception." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/79214.
Повний текст джерелаCataloged from PDF version of thesis.
Includes bibliographical references (p. 159-163).
Recent understanding in human vision suggests that the periphery compresses visual information to a set of summary statistics. Some visual information is robust to this lossy compression, but others, like spatial location and phase are not perfectly represented, leading to ambiguous interpretations. Using the statistical encoding, we can visualize the information available in the periphery to gain intuitions about human performance in visual tasks, which have implications for user interface design, or more generally, whether the periphery encodes sufficient information to perform a task without additional eye movements. The periphery is most of the visual field. If it undergoes these losses of information, then our perception and ability to perform tasks efficiently are affected. We show that the statistical encoding explains human performance in classic visual search experiments. Based on the statistical understanding, we also propose a quantitative model that can estimate the average number of fixations humans would need to find a target in a search display. Further, we show that the ambiguities in the peripheral representation predict many aspects of some illusions. In particular, the model correctly predicts how polarity and width affects the Pinna-Gregory illusion. Visualizing the statistical representation of the illusion shows that many qualitative aspects of the illusion are captured by the statistical ambiguities. We also investigate a phenomena known as Object Substitution Masking (OSM), where the identity of an object is impaired when a sparse, non-overlapping, and temporally trailing mask surrounds that object. We find that different types of grouping of object and mask produce different levels of impairment. This contradicts a theory about OSM which predicts that grouping should always increase masking strength. We speculate some reasons for why the statistical model of the periphery may explain OSM.
by Alvin Andrew Raj.
Ph.D.
Goudie, Robert J. B. "Bayesian structural inference with applications in social science." Thesis, University of Warwick, 2011. http://wrap.warwick.ac.uk/78778/.
Повний текст джерелаMeintjes, M. M. (Maria Magdalena). "Evaluating the properties of sensory tests using computer intensive and biplot methodologies." Thesis, Stellenbosch : Stellenbosch University, 2007. http://hdl.handle.net/10019.1/20881.
Повний текст джерелаENGLISH ABSTRACT: This study is the result of part-time work done at a product development centre. The organisation extensively makes use of trained panels in sensory trials designed to asses the quality of its product. Although standard statistical procedures are used for analysing the results arising from these trials, circumstances necessitate deviations from the prescribed protocols. Therefore the validity of conclusions drawn as a result of these testing procedures might be questionable. This assignment deals with these questions. Sensory trials are vital in the development of new products, control of quality levels and the exploration of improvement in current products. Standard test procedures used to explore such questions exist but are in practice often implemented by investigators who have little or no statistical background. Thus test methods are implemented as black boxes and procedures are used blindly without checking all the appropriate assumptions and other statistical requirements. The specific product under consideration often warrants certain modifications to the standard methodology. These changes may have some unknown effect on the obtained results and therefore should be scrutinized to ensure that the results remain valid. The aim of this study is to investigate the distribution and other characteristics of sensory data, comparing the hypothesised, observed and bootstrap distributions. Furthermore, the standard testing methods used to analyse sensory data sets will be evaluated. After comparing these methods, alternative testing methods may be introduced and then tested using newly generated data sets. Graphical displays are also useful to get an overall impression of the data under consideration. Biplots are especially useful in the investigation of multivariate sensory data. The underlying relationships among attributes and their combined effect on the panellists’ decisions can be visually investigated by constructing a biplot. Results obtained by implementing biplot methods are compared to those of sensory tests, i.e. whether a significant difference between objects will correspond to large distances between the points representing objects in the display. In conclusion some recommendations are made as to how the organisation under consideration should implement sensory procedures in future trials. However, these proposals are preliminary and further research is necessary before final adoption. Some issues for further investigation are suggested.
AFRIKAANSE OPSOMMING: Hierdie studie spruit uit deeltydse werk by ’n produk-ontwikkeling-sentrum. Die organisasie maak in al hul sensoriese proewe rakende die kwaliteit van hul produkte op groot skaal gebruik van opgeleide panele. Alhoewel standaard prosedures ingespan word om die resultate te analiseer, noodsaak sekere omstandighede dat die voorgeskrewe protokol in ’n aangepaste vorm geïmplementeer word. Dié aanpassings mag meebring dat gevolgtrekkings gebaseer op resultate ongeldig is. Hierdie werkstuk ondersoek bogenoemde probleem. Sensoriese proewe is noodsaaklik in kwaliteitbeheer, die verbetering van bestaande produkte, asook die ontwikkeling van nuwe produkte. Daar bestaan standaard toets- prosedures om vraagstukke te verken, maar dié word dikwels toegepas deur navorsers met min of geen statistiese kennis. Dit lei daartoe dat toetsprosedures blindelings geïmplementeer en resultate geïnterpreteer word sonder om die nodige aannames en ander statistiese vereistes na te gaan. Alhoewel ’n spesifieke produk die wysiging van die standaard metode kan regverdig, kan hierdie veranderinge ’n groot invloed op die resultate hê. Dus moet die geldigheid van die resultate noukeurig ondersoek word. Die doel van hierdie studie is om die verdeling sowel as ander eienskappe van sensoriese data te bestudeer, deur die verdeling onder die nulhipotese sowel as die waargenome- en skoenlusverdelings te beskou. Verder geniet die standaard toetsprosedure, tans in gebruik om sensoriese data te analiseer, ook aandag. Na afloop hiervan word alternatiewe toetsprosedures voorgestel en dié geëvalueer op nuut gegenereerde datastelle. Grafiese voorstellings is ook nuttig om ’n geheelbeeld te kry van die data onder bespreking. Bistippings is veral handig om meerdimensionele sensoriese data te bestudeer. Die onderliggende verband tussen die kenmerke van ’n produk sowel as hul gekombineerde effek op ’n paneel se besluit, kan hierdeur visueel ondersoek word. Resultate verkry in die voorstellings word vergelyk met dié van sensoriese toetsprosedures om vas te stel of statisties betekenisvolle verskille in ’n produk korrespondeer met groot afstande tussen die relevante punte in die bistippingsvoorstelling. Ten slotte word sekere aanbevelings rakende die implementering van sensoriese proewe in die toekoms aan die betrokke organisasie gemaak. Hierdie aanbevelings word gemaak op grond van die voorafgaande ondersoeke, maar verdere navorsing is nodig voor die finale aanvaarding daarvan. Waar moontlik, word voorstelle vir verdere ondersoeke gedoen.
Billups, Robert Brent. "COMPUTER ASSISTED TREATMENT EVALUATION." University of Cincinnati / OhioLINK, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin997908439.
Повний текст джерелаSjöbergh, Jonas. "Language Technology for the Lazy : Avoiding Work by Using Statistics and Machine Learning." Doctoral thesis, KTH, Numerisk Analys och Datalogi, NADA, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4023.
Повний текст джерелаQC 20100920
Kress, Linda. "Analysis of computer science curriculum through development of an online crime reporting system." Morgantown, W. Va. : [West Virginia University Libraries], 2006. https://eidr.wvu.edu/etd/documentdata.eTD?documentid=4601.
Повний текст джерелаTitle from document title page. Document formatted into pages; contains vii, 189 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 175-189).
Xiang, Gang. "Fast algorithms for computing statistics under interval uncertainty with applications to computer science and to electrical and computer engineering /." To access this resource online via ProQuest Dissertations and Theses @ UTEP, 2007. http://0-proquest.umi.com.lib.utep.edu/login?COPT=REJTPTU0YmImSU5UPTAmVkVSPTI=&clientId=2515.
Повний текст джерелаClough, Andrew Lawrence. "Increasing adder efficiency by exploiting input statistics." Thesis, Massachusetts Institute of Technology, 2007. http://hdl.handle.net/1721.1/42424.
Повний текст джерелаIncludes bibliographical references (p. 49-50).
Current techniques for characterizing the power consumption of adders rely on assuming that the inputs are completely random. However, the inputs generated by realistic applications are not random, and in fact include a great deal of structure. Input bits are more likely to remain in the same logical states from addition to addition than would be expected by chance and bits, especially the most significant bits, are very likely to be in the same state as their neighbors. Taking this data, I look at ways that it can be used to improve the design of adders. The first method I look at involves looking at how different adder architectures respond to the different characteristics of input data from the more significant and less significant bits of the adder, and trying to use these responses to create a hybrid adder. Unfortunately the differences are not sufficient for this approach to be effective. I next look at the implications of the data I collected for the optimization of Kogge- Stone adder trees, and find that in certain circumstances the use of experimentally derived activity maps rather than ones based on simple assumptions can increase adder performance by as much as 30%.
by Andrew Lawrence Clough.
M.Eng.
Tikekar, Mehul (Mehul Deepak). "Energy-efficient video decoding using data statistics." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/113990.
Повний текст джерелаCataloged from PDF version of thesis.
Includes bibliographical references (pages 103-108).
Video traffic over the Internet is growing rapidly and is projected to be about 82% of the total consumer Internet traffic by 2020. To address this, new video coding standards such as H.265/HEVC (High Efficiency Video Coding) provide better compression especially at Full HD and higher video resolutions. HEVC achieves this through a variety of algorithmic techniques such as larger transform sizes and more accurate inter-frame prediction. However, these techniques increase the complexity of software and hardware-based video decoders. In this thesis, we design a hardware-based video decoder chip that exploits the statistics of the video to reduce the energy/pixel cost in several ways. For example, we exploit the sparsity in transform coefficients to reduce the energy/pixel cost of inverse transform by 29%. With the proposed architecture, larger transforms have the same energy/pixel cost as smaller transforms owing to their higher sparsity thus addressing the increased complexity of HEVC's larger transform sizes. As a second example, the energy/pixel cost of inter-prediction is dominated by off-chip memory access. We eliminate off-chip memory access by using on-chip embedded DRAM (eDRAM). However, eDRAM banks spend 80% of their energy on frequent refresh operations to retain stored data retention. To reduce refresh energy, we compress the video data stored in the eDRAM by exploiting spatial correlation among pixels. Thus, unused eDRAM banks can be turned off to reduce refresh energy by 55%. This thesis presents measured results for a 40 nm CMOS test chip that can decode Full HD video at 20 - 50 frames per second while consuming only 25 - 31 mW of system power. The system power is 6 times lower than the state-of-the-art and can enable even extremely energy-constrained wearable devices to decode video without exceeding their power budgets. The inverse transform result can enable future coding standards to use even larger transform sizes to improve compression without sacrificing energy efficiency.
by Mehul Tikekar.
Ph. D.
Sharan, Lavanya. "Image statistics and the perception of surface reflectance." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/34356.
Повний текст джерелаMIT Institute Archives copy: p. 223 (last page) bound in reverse order.
Includes bibliographical references (p. 217-223).
Humans are surprisingly good at judging the reflectance of complex surfaces even when the surfaces are viewed in isolation, contrary to the Gelb effect. We argue that textural cues are important for this task. Traditional machine vision systems, on the other hand, are incapable of recognizing reflectance properties. Estimating the reflectance of a complex surface under unknown illumination from a single image is a hard problem. Recent work in reflectance recognition has shown that certain statistics measured o an image of a surface are diagnostic of reflectance. We consider opaque surfaces with medium scale structure and spatially homogeneous reflectance properties. For such surfaces, we find that statistics of intensity histograms and histograms of filtered outputs are indicative of the diffuse surface reflectance. We compare the performance of a learning algorithm that employs these image statistics to human performance in two psychophysical experiments. In the first experiment, observers classify images of complex surfaces according to the perceived reflectance. We find that the learning algorithm rivals human performance at the classification task. In the second experiment, we manipulate the statistics of images and ask observers to provide reflectance ratings. In this case, the learning algorithm performs similarly to human observers. These findings lead us to conclude that the image statistics capture perceptually relevant information.
by Lavanya Sharan.
S.M.
Dror, Ron O. (Ron Ofer) 1975. "Surface reflectance recognition and real-world illumination statistics." Thesis, Massachusetts Institute of Technology, 2002. http://hdl.handle.net/1721.1/16911.
Повний текст джерелаIncludes bibliographical references (p. 141-150).
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Humans distinguish materials such as metal, plastic, and paper effortlessly at a glance. Traditional computer vision systems cannot solve this problem at all. Recognizing surface reflectance properties from a single photograph is difficult because the observed image depends heavily on the amount of light incident from every direction. A mirrored sphere, for example, produces a different image in every environment. To make matters worse, two surfaces with different reflectance properties could produce identical images. The mirrored sphere simply reflects its surroundings, so in the right artificial setting, it could mimic the appearance of a matte ping-pong ball. Yet, humans possess an intuitive sense of what materials typically "look like" in the real world. This thesis develops computational algorithms with a similar ability to recognize reflectance properties from photographs under unknown, real-world illumination conditions. Real-world illumination is complex, with light typically incident on a surface from every direction. We find, however, that real-world illumination patterns are not arbitrary. They exhibit highly predictable spatial structure, which we describe largely in the wavelet domain. Although they differ in several respects from the typical photographs, illumination patterns share much of the regularity described in the natural image statistics literature. These properties of real-world illumination lead to predictable image statistics for a surface with given reflectance properties. We construct a system that classifies a surface according to its reflectance from a single photograph under unknown illumination. Our algorithm learns relationships between surface reflectance and certain statistics computed from the observed image.
(cont.) Like the human visual system, we solve the otherwise underconstrained inverse problem of reflectance estimation by taking advantage of the statistical regularity of illumination. For surfaces with homogeneous reflectance properties and known geometry, our system rivals human performance.
by Ron O. Dror.
Ph.D.
Terrell, David. "Racial inequalities in America| Examining socioeconomic statistics using the Semantic Web." Thesis, Florida Atlantic University, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10154928.
Повний текст джерелаThe visualization of recent episodes regarding apparently unjustifiable deaths of minorities, caused by police and federal law enforcement agencies, has been amplified through today’s social media and television networks. Such events may seem to imply that issues concerning racial inequalities in America are getting worse. However, we do not know whether such indications are factual; whether this is a recent phenomenon, whether racial inequality is escalating relative to earlier decades, or whether it is better in certain regions of the nation compared to others.
We have built a semantic engine for the purpose of querying statistics on various metropolitan areas, based on a database of individual deaths. Separately, we have built a database of demographic data on poverty, income, education attainment, and crime statistics for the top 25 most populous metropolitan areas. These data will ultimately be combined with government data to evaluate this hypothesis, and provide a tool for predictive analytics. In this thesis, we will provide preliminary results in that direction.
The methodology in our research consisted of multiple steps. We initially described our requirements and drew data from numerous datasets, which contained information on the 23 highest populated Metropolitan Statistical Areas in the United States. After all of the required data was obtained we decomposed the Metropolitan Statistical Area records into domain components and created an Ontology/Taxonomy via Protégé to determine an hierarchy level of nouns towards identifying significant keywords throughout the datasets to use as search queries. Next, we used a Semantic Web implementation accompanied with Python programming language, and FuXi to build and instantiate a vocabulary. The Ontology was then parsed for the entered search query and returned corresponding results providing a semantically organized and relevant output in RDF/XML format.
Li, Shaolin 1963. "Stochastic approximation algorithms for statistical estimation." Thesis, McGill University, 1996. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=42077.
Повний текст джерелаAlmulla, Mohammed Ali. "A class of greedy algorithms for solving the travelling salesman problem /." Thesis, McGill University, 1990. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=59557.
Повний текст джерелаThis thesis looks closely at one of the approximate methods, namely sub-optimal tour building. In particular, it focuses on the nearest neighbour algorithm (a greedy algorithm). By being greedy at every step of the procedure, this algorithm returns an approximate solution that is near optimal in terms of solution cost. Next, this greedy algorithm is used in implementing a new algorithm that is called the "Multi-Degree Greedy Algorithm". By being greedy at half of the procedure steps, this algorithm returns optimal solutions to travelling salesman problems 99% of the time. Thus, this algorithm is an approximate algorithm, designed to run on small-scale travelling salesman problems (n $<$ 20).
Duguay, Richard. "Speech recognition : transition probability training in diphone bootstraping." Thesis, McGill University, 1999. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=21544.
Повний текст джерелаSverchkov, Yuriy. "Detection and explanation of statistical differences across a pair of groups." Thesis, University of Pittsburgh, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3647988.
Повний текст джерелаThe task of explaining differences across groups is a task that people encounter often, not only in the research environment, but also in less formal settings. Existing statistical tools designed specifically for discovering and understanding differences are limited. The methods developed in this dissertation provide such tools and help understand what properties such tools should have to be successful and to motivate further development of new approaches to discovering and understanding differences.
This dissertation presents a novel approach to comparing groups of data points. The process of comparing groups of data is divided into multiple stages: The learning of maximum a posteriori models for the data in each group, the identification of statistical differences between model parameters, the construction of a single model that captures those differences, and finally, the explanation of inferences of differences in marginal distributions in the form of an account of clinically significant contributions of elemental model differences to the marginal difference. A general framework for the process, applicable to a broad range of model types, is presented. This dissertation focuses on applying this framework to Bayesian networks over multinomial variables.
To evaluate model learning and the detection of parameter differences an empirical evaluation of methods for identifying statistically significant differences and clinically significant differences is performed. To evaluate the generated explanations of how differences in the models account for the differences in probabilities computed from those models, case studies with real clinical data are presented, and the findings generated by explanations are discussed. An interactive prototype that allows a user to navigate through such an explanation is presented, and ideas are discussed for further development of data analysis tools for comparing groups of data.
Fang, Youhan. "Efficient Markov Chain Monte Carlo Methods." Thesis, Purdue University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10809188.
Повний текст джерелаGenerating random samples from a prescribed distribution is one of the most important and challenging problems in machine learning, Bayesian statistics, and the simulation of materials. Markov Chain Monte Carlo (MCMC) methods are usually the required tool for this task, if the desired distribution is known only up to a multiplicative constant. Samples produced by an MCMC method are real values in N-dimensional space, called the configuration space. The distribution of such samples converges to the target distribution in the limit. However, existing MCMC methods still face many challenges that are not well resolved. Difficulties for sampling by using MCMC methods include, but not exclusively, dealing with high dimensional and multimodal problems, high computation cost due to extremely large datasets in Bayesian machine learning models, and lack of reliable indicators for detecting convergence and measuring the accuracy of sampling. This dissertation focuses on new theory and methodology for efficient MCMC methods that aim to overcome the aforementioned difficulties.
One contribution of this dissertation is generalizations of hybrid Monte Carlo (HMC). An HMC method combines a discretized dynamical system in an extended space, called the state space, and an acceptance test based on the Metropolis criterion. The discretized dynamical system used in HMC is volume preserving—meaning that in the state space, the absolute Jacobian of a map from one point on the trajectory to another is 1. Volume preservation is, however, not necessary for the general purpose of sampling. A general theory allowing the use of non-volume preserving dynamics for proposing MCMC moves is proposed. Examples including isokinetic dynamics and variable mass Hamiltonian dynamics with an explicit integrator, are all designed with fewer restrictions based on the general theory. Experiments show improvement in efficiency for sampling high dimensional multimodal problems. A second contribution is stochastic gradient samplers with reduced bias. An in-depth analysis of the noise introduced by the stochastic gradient is provided. Two methods to reduce the bias in the distribution of samples are proposed. One is to correct the dynamics by using an estimated noise based on subsampled data, and the other is to introduce additional variables and corresponding dynamics to adaptively reduce the bias. Extensive experiments show that both methods outperform existing methods. A third contribution is quasi-reliable estimates of effective sample size. Proposed is a more reliable indicator—the longest integrated autocorrelation time over all functions in the state space—for detecting the convergence and measuring the accuracy of MCMC methods. The superiority of the new indicator is supported by experiments on both synthetic and real problems.
Minor contributions include a general framework of changing variables, and a numerical integrator for the Hamiltonian dynamics with fourth order accuracy. The idea of changing variables is to transform the potential energy function as a function of the original variable to a function of the new variable, such that undesired properties can be removed. Two examples are provided and preliminary experimental results are obtained for supporting this idea. The fourth order integrator is constructed by combining the idea of the simplified Takahashi-Imada method and a two-stage Hessian-based integrator. The proposed method, called two-stage simplified Takahashi-Imada method, shows outstanding performance over existing methods in high-dimensional sampling problems.
Wei, Wutao. "Model Based Clustering Algorithms with Applications." Thesis, Purdue University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10830711.
Повний текст джерелаIn machine learning predictive area, unsupervised learning will be applied when the labels of the data are unavailable, laborious to obtain or with limited proportion. Based on the special properties of data, we can build models by understanding the properties and making some reasonable assumptions. In this thesis, we will introduce three practical problems and discuss them in detail. This thesis produces 3 papers as follow: Wei, Wutao, et al. "A Non-parametric Hidden Markov Clustering Model with Applications to Time Varying User Activity Analysis." ICMLA2015 Wei, Wutao, et al. "Dynamic Bayesian predictive model for box office forecasting." IEEE Big Data 2017. Wei, Wutao, Bowei Xi, and Murat Kantarcioglu. "Adversarial Clustering: A Grid Based Clustering Algorithm Against Active Adversaries." Submitted
User Profiling Clustering: Activity data of individual users on social media are easily accessible in this big data era. However, proper modeling strategies for user profiles have not been well developed in the literature. Existing methods or models usually have two limitations. The first limitation is that most methods target the population rather than individual users, and the second is that they cannot model non-stationary time-varying patterns. Different users in general demonstrate different activity modes on social media. Therefore, one population model may fail to characterize activities of individual users. Furthermore, online social media are dynamic and ever evolving, so are users’ activities. Dynamic models are needed to properly model users’ activities. In this paper, we introduce a non-parametric hidden Markov model to characterize the time-varying activities of social media users. In addition, based on the proposed model, we develop a clustering method to group users with similar activity patterns.
Adversarial Clustering: Nowadays more and more data are gathered for detecting and preventing cyber-attacks. Unique to the cyber security applications, data analytics techniques have to deal with active adversaries that try to deceive the data analytics models and avoid being detected. The existence of such adversarial behavior motivates the development of robust and resilient adversarial learning techniques for various tasks. In the past most of the work focused on adversarial classification techniques, which assumed the existence of a reasonably large amount of carefully labeled data instances. However, in real practice, labeling the data instances often requires costly and time-consuming human expertise and becomes a significant bottleneck. Meanwhile, a large number of unlabeled instances can also be used to understand the adversaries' behavior. To address the above mentioned challenges, we develop a novel grid based adversarial clustering algorithm. Our adversarial clustering algorithm is able to identify the core normal regions, and to draw defensive walls around the core positions of the normal objects utilizing game theoretic ideas. Our algorithm also identifies sub-clusters of attack objects, the overlapping areas within clusters, and outliers which may be potential anomalies.
Dynamic Bayesian Update for Profiling Clustering: Movie industry becomes one of the most important consumer business. The business is also more and more competitive. As a movie producer, there is a big cost in movie production and marketing; as an owner of a movie theater, it is also a problem that how to arrange the limited screens to the current movies in theater. However, all the current models in movie industry can only give an estimate of the opening week. We improve the dynamic linear model with a Bayesian framework. By using this updating method, we are also able to update the streaming adversarial data and make defensive recommendation for the defensive systems.
Zaetz, Jiaqi L. "A Riemannian Framework for Shape Analysis of Annotated 3D Objects." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1440368778.
Повний текст джерелаChen, Guo. "Implementation of Cumulative Probability Models for Big Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=case1619624862283514.
Повний текст джерелаKonzem, Scott R. "Tenability and Computability of Generalized Polya Urns." Thesis, The George Washington University, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10263413.
Повний текст джерелаUrn models have a storied part in the history of probability and have been studied extensively over the past century for their wide range of applications. We analyze a generalized class of urn models introduced in the past decade, the so-called "multiset" class, in which more than one ball is sampled at a time. We investigate sufficient conditions for a multiset urn process to be tenable, meaning the process can continue indefinitely without getting stuck. We fully characterize the "strongly tenable" class of Pólya urn schemes, which is tenable under any starting conditions that allow the process to begin. We find several "weakly tenable" classes of Pólya urn schemes that are tenable only under restricted starting conditions. We enumerate the size of some of these tenable classes using combinatorics, probabilistically analyze them, and provide an algorithm to assess the tenability of an arbitrary urn scheme using breadth-first search. We further analyze the computational complexity of the tenability problem itself. By showing how to encode the Boolean satisfiability problem within a Pólya urn scheme, we find that the problem of determining whether a multiset urn scheme is untenable is in the complexity class NP-hard, and this places constraints on the kinds of tenability theorems we can hope to find. Finally, we analyze a generalized “fault tolerant” urn model that can take action to avoid getting stuck, and by showing that this model is Turing-equivalent, we show that the tenability problem for this model is undecidable.
Angiuli, Olivia Marie. "The effect of quasi-identifier characteristics on statistical bias introduced by k-anonymization." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:14398529.
Повний текст джерелаZheng, Shijie. "The Differential Privacy of Bayesian Inference." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:14398533.
Повний текст джерелаComiter, Marcus Zachary. "A Future of Abundant Sparsity: Novel Use and Analysis of Sparse Coding in Machine Learning Applications." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:17417575.
Повний текст джерелаComputer Science
Swaminathan, Adith. "Counterfactual Evaluation and Learning From Logged User Feedback." Thesis, Cornell University, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10258968.
Повний текст джерелаInteractive systems that interact with and learn from user behavior are ubiquitous today. Machine learning algorithms are core components of such systems. In this thesis, we will study how we can re-use logged user behavior data to evaluate interactive systems and train their machine learned components in a principled way. The core message of the thesis is • Using simple techniques from causal inference, we can improve popular machine learning algorithms so that they interact reliably. • These improvements are effective and scalable, and complement current algorithmic and modeling advances in machine learning. • They open further avenues for research in Counterfactual Evaluation and Learning to ensure machine learned components interact reliably with users and with each other. This thesis explores two fundamental tasks—evaluation and training of interactive systems. Solving evaluation and training tasks using logged data is an exercise in counterfactual reasoning. So we will first review concepts from causal inference for counterfactual reasoning, assignment mechanisms, statistical estimation and learning theory. The thesis then contains two parts.
In the first part, we will study scenarios where unknown assignment mechanisms underlie the logged data we collect. These scenarios often arise in learning-to-rank and learning-to-recommend applications. We will view these applications through the lens of causal inference and modularize the problem of building a good ranking engine or recommender system into two components—first, infer a plausible assignment mechanism and second, reliably learn to rank or recommend assuming this mechanism was active when collecting data.
The second part of the thesis focuses on scenarios where we collect logged data from past interventions. We will formalize these scenarios as batch learning from logged contextual bandit feedback. We will first develop better off-policy estimators for evaluating online user-centric metrics in information retrieval applications. In subsequent chapters, we will study the bias-variance trade-off when learning from logged interventions. This study will yield new learning principles, algorithms and insights into the design of statistical estimators for counterfactual learning.
The thesis outlines a few principles, tools, datasets and software that hopefully prove to be useful to you as you build your interactive learning system.
Foulds, James Richard. "Latent Variable Modeling for Networks and Text| Algorithms, Models and Evaluation Techniques." Thesis, University of California, Irvine, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3631094.
Повний текст джерелаIn the era of the internet, we are connected to an overwhelming abundance of information. As more facets of our lives become digitized, there is a growing need for automatic tools to help us find the content we care about. To tackle the problem of information overload, a standard machine learning approach is to perform dimensionality reduction, transforming complicated high-dimensional data into a manageable, low-dimensional form. Probabilistic latent variable models provide a powerful and elegant framework for performing this transformation in a principled way. This thesis makes several advances for modeling two of the most ubiquitous types of online information: networks and text data.
Our first contribution is to develop a model for social networks as they vary over time. The model recovers latent feature representations of each individual, and tracks these representations as they change dynamically. We also show how to use text information to interpret these latent features.
Continuing the theme of modeling networks and text data, we next build a model of citation networks. The model finds influential scientific articles and the influence relationships between the articles, potentially opening the door for automated exploratory tools for scientists. The increasing prevalence of web-scale data sets provides both an opportunity and a challenge. With more data we can fit more accurate models, as long as our learning algorithms are up to the task. To meet this challenge, we present an algorithm for learning latent Dirichlet allocation topic models quickly, accurately and at scale. The algorithm leverages stochastic techniques, as well as the collapsed representation of the model. We use it to build a topic model on 4.6 million articles from the open encyclopedia Wikipedia in a matter of hours, and on a corpus of 1740 machine learning articles from the NIPS conference in seconds.
Finally, evaluating the predictive performance of topic models is an important yet computationally difficult task. We develop one algorithm for comparing topic models, and another for measuring the progress of learning algorithms for these models. The latter method achieves better estimates than previous algorithms, in many cases with an order of magnitude less computational effort.
Kaftan, David. "Design Day Analysis - Forecasting Extreme Daily Natural Gas Demand." Thesis, Marquette University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10825062.
Повний текст джерелаThis work provides a framework for Design Day analysis. First, we estimate the temperature conditions which are expected to be colder than all but one day in N years. This temperature is known as the Design Day condition. Then, we forecast an upper bound on natural gas demand when temperature is at the Design Day condition.
Natural gas distribution companies (LDCs) need to meet demand during extreme cold days. Just as bridge builders design for a nominal load, natural gas distribution companies need to design for a nominal temperature. This nominal temperature is the Design Day condition. The Design Day condition is the temperature that is expected to be colder than every day except one in N years. Once Design Day conditions are estimated, LDCs need to prepare for the Design Day demand. We provide an upper bound on Design Day demand to ensure LDCs will be able to meet demand.
Design Day conditions are determined in a variety of ways. First, we fit a kernel density function to surrogate temperatures - this method is referred to as the Surrogate Kernel Density Fit. Second, we apply Extreme Value Theory - a field dedicated to finding the maxima or minima of a distribution. In particular, we apply Block-Maxima and Peak-Over-Threshold (POT) techniques. The upper bound of Design Day demand is determined using a modified version of quantile regression.
Similar Design Day conditions are estimated by both the Surrogate Kernel Density Fit and Peaks-Over-Threshold methods. Both methods perform well. The theory supporting the POT method and the empirical performance of the SKDF method lends confidence in the Design Day conditions estimates. The upper bound of demand on these conditions is well modeled by the modified quantile regression technique.
Lin, Lei. "Data science application in intelligent transportation systems| An integrative approach for border delay prediction and traffic accident analysis." Thesis, State University of New York at Buffalo, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3683052.
Повний текст джерелаWith the great progress in information and communications technologies in the past few decades, intelligent transportation systems (ITS) have accumulated vast amounts of data regarding the movement iof people and goods from one location to another. Besides the traditional fixed sensors and GPS devices, new emerging data sources and approaches such as social media and crowdsourcing can be used to extract travel-related data, especially given the wide popularity of mobile devices such as smartphones and tablets, along with their associated apps. To take advantage of all these data and to address the associated challenges, big data techniques, and a new emerging field called data science, are currently receiving more and more attention. Data science employs techniques and theories from many fields such as statistics, machine learning, data mining, analytical models and computer programming to solve the data analysis task. It is therefore timely and important to explore how data science may be best employed for transportation data analysis. In this doctoral study, an integrative approach is proposed for data science applications in ITS. The proposed approach constitutes to an integration of multiple steps in the data analysis process, or integration of different models to build a more powerful one. The integrative approach is applied and tested on two case studies: border crossing delay prediction and traffic accident data analysis.
For the first case study, a two-step border crossing delay prediction model is proposed, consisting of a short-term traffic volume prediction model and a multi-server queueing model. As such, this can be seen as an integration of data-driven models and analytical models. For the first step, the short-term traffic volume prediction model, an integration of data "width" decreasing (i.e., data grouping) step and model development step is applied. For model development, a model combination step of a Seasonal Autoregressive Integrated Moving Average Model (SARIMA) and Support Vector Regression (SVR) is applied to realize better performance than when using each single model. In addition, the spinning network (SPN) forecasting paradigm is enhanced for border crossing traffic prediction through the utilization of a dynamic time warping (DTW) similarity metric. The DTW-SPN is shown to yield several advantages such as computational efficiency and accuracy as demonstrated by a promising Mean Absolute Percent Error (MAPE) compared to SARIMA and SVR.
This dissertation also proposes the introduction of a data diagnosis step before short-term traffic prediction. In order to develop a methodology for model selection guidance, the author calculated the statistical measures of nonlinearity and complexity for multiple datasets and correlated those to the performances of multiple models SARIMA, SVR and k nearest neighbor (k-NN). Based on this, useful insights are revealed pertaining to parameter setting and model selection based on the data diagnosis results.
For the second step, namely the queueing model development, heuristic solutions are presented for two types of queueing models M/E K/n and BMAP/PH/n. These models take the predicted traffic volume as input, and use it to calculate future waiting time. The analytical results are compared to the results from a VISSIM model simulation results, and shown to be comparable. . Finally, an android smartphone app, which utilizes the two-step border prediction model methodology described above, is developed to collect, share and predict waiting time at the three Niagara Frontier border crossings.
For the second case study involving traffic accident data analysis, first an integration of a data "depth" decreasing step and a model development step is once again applied. To do this, the modularity-optimizing community detection algorithm is used to cluster the dataset, and for each cluster, the association rule algorithm is applied to yield insight into traffic accident hotspots and incident clearance time. The results show that more meaningful association rules can be derived when the data is clustered compared to when using the whole dataset directly. Secondly, an integration of a data "width" decreasing step (variable selection) and model development step is applied for real-time traffic accident risk prediction. For this, a novel variable selection method based on the Frequent Pattern tree (FP tree) algorithm is proposed and tested, before applying Bayesian networks and the k-NN algorithms. The experiment shows the models based on variables selected by FP tree always performed better than those using variables selected by the random forecast method. Lastly, an integration of the data mining model, M5P tree, and the hazard-based duration model (HBDM) statistical method is applied to traffic accident duration prediction. The M5P-HBDM method is shown to be capable of identifying more meaningful factors that impact the traffic accident duration, and to have a better prediction performance, than either M5P or HBDM.
The two case studies considered in this dissertation serve to illustrate the advantages of an integrative data science approach to analyzing transportation data. With this approach, invaluable insight is gained that can help solve transportation problems and guide public policy.
Gligorijevic, Djordje. "Predictive Uncertainty Quantification and Explainable Machine Learning in Healthcare." Diss., Temple University Libraries, 2018. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/520057.
Повний текст джерелаPh.D.
Predictive modeling is an ever-increasingly important part of decision making. The advances in Machine Learning predictive modeling have spread across many domains bringing significant improvements in performance and providing unique opportunities for novel discoveries. A notably important domains of the human world are medical and healthcare domains, which take care of peoples' wellbeing. And while being one of the most developed areas of science with active research, there are many ways they can be improved. In particular, novel tools developed based on Machine Learning theory have drawn benefits across many areas of clinical practice, pushing the boundaries of medical science and directly affecting well-being of millions of patients. Additionally, healthcare and medicine domains require predictive modeling to anticipate and overcome many obstacles that future may hold. These kinds of applications employ a precise decision--making processes which requires accurate predictions. However, good prediction by its own is often insufficient. There has been no major focus in developing algorithms with good quality uncertainty estimates. Ergo, this thesis aims at providing a variety of ways to incorporate solutions by learning high quality uncertainty estimates or providing interpretability of the models where needed for purpose of improving existing tools built in practice and allowing many other tools to be used where uncertainty is the key factor for decision making. The first part of the thesis proposes approaches for learning high quality uncertainty estimates for both short- and long-term predictions in multi-task learning, developed on top for continuous probabilistic graphical models. In many scenarios, especially in long--term predictions, it may be of great importance for the models to provide a reliability flag in order to be accepted by domain experts. To this end we explored a widely applied structured regression model with a goal of providing meaningful uncertainty estimations on various predictive tasks. Our particular interest is in modeling uncertainty propagation while predicting far in the future. To address this important problem, our approach centers around providing an uncertainty estimate by modeling input features as random variables. This allows modeling uncertainty from noisy inputs. In cases when model iteratively produces errors it should propagate uncertainty over the predictive horizon, which may provide invaluable information for decision making based on predictions. In the second part of the thesis we propose novel neural embedding models for learning low-dimensional embeddings of medical concepts, such are diseases and genes, and show how they can be interpreted to allow accessing their quality, and show how can they be used to solve many problems in medical and healthcare research. We use EHR data to discover novel relationships between diseases by studying their comorbidities (i.e., co-occurrences in patients). We trained our models on a large-scale EHR database comprising more than 35 million inpatient cases. To confirm value and potential of the proposed approach we evaluate its effectiveness on a held-out set. Furthermore, for select diseases we provide a candidate gene list for which disease-gene associations were not studied previously, allowing biomedical researchers to better focus their often very costly lab studies. We furthermore examine how disease heterogeneity can affect the quality of learned embeddings and propose an approach for learning types of such heterogeneous diseases, while in our study we primarily focus on learning types of sepsis. Finally, we evaluate the quality of low-dimensional embeddings on tasks of predicting hospital quality indicators such as length of stay, total charges and mortality likelihood, demonstrating their superiority over other approaches. In the third part of the thesis we focus on decision making in medicine and healthcare domain by developing state-of-the-art deep learning models capable of outperforming human performance while maintaining good interpretability and uncertainty estimates.
Temple University--Theses
Nelson, Emily W. (Emily Wyke) 1977. "Counting statistics of a system to produce entangled photon pairs." Thesis, Massachusetts Institute of Technology, 2001. http://hdl.handle.net/1721.1/86724.
Повний текст джерелаHuang, Yen-Chin. "Empirical distribution function statistics, speed of convergence, and p-variation." Thesis, Massachusetts Institute of Technology, 1994. http://hdl.handle.net/1721.1/12017.
Повний текст джерелаHerring, Keith 1981. "Propagation models for multiple-antenna systems : methodology, measurements and statistics." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/43027.
Повний текст джерелаIncludes bibliographical references (leaves 219-223).
The trend in wireless communications is towards utilization of multiple antenna systems. While techniques such as beam-forming and spatial diversity have been implemented for some time, the emergence of Multiple-Input Multiple-Output (MIMO) communications has increased commercial interest and development in multiple-antenna technology. Given this trend it has become increasingly important that we understand the propagation characteristics of the environments where this new technology will be deployed. In particular the development of low-cost, high-performance system architectures and protocols is largely dependent on the accuracy of available channel models for approximating realized propagation behavior. The first contribution of this thesis is a methodology for the modeling of wireless propagation in multiple antenna systems. Specifically we consider the problem of propagation modeling from the perspective of the protocol designer and system engineer. By defining the wireless channel as the complex narrow-band channel response h e C between two devices, we characterize the important degrees of freedom associated with the channel by modeling it as a function of its path-loss, multipath/frequency, time stability, spatial, and polarization characteristics. We then motivate this model by presenting a general set of design decisions that depend on these parameters such as network density, channel allocation, and channel-state information (CSI) update rate. Lastly we provide a parametrization of the environment into measurable factors that can be used to predict channel behavior including link-length, Line-Of-Sight (LOS), link topology (e.g. air-to-ground), building density, and other physical parameters. The second contribution of this thesis is the experimental analysis and development of this modeling space.
(cont) Specifically we have gathered a large database of real wireless channel data from a diverse set of propagation environments. A mobile channel-data collection system was built for obtaining the required data which includes an eight-channel software receiver and a collection of WiFi channel sounders. The software receiver synchronously samples the 20-MHz band centered at 2.4 GHz from eight configurable antennas. Measurements have been carried out for both air-to-ground and ground-to-ground links for distances ranging from tens of meters to several kilometers throughout the city of Cambridge, MA. Here we have developed a collection of models for predicting channel behavior, including a model for estimating the path-loss coefficient a in street environments that utilizes two physical parameters: P1 = percentage of building gaps averaged over each side of the street, P2= percentage of the street length that has a building gap on at least one side of the street. Results show a linear increase in a of 0.53 and 0.32 per 10% increase in P1 and P2, respectively, with RMS errors of 0.47 and 0.27 a for a's between 2 and 5. Experiments indicate a 10dB performance advantage in estimating path-loss with this multi-factor model over the optimal linear estimator (upper-bound empirical model) for link lengths as short as 100 meters. In contrast, air-to-ground links have been shown to exhibit log-normal fading with an average attenuation of a ; 2 and standard deviation of 8dB. Additionally we provide exhaustive evidence that the small-scale fading behavior (frequency domain) of both Non-Line-Of-Sight (NLOS) air-to-ground and ground-to-ground links as short as tens of meters is Rayleigh distributed. More specifically, fading distributions across a diverse set of environments and link lengths have been shown to have Rician K-factors smaller than 1, suggesting robust performance of the Rayleigh model.
(cont) A model is also presented that defines a stochastic distribution for the delay-spread of the channel as a function of the link-length (do), multipath component (MPC) decay-rate ( ... attenuation per unit delay ... ), and MPC arrival-rate (q = MPCs per unit delay ... periments support the use of this model over a spectrum of link-lengths (50m-700m) and indicate a dense arrival-rate (q) (on the order of 1 MPC) in ground-to-ground links. In this range the frequency structure of the channel is insensitive to q, which reduces the modeling complexity to a single unknown parameter, P. We provide estimators for 3 over a variety of environment types that have been shown to closely replicate the fade width distribution in these environments. The observed time-coherence length (tc) of MPCs tend to be either less than 300ms (high-frequency) or 5 seconds and longer (low-frequency), resulting in a Rician-like distribution for fading in the time domain. We show that the time characteristics of the channel are accurately modeled as the superposition of two independent circularly symmetric complex gaussian random variables corresponding to the channel response due to a set of stable and unstable MPCs. We observe the S-factor, defined as the ratio of average power in stable to unstable MPCs (distinct from the Rician K-factor), which ranges between 0-30dB depending on environment and link length, and can be estimated with an rms error of 3dB in both ground-to-ground and air-to-ground link regimes. Experiments show improved performance of this model over the Rician fading model which has been shown to underestimate high fade events (tails) in the time domain, corresponding to cases where the stable MPCs destructively combine to form a null. Additionally, the Kronecker MIMO channel model is shown to predict channel capacity (of a 7x7 system) with an rms error of 1.7 ... (at 20dB SNR) over a diverse set of observed outdoor environments.
(cont) Experiments indicate a 3dB performance advantage in this prediction when applied to environments that are not dominated by single-bounce propagation paths (Single-bounce: 2.1 ... rms, Multi-bounce: 1 ... rms).
by Keith T. Herring.
Ph.D.
Herring, Keith 1981. "Blind separation of noisy multivariate data using second-order statistics." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/30173.
Повний текст джерелаIncludes bibliographical references (leaves 81-83).
A second-order method for blind source separation of noisy instantaneous linear mixtures is presented and analyzed for the case where the signal order k and noise covariance GG-H are unknown. Only a data set X of dimension n > k and of sample size m is observed, where X = AP + GW. The quality of separation depends on source-observation ratio k/n, the degree of spectral diversity, and the second-order non-stationarity of the underlying sources. The algorithm estimates the Second-Order separation transform A, the signal Order, and Noise, and is therefore referred to as SOON. SOON iteratively estimates: 1) k using a scree metric, and 2) the values of AP, G, and W using the Expectation-Maximization (EM) algorithm, where W is white noise and G is diagonal. The final step estimates A and the set of k underlying sources P using a variant of the joint diagonalization method, where P has k independent unit-variance elements. Tests using simulated Auto Regressive (AR) gaussian data show that SOON improves the quality of source separation in comparison to the standard second-order separation algorithms, i.e., Second-Order Blind Identification (SOBI) [3] and Second-Order Non-Stationary (SONS) blind identification [4]. The sensitivity in performance of SONS and SOON to several algorithmic parameters is also displayed in these experiments. To reduce sensitivities in the pre-whitening step of these algorithms, a heuristic is proposed by this thesis for whitening the data set; it is shown to improve separation performance. Additionally the application of blind source separation techniques to remote sensing data is discussed.
(cont.) Analysis of remote sensing data collected by the AVIRIS multichannel visible/infrared imaging instrument shows that SOON reveals physically significant dynamics within the data not found by the traditional methods of Principal Component Analysis (PCA) and Noise Adjusted Principal Component Analysis (NAPCA).
by Keith Herring.
S.M.
Haulcy, R'mani(R'mani Symon). "Time-to-contact statistics as a proxy for accident probabilities." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/122699.
Повний текст джерелаThesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 56-58).
Accidents are relatively rare, and this makes it difficult to study the impact of traffic system changes or vehicle control changes on accident rates. One potential solution to this problem is the use of time-to-contact (TTC) statistics as a proxy for accident probabilities. Low TTC can be used as a measure of potential danger. Simulations were performed to explore whether inverse TTC can serve as a good proxy of accident probability. The resulting data was then analyzed to investigate how inverse TTC varies with the mixture of vehicles with bilateral control as opposed to car-following control. Previously, it was found that a relatively high mixture ratio is needed to prevent phantom traffic jams. The results in this paper show that there is a benefit to mixing bilateral control cars into general traffic, even at relatively low mixture ratios. Simulations were also performed to see how acceleration and jerk vary with the mixture of vehicles with bilateral control so that passenger comfort could be quantified. The results show that bilateral control improves passenger comfort.
by R'mani Haulcy.
S.M.
S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
Tolle, Kristin M. "Domain-independent semantic concept extraction using corpus linguistics, statistics and artificial intelligence techniques." Diss., The University of Arizona, 2003. http://hdl.handle.net/10150/280502.
Повний текст джерелаChen, Hui 1974. "Algorithms and statistics for the detection of binding sites in coding regions." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=97926.
Повний текст джерелаThe inter-species sequence conservation observed in coding regions may be the result of two types of selective pressure: the selective pressure on the protein encoded and, sometimes, the selective pressure on the binding sites. To predict some region in coding regions as a binding site, one needs to make sure that the conservation observed in this region is not due to the selective pressure on the protein encoded. To achieve this, COSMO built a null model with only the selective pressure on the protein encoded and computed p-values for the observed conservation scores, conditional on the fixed set of amino acids observed at the leaves.
It is believed, however, that the selective pressure on the protein assumed in COSMO is overly strong. Consequently, some interesting regions may be left undetected. In this thesis, a new method, COSMO-2, is developed to relax this assumption.
The amino acids are first classified into a fixed number of overlapping functional classes by applying an expectation maximization algorithm on a protein database. Two probabilities for each gene position are then calculated: (i) the probability of observing a certain degree of conservation in the orthologous sequences generated under each class in the null model (i.e. the p-value of the observed conservation under each class); and (ii) the probability that the codon column associated with that gene position belongs to each class. The p-value of the observed conservation for each gene position is the sum of the products of the two probabilities for all classes. Regions with low p-values are identified as potential binding sites.
Five sets of orthologous genes are analyzed using COSMO-2. The results show that COSMO-2 can detect the interesting regions identified by COSMO and can detect more interesting regions than COSMO in some cases.
Van, Rooyen Marchand. "Stable parametric optimization." Thesis, McGill University, 1992. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=70259.
Повний текст джерелаOstberg, Colin R. "Computational pain quantification and the effects of age, gender, culture and cause." Thesis, Marquette University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1554606.
Повний текст джерелаChronic pain affects more than 100 million Americans and more than 1.5 billion people worldwide. Pain is a multidimensional construct, expressed through a variety of means. Facial expressions are one such type of pain expression. Automatic facial expression recognition, and in particular pain expression recognition, are fields that have been studied extensively. However, nothing has explored the possibility of an automatic pain quantification algorithm, able to output pain levels based upon a facial image.
Developed for a remote monitoring context, a computational pain quantification algorithm has been developed and validated by two distinct sets of data. The second set of data also included associated data for the fields of age, gender, culture and cause of pain. These four fields were investigated for their effect on automatic pain quantification, determining that age and gender have a definite impact and should be involved in the algorithm, while culture and cause require further investigation.
Deng, Wenping. "Algorithms for Reconstruction of Gene Regulatory Networks from High-Throughput Gene Expression Data." Thesis, Michigan Technological University, 2019. http://pqdtopen.proquest.com/#viewpdf?dispub=13420080.
Повний текст джерелаUnderstanding gene interactions in complex living systems is one of the central tasks in system biology. With the availability of microarray and RNA-Seq technologies, a multitude of gene expression datasets has been generated towards novel biological knowledge discovery through statistical analysis and reconstruction of gene regulatory networks (GRN). Reconstruction of GRNs can reveal the interrelationships among genes and identify the hierarchies of genes and hubs in networks. The new algorithms I developed in this dissertation are specifically focused on the reconstruction of GRNs with increased accuracy from microarray and RNA-Seq high-throughput gene expression data sets.
The first algorithm (Chapter 2) focuses on modeling the transcriptional regulatory relationships between transcription factors (TF) and pathway genes. Multiple linear regression and its regularized version, such as Ridge regression and LASSO, are common tools that are usually used to model the relationship between predictor variables and dependent variable. To deal with the outliers in gene expression data, the group effect of TFs in regulation and to improve the statistical efficiency, it is proposed to use Huber function as loss function and Berhu function as penalty function to model the relationships between a pathway gene and many or all TFs. A proximal gradient descent algorithm was developed to solve the corresponding optimization problem. This algorithm is much faster than the general convex optimization solver CVX. Then this Huber-Berhu regression was embedded into partial least square (PLS) framework to deal with the high dimension and multicollinearity property of gene expression data. The result showed this method can identify the true regulatory TFs for each pathway gene with high efficiency.
The second algorithm (Chapter 3) focuses on building multilayered hierarchical gene regulatory networks (ML-hGRNs). A backward elimination random forest (BWERF) algorithm was developed for constructing an ML-hGRN operating above a biological pathway or a biological process. The algorithm first divided construction of ML-hGRN into multiple regression tasks; each involves a regression between a pathway gene and all TFs. Random forest models with backward elimination were used to determine the importance of each TF to a pathway gene. Then the importance of a TF to the whole pathway was computed by aggregating all the importance values of the TF to the individual pathway gene. Next, an expectation maximization algorithm was used to cut the TFs to form the first layer of direct regulatory relationships. The upper layers of GRN were constructed in the same way only replacing the pathway genes by the newly cut TFs. Both simulated and real gene expression data were used to test the algorithms and demonstrated the accuracy and efficiency of the method.
The third algorithm (Chapter 4) focuses on Joint Reconstruction of Multiple Gene Regulatory Networks (JRmGRN) using gene expression data from multiple tissues or conditions. In the formulation, shared hub genes across different tissues or conditions were assumed. Under the framework of the Gaussian graphical model, JRmGRN method constructs the GRNs through maximizing a penalized log-likelihood function. It was formulated as a convex optimization problem, and then solved it with an alternating direction method of multipliers (ADMM) algorithm. Both simulated and real gene expression data manifested JRmGRN had better performance than existing methods.
Green, Michael A. "Improving Identification of Subtle Changes in Wide-Area Sensing through Dynamic Zoom." Thesis, Delaware State University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10794023.
Повний текст джерелаThe past decade has seen an abundance of applications that utilize sensors to collect data. One such example is a gigapixel image, which combines a multitude of high-quality images into a panorama capable of viewing hundreds of acres. The resulting datasets can be quite large, making analysis time consuming and resource intensive. Moreover, coverage of such broad environments can mean numerous sensor feeds to which one must attend. A suitable approach for analysis and sense-making of such data is to focus on “interesting” samples of data, namely regions of interest, or ROI. ROIs are especially useful in wide-area sensing situations that return datasets that are largely similar from one instance to the next, but also possess small differences. Identifying subtle changes is relevant to certain scenarios in surveillance, such as the evidence of human activity. Several ROI detection techniques exist in the research literature. My work focuses on ROI detection tuned to subtle differences for images at varying zoom levels. My thesis consists of developing a method that identifies regions of interest for subtle changes in images. In this pursuit, my contributions will address key questions including the characterization of image information dynamics through introduction of dynamic zoom, the definition and measurement of subtlety, and an approach for scoring and selecting ROIs. This work will provide an automated attention mechanism for zoomed images, but is also applicable to domains include satellite imagery and cyber security.
Eiland, E. Earl. "A Coherent Classifier/Prediction/Diagnostic Problem Framework and Relevant Summary Statistics." Thesis, New Mexico Institute of Mining and Technology, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10617960.
Повний текст джерелаClassification is a ubiquitous decision activity. Regardless of whether it is predicting the future, e.g., a weather forecast, determining an existing state, e.g., a medical diagnosis, or some other activity, classifier outputs drive future actions. Because of their importance, classifier research and development is an active field.
Regardless of whether one is a classifier developer or an end user, evaluating and comparing classifier output quality is important. Intuitively, classifier evaluation may seem simple, however, it is not. There is a plethora of classifier summary statistics and new summary statistics seem to surface regularly. Summary statistic users appear not to be satisfied with the existing summary statistics. For end users, many existing summary statistics do not provide actionable information. This dissertation addresses the end user's quandary.
The work consists of four parts: 1. Considering eight summary statistics with regard to their purpose (what questions do they quantitatively answer) and efficacy (as defined by measurement theory). 2. Characterizing the classification problem from the end user's perspective and identifying four axioms for end user efficacious classifier evaluation summary statistics. 3. Applying the axia and measurement theory to evaluate eight summary statistics and create two compliant (end user efficacious) summary statistics. 4. Using the compliant summary statistics to show the actionable information they generate.
By applying the recommendations in this dissertation, both end users and researchers benefit. Researchers have summary statistic selection and classifier evaluation protocols that generate the most usable information. End users can also generate information that facilitates tool selection and optimal deployment, if classifier test reports provide the necessary information.
Xu, Yushi Ph D. Massachusetts Institute of Technology. "Combining linguistics and statistics for high-quality limited domain English-Chinese machine translation." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/44726.
Повний текст джерелаIncludes bibliographical references (p. 86-87).
Second language learning is a compelling activity in today's global markets. This thesis focuses on critical technology necessary to produce a computer spoken translation game for learning Mandarin Chinese in a relatively broad travel domain. Three main aspects are addressed: efficient Chinese parsing, high-quality English-Chinese machine translation, and how these technologies can be integrated into a translation game system. In the language understanding component, the TINA parser is enhanced with bottom-up and long distance constraint features. The results showed that with these features, the Chinese grammar ran ten times faster and covered 15% more of the test set. In the machine translation component, a combined method of linguistic and statistical system is introduced. The English-Chinese translation is done via an intermediate language "Zhonglish", where the English-Zhonglish translation is accomplished by a parse-and-paraphrase paradigm using hand-coded rules, mainly for structural reconstruction. Zhonglish-Chinese translation is accomplished by a standard phrase based statistical machine translation system, mostly accomplishing word sense disambiguation and lexicon mapping. We evaluated in an independent test set in IWSLT travel domain spoken language corpus. Substantial improvements were achieved for GIZA alignment crossover: we obtained a 45% decrease in crossovers compared to a traditional phrase-based statistical MT system. Furthermore, the BLEU score improved by 2 points. Finally, a framework of the translation game system is described, and the feasibility of integrating the components to produce reference translation and to automatically assess student's translation is verified.
by Yushi Xu.
S.M.
Schmidt, Molly A. "Weighting protein ensembles with Bayesian statistics and small-angle X-ray scattering data." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/119574.
Повний текст джерелаThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 52-54).
Intrinsically Disordered Proteins (IDPs) are involved in a number of neurodegenerative disorders such as Parkinson's and Alzheimer's diseases. Their disordered nature allows them to sample many different conformations, so their structures must be represented as ensembles. Typically, structural ensembles for IDPs are constructed by generating a set of conformations that yield ensemble averages that agree with pre-existing experimental data. However, as the number of experimental constraints is usually much smaller than the degrees of freedom in the protein, the ensemble construction process is under-determined, meaning there are many different ensembles that agree with a given set of experimental observables. The Variational Bayesian Weighting program uses Bayesian statistics to fit conformational ensembles, and in doing so also quantifies the uncertainty in the underlying ensemble. The present work sought to introduce new functionality to this program, allowing it to use data obtained from Small-Angle X-ray Scattering.
by Molly A. Schmidt.
M. Eng.
Yong, Florence Hiu-Ling. "Quantitative Methods for Stratified Medicine." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:17463130.
Повний текст джерелаBiostatistics
Vũ, John Huân. "Software Internationalization: A Framework Validated Against Industry Requirements for Computer Science and Software Engineering Programs." DigitalCommons@CalPoly, 2010. https://digitalcommons.calpoly.edu/theses/248.
Повний текст джерелаChavali, Krishna Kumar. "Integration of statistical and neural network method for data analysis." Morgantown, W. Va. : [West Virginia University Libraries], 2006. https://eidr.wvu.edu/etd/documentdata.eTD?documentid=4749.
Повний текст джерелаTitle from document title page. Document formatted into pages; contains viii, 68 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 50-51).
Xiong, Kuangnan. "Roughened Random Forests for Binary Classification." Thesis, State University of New York at Albany, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3624962.
Повний текст джерелаBinary classification plays an important role in many decision-making processes. Random forests can build a strong ensemble classifier by combining weaker classification trees that are de-correlated. The strength and correlation among individual classification trees are the key factors that contribute to the ensemble performance of random forests. We propose roughened random forests, a new set of tools which show further improvement over random forests in binary classification. Roughened random forests modify the original dataset for each classification tree and further reduce the correlation among individual classification trees. This data modification process is composed of artificially imposing missing data that are missing completely at random and subsequent missing data imputation.
Through this dissertation we aim to answer a few important questions in building roughened random forests: (1) What is the ideal rate of missing data to impose on the original dataset? (2) Should we impose missing data on both the training and testing datasets, or only on the training dataset? (3) What are the best missing data imputation methods to use in roughened random forests? (4) Do roughened random forests share the same ideal number of covariates selected at each tree node as the original random forests? (5) Can roughened random forests be used in medium- to high- dimensional datasets?
Navaroli, Nicholas Martin. "Generative Probabilistic Models for Analysis of Communication Event Data with Applications to Email Behavior." Thesis, University of California, Irvine, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3668831.
Повний текст джерелаOur daily lives increasingly involve interactions with others via different communication channels, such as email, text messaging, and social media. In this context, the ability to analyze and understand our communication patterns is becoming increasingly important. This dissertation focuses on generative probabilistic models for describing different characteristics of communication behavior, focusing primarily on email communication.
First, we present a two-parameter kernel density estimator for estimating the probability density over recipients of an email (or, more generally, items which appear in an itemset). A stochastic gradient method is proposed for efficiently inferring the kernel parameters given a continuous stream of data. Next, we apply the kernel model and the Bernoulli mixture model to two important prediction tasks: given a partially completed email recipient list, 1) predict which others will be included in the email, and 2) rank potential recipients based on their likelihood to be added to the email. Such predictions are useful in suggesting future actions to the user (i.e. which person to add to an email) based on their previous actions. We then investigate a piecewise-constant Poisson process model for describing the time-varying communication rate between an individual and several groups of their contacts, where changes in the Poisson rate are modeled as latent state changes within a hidden Markov model.
We next focus on the time it takes for an individual to respond to an event, such as receiving an email. We show that this response time depends heavily on the individual's typical daily and weekly patterns - patterns not adequately captured in standard models of response time (e.g. the Gamma distribution or Hawkes processes). A time-warping mechanism is introduced where the absolute response time is modeled as a transformation of effective response time, relative to the daily and weekly patterns of the individual. The usefulness of applying the time-warping mechanism to standard models of response time, both in terms of log-likelihood and accuracy in predicting which events will be quickly responded to, is illustrated over several individual email histories.
Vang, Yeeleng Scott. "An Ensemble Prognostic Model for Metastatic, Castrate-Resistant Prostate Cancer." Thesis, University of California, Irvine, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10162542.
Повний текст джерелаMetastatic, castrate-resistant prostate cancer (mCRPC) is one of the most prevalent cancers and is the third leading cause of cancer death among men. Several treatment options have been developed to combat mCRPC, however none have produced any tangible benefits to patients' overall survivability. As part of a crowd-sourced algorithm development competition, participants were asked to develop new prognostic models for mCRPC patients treated with docetaxel. Such results could potentially assist in clinical decision making for future mCRPC patients.
In this thesis, we present a new ensemble prognostic model to perform risk prediction for mCRPC patients treated with docetaxel. We rely on traditional survival analysis model like the Cox Proportional Hazard model, as well as more recently developed boosting model that incorporates smooth approximation of the concordance index for direct optimization. Our model performs better than the the current state-of-the-art mCRPC prognostic models for the concordance index performance measure and is competitive with these models on the integrated time-dependent area under the receiver operating characteristic curve.
Wu, Tao. "Higher-order Random Walk Methods for Data Analysis." Thesis, Purdue University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10790747.
Повний текст джерелаMarkov random walk models are powerful analytical tools for multiple areas in machine learning, numerical optimizations and data mining tasks. The key assumption of a first-order Markov chain is memorylessness, which restricts the dependence of the transition distribution to the current state only. However in many applications, this assumption is not appropriate. We propose a set of higher-order random walk techniques and discuss their applications to tensor co-clustering, user trails modeling, and solving linear systems. First, we develop a new random walk model that we call the super-spacey random surfer, which simultaneously clusters the rows, columns, and slices of a nonnegative three-mode tensor. This algorithm generalizes to tensors with any number of modes. We partition the tensor by minimizing the exit probability between clusters when the super-spacey random walk is at stationary. The second application is user trails modeling, where user trails record sequences of activities when individuals interact with the Internet and the world. We propose the retrospective higher-order Markov process as a two-step process by first choosing a state from the history and then transitioning as a first-order chain conditional on that state. This way the total number of parameters is restricted and thus the model is protected from overfitting. Lastly we propose to use a time-inhomogeneous Markov chain to approximate the solution of a linear system. Multiple simulations of the random walk are conducted to approximate the solution. By allowing the random walk to transition based on multiple matrices, we decrease the variance of the simulations, and thus increase the speed of the solver.