Dissertations / Theses on the topic 'Statistics and Computer Science'

To see the other types of publications on this topic, follow the link: Statistics and Computer Science.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Statistics and Computer Science.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Raj, Alvin Andrew. "Ambiguous statistics - how a statistical encoding in the periphery affects perception." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/79214.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 159-163).
Recent understanding in human vision suggests that the periphery compresses visual information to a set of summary statistics. Some visual information is robust to this lossy compression, but others, like spatial location and phase are not perfectly represented, leading to ambiguous interpretations. Using the statistical encoding, we can visualize the information available in the periphery to gain intuitions about human performance in visual tasks, which have implications for user interface design, or more generally, whether the periphery encodes sufficient information to perform a task without additional eye movements. The periphery is most of the visual field. If it undergoes these losses of information, then our perception and ability to perform tasks efficiently are affected. We show that the statistical encoding explains human performance in classic visual search experiments. Based on the statistical understanding, we also propose a quantitative model that can estimate the average number of fixations humans would need to find a target in a search display. Further, we show that the ambiguities in the peripheral representation predict many aspects of some illusions. In particular, the model correctly predicts how polarity and width affects the Pinna-Gregory illusion. Visualizing the statistical representation of the illusion shows that many qualitative aspects of the illusion are captured by the statistical ambiguities. We also investigate a phenomena known as Object Substitution Masking (OSM), where the identity of an object is impaired when a sparse, non-overlapping, and temporally trailing mask surrounds that object. We find that different types of grouping of object and mask produce different levels of impairment. This contradicts a theory about OSM which predicts that grouping should always increase masking strength. We speculate some reasons for why the statistical model of the periphery may explain OSM.
by Alvin Andrew Raj.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
2

Goudie, Robert J. B. "Bayesian structural inference with applications in social science." Thesis, University of Warwick, 2011. http://wrap.warwick.ac.uk/78778/.

Full text
Abstract:
Structural inference for Bayesian networks is useful in situations where the underlying relationship between the variables under study is not well understood. This is often the case in social science settings in which, whilst there are numerous theories about interdependence between factors, there is rarely a consensus view that would form a solid base upon which inference could be performed. However, there are now many social science datasets available with sample sizes large enough to allow a more exploratory structural approach, and this is the approach we investigate in this thesis. In the first part of the thesis, we apply Bayesian model selection to address a key question in empirical economics: why do some people take unnecessary risks with their lives? We investigate this question in the setting of road safety, and demonstrate that less satisfied individuals wear seatbelts less frequently. Bayesian model selection over restricted structures is a useful tool for exploratory analysis, but fuller structural inference is more appealing, especially when there is a considerable quantity of data available, but scant prior information. However, robust structural inference remains an open problem. Surprisingly, it is especially challenging for large n problems, which are sometimes encountered in social science. In the second part of this thesis we develop a new approach that addresses this problem|a Gibbs sampler for structural inference, which we show gives robust results in many settings in which existing methods do not. In the final part of the thesis we use the sampler to investigate depression in adolescents in the US, using data from the Add Health survey. The result stresses the importance of adolescents not getting medical help even when they feel they should, an aspect that has been discussed previously, but not emphasised.
APA, Harvard, Vancouver, ISO, and other styles
3

Meintjes, M. M. (Maria Magdalena). "Evaluating the properties of sensory tests using computer intensive and biplot methodologies." Thesis, Stellenbosch : Stellenbosch University, 2007. http://hdl.handle.net/10019.1/20881.

Full text
Abstract:
Assignment (MComm)--University of Stellenbosch, 2007.
ENGLISH ABSTRACT: This study is the result of part-time work done at a product development centre. The organisation extensively makes use of trained panels in sensory trials designed to asses the quality of its product. Although standard statistical procedures are used for analysing the results arising from these trials, circumstances necessitate deviations from the prescribed protocols. Therefore the validity of conclusions drawn as a result of these testing procedures might be questionable. This assignment deals with these questions. Sensory trials are vital in the development of new products, control of quality levels and the exploration of improvement in current products. Standard test procedures used to explore such questions exist but are in practice often implemented by investigators who have little or no statistical background. Thus test methods are implemented as black boxes and procedures are used blindly without checking all the appropriate assumptions and other statistical requirements. The specific product under consideration often warrants certain modifications to the standard methodology. These changes may have some unknown effect on the obtained results and therefore should be scrutinized to ensure that the results remain valid. The aim of this study is to investigate the distribution and other characteristics of sensory data, comparing the hypothesised, observed and bootstrap distributions. Furthermore, the standard testing methods used to analyse sensory data sets will be evaluated. After comparing these methods, alternative testing methods may be introduced and then tested using newly generated data sets. Graphical displays are also useful to get an overall impression of the data under consideration. Biplots are especially useful in the investigation of multivariate sensory data. The underlying relationships among attributes and their combined effect on the panellists’ decisions can be visually investigated by constructing a biplot. Results obtained by implementing biplot methods are compared to those of sensory tests, i.e. whether a significant difference between objects will correspond to large distances between the points representing objects in the display. In conclusion some recommendations are made as to how the organisation under consideration should implement sensory procedures in future trials. However, these proposals are preliminary and further research is necessary before final adoption. Some issues for further investigation are suggested.
AFRIKAANSE OPSOMMING: Hierdie studie spruit uit deeltydse werk by ’n produk-ontwikkeling-sentrum. Die organisasie maak in al hul sensoriese proewe rakende die kwaliteit van hul produkte op groot skaal gebruik van opgeleide panele. Alhoewel standaard prosedures ingespan word om die resultate te analiseer, noodsaak sekere omstandighede dat die voorgeskrewe protokol in ’n aangepaste vorm geïmplementeer word. Dié aanpassings mag meebring dat gevolgtrekkings gebaseer op resultate ongeldig is. Hierdie werkstuk ondersoek bogenoemde probleem. Sensoriese proewe is noodsaaklik in kwaliteitbeheer, die verbetering van bestaande produkte, asook die ontwikkeling van nuwe produkte. Daar bestaan standaard toets- prosedures om vraagstukke te verken, maar dié word dikwels toegepas deur navorsers met min of geen statistiese kennis. Dit lei daartoe dat toetsprosedures blindelings geïmplementeer en resultate geïnterpreteer word sonder om die nodige aannames en ander statistiese vereistes na te gaan. Alhoewel ’n spesifieke produk die wysiging van die standaard metode kan regverdig, kan hierdie veranderinge ’n groot invloed op die resultate hê. Dus moet die geldigheid van die resultate noukeurig ondersoek word. Die doel van hierdie studie is om die verdeling sowel as ander eienskappe van sensoriese data te bestudeer, deur die verdeling onder die nulhipotese sowel as die waargenome- en skoenlusverdelings te beskou. Verder geniet die standaard toetsprosedure, tans in gebruik om sensoriese data te analiseer, ook aandag. Na afloop hiervan word alternatiewe toetsprosedures voorgestel en dié geëvalueer op nuut gegenereerde datastelle. Grafiese voorstellings is ook nuttig om ’n geheelbeeld te kry van die data onder bespreking. Bistippings is veral handig om meerdimensionele sensoriese data te bestudeer. Die onderliggende verband tussen die kenmerke van ’n produk sowel as hul gekombineerde effek op ’n paneel se besluit, kan hierdeur visueel ondersoek word. Resultate verkry in die voorstellings word vergelyk met dié van sensoriese toetsprosedures om vas te stel of statisties betekenisvolle verskille in ’n produk korrespondeer met groot afstande tussen die relevante punte in die bistippingsvoorstelling. Ten slotte word sekere aanbevelings rakende die implementering van sensoriese proewe in die toekoms aan die betrokke organisasie gemaak. Hierdie aanbevelings word gemaak op grond van die voorafgaande ondersoeke, maar verdere navorsing is nodig voor die finale aanvaarding daarvan. Waar moontlik, word voorstelle vir verdere ondersoeke gedoen.
APA, Harvard, Vancouver, ISO, and other styles
4

Billups, Robert Brent. "COMPUTER ASSISTED TREATMENT EVALUATION." University of Cincinnati / OhioLINK, 2001. http://rave.ohiolink.edu/etdc/view?acc_num=ucin997908439.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Sjöbergh, Jonas. "Language Technology for the Lazy : Avoiding Work by Using Statistics and Machine Learning." Doctoral thesis, KTH, Numerisk Analys och Datalogi, NADA, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4023.

Full text
Abstract:
Language technology is when a computer processes human languages in some way. Since human languages are irregular and hard to define in detail, this is often difficult. Despite this, good results can many times be achieved. Often a lot of manual work is used in creating these systems though. While this usually gives good results, it is not always desirable. For smaller languages the resources for manual work might not be available, since it is usually time consuming and expensive. This thesis discusses methods for language processing where manual work is kept to a minimum. Instead, the computer does most of the work. This usually means basing the language processing methods on statistical information. These kinds of methods can normally be applied to other languages than they were originally developed for, without requiring much manual work for the language transition. The first half of the thesis mainly deals with methods that are useful as tools for other language processing methods. Ways to improve part of speech tagging, which is an important part in many language processing systems, without using manual work, are examined. Statistical methods for analysis of compound words, also useful in language processing, is also discussed. The first part is rounded off by a presentation of methods for evaluation of language processing systems. As languages are not very clearly defined, it is hard to prove that a system does anything useful. Thus it is very important to evaluate systems, to see if they are useful. Evaluation usually entails manual work, but in this thesis two methods with minimal manual work are presented. One uses a manually developed resource for evaluating other properties than originally intended with no extra work. The other method shows how to calculate an estimate of the system performance without using any manual work at all. In the second half of the thesis, language technology tools that are in themselves useful for a human user are presented. This includes statistical methods for detecting errors in texts. These methods complement traditional methods, based on manually written error detection rules, for instance by being able to detect errors that the rule writer could not imagine that writers could make. Two methods for automatic summarization are also presented. One is based on comparing the overall impression of the summary to that of the original text. This is based on statistical methods for measuring the contents of a text. The second method tries to mitigate the common problem of very sudden topic shifts in automatically generated summaries. After this, a modified method for automatically creating a lexicon between two languages by using lexicons to a common intermediary language is presented. This type of method is useful since there are many language pairs in the world lacking a lexicon, but many languages have lexicons available with translations to one of the larger languages of the world, for instance English. The modifications were intended to improve the coverage of the lexicon, possibly at the cost of lower translation quality. Finally a program for generating puns in Japanese is presented. The generated puns are not very funny, the main purpose of the program is to test the hypothesis that by using "bad word" things become a little bit more funny.
QC 20100920
APA, Harvard, Vancouver, ISO, and other styles
6

Kress, Linda. "Analysis of computer science curriculum through development of an online crime reporting system." Morgantown, W. Va. : [West Virginia University Libraries], 2006. https://eidr.wvu.edu/etd/documentdata.eTD?documentid=4601.

Full text
Abstract:
Thesis (M.S.)--West Virginia University, 2006.
Title from document title page. Document formatted into pages; contains vii, 189 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 175-189).
APA, Harvard, Vancouver, ISO, and other styles
7

Xiang, Gang. "Fast algorithms for computing statistics under interval uncertainty with applications to computer science and to electrical and computer engineering /." To access this resource online via ProQuest Dissertations and Theses @ UTEP, 2007. http://0-proquest.umi.com.lib.utep.edu/login?COPT=REJTPTU0YmImSU5UPTAmVkVSPTI=&clientId=2515.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Clough, Andrew Lawrence. "Increasing adder efficiency by exploiting input statistics." Thesis, Massachusetts Institute of Technology, 2007. http://hdl.handle.net/1721.1/42424.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2008.
Includes bibliographical references (p. 49-50).
Current techniques for characterizing the power consumption of adders rely on assuming that the inputs are completely random. However, the inputs generated by realistic applications are not random, and in fact include a great deal of structure. Input bits are more likely to remain in the same logical states from addition to addition than would be expected by chance and bits, especially the most significant bits, are very likely to be in the same state as their neighbors. Taking this data, I look at ways that it can be used to improve the design of adders. The first method I look at involves looking at how different adder architectures respond to the different characteristics of input data from the more significant and less significant bits of the adder, and trying to use these responses to create a hybrid adder. Unfortunately the differences are not sufficient for this approach to be effective. I next look at the implications of the data I collected for the optimization of Kogge- Stone adder trees, and find that in certain circumstances the use of experimentally derived activity maps rather than ones based on simple assumptions can increase adder performance by as much as 30%.
by Andrew Lawrence Clough.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
9

Tikekar, Mehul (Mehul Deepak). "Energy-efficient video decoding using data statistics." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/113990.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 103-108).
Video traffic over the Internet is growing rapidly and is projected to be about 82% of the total consumer Internet traffic by 2020. To address this, new video coding standards such as H.265/HEVC (High Efficiency Video Coding) provide better compression especially at Full HD and higher video resolutions. HEVC achieves this through a variety of algorithmic techniques such as larger transform sizes and more accurate inter-frame prediction. However, these techniques increase the complexity of software and hardware-based video decoders. In this thesis, we design a hardware-based video decoder chip that exploits the statistics of the video to reduce the energy/pixel cost in several ways. For example, we exploit the sparsity in transform coefficients to reduce the energy/pixel cost of inverse transform by 29%. With the proposed architecture, larger transforms have the same energy/pixel cost as smaller transforms owing to their higher sparsity thus addressing the increased complexity of HEVC's larger transform sizes. As a second example, the energy/pixel cost of inter-prediction is dominated by off-chip memory access. We eliminate off-chip memory access by using on-chip embedded DRAM (eDRAM). However, eDRAM banks spend 80% of their energy on frequent refresh operations to retain stored data retention. To reduce refresh energy, we compress the video data stored in the eDRAM by exploiting spatial correlation among pixels. Thus, unused eDRAM banks can be turned off to reduce refresh energy by 55%. This thesis presents measured results for a 40 nm CMOS test chip that can decode Full HD video at 20 - 50 frames per second while consuming only 25 - 31 mW of system power. The system power is 6 times lower than the state-of-the-art and can enable even extremely energy-constrained wearable devices to decode video without exceeding their power budgets. The inverse transform result can enable future coding standards to use even larger transform sizes to improve compression without sacrificing energy efficiency.
by Mehul Tikekar.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
10

Sharan, Lavanya. "Image statistics and the perception of surface reflectance." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/34356.

Full text
Abstract:
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.
MIT Institute Archives copy: p. 223 (last page) bound in reverse order.
Includes bibliographical references (p. 217-223).
Humans are surprisingly good at judging the reflectance of complex surfaces even when the surfaces are viewed in isolation, contrary to the Gelb effect. We argue that textural cues are important for this task. Traditional machine vision systems, on the other hand, are incapable of recognizing reflectance properties. Estimating the reflectance of a complex surface under unknown illumination from a single image is a hard problem. Recent work in reflectance recognition has shown that certain statistics measured o an image of a surface are diagnostic of reflectance. We consider opaque surfaces with medium scale structure and spatially homogeneous reflectance properties. For such surfaces, we find that statistics of intensity histograms and histograms of filtered outputs are indicative of the diffuse surface reflectance. We compare the performance of a learning algorithm that employs these image statistics to human performance in two psychophysical experiments. In the first experiment, observers classify images of complex surfaces according to the perceived reflectance. We find that the learning algorithm rivals human performance at the classification task. In the second experiment, we manipulate the statistics of images and ask observers to provide reflectance ratings. In this case, the learning algorithm performs similarly to human observers. These findings lead us to conclude that the image statistics capture perceptually relevant information.
by Lavanya Sharan.
S.M.
APA, Harvard, Vancouver, ISO, and other styles
11

Dror, Ron O. (Ron Ofer) 1975. "Surface reflectance recognition and real-world illumination statistics." Thesis, Massachusetts Institute of Technology, 2002. http://hdl.handle.net/1721.1/16911.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2003.
Includes bibliographical references (p. 141-150).
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Humans distinguish materials such as metal, plastic, and paper effortlessly at a glance. Traditional computer vision systems cannot solve this problem at all. Recognizing surface reflectance properties from a single photograph is difficult because the observed image depends heavily on the amount of light incident from every direction. A mirrored sphere, for example, produces a different image in every environment. To make matters worse, two surfaces with different reflectance properties could produce identical images. The mirrored sphere simply reflects its surroundings, so in the right artificial setting, it could mimic the appearance of a matte ping-pong ball. Yet, humans possess an intuitive sense of what materials typically "look like" in the real world. This thesis develops computational algorithms with a similar ability to recognize reflectance properties from photographs under unknown, real-world illumination conditions. Real-world illumination is complex, with light typically incident on a surface from every direction. We find, however, that real-world illumination patterns are not arbitrary. They exhibit highly predictable spatial structure, which we describe largely in the wavelet domain. Although they differ in several respects from the typical photographs, illumination patterns share much of the regularity described in the natural image statistics literature. These properties of real-world illumination lead to predictable image statistics for a surface with given reflectance properties. We construct a system that classifies a surface according to its reflectance from a single photograph under unknown illumination. Our algorithm learns relationships between surface reflectance and certain statistics computed from the observed image.
(cont.) Like the human visual system, we solve the otherwise underconstrained inverse problem of reflectance estimation by taking advantage of the statistical regularity of illumination. For surfaces with homogeneous reflectance properties and known geometry, our system rivals human performance.
by Ron O. Dror.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
12

Terrell, David. "Racial inequalities in America| Examining socioeconomic statistics using the Semantic Web." Thesis, Florida Atlantic University, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10154928.

Full text
Abstract:

The visualization of recent episodes regarding apparently unjustifiable deaths of minorities, caused by police and federal law enforcement agencies, has been amplified through today’s social media and television networks. Such events may seem to imply that issues concerning racial inequalities in America are getting worse. However, we do not know whether such indications are factual; whether this is a recent phenomenon, whether racial inequality is escalating relative to earlier decades, or whether it is better in certain regions of the nation compared to others.

We have built a semantic engine for the purpose of querying statistics on various metropolitan areas, based on a database of individual deaths. Separately, we have built a database of demographic data on poverty, income, education attainment, and crime statistics for the top 25 most populous metropolitan areas. These data will ultimately be combined with government data to evaluate this hypothesis, and provide a tool for predictive analytics. In this thesis, we will provide preliminary results in that direction.

The methodology in our research consisted of multiple steps. We initially described our requirements and drew data from numerous datasets, which contained information on the 23 highest populated Metropolitan Statistical Areas in the United States. After all of the required data was obtained we decomposed the Metropolitan Statistical Area records into domain components and created an Ontology/Taxonomy via Protégé to determine an hierarchy level of nouns towards identifying significant keywords throughout the datasets to use as search queries. Next, we used a Semantic Web implementation accompanied with Python programming language, and FuXi to build and instantiate a vocabulary. The Ontology was then parsed for the entered search query and returned corresponding results providing a semantically organized and relevant output in RDF/XML format.

APA, Harvard, Vancouver, ISO, and other styles
13

Li, Shaolin 1963. "Stochastic approximation algorithms for statistical estimation." Thesis, McGill University, 1996. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=42077.

Full text
Abstract:
This thesis presents some broadly applicable algorithms for computing maximum likelihood estimates (MLE) from the incomplete data based on the stochastic approximation (SA) proposed by Robbins and Monro (1951). The usual approach for such problems is the EM algorithm. In many interesting examples, however, it is impossible to carry out either the E-step or the M-step of the EM algorithm. Although some remedial EM algorithms were developed, these algorithms could be very expensive numerically, especially when both the E-step and the M-step become intractable. The SA algorithms proposed are appealing because they avoid computing expectation within iterations and are easy to implement. These advantages are reinforced by a discussion of some examples illustrating how these SA algorithms can succeed while the EM algorithm is intractable. Theory showing convergence of these algorithms along with the rate of their optimal convergence is developed. Moreover, the thesis also explores some theoretical issues about the robustness, consistency and asymptotic theory of the SA estimation of MLE with incomplete data in the double array sense.
APA, Harvard, Vancouver, ISO, and other styles
14

Almulla, Mohammed Ali. "A class of greedy algorithms for solving the travelling salesman problem /." Thesis, McGill University, 1990. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=59557.

Full text
Abstract:
The travelling salesman problem is one of the NP-complete problems. It has been under consideration in computer science for at least forty years. Solving this hard problem using search methods can be accomplished by choosing: a starting point, a solution generation scheme and a termination rule. When the termination rule is such that search stops if and only if the tour is optimal, we call the method "exact". When the termination rule is such that the search stops but not necessarily with an optimal tour, we call the method "approximate".
This thesis looks closely at one of the approximate methods, namely sub-optimal tour building. In particular, it focuses on the nearest neighbour algorithm (a greedy algorithm). By being greedy at every step of the procedure, this algorithm returns an approximate solution that is near optimal in terms of solution cost. Next, this greedy algorithm is used in implementing a new algorithm that is called the "Multi-Degree Greedy Algorithm". By being greedy at half of the procedure steps, this algorithm returns optimal solutions to travelling salesman problems 99% of the time. Thus, this algorithm is an approximate algorithm, designed to run on small-scale travelling salesman problems (n $<$ 20).
APA, Harvard, Vancouver, ISO, and other styles
15

Duguay, Richard. "Speech recognition : transition probability training in diphone bootstraping." Thesis, McGill University, 1999. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=21544.

Full text
Abstract:
This work explores possible methods of improving already well-trained diphone models using the same data set that was used to train the base monophones. The emphasis is placed on transition probability training. A simple approach to probability adaptation is used as a test of the expected magnitude of change in performance. Various other methods of probability modifications are explored, including sample pruning, unseen model substitution, and use phonetically tied mixtures. Model performance improvement is observed by comparison with similar experiments.
APA, Harvard, Vancouver, ISO, and other styles
16

Sverchkov, Yuriy. "Detection and explanation of statistical differences across a pair of groups." Thesis, University of Pittsburgh, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3647988.

Full text
Abstract:

The task of explaining differences across groups is a task that people encounter often, not only in the research environment, but also in less formal settings. Existing statistical tools designed specifically for discovering and understanding differences are limited. The methods developed in this dissertation provide such tools and help understand what properties such tools should have to be successful and to motivate further development of new approaches to discovering and understanding differences.

This dissertation presents a novel approach to comparing groups of data points. The process of comparing groups of data is divided into multiple stages: The learning of maximum a posteriori models for the data in each group, the identification of statistical differences between model parameters, the construction of a single model that captures those differences, and finally, the explanation of inferences of differences in marginal distributions in the form of an account of clinically significant contributions of elemental model differences to the marginal difference. A general framework for the process, applicable to a broad range of model types, is presented. This dissertation focuses on applying this framework to Bayesian networks over multinomial variables.

To evaluate model learning and the detection of parameter differences an empirical evaluation of methods for identifying statistically significant differences and clinically significant differences is performed. To evaluate the generated explanations of how differences in the models account for the differences in probabilities computed from those models, case studies with real clinical data are presented, and the findings generated by explanations are discussed. An interactive prototype that allows a user to navigate through such an explanation is presented, and ideas are discussed for further development of data analysis tools for comparing groups of data.

APA, Harvard, Vancouver, ISO, and other styles
17

Fang, Youhan. "Efficient Markov Chain Monte Carlo Methods." Thesis, Purdue University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10809188.

Full text
Abstract:

Generating random samples from a prescribed distribution is one of the most important and challenging problems in machine learning, Bayesian statistics, and the simulation of materials. Markov Chain Monte Carlo (MCMC) methods are usually the required tool for this task, if the desired distribution is known only up to a multiplicative constant. Samples produced by an MCMC method are real values in N-dimensional space, called the configuration space. The distribution of such samples converges to the target distribution in the limit. However, existing MCMC methods still face many challenges that are not well resolved. Difficulties for sampling by using MCMC methods include, but not exclusively, dealing with high dimensional and multimodal problems, high computation cost due to extremely large datasets in Bayesian machine learning models, and lack of reliable indicators for detecting convergence and measuring the accuracy of sampling. This dissertation focuses on new theory and methodology for efficient MCMC methods that aim to overcome the aforementioned difficulties.

One contribution of this dissertation is generalizations of hybrid Monte Carlo (HMC). An HMC method combines a discretized dynamical system in an extended space, called the state space, and an acceptance test based on the Metropolis criterion. The discretized dynamical system used in HMC is volume preserving—meaning that in the state space, the absolute Jacobian of a map from one point on the trajectory to another is 1. Volume preservation is, however, not necessary for the general purpose of sampling. A general theory allowing the use of non-volume preserving dynamics for proposing MCMC moves is proposed. Examples including isokinetic dynamics and variable mass Hamiltonian dynamics with an explicit integrator, are all designed with fewer restrictions based on the general theory. Experiments show improvement in efficiency for sampling high dimensional multimodal problems. A second contribution is stochastic gradient samplers with reduced bias. An in-depth analysis of the noise introduced by the stochastic gradient is provided. Two methods to reduce the bias in the distribution of samples are proposed. One is to correct the dynamics by using an estimated noise based on subsampled data, and the other is to introduce additional variables and corresponding dynamics to adaptively reduce the bias. Extensive experiments show that both methods outperform existing methods. A third contribution is quasi-reliable estimates of effective sample size. Proposed is a more reliable indicator—the longest integrated autocorrelation time over all functions in the state space—for detecting the convergence and measuring the accuracy of MCMC methods. The superiority of the new indicator is supported by experiments on both synthetic and real problems.

Minor contributions include a general framework of changing variables, and a numerical integrator for the Hamiltonian dynamics with fourth order accuracy. The idea of changing variables is to transform the potential energy function as a function of the original variable to a function of the new variable, such that undesired properties can be removed. Two examples are provided and preliminary experimental results are obtained for supporting this idea. The fourth order integrator is constructed by combining the idea of the simplified Takahashi-Imada method and a two-stage Hessian-based integrator. The proposed method, called two-stage simplified Takahashi-Imada method, shows outstanding performance over existing methods in high-dimensional sampling problems.

APA, Harvard, Vancouver, ISO, and other styles
18

Wei, Wutao. "Model Based Clustering Algorithms with Applications." Thesis, Purdue University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10830711.

Full text
Abstract:

In machine learning predictive area, unsupervised learning will be applied when the labels of the data are unavailable, laborious to obtain or with limited proportion. Based on the special properties of data, we can build models by understanding the properties and making some reasonable assumptions. In this thesis, we will introduce three practical problems and discuss them in detail. This thesis produces 3 papers as follow: Wei, Wutao, et al. "A Non-parametric Hidden Markov Clustering Model with Applications to Time Varying User Activity Analysis." ICMLA2015 Wei, Wutao, et al. "Dynamic Bayesian predictive model for box office forecasting." IEEE Big Data 2017. Wei, Wutao, Bowei Xi, and Murat Kantarcioglu. "Adversarial Clustering: A Grid Based Clustering Algorithm Against Active Adversaries." Submitted

User Profiling Clustering: Activity data of individual users on social media are easily accessible in this big data era. However, proper modeling strategies for user profiles have not been well developed in the literature. Existing methods or models usually have two limitations. The first limitation is that most methods target the population rather than individual users, and the second is that they cannot model non-stationary time-varying patterns. Different users in general demonstrate different activity modes on social media. Therefore, one population model may fail to characterize activities of individual users. Furthermore, online social media are dynamic and ever evolving, so are users’ activities. Dynamic models are needed to properly model users’ activities. In this paper, we introduce a non-parametric hidden Markov model to characterize the time-varying activities of social media users. In addition, based on the proposed model, we develop a clustering method to group users with similar activity patterns.

Adversarial Clustering: Nowadays more and more data are gathered for detecting and preventing cyber-attacks. Unique to the cyber security applications, data analytics techniques have to deal with active adversaries that try to deceive the data analytics models and avoid being detected. The existence of such adversarial behavior motivates the development of robust and resilient adversarial learning techniques for various tasks. In the past most of the work focused on adversarial classification techniques, which assumed the existence of a reasonably large amount of carefully labeled data instances. However, in real practice, labeling the data instances often requires costly and time-consuming human expertise and becomes a significant bottleneck. Meanwhile, a large number of unlabeled instances can also be used to understand the adversaries' behavior. To address the above mentioned challenges, we develop a novel grid based adversarial clustering algorithm. Our adversarial clustering algorithm is able to identify the core normal regions, and to draw defensive walls around the core positions of the normal objects utilizing game theoretic ideas. Our algorithm also identifies sub-clusters of attack objects, the overlapping areas within clusters, and outliers which may be potential anomalies.

Dynamic Bayesian Update for Profiling Clustering: Movie industry becomes one of the most important consumer business. The business is also more and more competitive. As a movie producer, there is a big cost in movie production and marketing; as an owner of a movie theater, it is also a problem that how to arrange the limited screens to the current movies in theater. However, all the current models in movie industry can only give an estimate of the opening week. We improve the dynamic linear model with a Bayesian framework. By using this updating method, we are also able to update the streaming adversarial data and make defensive recommendation for the defensive systems.

APA, Harvard, Vancouver, ISO, and other styles
19

Zaetz, Jiaqi L. "A Riemannian Framework for Shape Analysis of Annotated 3D Objects." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1440368778.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Chen, Guo. "Implementation of Cumulative Probability Models for Big Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=case1619624862283514.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Konzem, Scott R. "Tenability and Computability of Generalized Polya Urns." Thesis, The George Washington University, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10263413.

Full text
Abstract:

Urn models have a storied part in the history of probability and have been studied extensively over the past century for their wide range of applications. We analyze a generalized class of urn models introduced in the past decade, the so-called "multiset" class, in which more than one ball is sampled at a time. We investigate sufficient conditions for a multiset urn process to be tenable, meaning the process can continue indefinitely without getting stuck. We fully characterize the "strongly tenable" class of Pólya urn schemes, which is tenable under any starting conditions that allow the process to begin. We find several "weakly tenable" classes of Pólya urn schemes that are tenable only under restricted starting conditions. We enumerate the size of some of these tenable classes using combinatorics, probabilistically analyze them, and provide an algorithm to assess the tenability of an arbitrary urn scheme using breadth-first search. We further analyze the computational complexity of the tenability problem itself. By showing how to encode the Boolean satisfiability problem within a Pólya urn scheme, we find that the problem of determining whether a multiset urn scheme is untenable is in the complexity class NP-hard, and this places constraints on the kinds of tenability theorems we can hope to find. Finally, we analyze a generalized “fault tolerant” urn model that can take action to avoid getting stuck, and by showing that this model is Turing-equivalent, we show that the tenability problem for this model is undecidable.

APA, Harvard, Vancouver, ISO, and other styles
22

Angiuli, Olivia Marie. "The effect of quasi-identifier characteristics on statistical bias introduced by k-anonymization." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:14398529.

Full text
Abstract:
The de-identification of publicly released datasets that contain personal information is necessary to preserve personal privacy. One such de-identification algorithm, k-anonymization, reduces the risk of the re-identification of such datasets by requiring that each combination of information-revealing traits be represented by at least k different records in the dataset. However, this requirement may skew the resulting dataset by preferentially deleting records that contain more rare information-revealing traits. This paper investigates the amount of bias and loss of utility introduced into an online education dataset by the k-anonymization process, as well as suggesting future directions that may decrease the amount of bias introduced during de-identification procedures.
APA, Harvard, Vancouver, ISO, and other styles
23

Zheng, Shijie. "The Differential Privacy of Bayesian Inference." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:14398533.

Full text
Abstract:
Differential privacy is one recent framework for analyzing and quantifying the amount of privacy lost when data is released. Meanwhile, multiple imputation is an existing Bayesian-inference based technique from statistics that learns a model using real data, then releases synthetic data by drawing from that model. Because multiple imputation does not directly release any real data, it is generally believed to protect privacy. In this thesis, we examine that claim. While there exist newer synthetic data algorithms specifically designed to provide differential privacy, we evaluate whether multiple imputation already includes differential privacy for free. Thus, we focus on several method variants for releasing the learned model and releasing the synthetic data, and how these methods perform for models taking on two common distributions: the Bernoulli and the Gaussian with known variance. We prove a number of new or improved bounds on the amount of privacy afforded by multiple imputation for these distributions. We find that while differential privacy is ostensibly achievable for most of our method variants, the conditions needed for it to do so are often not realistic for practical usage. At least in theory, this is particularly true if we want absolute privacy (ε-differential privacy), but that the methods are more practically compatible with privacy when we allow a small probability of a catastrophic data leakage ((ε, δ)-differential privacy).
APA, Harvard, Vancouver, ISO, and other styles
24

Comiter, Marcus Zachary. "A Future of Abundant Sparsity: Novel Use and Analysis of Sparse Coding in Machine Learning Applications." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:17417575.

Full text
Abstract:
We present novel applications and analysis of the use of sparse coding within the con- text of machine learning. We first present Sparse Coding Trees (SC-trees), a sparse coding-based framework for resolving classification conflicts, which occur when different classes are mapped to similar feature representations. More specifically, SC-trees are novel supervised hierarchical clustering trees that use node specific dictionary and classifier training to direct input images based on classification results in the feature space at each node. We validate SC-trees on image-based emotion classification, combining it with Mirrored Nonnegative Sparse Coding (MNNSC), a novel sparse coding algorithm leveraging a nonnegativity constraint and the inherent symmetry of the domain, to achieve results exceeding or competitive with the state-of-the-art. We next present SILQ, a sparse coding-based link state model that can predictively buffer packets during wireless link outages to avoid disruption to higher layer protocols such as TCP. We demonstrate empirically that SILQ increases TCP throughput by a factor of 2-4x in varied scenarios.
Computer Science
APA, Harvard, Vancouver, ISO, and other styles
25

Swaminathan, Adith. "Counterfactual Evaluation and Learning From Logged User Feedback." Thesis, Cornell University, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10258968.

Full text
Abstract:

Interactive systems that interact with and learn from user behavior are ubiquitous today. Machine learning algorithms are core components of such systems. In this thesis, we will study how we can re-use logged user behavior data to evaluate interactive systems and train their machine learned components in a principled way. The core message of the thesis is • Using simple techniques from causal inference, we can improve popular machine learning algorithms so that they interact reliably. • These improvements are effective and scalable, and complement current algorithmic and modeling advances in machine learning. • They open further avenues for research in Counterfactual Evaluation and Learning to ensure machine learned components interact reliably with users and with each other. This thesis explores two fundamental tasks—evaluation and training of interactive systems. Solving evaluation and training tasks using logged data is an exercise in counterfactual reasoning. So we will first review concepts from causal inference for counterfactual reasoning, assignment mechanisms, statistical estimation and learning theory. The thesis then contains two parts.

In the first part, we will study scenarios where unknown assignment mechanisms underlie the logged data we collect. These scenarios often arise in learning-to-rank and learning-to-recommend applications. We will view these applications through the lens of causal inference and modularize the problem of building a good ranking engine or recommender system into two components—first, infer a plausible assignment mechanism and second, reliably learn to rank or recommend assuming this mechanism was active when collecting data.

The second part of the thesis focuses on scenarios where we collect logged data from past interventions. We will formalize these scenarios as batch learning from logged contextual bandit feedback. We will first develop better off-policy estimators for evaluating online user-centric metrics in information retrieval applications. In subsequent chapters, we will study the bias-variance trade-off when learning from logged interventions. This study will yield new learning principles, algorithms and insights into the design of statistical estimators for counterfactual learning.

The thesis outlines a few principles, tools, datasets and software that hopefully prove to be useful to you as you build your interactive learning system.

APA, Harvard, Vancouver, ISO, and other styles
26

Foulds, James Richard. "Latent Variable Modeling for Networks and Text| Algorithms, Models and Evaluation Techniques." Thesis, University of California, Irvine, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3631094.

Full text
Abstract:

In the era of the internet, we are connected to an overwhelming abundance of information. As more facets of our lives become digitized, there is a growing need for automatic tools to help us find the content we care about. To tackle the problem of information overload, a standard machine learning approach is to perform dimensionality reduction, transforming complicated high-dimensional data into a manageable, low-dimensional form. Probabilistic latent variable models provide a powerful and elegant framework for performing this transformation in a principled way. This thesis makes several advances for modeling two of the most ubiquitous types of online information: networks and text data.

Our first contribution is to develop a model for social networks as they vary over time. The model recovers latent feature representations of each individual, and tracks these representations as they change dynamically. We also show how to use text information to interpret these latent features.

Continuing the theme of modeling networks and text data, we next build a model of citation networks. The model finds influential scientific articles and the influence relationships between the articles, potentially opening the door for automated exploratory tools for scientists. The increasing prevalence of web-scale data sets provides both an opportunity and a challenge. With more data we can fit more accurate models, as long as our learning algorithms are up to the task. To meet this challenge, we present an algorithm for learning latent Dirichlet allocation topic models quickly, accurately and at scale. The algorithm leverages stochastic techniques, as well as the collapsed representation of the model. We use it to build a topic model on 4.6 million articles from the open encyclopedia Wikipedia in a matter of hours, and on a corpus of 1740 machine learning articles from the NIPS conference in seconds.

Finally, evaluating the predictive performance of topic models is an important yet computationally difficult task. We develop one algorithm for comparing topic models, and another for measuring the progress of learning algorithms for these models. The latter method achieves better estimates than previous algorithms, in many cases with an order of magnitude less computational effort.

APA, Harvard, Vancouver, ISO, and other styles
27

Kaftan, David. "Design Day Analysis - Forecasting Extreme Daily Natural Gas Demand." Thesis, Marquette University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10825062.

Full text
Abstract:

This work provides a framework for Design Day analysis. First, we estimate the temperature conditions which are expected to be colder than all but one day in N years. This temperature is known as the Design Day condition. Then, we forecast an upper bound on natural gas demand when temperature is at the Design Day condition.

Natural gas distribution companies (LDCs) need to meet demand during extreme cold days. Just as bridge builders design for a nominal load, natural gas distribution companies need to design for a nominal temperature. This nominal temperature is the Design Day condition. The Design Day condition is the temperature that is expected to be colder than every day except one in N years. Once Design Day conditions are estimated, LDCs need to prepare for the Design Day demand. We provide an upper bound on Design Day demand to ensure LDCs will be able to meet demand.

Design Day conditions are determined in a variety of ways. First, we fit a kernel density function to surrogate temperatures - this method is referred to as the Surrogate Kernel Density Fit. Second, we apply Extreme Value Theory - a field dedicated to finding the maxima or minima of a distribution. In particular, we apply Block-Maxima and Peak-Over-Threshold (POT) techniques. The upper bound of Design Day demand is determined using a modified version of quantile regression.

Similar Design Day conditions are estimated by both the Surrogate Kernel Density Fit and Peaks-Over-Threshold methods. Both methods perform well. The theory supporting the POT method and the empirical performance of the SKDF method lends confidence in the Design Day conditions estimates. The upper bound of demand on these conditions is well modeled by the modified quantile regression technique.

APA, Harvard, Vancouver, ISO, and other styles
28

Lin, Lei. "Data science application in intelligent transportation systems| An integrative approach for border delay prediction and traffic accident analysis." Thesis, State University of New York at Buffalo, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3683052.

Full text
Abstract:

With the great progress in information and communications technologies in the past few decades, intelligent transportation systems (ITS) have accumulated vast amounts of data regarding the movement iof people and goods from one location to another. Besides the traditional fixed sensors and GPS devices, new emerging data sources and approaches such as social media and crowdsourcing can be used to extract travel-related data, especially given the wide popularity of mobile devices such as smartphones and tablets, along with their associated apps. To take advantage of all these data and to address the associated challenges, big data techniques, and a new emerging field called data science, are currently receiving more and more attention. Data science employs techniques and theories from many fields such as statistics, machine learning, data mining, analytical models and computer programming to solve the data analysis task. It is therefore timely and important to explore how data science may be best employed for transportation data analysis. In this doctoral study, an integrative approach is proposed for data science applications in ITS. The proposed approach constitutes to an integration of multiple steps in the data analysis process, or integration of different models to build a more powerful one. The integrative approach is applied and tested on two case studies: border crossing delay prediction and traffic accident data analysis.

For the first case study, a two-step border crossing delay prediction model is proposed, consisting of a short-term traffic volume prediction model and a multi-server queueing model. As such, this can be seen as an integration of data-driven models and analytical models. For the first step, the short-term traffic volume prediction model, an integration of data "width" decreasing (i.e., data grouping) step and model development step is applied. For model development, a model combination step of a Seasonal Autoregressive Integrated Moving Average Model (SARIMA) and Support Vector Regression (SVR) is applied to realize better performance than when using each single model. In addition, the spinning network (SPN) forecasting paradigm is enhanced for border crossing traffic prediction through the utilization of a dynamic time warping (DTW) similarity metric. The DTW-SPN is shown to yield several advantages such as computational efficiency and accuracy as demonstrated by a promising Mean Absolute Percent Error (MAPE) compared to SARIMA and SVR.

This dissertation also proposes the introduction of a data diagnosis step before short-term traffic prediction. In order to develop a methodology for model selection guidance, the author calculated the statistical measures of nonlinearity and complexity for multiple datasets and correlated those to the performances of multiple models SARIMA, SVR and k nearest neighbor (k-NN). Based on this, useful insights are revealed pertaining to parameter setting and model selection based on the data diagnosis results.

For the second step, namely the queueing model development, heuristic solutions are presented for two types of queueing models M/E K/n and BMAP/PH/n. These models take the predicted traffic volume as input, and use it to calculate future waiting time. The analytical results are compared to the results from a VISSIM model simulation results, and shown to be comparable. . Finally, an android smartphone app, which utilizes the two-step border prediction model methodology described above, is developed to collect, share and predict waiting time at the three Niagara Frontier border crossings.

For the second case study involving traffic accident data analysis, first an integration of a data "depth" decreasing step and a model development step is once again applied. To do this, the modularity-optimizing community detection algorithm is used to cluster the dataset, and for each cluster, the association rule algorithm is applied to yield insight into traffic accident hotspots and incident clearance time. The results show that more meaningful association rules can be derived when the data is clustered compared to when using the whole dataset directly. Secondly, an integration of a data "width" decreasing step (variable selection) and model development step is applied for real-time traffic accident risk prediction. For this, a novel variable selection method based on the Frequent Pattern tree (FP tree) algorithm is proposed and tested, before applying Bayesian networks and the k-NN algorithms. The experiment shows the models based on variables selected by FP tree always performed better than those using variables selected by the random forecast method. Lastly, an integration of the data mining model, M5P tree, and the hazard-based duration model (HBDM) statistical method is applied to traffic accident duration prediction. The M5P-HBDM method is shown to be capable of identifying more meaningful factors that impact the traffic accident duration, and to have a better prediction performance, than either M5P or HBDM.

The two case studies considered in this dissertation serve to illustrate the advantages of an integrative data science approach to analyzing transportation data. With this approach, invaluable insight is gained that can help solve transportation problems and guide public policy.

APA, Harvard, Vancouver, ISO, and other styles
29

Gligorijevic, Djordje. "Predictive Uncertainty Quantification and Explainable Machine Learning in Healthcare." Diss., Temple University Libraries, 2018. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/520057.

Full text
Abstract:
Computer and Information Science
Ph.D.
Predictive modeling is an ever-increasingly important part of decision making. The advances in Machine Learning predictive modeling have spread across many domains bringing significant improvements in performance and providing unique opportunities for novel discoveries. A notably important domains of the human world are medical and healthcare domains, which take care of peoples' wellbeing. And while being one of the most developed areas of science with active research, there are many ways they can be improved. In particular, novel tools developed based on Machine Learning theory have drawn benefits across many areas of clinical practice, pushing the boundaries of medical science and directly affecting well-being of millions of patients. Additionally, healthcare and medicine domains require predictive modeling to anticipate and overcome many obstacles that future may hold. These kinds of applications employ a precise decision--making processes which requires accurate predictions. However, good prediction by its own is often insufficient. There has been no major focus in developing algorithms with good quality uncertainty estimates. Ergo, this thesis aims at providing a variety of ways to incorporate solutions by learning high quality uncertainty estimates or providing interpretability of the models where needed for purpose of improving existing tools built in practice and allowing many other tools to be used where uncertainty is the key factor for decision making. The first part of the thesis proposes approaches for learning high quality uncertainty estimates for both short- and long-term predictions in multi-task learning, developed on top for continuous probabilistic graphical models. In many scenarios, especially in long--term predictions, it may be of great importance for the models to provide a reliability flag in order to be accepted by domain experts. To this end we explored a widely applied structured regression model with a goal of providing meaningful uncertainty estimations on various predictive tasks. Our particular interest is in modeling uncertainty propagation while predicting far in the future. To address this important problem, our approach centers around providing an uncertainty estimate by modeling input features as random variables. This allows modeling uncertainty from noisy inputs. In cases when model iteratively produces errors it should propagate uncertainty over the predictive horizon, which may provide invaluable information for decision making based on predictions. In the second part of the thesis we propose novel neural embedding models for learning low-dimensional embeddings of medical concepts, such are diseases and genes, and show how they can be interpreted to allow accessing their quality, and show how can they be used to solve many problems in medical and healthcare research. We use EHR data to discover novel relationships between diseases by studying their comorbidities (i.e., co-occurrences in patients). We trained our models on a large-scale EHR database comprising more than 35 million inpatient cases. To confirm value and potential of the proposed approach we evaluate its effectiveness on a held-out set. Furthermore, for select diseases we provide a candidate gene list for which disease-gene associations were not studied previously, allowing biomedical researchers to better focus their often very costly lab studies. We furthermore examine how disease heterogeneity can affect the quality of learned embeddings and propose an approach for learning types of such heterogeneous diseases, while in our study we primarily focus on learning types of sepsis. Finally, we evaluate the quality of low-dimensional embeddings on tasks of predicting hospital quality indicators such as length of stay, total charges and mortality likelihood, demonstrating their superiority over other approaches. In the third part of the thesis we focus on decision making in medicine and healthcare domain by developing state-of-the-art deep learning models capable of outperforming human performance while maintaining good interpretability and uncertainty estimates.
Temple University--Theses
APA, Harvard, Vancouver, ISO, and other styles
30

Nelson, Emily W. (Emily Wyke) 1977. "Counting statistics of a system to produce entangled photon pairs." Thesis, Massachusetts Institute of Technology, 2001. http://hdl.handle.net/1721.1/86724.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Huang, Yen-Chin. "Empirical distribution function statistics, speed of convergence, and p-variation." Thesis, Massachusetts Institute of Technology, 1994. http://hdl.handle.net/1721.1/12017.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Herring, Keith 1981. "Propagation models for multiple-antenna systems : methodology, measurements and statistics." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/43027.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.
Includes bibliographical references (leaves 219-223).
The trend in wireless communications is towards utilization of multiple antenna systems. While techniques such as beam-forming and spatial diversity have been implemented for some time, the emergence of Multiple-Input Multiple-Output (MIMO) communications has increased commercial interest and development in multiple-antenna technology. Given this trend it has become increasingly important that we understand the propagation characteristics of the environments where this new technology will be deployed. In particular the development of low-cost, high-performance system architectures and protocols is largely dependent on the accuracy of available channel models for approximating realized propagation behavior. The first contribution of this thesis is a methodology for the modeling of wireless propagation in multiple antenna systems. Specifically we consider the problem of propagation modeling from the perspective of the protocol designer and system engineer. By defining the wireless channel as the complex narrow-band channel response h e C between two devices, we characterize the important degrees of freedom associated with the channel by modeling it as a function of its path-loss, multipath/frequency, time stability, spatial, and polarization characteristics. We then motivate this model by presenting a general set of design decisions that depend on these parameters such as network density, channel allocation, and channel-state information (CSI) update rate. Lastly we provide a parametrization of the environment into measurable factors that can be used to predict channel behavior including link-length, Line-Of-Sight (LOS), link topology (e.g. air-to-ground), building density, and other physical parameters. The second contribution of this thesis is the experimental analysis and development of this modeling space.
(cont) Specifically we have gathered a large database of real wireless channel data from a diverse set of propagation environments. A mobile channel-data collection system was built for obtaining the required data which includes an eight-channel software receiver and a collection of WiFi channel sounders. The software receiver synchronously samples the 20-MHz band centered at 2.4 GHz from eight configurable antennas. Measurements have been carried out for both air-to-ground and ground-to-ground links for distances ranging from tens of meters to several kilometers throughout the city of Cambridge, MA. Here we have developed a collection of models for predicting channel behavior, including a model for estimating the path-loss coefficient a in street environments that utilizes two physical parameters: P1 = percentage of building gaps averaged over each side of the street, P2= percentage of the street length that has a building gap on at least one side of the street. Results show a linear increase in a of 0.53 and 0.32 per 10% increase in P1 and P2, respectively, with RMS errors of 0.47 and 0.27 a for a's between 2 and 5. Experiments indicate a 10dB performance advantage in estimating path-loss with this multi-factor model over the optimal linear estimator (upper-bound empirical model) for link lengths as short as 100 meters. In contrast, air-to-ground links have been shown to exhibit log-normal fading with an average attenuation of a ; 2 and standard deviation of 8dB. Additionally we provide exhaustive evidence that the small-scale fading behavior (frequency domain) of both Non-Line-Of-Sight (NLOS) air-to-ground and ground-to-ground links as short as tens of meters is Rayleigh distributed. More specifically, fading distributions across a diverse set of environments and link lengths have been shown to have Rician K-factors smaller than 1, suggesting robust performance of the Rayleigh model.
(cont) A model is also presented that defines a stochastic distribution for the delay-spread of the channel as a function of the link-length (do), multipath component (MPC) decay-rate ( ... attenuation per unit delay ... ), and MPC arrival-rate (q = MPCs per unit delay ... periments support the use of this model over a spectrum of link-lengths (50m-700m) and indicate a dense arrival-rate (q) (on the order of 1 MPC) in ground-to-ground links. In this range the frequency structure of the channel is insensitive to q, which reduces the modeling complexity to a single unknown parameter, P. We provide estimators for 3 over a variety of environment types that have been shown to closely replicate the fade width distribution in these environments. The observed time-coherence length (tc) of MPCs tend to be either less than 300ms (high-frequency) or 5 seconds and longer (low-frequency), resulting in a Rician-like distribution for fading in the time domain. We show that the time characteristics of the channel are accurately modeled as the superposition of two independent circularly symmetric complex gaussian random variables corresponding to the channel response due to a set of stable and unstable MPCs. We observe the S-factor, defined as the ratio of average power in stable to unstable MPCs (distinct from the Rician K-factor), which ranges between 0-30dB depending on environment and link length, and can be estimated with an rms error of 3dB in both ground-to-ground and air-to-ground link regimes. Experiments show improved performance of this model over the Rician fading model which has been shown to underestimate high fade events (tails) in the time domain, corresponding to cases where the stable MPCs destructively combine to form a null. Additionally, the Kronecker MIMO channel model is shown to predict channel capacity (of a 7x7 system) with an rms error of 1.7 ... (at 20dB SNR) over a diverse set of observed outdoor environments.
(cont) Experiments indicate a 3dB performance advantage in this prediction when applied to environments that are not dominated by single-bounce propagation paths (Single-bounce: 2.1 ... rms, Multi-bounce: 1 ... rms).
by Keith T. Herring.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
33

Herring, Keith 1981. "Blind separation of noisy multivariate data using second-order statistics." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/30173.

Full text
Abstract:
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.
Includes bibliographical references (leaves 81-83).
A second-order method for blind source separation of noisy instantaneous linear mixtures is presented and analyzed for the case where the signal order k and noise covariance GG-H are unknown. Only a data set X of dimension n > k and of sample size m is observed, where X = AP + GW. The quality of separation depends on source-observation ratio k/n, the degree of spectral diversity, and the second-order non-stationarity of the underlying sources. The algorithm estimates the Second-Order separation transform A, the signal Order, and Noise, and is therefore referred to as SOON. SOON iteratively estimates: 1) k using a scree metric, and 2) the values of AP, G, and W using the Expectation-Maximization (EM) algorithm, where W is white noise and G is diagonal. The final step estimates A and the set of k underlying sources P using a variant of the joint diagonalization method, where P has k independent unit-variance elements. Tests using simulated Auto Regressive (AR) gaussian data show that SOON improves the quality of source separation in comparison to the standard second-order separation algorithms, i.e., Second-Order Blind Identification (SOBI) [3] and Second-Order Non-Stationary (SONS) blind identification [4]. The sensitivity in performance of SONS and SOON to several algorithmic parameters is also displayed in these experiments. To reduce sensitivities in the pre-whitening step of these algorithms, a heuristic is proposed by this thesis for whitening the data set; it is shown to improve separation performance. Additionally the application of blind source separation techniques to remote sensing data is discussed.
(cont.) Analysis of remote sensing data collected by the AVIRIS multichannel visible/infrared imaging instrument shows that SOON reveals physically significant dynamics within the data not found by the traditional methods of Principal Component Analysis (PCA) and Noise Adjusted Principal Component Analysis (NAPCA).
by Keith Herring.
S.M.
APA, Harvard, Vancouver, ISO, and other styles
34

Haulcy, R'mani(R'mani Symon). "Time-to-contact statistics as a proxy for accident probabilities." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/122699.

Full text
Abstract:
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 56-58).
Accidents are relatively rare, and this makes it difficult to study the impact of traffic system changes or vehicle control changes on accident rates. One potential solution to this problem is the use of time-to-contact (TTC) statistics as a proxy for accident probabilities. Low TTC can be used as a measure of potential danger. Simulations were performed to explore whether inverse TTC can serve as a good proxy of accident probability. The resulting data was then analyzed to investigate how inverse TTC varies with the mixture of vehicles with bilateral control as opposed to car-following control. Previously, it was found that a relatively high mixture ratio is needed to prevent phantom traffic jams. The results in this paper show that there is a benefit to mixing bilateral control cars into general traffic, even at relatively low mixture ratios. Simulations were also performed to see how acceleration and jerk vary with the mixture of vehicles with bilateral control so that passenger comfort could be quantified. The results show that bilateral control improves passenger comfort.
by R'mani Haulcy.
S.M.
S.M. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
APA, Harvard, Vancouver, ISO, and other styles
35

Tolle, Kristin M. "Domain-independent semantic concept extraction using corpus linguistics, statistics and artificial intelligence techniques." Diss., The University of Arizona, 2003. http://hdl.handle.net/10150/280502.

Full text
Abstract:
For this dissertation two software applications were developed and three experiments were conducted to evaluate the viability of a unique approach to medical information extraction. The first system, the AZ Noun Phraser, was designed as a concept extraction tool. The second application, ANNEE, is a neural net-based entity extraction (EE) system. These two systems were combined to perform concept extraction and semantic classification specifically for use in medical document retrieval systems. The goal of this research was to create a system that automatically (without human interaction) enabled semantic type assignment, such as gene name and disease, to concepts extracted from unstructured medical text documents. Improving conceptual analysis of search phrases has been shown to improve the precision of information retrieval systems. Enabling this capability in the field of medicine can aid medical researchers, doctors and librarians in locating information, potentially improving healthcare decision-making. Due to the flexibility and non-domain specificity of the implementation, these applications have also been successfully deployed in other text retrieval experimentation for law enforcement (Atabakhsh et al., 2001; Hauck, Atabakhsh, Ongvasith, Gupta, & Chen, 2002), medicine (Tolle & Chen, 2000), query expansion (Leroy, Tolle, & Chen, 2000), web document categorization (Chen, Fan, Chau, & Zeng, 2001), Internet spiders (Chau, Zeng, & Chen, 2001), collaborative agents (Chau, Zeng, Chen, Huang, & Hendriawan, 2002), competitive intelligence (Chen, Chau, & Zeng, 2002), and Internet chat-room data visualization (Zhu & Chen, 2001).
APA, Harvard, Vancouver, ISO, and other styles
36

Chen, Hui 1974. "Algorithms and statistics for the detection of binding sites in coding regions." Thesis, McGill University, 2006. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=97926.

Full text
Abstract:
This thesis deals with the problem of detecting binding sites in coding regions. A new comparative analysis method is developed by improving an existing method called COSMO.
The inter-species sequence conservation observed in coding regions may be the result of two types of selective pressure: the selective pressure on the protein encoded and, sometimes, the selective pressure on the binding sites. To predict some region in coding regions as a binding site, one needs to make sure that the conservation observed in this region is not due to the selective pressure on the protein encoded. To achieve this, COSMO built a null model with only the selective pressure on the protein encoded and computed p-values for the observed conservation scores, conditional on the fixed set of amino acids observed at the leaves.
It is believed, however, that the selective pressure on the protein assumed in COSMO is overly strong. Consequently, some interesting regions may be left undetected. In this thesis, a new method, COSMO-2, is developed to relax this assumption.
The amino acids are first classified into a fixed number of overlapping functional classes by applying an expectation maximization algorithm on a protein database. Two probabilities for each gene position are then calculated: (i) the probability of observing a certain degree of conservation in the orthologous sequences generated under each class in the null model (i.e. the p-value of the observed conservation under each class); and (ii) the probability that the codon column associated with that gene position belongs to each class. The p-value of the observed conservation for each gene position is the sum of the products of the two probabilities for all classes. Regions with low p-values are identified as potential binding sites.
Five sets of orthologous genes are analyzed using COSMO-2. The results show that COSMO-2 can detect the interesting regions identified by COSMO and can detect more interesting regions than COSMO in some cases.
APA, Harvard, Vancouver, ISO, and other styles
37

Van, Rooyen Marchand. "Stable parametric optimization." Thesis, McGill University, 1992. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=70259.

Full text
Abstract:
This thesis is a study of convex parametric programs on regions of stability. The main tools are complete characterizations of optimality without constraint qualifications and a theory of point-to-set mappings. We prove various new results that describe the Lipschitzian behaviour of the optimal value function and optimal solution point-to-set mapping. Then we show how these results can be used in the algorithms of Input Optimization, and other applications. These applications include new results on structural optima in nonlinear programming, determination of optimal trade-off directions in interactive multi-objective optimization, and formulation of new dynamic models for efficiency testing in data envelopment analysis.
APA, Harvard, Vancouver, ISO, and other styles
38

Ostberg, Colin R. "Computational pain quantification and the effects of age, gender, culture and cause." Thesis, Marquette University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1554606.

Full text
Abstract:

Chronic pain affects more than 100 million Americans and more than 1.5 billion people worldwide. Pain is a multidimensional construct, expressed through a variety of means. Facial expressions are one such type of pain expression. Automatic facial expression recognition, and in particular pain expression recognition, are fields that have been studied extensively. However, nothing has explored the possibility of an automatic pain quantification algorithm, able to output pain levels based upon a facial image.

Developed for a remote monitoring context, a computational pain quantification algorithm has been developed and validated by two distinct sets of data. The second set of data also included associated data for the fields of age, gender, culture and cause of pain. These four fields were investigated for their effect on automatic pain quantification, determining that age and gender have a definite impact and should be involved in the algorithm, while culture and cause require further investigation.

APA, Harvard, Vancouver, ISO, and other styles
39

Deng, Wenping. "Algorithms for Reconstruction of Gene Regulatory Networks from High-Throughput Gene Expression Data." Thesis, Michigan Technological University, 2019. http://pqdtopen.proquest.com/#viewpdf?dispub=13420080.

Full text
Abstract:

Understanding gene interactions in complex living systems is one of the central tasks in system biology. With the availability of microarray and RNA-Seq technologies, a multitude of gene expression datasets has been generated towards novel biological knowledge discovery through statistical analysis and reconstruction of gene regulatory networks (GRN). Reconstruction of GRNs can reveal the interrelationships among genes and identify the hierarchies of genes and hubs in networks. The new algorithms I developed in this dissertation are specifically focused on the reconstruction of GRNs with increased accuracy from microarray and RNA-Seq high-throughput gene expression data sets.

The first algorithm (Chapter 2) focuses on modeling the transcriptional regulatory relationships between transcription factors (TF) and pathway genes. Multiple linear regression and its regularized version, such as Ridge regression and LASSO, are common tools that are usually used to model the relationship between predictor variables and dependent variable. To deal with the outliers in gene expression data, the group effect of TFs in regulation and to improve the statistical efficiency, it is proposed to use Huber function as loss function and Berhu function as penalty function to model the relationships between a pathway gene and many or all TFs. A proximal gradient descent algorithm was developed to solve the corresponding optimization problem. This algorithm is much faster than the general convex optimization solver CVX. Then this Huber-Berhu regression was embedded into partial least square (PLS) framework to deal with the high dimension and multicollinearity property of gene expression data. The result showed this method can identify the true regulatory TFs for each pathway gene with high efficiency.

The second algorithm (Chapter 3) focuses on building multilayered hierarchical gene regulatory networks (ML-hGRNs). A backward elimination random forest (BWERF) algorithm was developed for constructing an ML-hGRN operating above a biological pathway or a biological process. The algorithm first divided construction of ML-hGRN into multiple regression tasks; each involves a regression between a pathway gene and all TFs. Random forest models with backward elimination were used to determine the importance of each TF to a pathway gene. Then the importance of a TF to the whole pathway was computed by aggregating all the importance values of the TF to the individual pathway gene. Next, an expectation maximization algorithm was used to cut the TFs to form the first layer of direct regulatory relationships. The upper layers of GRN were constructed in the same way only replacing the pathway genes by the newly cut TFs. Both simulated and real gene expression data were used to test the algorithms and demonstrated the accuracy and efficiency of the method.

The third algorithm (Chapter 4) focuses on Joint Reconstruction of Multiple Gene Regulatory Networks (JRmGRN) using gene expression data from multiple tissues or conditions. In the formulation, shared hub genes across different tissues or conditions were assumed. Under the framework of the Gaussian graphical model, JRmGRN method constructs the GRNs through maximizing a penalized log-likelihood function. It was formulated as a convex optimization problem, and then solved it with an alternating direction method of multipliers (ADMM) algorithm. Both simulated and real gene expression data manifested JRmGRN had better performance than existing methods.

APA, Harvard, Vancouver, ISO, and other styles
40

Green, Michael A. "Improving Identification of Subtle Changes in Wide-Area Sensing through Dynamic Zoom." Thesis, Delaware State University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10794023.

Full text
Abstract:

The past decade has seen an abundance of applications that utilize sensors to collect data. One such example is a gigapixel image, which combines a multitude of high-quality images into a panorama capable of viewing hundreds of acres. The resulting datasets can be quite large, making analysis time consuming and resource intensive. Moreover, coverage of such broad environments can mean numerous sensor feeds to which one must attend. A suitable approach for analysis and sense-making of such data is to focus on “interesting” samples of data, namely regions of interest, or ROI. ROIs are especially useful in wide-area sensing situations that return datasets that are largely similar from one instance to the next, but also possess small differences. Identifying subtle changes is relevant to certain scenarios in surveillance, such as the evidence of human activity. Several ROI detection techniques exist in the research literature. My work focuses on ROI detection tuned to subtle differences for images at varying zoom levels. My thesis consists of developing a method that identifies regions of interest for subtle changes in images. In this pursuit, my contributions will address key questions including the characterization of image information dynamics through introduction of dynamic zoom, the definition and measurement of subtlety, and an approach for scoring and selecting ROIs. This work will provide an automated attention mechanism for zoomed images, but is also applicable to domains include satellite imagery and cyber security.

APA, Harvard, Vancouver, ISO, and other styles
41

Eiland, E. Earl. "A Coherent Classifier/Prediction/Diagnostic Problem Framework and Relevant Summary Statistics." Thesis, New Mexico Institute of Mining and Technology, 2017. http://pqdtopen.proquest.com/#viewpdf?dispub=10617960.

Full text
Abstract:

Classification is a ubiquitous decision activity. Regardless of whether it is predicting the future, e.g., a weather forecast, determining an existing state, e.g., a medical diagnosis, or some other activity, classifier outputs drive future actions. Because of their importance, classifier research and development is an active field.

Regardless of whether one is a classifier developer or an end user, evaluating and comparing classifier output quality is important. Intuitively, classifier evaluation may seem simple, however, it is not. There is a plethora of classifier summary statistics and new summary statistics seem to surface regularly. Summary statistic users appear not to be satisfied with the existing summary statistics. For end users, many existing summary statistics do not provide actionable information. This dissertation addresses the end user's quandary.

The work consists of four parts: 1. Considering eight summary statistics with regard to their purpose (what questions do they quantitatively answer) and efficacy (as defined by measurement theory). 2. Characterizing the classification problem from the end user's perspective and identifying four axioms for end user efficacious classifier evaluation summary statistics. 3. Applying the axia and measurement theory to evaluate eight summary statistics and create two compliant (end user efficacious) summary statistics. 4. Using the compliant summary statistics to show the actionable information they generate.

By applying the recommendations in this dissertation, both end users and researchers benefit. Researchers have summary statistic selection and classifier evaluation protocols that generate the most usable information. End users can also generate information that facilitates tool selection and optimal deployment, if classifier test reports provide the necessary information.

APA, Harvard, Vancouver, ISO, and other styles
42

Xu, Yushi Ph D. Massachusetts Institute of Technology. "Combining linguistics and statistics for high-quality limited domain English-Chinese machine translation." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/44726.

Full text
Abstract:
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.
Includes bibliographical references (p. 86-87).
Second language learning is a compelling activity in today's global markets. This thesis focuses on critical technology necessary to produce a computer spoken translation game for learning Mandarin Chinese in a relatively broad travel domain. Three main aspects are addressed: efficient Chinese parsing, high-quality English-Chinese machine translation, and how these technologies can be integrated into a translation game system. In the language understanding component, the TINA parser is enhanced with bottom-up and long distance constraint features. The results showed that with these features, the Chinese grammar ran ten times faster and covered 15% more of the test set. In the machine translation component, a combined method of linguistic and statistical system is introduced. The English-Chinese translation is done via an intermediate language "Zhonglish", where the English-Zhonglish translation is accomplished by a parse-and-paraphrase paradigm using hand-coded rules, mainly for structural reconstruction. Zhonglish-Chinese translation is accomplished by a standard phrase based statistical machine translation system, mostly accomplishing word sense disambiguation and lexicon mapping. We evaluated in an independent test set in IWSLT travel domain spoken language corpus. Substantial improvements were achieved for GIZA alignment crossover: we obtained a 45% decrease in crossovers compared to a traditional phrase-based statistical MT system. Furthermore, the BLEU score improved by 2 points. Finally, a framework of the translation game system is described, and the feasibility of integrating the components to produce reference translation and to automatically assess student's translation is verified.
by Yushi Xu.
S.M.
APA, Harvard, Vancouver, ISO, and other styles
43

Schmidt, Molly A. "Weighting protein ensembles with Bayesian statistics and small-angle X-ray scattering data." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/119574.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 52-54).
Intrinsically Disordered Proteins (IDPs) are involved in a number of neurodegenerative disorders such as Parkinson's and Alzheimer's diseases. Their disordered nature allows them to sample many different conformations, so their structures must be represented as ensembles. Typically, structural ensembles for IDPs are constructed by generating a set of conformations that yield ensemble averages that agree with pre-existing experimental data. However, as the number of experimental constraints is usually much smaller than the degrees of freedom in the protein, the ensemble construction process is under-determined, meaning there are many different ensembles that agree with a given set of experimental observables. The Variational Bayesian Weighting program uses Bayesian statistics to fit conformational ensembles, and in doing so also quantifies the uncertainty in the underlying ensemble. The present work sought to introduce new functionality to this program, allowing it to use data obtained from Small-Angle X-ray Scattering.
by Molly A. Schmidt.
M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
44

Yong, Florence Hiu-Ling. "Quantitative Methods for Stratified Medicine." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:17463130.

Full text
Abstract:
Stratified medicine has tremendous potential to deliver more effective therapeutic intervention to improve public health. For practical implementation, reliable prediction models and clinically meaningful categorization of some comprehensible summary measures of individual treatment effect are vital elements to aid the decision-making process and bring stratified medicine to fruitful realization. We tackle the quantitative issues involved from three fronts : 1) prediction model building and selection; 2) reproducibility assessment; and 3) stratification. First, we propose a systematic model development strategy that integrates cross-validation and predictive accuracy measures in the prediction model building and selection process. Valid inference is made possible via internal holdout sample or external data evaluation to enhance generalizability of the selected prediction model. Second, we employ parametric or semi-parametric modeling to derive individual treatment effect scoring systems. We introduce a stratification algorithm with constrained optimization by utilizing dynamic programming and supervised-learning techniques to group patients into different actionable categories. We integrate the stratification and newly proposed prediction performance metric into the model development process. The methodologies are first presented in single treatment case, and then extended to two treatment cases. Finally, adapting the concept of uplift modeling, we provide a framework to identify the subgroup(s) with the most beneficial prospect; wasteful, harmful, and futile subgroups to save resources and reduce unnecessary exposure to treatment adverse effects. The proposals are illustrated by AIDS clinical study data and cardiology studies for non-censored and censored outcomes. The contribution of this dissertation is to provide an operational framework to bridge predictive modeling and decision making for more practical applications in stratified medicine.
Biostatistics
APA, Harvard, Vancouver, ISO, and other styles
45

Vũ, John Huân. "Software Internationalization: A Framework Validated Against Industry Requirements for Computer Science and Software Engineering Programs." DigitalCommons@CalPoly, 2010. https://digitalcommons.calpoly.edu/theses/248.

Full text
Abstract:
View John Huân Vũ's thesis presentation at http://youtu.be/y3bzNmkTr-c. In 2001, the ACM and IEEE Computing Curriculum stated that it was necessary to address "the need to develop implementation models that are international in scope and could be practiced in universities around the world." With increasing connectivity through the internet, the move towards a global economy and growing use of technology places software internationalization as a more important concern for developers. However, there has been a "clear shortage in terms of numbers of trained persons applying for entry-level positions" in this area. Eric Brechner, Director of Microsoft Development Training, suggested five new courses to add to the computer science curriculum due to the growing "gap between what college graduates in any field are taught and what they need to know to work in industry." He concludes that "globalization and accessibility should be part of any course of introductory programming," stating: A course on globalization and accessibility is long overdue on college campuses. It is embarrassing to take graduates from a college with a diverse student population and have to teach them how to write software for a diverse set of customers. This should be part of introductory software development. Anything less is insulting to students, their family, and the peoples of the world. There is very little research into how the subject of software internationalization should be taught to meet the major requirements of the industry. The research question of the thesis is thus, "Is there a framework for software internationalization that has been validated against industry requirements?" The answer is no. The framework "would promote communication between academia and industry ... that could serve as a common reference point in discussions." Since no such framework for software internationalization currently exists, one will be developed here. The contribution of this thesis includes a provisional framework to prepare graduates to internationalize software and a validation of the framework against industry requirements. The requirement of this framework is to provide a portable and standardized set of requirements for computer science and software engineering programs to teach future graduates.
APA, Harvard, Vancouver, ISO, and other styles
46

Chavali, Krishna Kumar. "Integration of statistical and neural network method for data analysis." Morgantown, W. Va. : [West Virginia University Libraries], 2006. https://eidr.wvu.edu/etd/documentdata.eTD?documentid=4749.

Full text
Abstract:
Thesis (M.S.)--West Virginia University, 2006.
Title from document title page. Document formatted into pages; contains viii, 68 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 50-51).
APA, Harvard, Vancouver, ISO, and other styles
47

Xiong, Kuangnan. "Roughened Random Forests for Binary Classification." Thesis, State University of New York at Albany, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3624962.

Full text
Abstract:

Binary classification plays an important role in many decision-making processes. Random forests can build a strong ensemble classifier by combining weaker classification trees that are de-correlated. The strength and correlation among individual classification trees are the key factors that contribute to the ensemble performance of random forests. We propose roughened random forests, a new set of tools which show further improvement over random forests in binary classification. Roughened random forests modify the original dataset for each classification tree and further reduce the correlation among individual classification trees. This data modification process is composed of artificially imposing missing data that are missing completely at random and subsequent missing data imputation.

Through this dissertation we aim to answer a few important questions in building roughened random forests: (1) What is the ideal rate of missing data to impose on the original dataset? (2) Should we impose missing data on both the training and testing datasets, or only on the training dataset? (3) What are the best missing data imputation methods to use in roughened random forests? (4) Do roughened random forests share the same ideal number of covariates selected at each tree node as the original random forests? (5) Can roughened random forests be used in medium- to high- dimensional datasets?

APA, Harvard, Vancouver, ISO, and other styles
48

Navaroli, Nicholas Martin. "Generative Probabilistic Models for Analysis of Communication Event Data with Applications to Email Behavior." Thesis, University of California, Irvine, 2015. http://pqdtopen.proquest.com/#viewpdf?dispub=3668831.

Full text
Abstract:

Our daily lives increasingly involve interactions with others via different communication channels, such as email, text messaging, and social media. In this context, the ability to analyze and understand our communication patterns is becoming increasingly important. This dissertation focuses on generative probabilistic models for describing different characteristics of communication behavior, focusing primarily on email communication.

First, we present a two-parameter kernel density estimator for estimating the probability density over recipients of an email (or, more generally, items which appear in an itemset). A stochastic gradient method is proposed for efficiently inferring the kernel parameters given a continuous stream of data. Next, we apply the kernel model and the Bernoulli mixture model to two important prediction tasks: given a partially completed email recipient list, 1) predict which others will be included in the email, and 2) rank potential recipients based on their likelihood to be added to the email. Such predictions are useful in suggesting future actions to the user (i.e. which person to add to an email) based on their previous actions. We then investigate a piecewise-constant Poisson process model for describing the time-varying communication rate between an individual and several groups of their contacts, where changes in the Poisson rate are modeled as latent state changes within a hidden Markov model.

We next focus on the time it takes for an individual to respond to an event, such as receiving an email. We show that this response time depends heavily on the individual's typical daily and weekly patterns - patterns not adequately captured in standard models of response time (e.g. the Gamma distribution or Hawkes processes). A time-warping mechanism is introduced where the absolute response time is modeled as a transformation of effective response time, relative to the daily and weekly patterns of the individual. The usefulness of applying the time-warping mechanism to standard models of response time, both in terms of log-likelihood and accuracy in predicting which events will be quickly responded to, is illustrated over several individual email histories.

APA, Harvard, Vancouver, ISO, and other styles
49

Vang, Yeeleng Scott. "An Ensemble Prognostic Model for Metastatic, Castrate-Resistant Prostate Cancer." Thesis, University of California, Irvine, 2016. http://pqdtopen.proquest.com/#viewpdf?dispub=10162542.

Full text
Abstract:

Metastatic, castrate-resistant prostate cancer (mCRPC) is one of the most prevalent cancers and is the third leading cause of cancer death among men. Several treatment options have been developed to combat mCRPC, however none have produced any tangible benefits to patients' overall survivability. As part of a crowd-sourced algorithm development competition, participants were asked to develop new prognostic models for mCRPC patients treated with docetaxel. Such results could potentially assist in clinical decision making for future mCRPC patients.

In this thesis, we present a new ensemble prognostic model to perform risk prediction for mCRPC patients treated with docetaxel. We rely on traditional survival analysis model like the Cox Proportional Hazard model, as well as more recently developed boosting model that incorporates smooth approximation of the concordance index for direct optimization. Our model performs better than the the current state-of-the-art mCRPC prognostic models for the concordance index performance measure and is competitive with these models on the integrated time-dependent area under the receiver operating characteristic curve.

APA, Harvard, Vancouver, ISO, and other styles
50

Wu, Tao. "Higher-order Random Walk Methods for Data Analysis." Thesis, Purdue University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10790747.

Full text
Abstract:

Markov random walk models are powerful analytical tools for multiple areas in machine learning, numerical optimizations and data mining tasks. The key assumption of a first-order Markov chain is memorylessness, which restricts the dependence of the transition distribution to the current state only. However in many applications, this assumption is not appropriate. We propose a set of higher-order random walk techniques and discuss their applications to tensor co-clustering, user trails modeling, and solving linear systems. First, we develop a new random walk model that we call the super-spacey random surfer, which simultaneously clusters the rows, columns, and slices of a nonnegative three-mode tensor. This algorithm generalizes to tensors with any number of modes. We partition the tensor by minimizing the exit probability between clusters when the super-spacey random walk is at stationary. The second application is user trails modeling, where user trails record sequences of activities when individuals interact with the Internet and the world. We propose the retrospective higher-order Markov process as a two-step process by first choosing a state from the history and then transitioning as a first-order chain conditional on that state. This way the total number of parameters is restricted and thus the model is protected from overfitting. Lastly we propose to use a time-inhomogeneous Markov chain to approximate the solution of a linear system. Multiple simulations of the random walk are conducted to approximate the solution. By allowing the random walk to transition based on multiple matrices, we decrease the variance of the simulations, and thus increase the speed of the solver.

APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography