Dissertations / Theses: 'Outlier analyses'

1

Zhang, Ji. "Towards outlier detection for high-dimensional data streams using projected outlier analysis strategy." University of Southern Queensland, Faculty of Sciences, 2008. http://eprints.usq.edu.au/archive/00005645/.

Full text

Abstract:

[Abstract]: Outlier detection is an important research problem in data mining that aims to discover useful abnormal and irregular patterns hidden in large data sets. Most existing outlier detection methods only deal with static data with relatively low dimensionality.Recently, outlier detection for high-dimensional stream data became a new emerging research problem. A key observation that motivates this research is that outliersin high-dimensional data are projected outliers, i.e., they are embedded in lower-dimensional subspaces. Detecting projected outliers from high-dimensional streamdata is a very challenging task for several reasons. First, detecting projected outliers is difficult even for high-dimensional static data. The exhaustive search for the out-lying subspaces where projected outliers are embedded is a NP problem. Second, the algorithms for handling data streams are constrained to take only one pass to process the streaming data with the conditions of space limitation and time criticality. The currently existing methods for outlier detection are found to be ineffective for detecting projected outliers in high-dimensional data streams.In this thesis, we present a new technique, called the Stream Project Outlier deTector (SPOT), which attempts to detect projected outliers in high-dimensionaldata streams. SPOT employs an innovative window-based time model in capturing dynamic statistics from stream data, and a novel data structure containing a set oftop sparse subspaces to detect projected outliers effectively. SPOT also employs a multi-objective genetic algorithm as an effective search method for finding theoutlying subspaces where most projected outliers are embedded. The experimental results demonstrate that SPOT is efficient and effective in detecting projected outliersfor high-dimensional data streams. The main contribution of this thesis is that it provides a backbone in tackling the challenging problem of outlier detection for high-dimensional data streams. SPOT can facilitate the discovery of useful abnormal patterns and can be potentially applied to a variety of high demand applications, such as for sensor network data monitoring, online transaction protection, etc.

APA, Harvard, Vancouver, ISO, and other styles

2

Cheng, Gongxian. "Outlier management in intelligent data analysis." Thesis, Birkbeck (University of London), 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.417120.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Abghari, Shahrooz. "Data Modeling for Outlier Detection." Licentiate thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-16580.

Full text

Abstract:

This thesis explores the data modeling for outlier detection techniques in three different application domains: maritime surveillance, district heating, and online media and sequence datasets. The proposed models are evaluated and validated under different experimental scenarios, taking into account specific characteristics and setups of the different domains. Outlier detection has been studied and applied in many domains. Outliers arise due to different reasons such as fraudulent activities, structural defects, health problems, and mechanical issues. The detection of outliers is a challenging task that can reveal system faults, fraud, and save people's lives. Outlier detection techniques are often domain-specific. The main challenge in outlier detection relates to modeling the normal behavior in order to identify abnormalities. The choice of model is important, i.e., an incorrect choice of data model can lead to poor results. This requires a good understanding and interpretation of the data, the constraints, and the requirements of the problem domain. Outlier detection is largely an unsupervised problem due to unavailability of labeled data and the fact that labeled data is expensive. We have studied and applied a combination of both machine learning and data mining techniques to build data-driven and domain-oriented outlier detection models. We have shown the importance of data preprocessing as well as feature selection in building suitable methods for data modeling. We have taken advantage of both supervised and unsupervised techniques to create hybrid methods. For example, we have proposed a rule-based outlier detection system based on open data for the maritime surveillance domain. Furthermore, we have combined cluster analysis and regression to identify manual changes in the heating systems at the building level. Sequential pattern mining for identifying contextual and collective outliers in online media data have also been exploited. In addition, we have proposed a minimum spanning tree clustering technique for detection of groups of outliers in online media and sequence data. The proposed models have been shown to be capable of explaining the underlying properties of the detected outliers. This can facilitate domain experts in narrowing down the scope of analysis and understanding the reasons of such anomalous behaviors. We have also investigated the reproducibility of the proposed models in similar application domains.
Scalable resource-efficient systems for big data analytics

APA, Harvard, Vancouver, ISO, and other styles

4

Birch, Gary Edward. "Single trial EEG signal analysis using outlier information." Thesis, University of British Columbia, 1988. http://hdl.handle.net/2429/28626.

Full text

Abstract:

The goal of this thesis work was to study the characteristics of the EEG signal and then, based on the insights gained from these studies, pursue an initial investigation into a processing method that would extract useful event related information from single trial EEG. The fundamental tool used to study the EEG signal characteristics was autoregressive modeling. Early investigations pointed to the need to employ robust techniques in both model parameter estimation and signal estimation applications. Pursuing robust techniques ultimately led to the development of a single trial processing method which was based on a simple neurological model that assumed an additive outlier nature of event related potentials to the ongoing EEG process. When event related potentials, such as motor related potentials, are generated by a unique additional process they are "added" into the ongoing process and hence, will appear as additive outlier content when considered from the point of view of the ongoing process. By modeling the EEG with AR models with robustly estimated (GM-estimates) parameters and by using those models in a robust signal estimator, a "cleaned" EEG signal is obtained. The outlier content, data that is extracted from the EEG during cleaning, is then processed to yield event related information. The EEG from four subjects formed the basis of the initial investigation into the viability of this single trial processing scheme. The EEG was collected under two conditions: an active task in which subjects performed a skilled thumb movement and an idle task in which subjects remained alert but did not carry out any motor activity. The outlier content was processed which provided single trial outlier waveforms. In the active case these waveforms possessed consistent features which were found to be related to events in the individual thumb movements. In the idle case the waveforms did not contain consistent features. Bayesian classification of active trials versus idle trials was carried out using a cost statistic resulting from the application of dynamic time warping to the outlier waveforms. Across the four subjects, when the decision boundary was set with the cost of misclassification equal, 93% of the active trials were classified correctly and 18% of the idle trials were incorrectly classified as active. When the cost of misclassifying an idle trial was set to be five times greater, 80% of the active trials were classified correctly and only 1.7% of the idle trials were incorrectly classified as active.
Applied Science, Faculty of
Electrical and Computer Engineering, Department of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

5

Mitchell, Napoleon. "Outliers and Regression Models." Thesis, University of North Texas, 1992. https://digital.library.unt.edu/ark:/67531/metadc279029/.

Full text

Abstract:

The mitigation of outliers serves to increase the strength of a relationship between variables. This study defined outliers in three different ways and used five regression procedures to describe the effects of outliers on 50 data sets. This study also examined the relationship among the shape of the distribution, skewness, and outliers.

APA, Harvard, Vancouver, ISO, and other styles

6

Soon, Shih Chung. "On detection of extreme data points in cluster analysis." Connect to resource, 1987. http://rave.ohiolink.edu/etdc/view.cgi?acc%5Fnum=osu1262886219.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Robson, Geoffrey. "Multiple outlier detection and cluster analysis of multivariate normal data." Thesis, Stellenbosch : Stellenbosch University, 2003. http://hdl.handle.net/10019.1/53508.

Full text

Abstract:

Thesis (MscEng)--Stellenbosch University, 2003.
ENGLISH ABSTRACT: Outliers may be defined as observations that are sufficiently aberrant to arouse the suspicion of the analyst as to their origin. They could be the result of human error, in which case they should be corrected, but they may also be an interesting exception, and this would deserve further investigation. Identification of outliers typically consists of an informal inspection of a plot of the data, but this is unreliable for dimensions greater than two. A formal procedure for detecting outliers allows for consistency when classifying observations. It also enables one to automate the detection of outliers by using computers. The special case of univariate data is treated separately to introduce essential concepts, and also because it may well be of interest in its own right. We then consider techniques used for detecting multiple outliers in a multivariate normal sample, and go on to explain how these may be generalized to include cluster analysis. Multivariate outlier detection is based on the Minimum Covariance Determinant (MCD) subset, and is therefore treated in detail. Exact bivariate algorithms were refined and implemented, and the solutions were used to establish the performance of the commonly used heuristic, Fast–MCD.
AFRIKAANSE OPSOMMING: Uitskieters word gedefinieer as waarnemings wat tot s´o ’n mate afwyk van die verwagte gedrag dat die analis wantrouig is oor die oorsprong daarvan. Hierdie waarnemings mag die resultaat wees van menslike foute, in welke geval dit reggestel moet word. Dit mag egter ook ’n interressante verskynsel wees wat verdere ondersoek benodig. Die identifikasie van uitskieters word tipies informeel deur inspeksie vanaf ’n grafiese voorstelling van die data uitgevoer, maar hierdie benadering is onbetroubaar vir dimensies groter as twee. ’n Formele prosedure vir die bepaling van uitskieters sal meer konsekwente klassifisering van steekproefdata tot gevolg hˆe. Dit gee ook geleentheid vir effektiewe rekenaar implementering van die tegnieke. Aanvanklik word die spesiale geval van eenveranderlike data behandel om noodsaaklike begrippe bekend te stel, maar ook aangesien dit in eie reg ’n area van groot belang is. Verder word tegnieke vir die identifikasie van verskeie uitskieters in meerveranderlike, normaal verspreide data beskou. Daar word ook ondersoek hoe hierdie idees veralgemeen kan word om tros analise in te sluit. Die sogenaamde Minimum Covariance Determinant (MCD) subversameling is fundamenteel vir die identifikasie van meerveranderlike uitskieters, en word daarom in detail ondersoek. Deterministiese tweeveranderlike algoritmes is verfyn en ge¨ımplementeer, en gebruik om die effektiwiteit van die algemeen gebruikte heuristiese algoritme, Fast–MCD, te ondersoek.

APA, Harvard, Vancouver, ISO, and other styles

8

Halldestam, Markus. "ANOVA - The Effect of Outliers." Thesis, Uppsala universitet, Statistiska institutionen, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-295864.

Full text

Abstract:

This bachelor’s thesis focuses on the effect of outliers on the one-way analysis of variance and examines whether the estimate in ANOVA is robust and whether the actual test itself is robust from influence of extreme outliers. The robustness of the estimates is examined using the breakdown point while the robustness of the test is examined by simulating the hypothesis test under some extreme situations. This study finds evidence that the estimates in ANOVA are sensitive to outliers, i.e. that the procedure is not robust. Samples with a larger portion of extreme outliers have a higher type-I error probability than the expected level.

APA, Harvard, Vancouver, ISO, and other styles

9

Astl, Stefan Ludwig. "Suboptimal LULU-estimators in measurements containing outliers." Thesis, Stellenbosch : Stellenbosch University, 2013. http://hdl.handle.net/10019.1/85833.

Full text

Abstract:

Thesis (MSc)--Stellenbosch University, 2013.
ENGLISH ABSTRACT: Techniques for estimating a signal in the presence of noise which contains outliers are currently not well developed. In this thesis, we consider a constant signal superimposed by a family of noise distributions structured as a tunable mixture f(x) = α g(x) + (1 − α) h(x) between finitesupport components of “well-behaved” noise with small variance g(x) and of “impulsive” noise h(x) with a large amplitude and strongly asymmetric character. When α ≈ 1, h(x) can for example model a cosmic ray striking an experimental detector. In the first part of our work, a method for obtaining the expected values of the positive and negative pulses in the first resolution level of a LULU Discrete Pulse Transform (DPT) is established. Subsequent analysis of sequences smoothed by the operators L1U1 or U1L1 of LULU-theory shows that a robust estimator for the location parameter for g is achieved in the sense that the contribution by h to the expected average of the smoothed sequences is suppressed to order (1 − α)2 or higher. In cases where the specific shape of h can be difficult to guess due to the assumed lack of data, it is thus also shown to be of lesser importance. Furthermore, upon smoothing a sequence with L1U1 or U1L1, estimators for the scale parameters of the model distribution become easily available. In the second part of our work, the same problem and data is approached from a Bayesian inference perspective. The Bayesian estimators are found to be optimal in the sense that they make full use of available information in the data. Heuristic comparison shows, however, that Bayes estimators do not always outperform the LULU estimators. Although the Bayesian perspective provides much insight into the logical connections inherent in the problem, its estimators can be difficult to obtain in analytic form and are slow to compute numerically. Suboptimal LULU-estimators are shown to be reasonable practical compromises in practical problems.
AFRIKAANSE OPSOMMING: Tegnieke om ’n sein af te skat in die teenwoordigheid van geraas wat uitskieters bevat is tans nie goed ontwikkel nie. In hierdie tesis aanskou ons ’n konstante sein gesuperponeer met ’n familie van geraasverdelings wat as verstelbare mengsel f(x) = α g(x) + (1 − α) h(x) tussen eindige-uitkomsruimte geraaskomponente g(x) wat “goeie gedrag” en klein variansie toon, plus “impulsiewe” geraas h(x) met groot amplitude en sterk asimmetriese karakter. Wanneer α ≈ 1 kan h(x) byvoorbeeld ’n kosmiese straal wat ’n eksperimentele apparaat tref modelleer. In die eerste gedeelte van ons werk word ’n metode om die verwagtingswaardes van die positiewe en negatiewe pulse in die eerste resolusievlak van ’n LULU Diskrete Pulse Transform (DPT) vasgestel. Die analise van rye verkry deur die inwerking van die gladstrykers L1U1 en U1L1 van die LULU-teorie toon dat hul verwagte gemiddelde waardes as afskatters van die liggingsparameter van g kan dien wat robuus is in die sin dat die bydrae van h tot die gemiddeld van orde grootte (1 − α)2 of hoër is. Die spesifieke vorm van h word dan ook onbelangrik. Daar word verder gewys dat afskatters vir die relevante skaalparameters van die model maklik verkry kan word na gladstryking met die operatore L1U1 of U1L1. In die tweede gedeelte van ons werk word dieselfde probleem en data vanuit ’n Bayesiese inferensie perspektief benader. Die Bayesiese afskatters word as optimaal bevind in die sin dat hulle vol gebruikmaak van die beskikbare inligting in die data. Heuristiese vergelyking wys egter dat Bayesiese afskatters nie altyd beter vaar as die LULU afskatters nie. Alhoewel die Bayesiese sienswyse baie insig in die logiese verbindings van die probleem gee, kan die afskatters moeilik wees om analities af te lei en stadig om numeries te bereken. Suboptimale LULU-beramers word voorgestel as redelike praktiese kompromieë in praktiese probleme.

APA, Harvard, Vancouver, ISO, and other styles

10

Lipkovich, Ilya A. "Bayesian Model Averaging and Variable Selection in Multivariate Ecological Models." Diss., Virginia Tech, 2002. http://hdl.handle.net/10919/11045.

Full text

Abstract:

Bayesian Model Averaging (BMA) is a new area in modern applied statistics that provides data analysts with an efficient tool for discovering promising models and obtaining esti-mates of their posterior probabilities via Markov chain Monte Carlo (MCMC). These probabilities can be further used as weights for model averaged predictions and estimates of the parameters of interest. As a result, variance components due to model selection are estimated and accounted for, contrary to the practice of conventional data analysis (such as, for example, stepwise model selection). In addition, variable activation probabilities can be obtained for each variable of interest. This dissertation is aimed at connecting BMA and various ramifications of the multivari-ate technique called Reduced-Rank Regression (RRR). In particular, we are concerned with Canonical Correspondence Analysis (CCA) in ecological applications where the data are represented by a site by species abundance matrix with site-specific covariates. Our goal is to incorporate the multivariate techniques, such as Redundancy Analysis and Ca-nonical Correspondence Analysis into the general machinery of BMA, taking into account such complicating phenomena as outliers and clustering of observations within a single data-analysis strategy. Traditional implementations of model averaging are concerned with selection of variables. We extend the methodology of BMA to selection of subgroups of observations and im-plement several approaches to cluster and outlier analysis in the context of the multivari-ate regression model. The proposed algorithm of cluster analysis can accommodate re-strictions on the resulting partition of observations when some of them form sub-clusters that have to be preserved when larger clusters are formed.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

11

Rodriguez, Gabriel. "Unit root, outliers and cointegration analysis with macroeconomic applications." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape3/PQDD_0028/NQ48794.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

馮榮錦 and Wing-kam Tony Fung. "Analysis of outliers using graphical and quasi-Bayesian methods." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1987. http://hub.hku.hk/bib/B31230842.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Fung, Wing-kam Tony. "Analysis of outliers using graphical and quasi-Bayesian methods /." [Hong Kong] : University of Hong Kong, 1987. http://sunzi.lib.hku.hk/hkuto/record.jsp?B1236146X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Al-Kahwati, Kammal. "Outlier detection on sparse-encoded vibration signals from rolling element bearings." Thesis, Luleå tekniska universitet, Institutionen för system- och rymdteknik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-76592.

Full text

Abstract:

The demand for reliable condition monitoring systems on rotating machinery for power generation is continuously increasing due to a wider use of wind power as an energy source, which requires expertise in the diagnostics of these systems. An alternative to the limited availability of diagnostics and maintenance experts in the wind energy sector is to use unsupervised machine learning algorithms as a support tool for condition monitoring. The way condition monitoring systems can employ unsupervised machine learning algorithms consists on prioritizing the assets to monitor via the number of anomalies detected in the vibration signals of the rolling element bearings. Previous work has focused on the detection of anomalies using features taken directly from the time or frequency domain of the vibration signals to determine if a machine has a fault. In this work, I detect outliers using features derived from encoded vibration signals via sparse coding with dictionary learning. I investigate multiple outlier detection algorithms and evaluate their performance using different features taken from the sparse representation. I show that it is possible to detect an abnormal behavior on a bearing earlier than reported fault dates using typical condition monitoring systems.

APA, Harvard, Vancouver, ISO, and other styles

15

Sothinathan, Nalaiyini. "Bayesian Analysis for outliers in binomial, Normal and circular data." Thesis, Queen Mary, University of London, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.498204.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Schubert, Daniel Dice. "A multivariate adaptive trimmed likelihood algorithm /." Access via Murdoch University Digital Theses Project, 2005. http://wwwlib.murdoch.edu.au/adt/browse/view/adt-MU20061019.132720.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Kinns, David Jonathan. "Multiple case influence analysis with particular reference to the linear model." Thesis, University of Birmingham, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.368427.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Keller, Fabian [Verfasser], and K. [Akademischer Betreuer] Böhm. "Attribute Relationship Analysis in Outlier Mining and Stream Processing / Fabian Keller. Betreuer: K. Böhm." Karlsruhe : KIT-Bibliothek, 2015. http://d-nb.info/1075254019/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Goi, Yoshinao. "Bayesian Damage Detection for Vibration Based Bridge Health Monitoring." Kyoto University, 2018. http://hdl.handle.net/2433/232013.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Andrésen, Anton, and Adam Håkansson. "Comparing unsupervised clustering algorithms to locate uncommon user behavior in public travel data : A comparison between the K-Means and Gaussian Mixture Model algorithms." Thesis, Tekniska Högskolan, Jönköping University, JTH, Datateknik och informatik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-49243.

Full text

Abstract:

Clustering machine learning algorithms have existed for a long time and there are a multitude of variations of them available to implement. Each of them has its advantages and disadvantages, which makes it challenging to select one for a particular problem and application. This study focuses on comparing two algorithms, the K-Means and Gaussian Mixture Model algorithms for outlier detection within public travel data from the travel planning mobile application MobiTime1[1]. The purpose of this study was to compare the two algorithms against each other, to identify differences between their outlier detection results. The comparisons were mainly done by comparing the differences in number of outliers located for each model, with respect to outlier threshold and number of clusters. The study found that the algorithms have large differences regarding their capabilities of detecting outliers. These differences heavily depend on the type of data that is used, but one major difference that was found was that K-Means was more restrictive then Gaussian Mixture Model when it comes to classifying data points as outliers. The result of this study could help people determining which algorithms to implement for their specific application and use case.

APA, Harvard, Vancouver, ISO, and other styles

21

Stark, Love. "Outlier detection with ensembled LSTM auto-encoders on PCA transformed financial data." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-296161.

Full text

Abstract:

Financial institutions today generate a large amount of data, data that can contain interesting information to investigate to further the economic growth of said institution. There exists an interest in analyzing these points of information, especially if they are anomalous from the normal day-to-day work. However, to find these outliers is not an easy task and not possible to do manually due to the massive amounts of data being generated daily. Previous work to solve this has explored the usage of machine learning to find outliers in these financial datasets. Previous studies have shown that the pre-processing of data usually stands for a big part in information loss. This work aims to study if there is a proper balance in how the pre-processing is carried out to retain the highest amount of information while simultaneously not letting the data remain too complex for the machine learning models. The dataset used consisted of Foreign exchange transactions supplied by the host company and was pre-processed through the use of Principal Component Analysis (PCA). The main purpose of this work is to test if an ensemble of Long Short-Term Memory Recurrent Neural Networks (LSTM), configured as autoencoders, can be used to detect outliers in the data and if the ensemble is more accurate than a single LSTM autoencoder. Previous studies have shown that Ensemble autoencoders can prove more accurate than a single autoencoder, especially when SkipCells have been implemented (a configuration that skips over LSTM cells to make the model perform with more variation). A datapoint will be considered an outlier if the LSTM model has trouble properly recreating it, i.e. a pattern that is hard to classify, making it available for further investigations done manually. The results show that the ensembled LSTM model proved to be more accurate than that of a single LSTM model in regards to reconstructing the dataset, and by our definition of an outlier, more accurate in outlier detection. The results from the pre-processing experiments reveal different methods of obtaining an optimal number of components for your data. One of those is by studying retained variance and accuracy of PCA transformation compared to model performance for a certain number of components. One of the conclusions from the work is that ensembled LSTM networks can prove very powerful, but that alternatives to pre-processing should be explored such as categorical embedding instead of PCA.
Finansinstitut genererar idag en stor mängd data, data som kan innehålla intressant information värd att undersöka för att främja den ekonomiska tillväxten för nämnda institution. Det finns ett intresse för att analysera dessa informationspunkter, särskilt om de är avvikande från det normala dagliga arbetet. Att upptäcka dessa avvikelser är dock inte en lätt uppgift och ej möjligt att göra manuellt på grund av de stora mängderna data som genereras dagligen. Tidigare arbete för att lösa detta har undersökt användningen av maskininlärning för att upptäcka avvikelser i finansiell data. Tidigare studier har visat på att förbehandlingen av datan vanligtvis står för en stor del i förlust av emphinformation från datan. Detta arbete syftar till att studera om det finns en korrekt balans i hur förbehandlingen utförs för att behålla den högsta mängden information samtidigt som datan inte förblir för komplex för maskininlärnings-modellerna. Det emphdataset som användes bestod av valutatransaktioner som tillhandahölls av värdföretaget och förbehandlades genom användning av Principal Component Analysis (PCA). Huvudsyftet med detta arbete är att undersöka om en ensemble av Long Short-Term Memory Recurrent Neural Networks (LSTM), konfigurerad som autoenkodare, kan användas för att upptäcka avvikelser i data och om ensemblen är mer precis i sina predikteringar än en ensam LSTM-autoenkodare. Tidigare studier har visat att en ensembel avautoenkodare kan visa sig vara mer precisa än en singel autokodare, särskilt när SkipCells har implementerats (en konfiguration som hoppar över vissa av LSTM-cellerna för att göra modellerna mer varierade). En datapunkt kommer att betraktas som en avvikelse om LSTM-modellen har problem med att återskapa den väl, dvs ett mönster som nätverket har svårt att återskapa, vilket gör datapunkten tillgänglig för vidare undersökningar. Resultaten visar att en ensemble av LSTM-modeller predikterade mer precist än en singel LSTM-modell när det gäller att återskapa datasetet, och då enligt vår definition av avvikelser, mer precis avvikelse detektering. Resultaten från förbehandlingen visar olika metoder för att uppnå ett optimalt antal komponenter för dina data genom att studera bibehållen varians och precision för PCA-transformation jämfört med modellprestanda. En av slutsatserna från arbetet är att en ensembel av LSTM-nätverk kan visa sig vara mycket kraftfulla, men att alternativ till förbehandling bör undersökas, såsom categorical embedding istället för PCA.

APA, Harvard, Vancouver, ISO, and other styles

22

Van, Deventer Petrus Jacobus Uys. "Outliers, influential observations and robust estimation in non-linear regression analysis and discriminant analysis." Doctoral thesis, University of Cape Town, 1993. http://hdl.handle.net/11427/4363.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Arruda, Gabriel Domingos de. "Análise de viés em notícias na língua portuguesa." Universidade de São Paulo, 2015. http://www.teses.usp.br/teses/disponiveis/100/100131/tde-10012016-144315/.

Full text

Abstract:

O projeto descrito neste documento propõe um modelo para análise de viés em notícias, procurando identificar o viés dos meios de comunicação em relação a entidades políticas. Foram analisados três tipos de viés: o viés de seleção, que avalia o quanto uma entidade é referenciada pelo meio de comunicação; o viés de cobertura, que avalia quanto destaque é destinado a entidade e, por fim, o viés de afirmação, que avalia se estão falando mal ou bem da entidade. Para tal, foi construído um corpus de notícias sistematicamente extraídas de 5 produtores de notícias e classificadas manualmente em relação à polaridade e entidade alvo. Técnicas de análise de sentimentos baseadas em aprendizado de máquina foram validadas utilizando o corpus criado. Criou-se uma metodologia para identificação de viés, utilizando o conceito de outliers, a partir de métricas indicadoras. A partir da metodologia proposta, foi analisado o viés em relação aos candidatos ao governo de São Paulo e à presidência a partir do corpus criado, em que se identificou os três tipos de viés em dois produtores de notícias
The project described here proposes a model to study bias on newswire texts, related to political entities. Three types of bias are analysed: selection bias, which refers to the amount of times an entity is referenced by the media outlet; coverage bias, which assesses the amount of coverage given to an entity and, finally, the assertion bias, which analyses whether the news is a positive or negative report of an entity. To accomplish this, a corpus was systematically built by extracting news from 5 different newswires. These texts were manually classified according to their polarity alignment and associated entity. Sentiment Analysis techniques were applied and evaluated using the corpus. Based on the concept of outliers, a methodology for bias detection was created. Bias was analysed using the proposed methodology on the generated corpus for candidates to the government of the state of São Paulo and to presidency, being identified in two newswires for the three above-defined types

APA, Harvard, Vancouver, ISO, and other styles

24

Balasubramanian, Vijay. "Variance reduction and outlier identification for IDDQ testing of integrated chips using principal component analysis." Texas A&M University, 2006. http://hdl.handle.net/1969.1/4766.

Full text

Abstract:

Integrated circuits manufactured in current technology consist of millions of transistors with dimensions shrinking into the nanometer range. These small transistors have quiescent (leakage) currents that are increasingly sensitive to process variations, which have increased the variation in good-chip quiescent current and consequently reduced the effectiveness of IDDQ testing. This research proposes the use of a multivariate statistical technique known as principal component analysis for the purpose of variance reduction. Outlier analysis is applied to the reduced leakage current values as well as the good chip leakage current estimate, to identify defective chips. The proposed idea is evaluated using IDDQ values from multiple wafers of an industrial chip fabricated in 130 nm technology. It is shown that the proposed method achieves significant variance reduction and identifies many outliers that escape identification by other established techniques. For example, it identifies many of the absolute outliers in bad neighborhoods, which are not detected by Nearest Neighbor Residual and Nearest Current Ratio. It also identifies many of the spatial outliers that pass when using Current Ratio. The proposed method also identifies both active and passive defects.

APA, Harvard, Vancouver, ISO, and other styles

25

Chen, Feng. "Efficient Algorithms for Mining Large Spatio-Temporal Data." Diss., Virginia Tech, 2013. http://hdl.handle.net/10919/19220.

Full text

Abstract:

Knowledge discovery on spatio-temporal datasets has attracted
growing interests. Recent advances on remote sensing technology mean
that massive amounts of spatio-temporal data are being collected,
and its volume keeps increasing at an ever faster pace. It becomes
critical to design efficient algorithms for identifying novel and
meaningful patterns from massive spatio-temporal datasets. Different
from the other data sources, this data exhibits significant
space-time statistical dependence, and the assumption of i.i.d. is
no longer valid. The exact modeling of space-time dependence will
render the exponential growth of model complexity as the data size
increases. This research focuses on the construction of efficient
and effective approaches using approximate inference techniques for
three main mining tasks, including spatial outlier detection, robust
spatio-temporal prediction, and novel applications to real world
problems.

Spatial novelty patterns, or spatial outliers, are those data points
whose characteristics are markedly different from their spatial
neighbors. There are two major branches of spatial outlier detection
methodologies, which can be either global Kriging based or local
Laplacian smoothing based. The former approach requires the exact
modeling of spatial dependence, which is time extensive; and the
latter approach requires the i.i.d. assumption of the smoothed
observations, which is not statistically solid. These two approaches
are constrained to numerical data, but in real world applications we
are often faced with a variety of non-numerical data types, such as
count, binary, nominal, and ordinal. To summarize, the main research
challenges are: 1) how much spatial dependence can be eliminated via
Laplace smoothing; 2) how to effectively and efficiently detect
outliers for large numerical spatial datasets; 3) how to generalize
numerical detection methods and develop a unified outlier detection
framework suitable for large non-numerical datasets; 4) how to
achieve accurate spatial prediction even when the training data has
been contaminated by outliers; 5) how to deal with spatio-temporal
data for the preceding problems.

To address the first and second challenges, we mathematically
validated the effectiveness of Laplacian smoothing on the
elimination of spatial autocorrelations. This work provides
fundamental support for existing Laplacian smoothing based methods.
We also discovered a nontrivial side-effect of Laplacian smoothing,
which ingests additional spatial variations to the data due to
convolution effects. To capture this extra variability, we proposed
a generalized local statistical model, and designed two fast forward
and backward outlier detection methods that achieve a better balance
between computational efficiency and accuracy than most existing
methods, and are well suited to large numerical spatial datasets.

We addressed the third challenge by mapping non-numerical variables
to latent numerical variables via a link function, such as logit
function used in logistic regression, and then utilizing
error-buffer artificial variables, which follow a Student-t
distribution, to capture the large valuations caused by outliers. We
proposed a unified statistical framework, which integrates the
advantages of spatial generalized linear mixed model, robust spatial
linear model, reduced-rank dimension reduction, and Bayesian
hierarchical model. A linear-time approximate inference algorithm
was designed to infer the posterior distribution of the error-buffer
artificial variables conditioned on observations. We demonstrated
that traditional numerical outlier detection methods can be directly
applied to the estimated artificial variables for outliers
detection. To the best of our knowledge, this is the first
linear-time outlier detection algorithm that supports a variety of
spatial attribute types, such as binary, count, ordinal, and
nominal.

To address the fourth and fifth challenges, we proposed a robust
version of the Spatio-Temporal Random Effects (STRE) model, namely
the Robust STRE (R-STRE) model. The regular STRE model is a recently
proposed statistical model for large spatio-temporal data that has a
linear order time complexity, but is not best suited for
non-Gaussian and contaminated datasets. This deficiency can be
systemically addressed by increasing the robustness of the model
using heavy-tailed distributions, such as the Huber, Laplace, or
Student-t distribution to model the measurement error, instead of
the traditional Gaussian. However, the resulting R-STRE model
becomes analytical intractable, and direct application of
approximate inferences techniques still has a cubic order time
complexity. To address the computational challenge, we reformulated
the prediction problem as a maximum a posterior (MAP) problem with a
non-smooth objection function, transformed it to a equivalent
quadratic programming problem, and developed an efficient
interior-point numerical algorithm with a near linear order
complexity. This work presents the first near linear time robust
prediction approach for large spatio-temporal datasets in both
offline and online cases.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

26

Åkerberg, Ludvig. "Using Unsupervised Machine Learning for Outlier Detection in Data to Improve Wind Power Production Prediction." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-200336.

Full text

Abstract:

The expansion of wind power for electrical energy production has increased in recent years and shows no signs of slowing down. This unpredictable source of energy has contributed to destabilization of the electrical grid causing the energy market prices to vary significantly on a daily basis. For energy producers and consumers to make good investments, methods have been developed to make predictions of wind power production. These methods are often based on machine learning were historical weather prognosis and wind power production data is used. However, the data often contain outliers, causing the machine learning methods to create inaccurate predictions. The goal of this Master’s Thesis was to identify and remove these outliers from the data so that the accuracy of machine learning predictions can improve. To do this an outlier detection method using unsupervised clustering has been developed and research has been made on the subject of using machine learning for outlier detection and wind power production prediction.
Vindkraftsproduktion som källa för hållbar elektrisk energi har på senare år ökat och visar inga tecken på att sakta in. Den här oförutsägbara källan till energi har bidragit till att destabilisera elnätet vilket orsakat dagliga kraftiga svängningar i priser på elmarknaden. För att elproducenter och konsumenter ska kunna göra bra investeringar har metoder för att prediktera vindkraftsproduktionen utvecklats. Dessa metoder är ofta baserade på maskininlärning där historiska data från väderleksprognoser och vindkraftsproduktion använts. Denna data kan innehålla så kallade outliers, vilket resulterar i försämrade prediktioner från maskininlärningsmetoderna. Målet med det här examensarbetet var att identifiera och ta bort outliers från data så att prediktionerna från dessa metoder kan förbättras. För att göra det har en metod för outlier-identifikation utveklats baserad på oövervakad maskininlärning och forskning har genomförts på områdena inom maskininlärning för att identifiera outliers samt prediktion för vindkraftsproduktion.

APA, Harvard, Vancouver, ISO, and other styles

27

Kuzmak, Barbara R. "An examination of outliers and interaction in a nonreplicated two-way table." Diss., Virginia Tech, 1990. http://hdl.handle.net/10919/37747.

Full text

Abstract:

The additive-plus-multiplicative model, Y_ij = μ + α_i + β_j + ∑_p=1^kλ_pτ_piγ_pj, has been used to describe multiplicative interaction in an unreplicated experiment. Outlier effects often appear as interaction in a two-way analysis of variance with one observation per cell. I use this model in the same setting to study outliers. In data sets with significant interaction, one may be interested in determining whether the cause of the interaction is due to a true interaction, outliers or both. I develop a new technique which can show how outliers can be distinguished from interaction when there are simple outliers in a two-way table. Several examples illustrating the use of this model to describe outliers and interaction are presented. I briefly address the topics of leverage and influence. Leverage measures the impact a change in an observation has on fitted values, whereas influence evaluates the effect deleting an observation has on model estimates. I extend the leverage tables for an additive-plus-multiplicative model of rank 1 to a rank k model. Several examples studying the influence in a two-way nonreplicated table are given.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

28

He, Tian Ying. "Outline-based image content analysis using partial signature." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp03/MQ40415.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Masood, Adnan. "Measuring Interestingness in Outliers with Explanation Facility using Belief Networks." NSUWorks, 2014. http://nsuworks.nova.edu/gscis_etd/232.

Full text

Abstract:

This research explores the potential of improving the explainability of outliers using Bayesian Belief Networks as background knowledge. Outliers are deviations from the usual trends of data. Mining outliers may help discover potential anomalies and fraudulent activities. Meaningful outliers can be retrieved and analyzed by using domain knowledge. Domain knowledge (or background knowledge) is represented using probabilistic graphical models such as Bayesian belief networks. Bayesian networks are graph-based representation used to model and encode mutual relationships between entities. Due to their probabilistic graphical nature, Belief Networks are an ideal way to capture the sensitivity, causal inference, uncertainty and background knowledge in real world data sets. Bayesian Networks effectively present the causal relationships between different entities (nodes) using conditional probability. This probabilistic relationship shows the degree of belief between entities. A quantitative measure which computes changes in this degree of belief acts as a sensitivity measure . The first contribution of this research is enhancing the performance for measurement of sensitivity based on earlier research work, the Interestingness Filtering Engine Miner algorithm. The algorithm developed (IBOX - Interestingness based Bayesian outlier eXplainer) provides progressive improvement in the performance and sensitivity scoring of earlier works. Earlier approaches compute sensitivity by measuring divergence among conditional probability of training and test data, while using only couple of probabilistic interestingness measures such as Mutual information and Support to calculate belief sensitivity. With ingrained support from the literature as well as quantitative evidence, IBOX provides a framework to use multiple interestingness measures resulting in better performance and improved sensitivity analysis. The results provide improved performance, and therefore explainability of rare class entities. This research quantitatively validated probabilistic interestingness measures as an effective sensitivity analysis technique in rare class mining. This results in a novel, original, and progressive research contribution to the areas of probabilistic graphical models and outlier analysis.

APA, Harvard, Vancouver, ISO, and other styles

30

Zhou, Bin. "Computational Analysis of LC-MS/MS Data for Metabolite Identification." Thesis, Virginia Tech, 2011. http://hdl.handle.net/10919/36109.

Full text

Abstract:

Metabolomics aims at the detection and quantitation of metabolites within a biological system. As the most direct representation of phenotypic changes, metabolomics is an important component in system biology research. Recent development on high-resolution, high-accuracy mass spectrometers enables the simultaneous study of hundreds or even thousands of metabolites in one experiment. Liquid chromatography-mass spectrometry (LC-MS) is a commonly used instrument for metabolomic studies due to its high sensitivity and broad coverage of metabolome. However, the identification of metabolites remains a bottle-neck for current metabolomic studies. This thesis focuses on utilizing computational approaches to improve the accuracy and efficiency for metabolite identification in LC-MS/MS-based metabolomic studies. First, an outlier screening approach is developed to identify those LC-MS runs with low analytical quality, so they will not adversely affect the identification of metabolites. The approach is computationally simple but effective, and does not depend on any preprocessing approach. Second, an integrated computational framework is proposed and implemented to improve the accuracy of metabolite identification and prioritize the multiple putative identifications of one peak in LC-MS data. Through the framework, peaks are likely to have the m/z values that can give appropriate putative identifications. And important guidance for the metabolite verification is provided by prioritizing the putative identifications. Third, an MS/MS spectral matching algorithm is proposed based on support vector machine classification. The approach provides an improved retrieval performance in spectral matching, especially in the presence of data heterogeneity due to different instruments or experimental settings used during the MS/MS spectra acquisition.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

31

Kaltenbach, Kelley J. "Analysis of magnetic anomalies in determining fault displacement in the crystalline Precambrian basement underneath the Bellefontaine Outlier, Ohio /." Connect to resource, 1998. http://hdl.handle.net/1811/28551.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Fidler, Michael L. "Three dimensional digital analysis of 2,500 square kilometers of gravity and magnetic survey data, Bellefontaine Outlier area, Ohio /." Columbus, Ohio : Ohio State University, 2003. http://hdl.handle.net/1811/6110.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Kanneganti, Raghuveer. "CLASSIFICATION OF ONE-DIMENSIONAL AND TWO-DIMENSIONAL SIGNALS." OpenSIUC, 2014. https://opensiuc.lib.siu.edu/dissertations/892.

Full text

Abstract:

This dissertation focuses on the classification of one-dimensional and two-dimensional signals. The one-dimensional signal classification problem involves the classification of brain signals for identifying the emotional responses of human subjects under given drug conditions. A strategy is developed to accurately classify ERPs in order to identify human emotions based on brain reactivity to emotional, neutral, and cigarette-related stimuli in smokers. A multichannel spatio-temporal model is employed to overcome the curse of dimensionality that plagues the design of parametric multivariate classifiers for multi-channel ERPs. The strategy is tested on the ERPs of 156 smokers who participated in a smoking cessation program. One half of the subjects were given nicotine patches and the other half were given placebo patches. ERPs were collected from 29 channel in response to the presentation of the pictures with emotional (pleasant and unpleasant), neutral/boring, and cigarette-related content. It is shown that human emotions can be classified accurately and the results also show that smoking cessation causes a drop in the classification accuracies of emotions in the placebo group, but not in the nicotine patch group. Given that individual brain patterns were compared with group average brain patterns, the findings support the view that individuals tend to have similar brain reactions to different types of emotional stimuli. Overall, this new classification approach to identify differential brain responses to different emotional types could lead to new knowledge concerning brain mechanisms associated with emotions common to most or all people. This novel classification technique for identifying emotions in the present study suggests that smoking cessation without nicotine replacement results in poorer differentiation of brain responses to different emotional stimuli. Future, directions in this area would be to use these methods to assess individual differences in responses to emotional stimuli and to different drug treatments. Advantages of this and other brain-based assessment include temporal precision (e.g, 400-800 ms post stimulus), and the elimination of biases related to self-report measures. The two-dimensional signal classification problems include the detection of graphite in testing documents and the detection of fraudulent bubbles in test sheets. A strategy is developed to detect graphite responses in optical mark recognition (OMR) documents using inexpensive visible light scanners. The main challenge in the formulation of the strategy is that the detection should be invariant to the numerous background colors and artwork in typical optical mark recognition documents. A test document is modeled as a superposition of a graphite response image and a background image. The background image in turn is modeled as superposition of screening artwork, lines, and machine text components. A sequence of image processing operations and a pattern recognition algorithm are developed to estimate the graphite response image from a test document by systematically removing the components of the background image. The proposed strategy is tested on a wide range of scanned documents and it is shown that the estimated graphite response images are visually similar to those scanned by very expensive infra-red scanners currently employed for optical mark recognition. The robustness of the detection strategy is also demonstrated by testing a large number of simulated test documents. A procedure is also developed to autonomously determine if cheating has occurred by detecting the presence of aberrant responses in scanned OMR test books. The challenges introduced by the significant imbalance in the numbers of typical and aberrant bubbles were identified. The aberrant bubble detection problem is formulated as an outlier detection problem. A feature based outlier detection procedure in conjunction with a one-class SVM classifier is developed. A multi-criteria rank-of-rank-sum technique is introduced to rank and select a subset of features from a pool of candidate features. Using the data set of 11 individuals, it is shown that a detection accuracy of over 90% is possible. Experiments conducted on three real test books flagged for suspected cheating showed that the proposed strategy has the potential to be deployed in practice.

APA, Harvard, Vancouver, ISO, and other styles

34

Gracie, Christina. "Bayesian analysis of agricultural treatment effects in the presence of a fertility trend and outliers." Thesis, Gracie, Christina (2005) Bayesian analysis of agricultural treatment effects in the presence of a fertility trend and outliers. Honours thesis, Murdoch University, 2005. https://researchrepository.murdoch.edu.au/id/eprint/40843/.

Full text

Abstract:

This thesis endeavors to study the Bayesian technique of making inferences, which was assisted by the use and study of the Markov Chain Monte Carlo (MCMC) approach and accompanied by the implementation of some models using the Bayesian program, winBUGS. The models investigated in this thesis were based on a model used by Taplin and Raftery (1994). These models were concerned with estimating treatment effects in the presence of a fertility trend and outliers, where the response variable was crop yield and the other parameters were treatment effects, fertility effects and an error term. The thesis includes a brief review of some underling principles concerning Bayesian analysis (Section 1) and MCMC (Section 2). It also includes a review of some literature that is related to the Bayesian estimation of treatment effects for agricultural data and techniques for accommodating outliers (Section 3). The practical section of the thesis (Section 4) was the estimation of treatment effects for some data. WinBUGS codes were written and its components verified and explored. Further research into the robustness of two techniques for accommodating outliers was also explored. This being comparisons between the variance inflation technique and the response variable following a Student's t-distribution.

APA, Harvard, Vancouver, ISO, and other styles

35

Anderson, Cynthia 1962. "A Comparison of Five Robust Regression Methods with Ordinary Least Squares: Relative Efficiency, Bias and Test of the Null Hypothesis." Thesis, University of North Texas, 2001. https://digital.library.unt.edu/ark:/67531/metadc5808/.

Full text

Abstract:

A Monte Carlo simulation was used to generate data for a comparison of five robust regression estimation methods with ordinary least squares (OLS) under 36 different outlier data configurations. Two of the robust estimators, Least Absolute Value (LAV) estimation and MM estimation, are commercially available. Three authormodified variations on MM were also included (MM1, MM2, and MM3). Design parameters that were varied include sample size (n=60 and n=180), number of independent predictor variables (2, 3 and 6), outlier density (0%, 5% and 15%) and outlier location (2x,2y s, 8x8y s, 4x,8y s and 8x,4y s). Criteria on which the regression methods were measured are relative efficiency, bias and a test of the null hypothesis. Results indicated that MM2 was the best performing robust estimator on relative efficiency. The best performing estimator on bias was MM1. The best performing regression method on the test of the null hypothesis was MM2. Overall, the MM-type robust regression methods outperformed OLS and LAV on relative efficiency, bias, and the test of the null hypothesis.

APA, Harvard, Vancouver, ISO, and other styles

36

Liu, Yan. "Documenting the impact of outliers on decisions about the number of factors in exploratory factor analysis." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/39802.

Full text

Abstract:

The overall purpose of this dissertation is to investigate how outliers affect the decisions about the number of factors in exploratory factor analysis (EFA) as determined by four widely used and/or highly recommended methods. Very few studies have looked into this issue in the literature and the conclusions are contradictory— i.e., with studies disagreeing as to whether outliers result in extra factors or a reduced number of factors. For this dissertation I systematically studied the impact of outliers arising from different sources and matched outlier simulation models with different type of outliers. Chapter 1 provides an overview of the gap between statistical theory regarding outliers and researchers’ day-to-day practice and their understanding of the effects of outliers. Chapter 2 presents a review of EFA with an emphasis on the four commonly used or highly recommended decision methods on the number of factors as well as a review of outliers which includes the sources of outliers and problems of outliers in factor analysis. Chapter 3 examines the effects of outliers arising from errors using the deterministic and slippage models. The results revealed that outliers can inflate, deflate, or have no effects on the decisions about the number of factors, which depends on the decision method used and the magnitude and number of outliers. Chapter 4 investigates the effects of outliers arising from an unintended and unknowingly included subpopulation using the mixture contamination model. The general conclusions are similar to chapter 3, but chapter 4 also reveals that symmetric and asymmetric contamination has different effects on different decision methods and the effects of outliers do not depend on sample size. Chapter 5 provides a general discussion of the findings of this dissertation, describes four novel contributions, and points out the limitations of the present research as well as the future research directions. This dissertation aims to bridge the gap from day-to-day researchers’ practice and understanding of the effects of outliers to current outlier research that emphasizes robust statistics. The findings of this dissertation address the contradictory conclusions made in previous studies.

APA, Harvard, Vancouver, ISO, and other styles

37

Sun, Yi. "New matching algorithm -- Outlier First Matching (OFM) and its performance on Propensity Score Analysis (PSA) under new Stepwise Matching Framework (SMF)." Thesis, State University of New York at Albany, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3633233.

Full text

Abstract:

An observational study is an empirical investigation of treatment effect when randomized experimentation is not ethical or feasible (Rosenbaum 2009). Observational studies are common in real life due to the following reasons: a) randomization is not feasible due to the ethical or financial reason; b) data are collected from survey or other resources where the object and design of the study has not been determined (e.g. retrospective study using administrative records); c) little knowledge on the given region so that some preliminary studies of observational data are conducted to formulate hypotheses to be tested in subsequent experiments. When statistical analysis are done using observational studies, the following issues need to be considered: a) the lack of randomization may lead to a selection bias; b) representativeness of sampling with respect to the problem under consideration (e.g. study of factors influencing a rare disease using a nationally representative survey with respective to race, income, and gender but not with respect to the rare disease condition).We will use the following sample to illustrate the challenges of observational studies and possible mitigation measures.

Our example is based on the study by Lalonde (1986), which evaluated the impact of job training on the earnings improvement of low-skilled workers in 1970's (In Paper 1 section 1.5.2, we will discuss this data set in more detail). The treatment effect estimated from the observational study was quite different from the one obtained using the baseline randomized "National Supported Work (NSW) Experiment" carried out in the mid-1970's. Now we understand the treatment effect which is the impact of job training. Selection bias may contaminate the treatment effect, in other words, workers who receive the job training may be fundamentally different from those who do not. Furthermore, the sample of control group selected for observational study by Lalonde may not represent the sample of control group from the original NSW experiment.

In this study, we address the issue of lack of randomization by applying a new matching algorithm (Outlier First Matching, OFM) which can be used in conjunction with the Propensity Score Analysis (PSA) or other similar methods to achieve the convincible treatment effect estimation in observational studies.

This dissertation consists of three papers.

Paper 1 proposes a new "Stepwise Matching Framework (SMF)" and rationalizes its usage in causal inference study (especially for PSA study using observational data). Furthermore, under the new framework of SMF, one new matching algorithm (Outlier First Matching or OFM in short) will be introduced. Its performance along with other well-known matching algorithms will be studied using the cross sectional data.

Paper 2 extends methods of paper 1 to correlated data (especially to longitudinal data). In the circumstance of correlated data (e.g. longitudinal data), besides the selection bias as in cross-sectional observational data, the repeated measures bring out the between-subject and within-subject correlation. Furthermore, the repeated measures can also bring out the missing value problem and rolling enrollment problem. All of above challenges from correlated data complexity the data structure and need to be addressed using more complex model and methodology. Our methodology calculate the variant p-score of control subjects at each time point and generate the p-score difference from each control subject to every treatment subject at treatment subject's time point. Then such p-score differences are summarized to create the distance matrix for next step analysis. Once again, the performance of OFM and other well-established matching algorithms are compared side by side and the conclusion will be summarized through simulation and real data applications.

Paper 3 handles missing value problem in longitudinal data. As we have mentioned in paper 2, the complexity of data structure of longitudinal data often comes with the problem of missing data. Due to the possibility of between subject and within subject correlation, the traditional imputation methodology will probably ignore the above two correlations so that it may lead to biased or inefficient imputation of missing data. We adopt one missing value imputation strategy introduced by Schafer and Yucel (2002) through one R package "pan" to handle the above two correlations. The "imputed complete data" will be treated using the similar methodology as paper 2. Then MI results will be summarized using Rubin's rule (1987). The conclusion will be drawn based on the findings through simulation study and compared to what we have found in complete longitudinal data study in paper 2.

In last section, we conclude the dissertation with the discussion of preliminary results, as well as the strengths and limitations of the present research. Also we will point out the direction of the future study and provide suggestions to practice works.

APA, Harvard, Vancouver, ISO, and other styles

38

Patyk, Sylwia. "Forces analysis rolling burnishing rough surfaces with triangular outlines asperity : PhD thesis summary." Rozprawa doktorska, [s.n.], 2015. http://dlibra.tu.koszalin.pl/Content/1059.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Labaš, Dominik. "Analýza metod pro detekci odlehlých hodnot." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2021. http://www.nusl.cz/ntk/nusl-445527.

Full text

Abstract:

The topic of this thesis is analysis of methods for detection of outliers. Firstly, a description of outliers and various methods for their detection is provided. Then a description of selected data sets for testing of methods for detection of outliers is given. Next, an application design for the analysis of the described methods is presented. Then, technologies are presented, which provide models for described methods of detection of outliers. The implementation is then described in more detail. Subsequently, the results of experiments are presented, which represent the main part of this thesis. The results are evaluated and the individual models are compared with each other. Lastly, a method for accelerating outlier detection is demonstrated.

APA, Harvard, Vancouver, ISO, and other styles

40

Aldas, Cem Nuri. "An Analysis Of Peculiarity Oriented Interestingness Measures On Medical Data." Master's thesis, METU, 2008. http://etd.lib.metu.edu.tr/upload/12609856/index.pdf.

Full text

Abstract:

Peculiar data are regarded as patterns which are significantly distinguishable from other records, relatively few in number and they are accepted as to be one of the most striking aspects of the interestingness concept. In clinical domain, peculiar records are probably signals for malignancy or disorder to be intervened immediately. The investigation of the rules and mechanisms which lie behind these records will be a meaningful contribution for improved clinical decision support systems. In order to discover the most interesting records and patterns, many peculiarity oriented interestingness measures, each fulfilling a specific requirement, have been developed. In this thesis well-known peculiarity oriented interestingness measures, Local Outlier Factor (LOF), Cluster Based Local Outlier Factor (CBLOF) and Record Peculiar Factor (RPF) are compared. The insights derived from the theoretical infrastructures of the algorithms were evaluated by using experiments on synthetic and real world medical data. The results are discussed based on the interestingness perspective and some departure points for building a more developed methodology for knowledge discovery in databases are proposed.

APA, Harvard, Vancouver, ISO, and other styles

41

Slezak, Thomas Joseph. "Quantitative Morphological Classification of Planetary Craterforms Using Multivariate Methods of Outline-Based Shape Analysis." BYU ScholarsArchive, 2017. https://scholarsarchive.byu.edu/etd/6639.

Full text

Abstract:

Craters formed by impact and volcanic processes are among the most fundamental planetary landforms. This study examines the morphology of diverse craterforms on Io, the Moon, Mars, and Earth using quantitative, outline-based shape analysis and multivariate statistical methods to evaluate the differences between different types of. Ultimately, this should help establish relationships between the form and origin of craterforms. Developed in the field of geometric morphometrics by paleontological and biological sciences communities, these methods were used for the analysis of the shapes of crater outlines. The shapes of terrestrial ash-flow calderas, terrestrial basaltic shield calderas, martian calderas, Ionian paterae, and lunar impact craters were quantified and compared. Specifically, we used circularity, ellipticity, elliptic Fourier analysis (EFA), Zahn and Roskies (Z-R) shape function, and diameter. Quantitative shape descriptors obtained from EFA yield coefficients from decomposition of the Fourier series that separates the vertical and horizontal components among the outline points for each shape. The shape descriptors extracted from Z-R analysis represent the angular deviation of the shapes from a circle. These quantities were subjected to multivariate statistical analysis including principal component analysis (PCA) and discriminant analysis, to examine maximum differences between each a priori established group. Univariate analyses of morphological quantities including diameter, circularity, and ellipticity, as well as multivariate analyses of elliptic Fourier coefficients and Z-R shape function angular quantities show that ash-flow calderas and paterae on Io, as well as basaltic shield calderas and martian calderas, are most similar in shape. Other classes of craters are also shown to be statistically distinct from one another. Multivariate statistical models provide successful classification of different types of craters. Three classification models were built with overall successful classification rates ranging from 90% to 75%, each conveying different shape information. The EFA model including coefficients from the 2nd to 10th harmonic was the most successful supervised model with the highest overall classification rate and most successful predictive group membership assignments for the population of examined craterforms. Multivariate statistical methods and classification models can be effective tools for analyzing landforms on planetary surfaces and geologic morphology. With larger data sets used to enhance supervision of the model, more successful classification by the supervised model could likely reveal clues to the formation and variables involved in the genesis of landforms.

APA, Harvard, Vancouver, ISO, and other styles

42

Wedlake, Ryan Stuart. "Robust principal component analysis biplots." Thesis, Link to the online version, 2008. http://hdl.handle.net/10019/929.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Fritsch, Virgile. "High-dimensional statistical methods for inter-subject studies in neuroimaging." Phd thesis, Université Paris Sud - Paris XI, 2013. http://tel.archives-ouvertes.fr/tel-00934695.

Full text

Abstract:

La variabilité inter-individuelle est un obstacle majeur à l'analyse d'images médicales, en particulier en neuroimagerie. Il convient de distinguer la variabilité naturelle ou statistique, source de potentiels effets d'intérêt pour du diagnostique, de la variabilité artefactuelle, constituée d'effets de nuisance liés à des problèmes expérimentaux ou techniques, survenant lors de l'acquisition ou le traitement des données. La dernière peut s'avérer bien plus importante que la première : en neuroimagerie, les problèmes d'acquisition peuvent ainsi masquer la variabilité fonctionnelle qui est par ailleurs associée à une maladie, un trouble psychologique, ou à l'expression d'un code génétique spécifique. La qualité des procédures statistiques utilisées pour les études de groupe est alors diminuée car lesdites procédures reposent sur l'hypothèse d'une population homogène, hypothèse difficile à vérifier manuellement sur des données de neuroimagerie dont la dimension est élevée. Des méthodes automatiques ont été mises en oeuvre pour tenter d'éliminer les sujets trop déviants et ainsi rendre les groupes étudiés plus homogènes. Cette pratique n'a pas entièrement fait ses preuves pour autant, attendu qu'aucune étude ne l'a clairement validée, et que le niveau de tolérance à choisir reste arbitraire. Une autre approche consiste alors à utiliser des procédures d'analyse et de traitement des données intrinsèquement insensibles à l'hypothèse d'homogénéité. Elles sont en outre mieux adaptées aux données réelles en ce qu'elles tolèrent dans une certaine mesure d'autres violations d'hypothèse plus subtiles telle que la normalité des données. Un autre problème, partiellement lié, est le manque de stabilité et de sensibilité des méthodes d'analyse au niveau voxel, sources de résultats qui ne sont pas reproductibles.Nous commençons cette thèse par le développement d'une méthode de détection d'individus atypiques adaptée aux données de neuroimagerie, qui fournit un contrôle statistique sur l'inclusion de sujets : nous proposons une version regularisée d'un estimateur de covariance robuste pour le rendre utilisable en grande dimension. Nous comparons plusieurs types de régularisation et concluons que les projections aléatoires offrent le meilleur compromis. Nous présentons également des procédures non-paramétriques dont nous montrons la qualité de performance, bien qu'elles n'offrent aucun contrôle statistique. La seconde contribution de cette thèse est une nouvelle approche, nommée RPBI (Randomized Parcellation Based Inference), répondant au manque de reproductibilité des méthodes classiques. Nous stabilisons l'approche d'analyse à l'échelle de la parcelle en agrégeant plusieurs analyses indépendantes, pour lesquelles le partitionnement du cerveau en parcelles varie d'une analyse à l'autre. La méthode permet d'atteindre un niveau de sensibilité supérieur à celui des méthodes de l'état de l'art, ce que nous démontrons par des expériences sur des données synthétiques et réelles. Notre troisième contribution est une application de la régression robuste aux études de neuroimagerie. Poursuivant un travail déjà existant, nous nous concentrons sur les études à grande échelle effectuées sur plus de cent sujets. Considérant à la fois des données simulées et des données réelles, nous montrons que l'utilisation de la régression robuste améliore la sensibilité des analyses. Nous démontrons qu'il est important d'assurer une résistance face aux violations d'hypothèse, même dans les cas où une inspection minutieuse du jeu de données a été conduite au préalable. Enfin, nous associons la régression robuste à notre méthode d'analyse RPBI afin d'obtenir des tests statistiques encore plus sensibles.

APA, Harvard, Vancouver, ISO, and other styles

44

Bartholomäus, Jenny, Sven Wunderlich, and Zoltán Sasvári. "Identification of Suspicious Semiconductor Devices Using Independent Component Analysis with Dimensionality Reduction." Institute of Electrical and Electronics Engineers (IEEE), 2019. https://tud.qucosa.de/id/qucosa%3A35129.

Full text

Abstract:

In the semiconductor industry the reliability of devices is of paramount importance. Therefore, after removing the defective ones, one wants to detect irregularities in measurement data because corresponding devices have a higher risk of failure early in the product lifetime. The paper presents a method to improve the detection of such suspicious devices where the screening is made on transformed measurement data. Thereby, e.g., dependencies between tests can be taken into account. Additionally, a new dimensionality reduction is performed within the transformation, so that the reduced and transformed data comprises only the informative content from the raw data. This simplifies the complexity of the subsequent screening steps. The new approach will be applied to semiconductor measurement data and it will be shown, by means of examples, how the screening can be improved.

APA, Harvard, Vancouver, ISO, and other styles

45

Almeida, Júnior José de. "Detecção de outlier como suporte para o controle estatístico do processo multivariado: um estudo de caso em uma empresa do setor plástico." Universidade Federal da Paraíba, 2013. http://tede.biblioteca.ufpb.br:8080/handle/tede/5225.

Full text

Abstract:

Made available in DSpace on 2015-05-08T14:53:25Z (GMT). No. of bitstreams: 1 ArquivoTotalJoseAlmeida.pdf: 1891145 bytes, checksum: 15212c0ee3aea31416abaeb33cac710c (MD5) Previous issue date: 2013-08-29
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES
The research project studied, aimed to apply a forward search algorithm to aid decision making in multivariate statistical process control in the manufacture of crates in a company of plastic products. Besides, the use of principal components analysis (PCA) and the Hotelling T square chart can summarize relevant information of this process. Thus, they were produced two results of considerable importance: the scores of the principal components and an adapted Hotelling T square chart, highlighting the relationship between the ten variables analyzed. The forward search algorithm detects discordant points of the data clustering rest that, when are too far away or have very different characteristics, are called outliers. The BACON algorithm was used for the detection of such occurrences, which part of a small subset demonstrably free of the original data outliers and it goes adding new information, which is not outliers, to this initial subset until no information can more be absorbed. One of the advantages of using this algorithm is that it combats the masking and swamping phenomena that alter the mean and covariance estimates. The research results showed that, for the dataset studied, the BACON algorithm did not detected no dissenting point. A simulation was then developed, using a uniform distribution by obtaining random numbers within a range for modifying the mean and standard deviation values, in order to show that this method is effective in detecting these outliers. For this simulation, they were randomly changed 5% of the mean and the standard deviation values of the original data. The result of this simulation showed that the BACON algorithm is perfectly applicable to this case study, being indicated its use in other processes that simultaneously depend on several variables.
O projeto de pesquisa estudado teve o objetivo de aplicar um algoritmo de busca sucessiva para o auxílio à tomada de decisão no controle estatístico do processo multivariado, na fabricação de garrafeiras em uma empresa de produtos plásticos. Além disso, a utilização das técnicas de análise de componentes principais (ACP) e da carta T² de Hotelling pode sumarizar parte das informações relevantes desse processo. Produziram-se então dois resultados de considerável importância: os escores dos componentes principais e um gráfico T² de Hotelling adaptado, evidenciando a relação entre as dez variáveis analisadas. O algoritmo de busca sucessiva detecta pontos discordantes do restante do agrupamento de dados que, quando se encontram muito distantes ou têm características muito diferentes, são denominados outliers. O algoritmo BACON foi utilizado para a detecção de tais ocorrências, o qual parte de um pequeno subconjunto, comprovadamente livre de outliers, dos dados originais e vai adicionando novas informações, que também não são outliers, a esse subconjunto inicial até que nenhuma informação possa mais ser absorvida. Uma das vantagens da utilização desse algoritmo é que ele combate os fenômenos do mascaramento e do esmagamento que alteram as estimativas da média e da covariância. Os resultados da pesquisa mostraram que, para a o conjunto de dados estudados, o algoritmo BACON não detectou nenhum ponto discordante. Uma simulação foi então desenvolvida, utilizando uma distribuição uniforme através da obtenção de números aleatórios dentro de um intervalo para a modificação dos valores da média e do desvio-padrão, a fim de mostrar que tal método é eficaz na detecção desses pontos aberrantes. Para essa simulação, foram alterados aleatoriamente os valores da média e do desvio-padrão de 5% dos dados originais. O resultado dessa simulação mostrou que o algoritmo BACON é perfeitamente aplicável ao caso estudado, sendo indicada a sua utilização em outros processos produtivos que dependam simultaneamente de diversas variáveis.

APA, Harvard, Vancouver, ISO, and other styles

46

Lausberg, Isabel. "Kundenpräferenzen für neue Angebotsformen im Einzelhandel eine Analyse am Beispiel von Factory Outlet Centern /." [S.l. : s.n.], 2002. http://d-nb.info/965502074/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Sienkiewicz, Stefan Fareed Abbas. "Five modes of scepticism : an analysis of the Agrippan modes in Sextus Empiricus' Outlines of Pyrrhonism." Thesis, University of Oxford, 2013. http://ora.ox.ac.uk/objects/uuid:2f49a75d-164c-4534-aa9e-9579d55be086.

Full text

Abstract:

This thesis has as its focus five argumentative modes that lie at the heart of Sextus Empiricus’ Outlines of Pyrrhonism. They are the modes of disagreement, hypothesis, infinite regression, reciprocity and relativity. They are analysed, individually, in the first five chapters of the thesis (one mode per chapter) and, collectively, in the sixth. The first four chapters deal, respectively, with the modes of disagreement, hypothesis, infinite regression and reciprocity. They distinguish between two versions of these modes: “dogmatic versions”, on the basis of which a dogmatic philosopher, who holds some theoretical beliefs, might reach a sceptical conclusion; and “sceptical versions”, on the basis of which a sceptical philosopher, who lacks all theoretical beliefs, might do so. It is argued that scholars such as Jonathan Barnes have offered reconstructions of these modes which are dogmatic in the sense just described, and alternative sceptical versions of the modes are presented. A stand-alone fifth chapter offers an analysis of a stand-alone mode - the mode of relativity. It argues that there are in fact three different modes of relativity at play in the Outlines, that only one of them is non-trivial, and that the non-trivial version is incompatible with the mode of disagreement. The sixth and final chapter offers an analysis of how the modes (excluding relativity) are meant to work in combination with one another. Four different combinations are presented and it is argued that all of them are underscored by a variety of theoretical assumptions, which a sceptic, who lacks all theoretical beliefs, cannot make. The ultimate conclusion of the thesis is that, though the sceptic can deploy the various modes individually (by means of exercising his particular sceptical ability), he is not able to systematise them into a net by means of which he might trap his dogmatic opponent. Unless specified otherwise, translations are based on Annas, J., and Barnes, J., Sextus Empiricus: Outlines of Scepticism (Cambridge: Cambridge University Press, 2000).

APA, Harvard, Vancouver, ISO, and other styles

48

Tuma, Josef. "Strategická analýza firmy "Marie Tumová" a nástin strategie." Master's thesis, Vysoká škola ekonomická v Praze, 2008. http://www.nusl.cz/ntk/nusl-76982.

Full text

Abstract:

The goal of this diploma thesis is to make a strategic analysis of Marie Tumová Company, external and internal anylysis, to reveal the effect of different factors and give some recommendations and describe a appropriate strategy.

APA, Harvard, Vancouver, ISO, and other styles

49

Waddle, Ashleigh Danielle. "A Market Analysis for Specialty Beef in Virginia." Thesis, Virginia Tech, 2009. http://hdl.handle.net/10919/32656.

Full text

Abstract:

Virginia beef producers have been overwhelmed with increasing costs and decreasing profits as well as facing challenges such as development pressures, drought, increasing competition for grazing land. Together these have reduced opportunities for expansion and often increased incentives for farmers to sell land for non-agriculture use. Nevertheless, opportunities exist for the Virginia beef market. Consumer demand is changing and consumers are seeking food from alternative production systems based on attributes related to human health, environment, animal welfare, and other social concerns. Consumers are also interested in increasing their consumption of locally produced foods. Specialty beef such as natural, organic, and pasture-fed addresses the changing consumer demand and provides alternatives to commodity beef production. This thesis analyzes the potential for and the constraints to specialty beef producers in Virginia to sell their beef through alternative market outlets such as large retail outlets, specialty stores, restaurants, or direct to consumers. The study will research the potential demand for specialty beef through alternative market outlets, the market entry requirements to supply specialty beef to these alternative outlets, and the potential for Virginiaâ s specialty beef producers to serve as suppliers to these alternative outlets. A survey is used to evaluate these alternative markets and determine if they present an opportunity for Virginia producers of specialty beef. The results of this study will evaluate the viability of selling and buying between producer and retailer and offer valuable information and recommendations to Virginia specialty beef producers about the potential and requirements in each of these markets.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

50

Cenonfolo, Filippo. "Signal cleaning techniques and anomaly detection algorithms for motorbike applications." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text

Abstract:

This paper outlines the results of the curricular internship project at the Research and Development section of Ducati Motor Holding S.p.A. in collaboration with the Motorvehicle University of Emilia-Romagna (MUNER). The focus is the development of a diagnostic plugin specifically tailored for motorcycle applications with the aim of automatically detecting anomalous behaviors of the signals recorded from the sensors mounted on-board. Acquisitions are performed whenever motorbikes are tested and they contain a variable number of channels related to the different parameters engineers decide to store for the after run analysis. Dealing with this complexity might be hard on its own, but the correct interpretation of data becomes even more demanding whenever signals present corruption or are affected by a relevant degree of noise. For this reason, the whole internship projects is centered on a research around signal cleaning techniques and anomaly detection algorithms which aims at developing an automatic diagnostic tool. The final goal is to implement a preliminary processing on the acquisition that allows an understanding of the quality of the signals recorded and, if possible, applies strategies that reduce the impact of the anomalies on the overall dataset.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Outlier analyses'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles