Dissertations / Theses on the topic 'K-statistics'

To see the other types of publications on this topic, follow the link: K-statistics.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'K-statistics.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Li, Songzi. "K-groups: A Generalization of K-means by Energy Distance." Bowling Green State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1428583805.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Angiuli, Olivia Marie. "The effect of quasi-identifier characteristics on statistical bias introduced by k-anonymization." Thesis, Harvard University, 2015. http://nrs.harvard.edu/urn-3:HUL.InstRepos:14398529.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The de-identification of publicly released datasets that contain personal information is necessary to preserve personal privacy. One such de-identification algorithm, k-anonymization, reduces the risk of the re-identification of such datasets by requiring that each combination of information-revealing traits be represented by at least k different records in the dataset. However, this requirement may skew the resulting dataset by preferentially deleting records that contain more rare information-revealing traits. This paper investigates the amount of bias and loss of utility introduced into an online education dataset by the k-anonymization process, as well as suggesting future directions that may decrease the amount of bias introduced during de-identification procedures.
3

Fisher, Julia Marie. "Classification Analytics in Functional Neuroimaging: Calibrating Signal Detection Parameters." Thesis, The University of Arizona, 2015. http://hdl.handle.net/10150/594646.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Classification analyses are a promising way to localize signal, especially scattered signal, in functional magnetic resonance imaging data. However, there is not yet a consensus on the most effective analysis pathway. We explore the efficacy of k-Nearest Neighbors classifiers on simulated functional magnetic resonance imaging data. We utilize a novel construction of the classification data. Additionally, we vary the spatial distribution of signal, the design matrix of the linear model used to construct the classification data, and the feature set available to the classifier. Results indicate that the k-Nearest Neighbors classifier is not sufficient under the current paradigm to adequately classify neural data and localize signal. Further exploration of the data using k-means clustering indicates that this is likely due in part to the amount of noise present in each data point. Suggestions are made for further research.
4

Dey, Rajarshi. "Inference for the K-sample problem based on precedence probabilities." Diss., Kansas State University, 2011. http://hdl.handle.net/2097/12000.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Doctor of Philosophy
Department of Statistics
Paul I. Nelson
Rank based inference using independent random samples to compare K>1 continuous distributions, called the K-sample problem, based on precedence probabilities is developed and explored. There are many parametric and nonparametric approaches, most dealing with hypothesis testing, to this important, classical problem. Most existing tests are designed to detect differences among the location parameters of different distributions. Best known and most widely used of these is the F- test, which assumes normality. A comparable nonparametric test was developed by Kruskal and Wallis (1952). When dealing with location-scale families of distributions, both of these tests can perform poorly if the differences among the distributions are among their scale parameters and not in their location parameters. Overall, existing tests are not effective in detecting changes in both location and scale. In this dissertation, I propose a new class of rank-based, asymptotically distribution- free tests that are effective in detecting changes in both location and scale based on precedence probabilities. Let X_{i} be a random variable with distribution function F_{i} ; Also, let _pi_ be the set of all permutations of the numbers (1,2,...,K) . Then P(X_{i_{1}}<...
5

Bsharat, Rebhi S. "Evaluation of [subscript n]C[subscript k] estimators." Diss., Manhattan, Kan. : Kansas State University, 2007. http://hdl.handle.net/2097/410.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Gong, Rongsheng. "A Segmentation and Re-balancing Approach for Classification of Imbalanced Data." University of Cincinnati / OhioLINK, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1296594422.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Sutula, Glenn Eric. "Developing a Framework for the Purposes of Locating Undiscovered Hydrogeologic Windows." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1462458460.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Tolos, Siti. "Nonparametric tests to detect relationship between variables in the presence of heteroscedastic treatment effects." Diss., Kansas State University, 2010. http://hdl.handle.net/2097/6760.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Doctor of Philosophy
Department of Statistics
Haiyan Wang
Statistical tools to detect nonlinear relationship between variables are commonly needed in various practices. The first part of the dissertation presents a test of independence between a response variable, either discrete or continuous, and a continuous covariate after adjusting for heteroscedastic treatment effects. The method first involves augmenting each pair of the data for all treatments with a fixed number of nearest neighbors as pseudo-replicates. A test statistic is then constructed by taking the difference of two quadratic forms. Using such differences eliminate the need to estimate any nonlinear regression function, reducing the computational time. Although using a fixed number of nearest neighbors poses significant difficulty in the inference compared to when the number of nearest neighbors goes to infinity, the parametric standardizing rate is obtained for the asymptotic distribution of the proposed test statistics. Numerical studies show that the new test procedure maintains the intended type I error rate and has robust power to detect nonlinear dependency in the presence of outliers. The second part of the dissertation discusses the theory and numerical studies for testing the nonparametric effects of no covariate-treatment interaction and no main covariate based on the decomposition of the conditional mean of regression function that is potentially nonlinear. A similar test was discussed in Wang and Akritas (2006) for the effects defined through the decomposition of the conditional distribution function, but with the number of pseudo-replicates going to infinity. Consequently, their test statistics have slow convergence rates and computational speeds. Both test limitations are overcome using new model and tests. The last part of the dissertation develops theory and numerical studies to test for no covariate-treatment interaction, no simple covariate and no main covariate effects for cases when the number of factor levels and the number of covariate values are large.
9

Ge, Wentao. "Bootstrap-adjusted Quasi-likelihood Information Criteria for Mixed Model Selection." Bowling Green State University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu156207676645628.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Hezel, Claudia Regina. "Avaliação das pressões em silos verticais conforme diferentes normas internacionais." Universidade Estadual do Oeste do Parana, 2007. http://tede.unioeste.br:8080/tede/handle/tede/213.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Made available in DSpace on 2017-05-12T14:47:11Z (GMT). No. of bitstreams: 1 Claudia Regina Hezel.pdf: 2628130 bytes, checksum: f776400a56686d4085b790f2d9436120 (MD5) Previous issue date: 2007-07-16
In engineering it is always looked to construct resistant structures, safe and economically viable. The safe and economic project of the structures is function of the imposed actions; in the case of silos it does not have a Brazilian norm that it prescribes on its projects and action, moreover, some divergences are observed between the foreign norms. This main work as objective the comparative analysis of the lapsings of the international norms: ENV (1995), the 3774 (1996), ACI 313 (1991), DIN 1055 (1987) and BMHB (1985), becoming an analysis statistics between them, being elaborated an electronic spread sheet of calculation of the pressures, being able itself to vary the properties, to facilitate to the analysis and the development of an theoretician-experimental study. It still has as purpose to present a theoretical study of the pressures in vertical silos, for this, a state of the art of the theories of pressures proposals for the most important researchers was developed and finally the modeling of a silo to be analyzed through Ansys® software. The comparative analysis statistics of the main foreign norms showed the existence of sufficiently significant differences between the gotten values, being that in the case of the horizontal pressures it has differences of up to 59% (between norm BMHB and DIN) and that in the average the lesser values are gotten in the British and the greaters in the German. It was still verified that the majority of the foreign norms adopts the theory of Janssen for the determination of the horizontal pressures. In relation to the vertical pressures, the difference between the norms arrived 400% almost (between norms ENV and), in average the lesser values is gotten in the European norm and the greaters in the Australian, were still observed that the model of Janssen, without no alteration are considered by norm ACI. E in the case of the pressures of attrition with the wall the lesser values are gotten in norm BMHB and the greaters in the DIN, having arrived themselves it 59% differences (between norm BMHB and DIN). In relation to the use of the Ansys® program an initial plan of modeling of a silo was sketched, observing itself that the program and the methodology are useful, being able itself to make a refinement, to compare the results and to vary the in agreement pressure the diverse existing norms, being able to be developed in a future work, becoming a comparative degree between practical theoretician and with silo use archetype and cells of pressure.
Na engenharia, procura-se sempre construir estruturas resistentes, seguras e economicamente viáveis. O projeto seguro e econômico das estruturas é função das ações impostas; no caso de silos, não há uma norma brasileira que prescreva sobre seus projetos e ações, além disso, algumas divergências são observadas entre as normas estrangeiras. Este trabalho tem como objetivo principal a análise comparativa das prescrições das normas internacionais: ENV (1995), AS 3774 (1996), ACI 313 (1991), DIN 1055 (1987) e BMHB (1985), fazendo-se uma análise estatística entre elas, sendo elaborada uma planilha eletrônica de cálculo das pressões, podendo-se variar as propriedades, facilitar a análise e o desenvolvimento de um estudo teórico-experimental. Tem ainda como finalidade apresentar um estudo teórico das pressões em silos verticais, para isso, um estado da arte das teorias de pressões propostas pelos mais importantes pesquisadores foi desenvolvido e por último a modelagem de um silo para ser analisado através do software Ansys®. A análise comparativa estatística das principais normas estrangeiras mostrou a existência de diferenças bastante significativas entre os valores obtidos, sendo que no caso das pressões horizontais há diferenças de até 59% (entre norma BMHB e DIN) e que na média os menores valores são obtidos na Britânica e os maiores na Alemã. Verificou-se ainda que a maioria das normas estrangeiras adota a teoria de Janssen para a determinação das pressões horizontais. Em relação às pressões verticais, a diferença entre as normas chegou a quase 400% (entre normas ENV e AS), em média os menores valores são obtidos na norma Européia e os maiores na Australiana, observou-se ainda que o modelo de Janssen, sem nenhuma alteração é proposto pela norma ACI. E no caso das pressões de atrito com a parede os menores valores são obtidos na norma BMHB e os maiores na DIN, chegando-se a diferenças de 59% (entre norma BMHB e DIN). Em relação ao uso do programa Ansys® esboçou-se um plano inicial de modelagem de um silo, observando-se que o programa e a metodologia são úteis, podendo-se fazer um refinamento, comparar os resultados e variar a pressão conforme as diversas normas existentes, podendo ser desenvolvido em um trabalho futuro, fazendo-se um comparativo entre teórico e prático com uso de silo protótipo e células de pressão.
11

Åkerblom, Thea, and Tobias Thor. "Fraud or Not?" Thesis, Uppsala universitet, Statistiska institutionen, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-388695.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This paper uses statistical learning to examine and compare three different statistical methods with the aim to predict credit card fraud. The methods compared are Logistic Regression, K-Nearest Neighbour and Random Forest. They are applied and estimated on a data set consisting of nearly 300,000 credit card transactions to determine their performance using classification of fraud as the outcome variable. The three models all have different properties and advantages. The K-NN model preformed the best in this paper but has some disadvantages, since it does not explain the data but rather predict the outcome accurately. Random Forest explains the variables but performs less precise. The Logistic Regression model seems to be unfit for this specific data set.
12

Feng, Ying (Olivia). "The development of an instrument to measure individual dispositions towards rules and principles, with implications for financial regulation." Thesis, University of Glasgow, 2014. http://theses.gla.ac.uk/5300/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The main focus of this PhD project is the development and validation of a psychometric instrument for the measurement of individual dispositions towards rules and principles. Literature review and focus groups were used to generate insights into the reasons why individuals prefer rules and principles. On the basis of that review, an initial item pool was created covering the conceptual space of dispositions towards rules and principles. The final instrument consists of 10 items, 5 items each for the rules and principles subscales. The psychometric analysis suggested that it is valid and reliable. The instrument has sound predictive power and was able to significantly predict individuals’ behavioral intentions in relation to rules and principles across contexts. I found there were gender and ethnic differences in the relationship between dispositions towards rules and principles scores and behavioural intentions. This PhD is relevant to an emerging literature in behavioural accounting research that examines how practitioners’ personal characteristics and styles affect financial reporting practice.
13

Zhou, Dunke. "High-dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1338303646.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Zhao, Jianmin. "Optimal Clustering: Genetic Constrained K-Means and Linear Programming Algorithms." VCU Scholars Compass, 2006. http://hdl.handle.net/10156/1583.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Ruzgys, Martynas. "IT žinių portalo statistikos modulis pagrįstas grupavimu." Master's thesis, Lithuanian Academic Libraries Network (LABT), 2007. http://vddb.library.lt/obj/LT-eLABa-0001:E.02~2007~D_20070816_143545-16583.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Pristatomas duomenų gavybos ir grupavimo naudojimas paplitusiose sistemose bei sukurtas IT žinių portalo statistikos prototipas duomenų saugojimui, analizei ir peržiūrai atlikti. Siūlomas statistikos modulis duomenų saugykloje periodiškais laiko momentais vykdantis duomenų transformacijas. Portale prieinami statistiniai duomenys gali būti grupuoti. Sugrupuotą informaciją pateikus grafiškai, duomenys gali būti interpretuojami ir stebimi veiklos mastai. Panašių objektų grupėms išskirti pritaikytas vienas iš žinomiausių duomenų grupavimo metodų – lygiagretusis k-vidurkių metodas.
Presented data mining methods and clustering usage in current statistical systems and created statistics module prototype for data storage, analysis and visualization for IT knowledge portal. In suggested statistics prototype database periodical data transformations are performed. Statistical data accessed in portal can be clustered. Clustered information represented graphically may serve for interpreting information when trends may be noticed. One of the best known data clustering methods – parallel k-means method – is adapted for separating similar data clusters.
16

Soale, Abdul-Nasah. "Spatio-Temporal Analysis of Point Patterns." Digital Commons @ East Tennessee State University, 2016. https://dc.etsu.edu/etd/3120.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
In this thesis, the basic tools of spatial statistics and time series analysis are applied to the case study of the earthquakes in a certain geographical region and time frame. Then some of the existing methods for joint analysis of time and space are described and applied. Finally, additional research questions about the spatial-temporal distribution of the earthquakes are posed and explored using statistical plots and models. The focus in the last section is in the relationship between number of events per year and maximum magnitude and its effect on how clustered the spatial distribution is and the relationship between distances in time and space in between consecutive events as well as the distribution of the distances.
17

Solomon, Mary Joanna. "Multivariate Analysis of Korean Pop Music Audio Features." Bowling Green State University / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1617105874719868.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Thorstensson, Linnea. "Clustering Methods as a Recruitment Tool for Smaller Companies." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-273571.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
With the help of new technology it has become much easier to apply for a job. Reaching out to a larger audience also results in a lot of more applications to consider when hiring for a new position. This has resulted in that many big companies uses statistical learning methods as a tool in the first step of the recruiting process. Smaller companies that do not have access to the same amount of historical and big data sets do not have the same opportunities to digitalise their recruitment process. Using topological data analysis, this thesis explore how clustering methods can be used on smaller data sets in the early stages of the recruitment process. It also studies how the level of abstraction in data representation affects the results. The methods seem to perform well on higher level job announcements but struggles on basic level positions. It also shows that the representation of candidates and jobs has a huge impact on the results.
Ny teknologi har förenklat processen för att söka arbete. Detta har resulterat i att företag får tusentals ansökningar som de måste ta hänsyn till. För att förenkla och påskynda rekryteringsprocessen har många stora företag börjat använda sig av maskininlärningsmetoder. Mindre företag, till exempel start-ups, har inte samma möjligheter för att digitalisera deras rekrytering. De har oftast inte tillgång till stora mängder historisk ansökningsdata. Den här uppsatsen undersöker därför med hjälp av topologisk dataanalys hur klustermetoder kan användas i rekrytering på mindre datauppsättningar. Den analyserar också hur abstraktionsnivån på datan påverkar resultaten. Metoderna visar sig fungera bra för jobbpositioner av högre nivå men har problem med jobb på en lägre nivå. Det visar sig också att valet av representation av kandidater och jobb har en stor inverkan på resultaten.
19

Sampath, Srinath. "Analysis of Agreement Between Two Long Ranked Lists." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1385415346.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Bergström, Sebastian. "Customer segmentation of retail chain customers using cluster analysis." Thesis, KTH, Matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-252559.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
In this thesis, cluster analysis was applied to data comprising of customer spending habits at a retail chain in order to perform customer segmentation. The method used was a two-step cluster procedure in which the first step consisted of feature engineering, a square root transformation of the data in order to handle big spenders in the data set and finally principal component analysis in order to reduce the dimensionality of the data set. This was done to reduce the effects of high dimensionality. The second step consisted of applying clustering algorithms to the transformed data. The methods used were K-means clustering, Gaussian mixture models in the MCLUST family, t-distributed mixture models in the tEIGEN family and non-negative matrix factorization (NMF). For the NMF clustering a slightly different data pre-processing step was taken, specifically no PCA was performed. Clustering partitions were compared on the basis of the Silhouette index, Davies-Bouldin index and subject matter knowledge, which revealed that K-means clustering with K = 3 produces the most reasonable clusters. This algorithm was able to separate the customer into different segments depending on how many purchases they made overall and in these clusters some minor differences in spending habits are also evident. In other words there is some support for the claim that the customer segments have some variation in their spending habits.
I denna uppsats har klusteranalys tillämpats på data bestående av kunders konsumtionsvanor hos en detaljhandelskedja för att utföra kundsegmentering. Metoden som använts bestod av en två-stegs klusterprocedur där det första steget bestod av att skapa variabler, tillämpa en kvadratrotstransformation av datan för att hantera kunder som spenderar långt mer än genomsnittet och slutligen principalkomponentanalys för att reducera datans dimension. Detta gjordes för att mildra effekterna av att använda en högdimensionell datamängd. Det andra steget bestod av att tillämpa klusteralgoritmer på den transformerade datan. Metoderna som användes var K-means klustring, gaussiska blandningsmodeller i MCLUST-familjen, t-fördelade blandningsmodeller från tEIGEN-familjen och icke-negativ matrisfaktorisering (NMF). För klustring med NMF användes förbehandling av datan, mer specifikt genomfördes ingen PCA. Klusterpartitioner jämfördes baserat på silhuettvärden, Davies-Bouldin-indexet och ämneskunskap, som avslöjade att K-means klustring med K=3 producerar de rimligaste resultaten. Denna algoritm lyckades separera kunderna i olika segment beroende på hur många köp de gjort överlag och i dessa segment finns vissa skillnader i konsumtionsvanor. Med andra ord finns visst stöd för påståendet att kundsegmenten har en del variation i sina konsumtionsvanor.
21

Warnqvist, Åsa. "Poesifloden : Utgivningen av diktsamlingar i Sverige 1976–1995." Doctoral thesis, Uppsala universitet, Litteraturvetenskapliga institutionen, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-8329.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The subject of this dissertation is the publishing of poetry in Sweden 1976–1995. The purpose is to examine the position of poetry in the Swedish book market and in the literary process. It is an empirical and statistical study based primarily on an inventory of the published works. The study shows that the publication of Swedish poetry collections in 1976–1995 consisted of 3 848 titles (new works only), which was more than ever before. Publication was consistent over the period, partly due to the allocation of the literature grant introduced by the Swedish government in 1975, but also to the technical development which made it possible for small and private publishers to release collections of poetry at a lower cost. The main publishers were the general publishing houses of Bonniers, Norstedts and Wahlström & Widstrand, but more than a third of the collections were published by vanity press and self publishers. Publication was strongly concentrated to the capital area. Regardless of the size of the publisher, poetry collections were printed in small numbers and generally sold poorly. Along with the technical development offset-printed books replaced duplicated publications, and more books were hardbound. The publishing houses made bigger efforts than ever before to publish female poets. The number increased over the period, but the men were still in a clear majority by 1995. The women were also largely responsible for rejuvenating the body of authors. The number of debutants was relatively constant during the period. The results in this dissertation indicate a hierarchic order among the publishing houses that determine the conditions for the authors and their works. This is verified through analyses of coverage in the national and regional daily papers, as well as three analyses of the authorships of Yngve Aldhagen, Else-Britt Kjellqvist and Bruno K. Öijer. The dissertation concludes that poetry exists on the publishing lists mainly for symbolical reasons; to publish poetry gives cultural capital to the publishers.
22

Tandan, Isabelle, and Erika Goteman. "Bank Customer Churn Prediction : A comparison between classification and evaluation methods." Thesis, Uppsala universitet, Statistiska institutionen, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-411918.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This study aims to assess which supervised statistical learning method; random forest, logistic regression or K-nearest neighbor, that is the best at predicting banks customer churn. Additionally, the study evaluates which cross-validation set approach; k-Fold cross-validation or leave-one-out cross-validation that yields the most reliable results. Predicting customer churn has increased in popularity since new technology, regulation and changed demand has led to an increase in competition for banks. Thus, with greater reason, banks acknowledge the importance of maintaining their customer base.   The findings of this study are that unrestricted random forest model estimated using k-Fold is to prefer out of performance measurements, computational efficiency and a theoretical point of view. Albeit, k-Fold cross-validation and leave-one-out cross-validation yield similar results, k-Fold cross-validation is to prefer due to computational advantages.   For future research, methods that generate models with both good interpretability and high predictability would be beneficial. In order to combine the knowledge of which customers end their engagement as well as understanding why. Moreover, interesting future research would be to analyze at which dataset size leave-one-out cross-validation and k-Fold cross-validation yield the same results.
23

Kondapalli, Swetha. "An Approach To Cluster And Benchmark Regional Emergency Medical Service Agencies." Wright State University / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=wright1596491788206805.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Wang, Kaijun. "Graph-based Modern Nonparametrics For High-dimensional Data." Diss., Temple University Libraries, 2019. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/578840.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Statistics
Ph.D.
Developing nonparametric statistical methods and inference procedures for high-dimensional large data have been a challenging frontier problem of statistics. To attack this problem, in recent years, a clear rising trend has been observed with a radically different viewpoint--``Graph-based Nonparametrics," which is the main research focus of this dissertation. The basic idea consists of two steps: (i) representation step: code the given data using graphs, (ii) analysis step: apply statistical methods on the graph-transformed problem to systematically tackle various types of data structures. Under this general framework, this dissertation develops two major research directions. Chapter 2—based on Mukhopadhyay and Wang (2019a)—introduces a new nonparametric method for high-dimensional k-sample comparison problem that is distribution-free, robust, and continues to work even when the dimension of the data is larger than the sample size. The proposed theory is based on modern LP-nonparametrics tools and unexplored connections with spectral graph theory. The key is to construct a specially-designed weighted graph from the data and to reformulate the k-sample problem into a community detection problem. The procedure is shown to possess various desirable properties along with a characteristic exploratory flavor that has practical consequences. The numerical examples show surprisingly well performance of our method under a broad range of realistic situations. Chapter 3—based on Mukhopadhyay and Wang (2019b)—revisits some foundational questions about network modeling that are still unsolved. In particular, we present unified statistical theory of the fundamental spectral graph methods (e.g., Laplacian, Modularity, Diffusion map, regularized Laplacian, Google PageRank model), which are often viewed as spectral heuristic-based empirical mystery facts. Despite half a century of research, this question has been one of the most formidable open issues, if not the core problem in modern network science. Our approach integrates modern nonparametric statistics, mathematical approximation theory (of integral equations), and computational harmonic analysis in a novel way to develop a theory that unifies and generalizes the existing paradigm. From a practical standpoint, it is shown that this perspective can provide adequate guidance for designing next-generation computational tools for large-scale problems. As an example, we have described the high-dimensional change-point detection problem. Chapter 4 discusses some further extensions and application of our methodologies to regularized spectral clustering and spatial graph regression problems. The dissertation concludes with the a discussion of two important areas of future studies.
Temple University--Theses
25

Lu, Shihai. "Novel Step-Down Multiple Testing Procedures Under Dependence." Bowling Green State University / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1416594298.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Chegancas, Rito Tiago Miguel. "Modelling and comparing protein interaction networks using subgraph counts." Thesis, University of Oxford, 2012. http://ora.ox.ac.uk/objects/uuid:dcc0eb0d-1dd8-428d-b2ec-447a806d6aa8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The astonishing progress of molecular biology, engineering and computer science has resulted in mature technologies capable of examining multiple cellular components at a genome-wide scale. Protein-protein interactions are one example of such growing data. These data are often organised as networks with proteins as nodes and interactions as edges. Albeit still incomplete, there is now a substantial amount of data available and there is a need for biologically meaningful methods to analyse and interpret these interactions. In this thesis we focus on how to compare protein interaction networks (PINs) and on the rela- tionship between network architecture and the biological characteristics of proteins. The underlying theme throughout the dissertation is the use of small subgraphs – small interaction patterns between 2-5 proteins. We start by examining two popular scores that are used to compare PINs and network models. When comparing networks of the same model type we find that the typical scores are highly unstable and depend on the number of nodes and edges in the networks. This is unsatisfactory and we propose a method based on non-parametric statistics to make more meaningful comparisons. We also employ principal component analysis to judge model fit according to subgraph counts. From these analyses we show that no current model fits to the PINs; this may well reflect our lack of knowledge on the evolution of protein interactions. Thus, we use explanatory variables such as protein age and protein structural class to find patterns in the interactions and subgraphs we observe. We discover that the yeast PIN is highly heterogeneous and therefore no single model is likely to fit the network. Instead, we focus on ego-networks containing an initial protein plus its interacting partners and their interaction partners. In the final chapter we propose a new, alignment-free method for network comparison based on such ego-networks. The method compares subgraph counts in neighbourhoods within PINs in an averaging, many-to-many fashion. It clusters networks of the same model type and is able to successfully reconstruct species phylogenies solely based on PIN data providing exciting new directions for future research.
27

Ježek, Jan. "Statistický modul k pracovní databázi TIVOLE." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2008. http://www.nusl.cz/ntk/nusl-217191.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Presented thesis thesis is written for IBM Company, specifically IBM IDC Czech Republic, s.r.o. branch. The thesis describes development of statistic module for work database TIVOLE utilizing Lotus Notes program. The basic aim of the thesis is to ease creation of reports and graphical data outputs in relation to speed of operator´s work while narrowing down the required time of human resources for solving problems.
28

Li, Xiaohu. "Security Analysis on Network Systems Based on Some Stochastic Models." ScholarWorks@UNO, 2014. http://scholarworks.uno.edu/td/1931.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Due to great effort from mathematicians, physicists and computer scientists, network science has attained rapid development during the past decades. However, because of the complexity, most researches in this area are conducted only based upon experiments and simulations, it is critical to do research based on theoretical results so as to gain more insight on how the structure of a network affects the security. This dissertation introduces some stochastic and statistical models on certain networks and uses a k-out-of-n tolerant structure to characterize both logically and physically the behavior of nodes. Based upon these models, we draw several illuminating results in the following two aspects, which are consistent with what computer scientists have observed in either practical situations or experimental studies. Suppose that the node in a P2P network loses the designed function or service when some of its neighbors are disconnected. By studying the isolation probability and the durable time of a single user, we prove that the network with the user's lifetime having more NWUE-ness is more resilient in the sense of having a smaller probability to be isolated by neighbors and longer time to be online without being interrupted. Meanwhile, some preservation properties are also studied for the durable time of a network. Additionally, in order to apply the model in practice, both graphical and nonparametric statistical methods are developed and are employed to a real data set. On the other hand, a stochastic model is introduced to investigate the security of network systems based on their vulnerability graph abstractions. A node loses its designed function when certain number of its neighbors are compromised in the sense of being taken over by the malicious codes or the hacker. The attack compromises some nodes, and the victimized nodes become accomplices. We derived an equation to solve the probability for a node to be compromised in a network. Since this equation has no explicit solution, we also established new lower and upper bounds for the probability. The two models proposed herewith generalize existing models in the literature, the corresponding theoretical results effectively improve those known results and hence carry an insight on designing a more secure system and enhancing the security of an existing system.
29

Zhao, Yanchun. "Comparison of Proposed K Sample Tests with Dietz's Test for Nondecreasing Ordered Alternatives for Bivariate Normal Data." Thesis, North Dakota State University, 2011. https://hdl.handle.net/10365/28836.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
There are many situations in which researchers want to consider a set of response variables simultaneously rather than just one response variable. For instance, a possible example is when a researcher wishes to determine the effects of an exercise and diet program on both the cholesterol levels and the weights of obese subjects. Dietz (1989) proposed two multivariate generalizations of the Jonckheere test for ordered alternatives. In this study, we propose k-sample tests for nondecreasing ordered alternatives for bivariate normal data and compare their powers with Dietz's sum statistic. The proposed k-sample tests are based on transformations of bivariate data to univariate data. The transformations considered are the sum, maximum and minimum functions. The ideas for these transformations come from the Leconte, Moreau, and Lellouch (1994). After the underlying bivariate normal data are reduced to univariate data, the Jonckheere-Terpstra (JT) test (Terpstra, 1952 and Jonckheere, 1954) and the Modified Jonckheere-Terpstra (MJT) test (Tryon and Hettmansperger, 1973) are applied to the univariate data. A simulation study is conducted to compare the proposed tests with Dietz's test for k bivariate normal populations (k=3, 4, 5). A variety of sample sizes and various location shifts are considered in this study. Two different correlations are used for the bivariate normal distributions. The simulation results show that generally the Dietz test performs the best for the situations considered with the underlying bivariate normal distribution. The estimated powers of MJT sum and JT sum are often close with the MJT sum generally having a little higher power. The sum transformation was the best of the three transformations to use for bivariate normal data.
30

Xie, Wen. "A Monte Carlo Simulation Study for Poly-k Test in Animal Carcinogenicity Studies." Thesis, California State University, Long Beach, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10638898.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:

An objective of animal carcinogenicity studies is to identify a tumorigenic potential in animals and to assess relevant risks in humans. Without using the cause-of-death information, the Cochran-Armitage test is applied for detecting a linear trend in the incidence of a tumor of interest across dose groups. The survival-adjusted Cochran–Armitage test, known as the Poly-k test, is investigated for the animals not at equal risk of tumor development by reflecting the shapes of the tumor onset distributions. In this thesis, we will validate Poly-k test through a Monte Carlo simulation study. We will design the simulation study to assess the size and power of the poly-k test using a wide range of k values for various tumor onset rates, for various competing risks rates, and for various tumor lethality rates. In this thesis, the Poly-k testing approach will be investigated to evaluate a dose-related linear trend of a test subject on the incidence of tumor and will be implemented in R package to be used widely amongst toxicologists.

31

Lindström, Alfred Minh. "Cutting and Destroying Graphs using k-cuts." Thesis, Uppsala universitet, Analys och sannolikhetsteori, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-395663.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Janštová, Michaela. "Segmentace měkkých tkání v obličejové části myších embryí v mikrotomografických datech." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2019. http://www.nusl.cz/ntk/nusl-400988.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This diploma thesis deals with a segmentation of soft tissues in facial part of mouse embryos in Matlab. Segmentation of soft tissues of mouse embryos was not fully automated and every case needs a specific solution. Solving parts of this issues can provide valuable data for evolutionary biologists. Issues about staining and segmentation techniques are described. On the basis of accessible literature otsu thresholding, region growing, k-means clustering and segmentation with atlas were tested. In the end of this paper are those methods tested and evaluated on 3D microtomography data.
33

Charitidis, Theoharis. "Sequence Prediction for Identifying User Equipment Patterns in Mobile Networks." Thesis, KTH, Matematisk statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-266380.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
With an increasing demand for bandwidth and lower latency in mobile communication networks it becomes gradually more important to improve current mobile network management solutions using available network data. To improve the network management it can for instance be of interest to infer future available bandwidth to the end user of the network. This can be done by utilizing the current knowledge of real-time user equipment (UE) behaviour in the network. In the scope of this thesis interest lies in, given a set of visited radio access points (cells), to predict what the next one is going to be. For this reason the aim is to investigate the prediction performance when utilizing the All-K-Order Markov (AKOM) model, with some added variations, on collected data generated from train trajectories. Moreover a method for testing the suitability of modeling the sequence of cells as a time-homogeneous Markov chain is proposed, in order to determine the goodness-of- t with the available data. Lastly, the elapsed time in each cell is attempted to be predicted using linear regression given the prior history window of previous cell and elapsed times pairs. The results show that moderate to good prediction accuracy on the upcoming cell can be achieved with AKOM and associated variations. For predicting the upcoming sojourn time in future cells the results reveal that linear regression does not yield satisfactory results and possibly another regression model should be utilized.
Med en ökande efterfrågan på banbredd och kortare latens i mobila nätverk har det gradvis blivit viktigare att förbättra nuvarande lösningar för hantering av nätverk genom att använda tillgänglig nätverksdata. Specifikt är det av intresse att kunna dra slutsatser kring vad framtida bandbredsförhållanden kommer vara, samt övriga parametrar av intresse genom att använda tillgänglig information om aktuell mobil användarutrustnings (UE) beteende i det mobila nätverket. Inom ramen av detta masterarbete ligger fokus på att, givet tidigare besökta radio accesspunkter (celler), kunna förutspå vilken nästkommande besökta cell kommer att vara. Av denna anledning är målet att undersöka vilken prestanda som kan uppnås när All-$K$-Order Markov (AKOM) modellen, med associerade varianter av denna, används på samlad data från tågfärder. Dessutom ges det förslag på test som avgör hur lämpligt det är att modelera observerade sekvenser av celler som en homogen Markovkedja med tillgänglig data. Slutligen undersöks även om besökstiden i en framtida cell kan förutspås med linjär regression givet ett historiskt fönster av tidigare cell och besökstids par. Erhållna resultat visar att måttlig till bra prestanda kan uppnås när kommande celler förutspås med AKOM modellen och associerade variationer. För prediktering av besökstid i kommande cell med linjär regression erhålles det däremot inte tillfredsställande resultat, vilket tyder på att en alternativ regressionsmetod antagligen är bättre lämpad för denna data.
34

Sebastian, Maria Treesa. "Modelling Bitcell Behaviour." Thesis, Linköpings universitet, Statistik och maskininlärning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166218.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
With advancements in technology, the dimensions of transistors are scaling down. It leads to shrinkage in the size of memory bitcells, increasing its sensitivity to process variations introduced during manufacturing. Failure of a single bitcell can cause the failure of an entire memory; hence careful statistical analysis is essential in estimating the highest reliable performance of the bitcell before using them in memory design. With high repetitiveness of bitcell, the traditional method of Monte Carlo simulation would require along time for accurate estimation of rare failure events. A more practical approach is importance sampling where more samples are collected from the failure region. Even though importance sampling is much faster than Monte Carlo simulations, it is still fairly time-consuming as it demands an iterative search making it impractical for large simulation sets. This thesis proposes two machine learning models that can be used in estimating the performance of a bitcell. The first model predicts the time taken by the bitcell for read or write operation. The second model predicts the minimum voltage required in maintaining the bitcell stability. The models were trained using the K-nearest neighbors algorithm and Gaussian process regression. Three sparse approximations were implemented in the time prediction model as a bigger dataset was available. The obtained results show that the models trained using Gaussian process regression were able to provide promising results.
35

Cai, Weixing. "Multiple decision rules for equivalence among k populations and their applications in signal processing, clinical trials and classification." Related electronic resource: Current Research at SU : database of SU dissertations, recent titles available full text, 2008. http://wwwlib.umi.com/cr/syr/main.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Ferm, Martin. "Prediktering av skogliga variabler med data från flygburen laser : En jämförelse mellan multipla regressionsmodeller och k nearest neighbour-modeller." Thesis, Umeå University, Department of Statistics, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-34833.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Dineff, Dimitris. "Clustering using k-means algorithm in multivariate dependent models with factor structure." Thesis, Uppsala universitet, Tillämpad matematik och statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-429528.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Bhattacharya, Abhishek. "Nonparametric Statistics on Manifolds With Applications to Shape Spaces." Diss., The University of Arizona, 2008. http://hdl.handle.net/10150/194508.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This thesis presents certain recent methodologies and some new results for the statistical analysis of probability distributions on non-Euclidean manifolds. The notions of Frechet mean and variation as measures of center and spread are introduced and their properties are discussed. The sample estimates from a random sample are shown to be consistent under fairly broad conditions. Depending on the choice of distance on the manifold, intrinsic and extrinsic statistical analyses are carried out. In both cases, sufficient conditions are derived for the uniqueness of the population means and for the asymptotic normality of the sample estimates. Analytic expressions for the parameters in the asymptotic distributions are derived. The manifolds of particular interest in this thesis are the shape spaces of k-ads. The statistical analysis tools developed on general manifolds are applied to the spaces of direct similarity shapes, planar shapes, reflection similarity shapes, affine shapes and projective shapes. Two-sample nonparametric tests are constructed to compare the mean shapes and variation in shapes for two random samples. The samples in consideration can be either independent of each other or be the outcome of a matched pair experiment. The testing procedures are based on the asymptotic distribution of the test statistics, or on nonparametric bootstrap methods suitably constructed. Real life examples are included to illustrate the theory.
39

Ramler, Ivan Peter. "Improved statistical methods for k-means clustering of noisy and directional data." [Ames, Iowa : Iowa State University], 2008.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
40

Wong, Wing Sing. "K-nearest-neighbor queries with non-spatial predicates on range attributes /." View abstract or full-text, 2005. http://library.ust.hk/cgi/db/thesis.pl?COMP%202005%20WONGW.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Adkins, Laura Jean. "A Generalization of the EM Algorithm for Maximum Likelihood Estimation in Mallows' Model Using Partially Ranked Data and Asymptotic Relative Efficiencies for Some Ranking Tests of The K-Sample Problem /." The Ohio State University, 1996. http://rave.ohiolink.edu/etdc/view?acc_num=osu1487933245538208.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Devie, Arnaud. "Caractérisation de l'usage des batteries Lithium-ion dans les véhicules électriques et hybrides. Application à l'étude du vieillissement et de la fiabilité." Phd thesis, Université Claude Bernard - Lyon I, 2012. http://tel.archives-ouvertes.fr/tel-00783338.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
De nouvelles architectures de traction (hybride, électrique) entrent en concurrence avec les motorisations thermiques conventionnelles. Des batteries Lithium-ion équipent ces véhicules innovants. La durabilité de ces batteries constitue un enjeu majeur mais dépend de nombreux paramètres environ-nementaux externes. Les outils de prédiction de durée de vie actuellement utilisés sont souvent trop simplificateurs dans leur approche. L'objet de ces travaux consiste à caractériser les conditions d'usage de ces batteries (température, tension, courant, SOC et DOD) afin d'étudier avec précision la durée de vie que l'on peut en attendre en fonction de l'application visée. Différents types de véhicules électrifiés (vélos à assistance élec-trique, voitures électriques, voitures hybrides, et trolleybus) ont été instrumentés afin de documenter les conditions d'usage réel des batteries. De larges volumes de données ont été recueillis puis ana-lysés au moyen d'une méthode innovante qui s'appuie sur la classification d'impulsions de courant par l'algorithme des K-means et la génération de cycles synthétiques par modélisation par chaine de Markov. Les cycles synthétiques ainsi obtenus présentent des caractéristiques très proches de l'échantillon complet de données récoltées et permettent donc de représenter fidèlement l'usage réel. Utilisés lors de campagnes de vieillissement de batteries, ils sont susceptibles de permettre l'obtention d'une juste prédiction de la durée de vie des batteries pour l'application considérée. Plusieurs résultats expérimentaux sont présentés afin d'étayer la pertinence de cette approche.
43

Berrett, Thomas Benjamin. "Modern k-nearest neighbour methods in entropy estimation, independence testing and classification." Thesis, University of Cambridge, 2017. https://www.repository.cam.ac.uk/handle/1810/267832.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Nearest neighbour methods are a classical approach in nonparametric statistics. The k-nearest neighbour classifier can be traced back to the seminal work of Fix and Hodges (1951) and they also enjoy popularity in many other problems including density estimation and regression. In this thesis we study their use in three different situations, providing new theoretical results on the performance of commonly-used nearest neighbour methods and proposing new procedures that are shown to outperform these existing methods in certain settings. The first problem we discuss is that of entropy estimation. Many statistical procedures, including goodness-of-fit tests and methods for independent component analysis, rely critically on the estimation of the entropy of a distribution. In this chapter, we seek entropy estimators that are efficient and achieve the local asymptotic minimax lower bound with respect to squared error loss. To this end, we study weighted averages of the estimators originally proposed by Kozachenko and Leonenko (1987), based on the k-nearest neighbour distances of a sample. A careful choice of weights enables us to obtain an efficient estimator in arbitrary dimensions, given sufficient smoothness, while the original unweighted estimator is typically only efficient in up to three dimensions. A related topic of study is the estimation of the mutual information between two random vectors, and its application to testing for independence. We propose tests for the two different situations of the marginal distributions being known or unknown and analyse their performance. Finally, we study the classical k-nearest neighbour classifier of Fix and Hodges (1951) and provide a new asymptotic expansion for its excess risk. We also show that, in certain situations, a new modification of the classifier that allows k to vary with the location of the test point can provide improvements. This has applications to the field of semi-supervised learning, where, in addition to labelled training data, we also have access to a large sample of unlabelled data.
44

Balabdaoui, Fadoua. "Nonparametric estimation of a k-monotone density : a new asymptotic distribution theory /." Thesis, Connect to this title online; UW restricted, 2004. http://hdl.handle.net/1773/8964.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Gatz, Philip L. Jr. "A comparison of three prediction based methods of choosing the ridge regression parameter k." Thesis, Virginia Tech, 1985. http://hdl.handle.net/10919/45724.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
A solution to the regression model y = xβ+ε is usually obtained using ordinary least squares. However, when the condition of multicollinearity exists among the regressor variables, then many qualities of this solution deteriorate. The qualities include the variances, the length, the stability, and the prediction capabilities of the solution. An analysis called ridge regression introduced a solution to combat this deterioration (Hoerl and Kennard, 1970a). The method uses a solution biased by a parameter k. Many methods have been developed to determine an optimal value of k. This study chose to investigate three little used methods of determining k: the PRESS statistic, Mallows' Ck. statistic, and DF-trace. The study compared the prediction capabilities of the three methods using data that contained various levels of both collinearity and leverage. This was completed by using a Monte Carlo experiment.
Master of Science
46

Staake, Thorsten R. "IP traffic statistics : a Markovian approach." Link to electronic thesis, 2002. http://www.wpi.edu/Pubs/ETD/Available/etd-0429102-123525.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Czudek, Marek. "Detekce síťových anomálií na základě NetFlow dat." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2013. http://www.nusl.cz/ntk/nusl-235461.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This thesis describes the use of NetFlow data in the systems for detection of disruptions or anomalies in computer network traffic. Various methods for network data collection are described, focusing especially on the NetFlow protocol. Further, various methods for anomaly detection  in network traffic are discussed and evaluated, and their advantages as well as disadvantages are listed. Based on this analysis one method is chosen. Further, test data set is analyzed using the method. Algorithm for real-time network traffic anomaly detection is designed based on the analysis outcomes. This method was chosen mainly because it enables detection of anomalies even in an unlabelled network traffic. The last part of the thesis describes implementation of the  algorithm, as well as experiments performed using the resulting  application on real NetFlow data.
48

Kolar, Michal. "Statistical Physics and Message Passing Algorithms. Two Case Studies: MAX-K-SAT Problem and Protein Flexibility." Doctoral thesis, SISSA, 2006. http://hdl.handle.net/20.500.11767/4659.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
In the last decades the tl1eory of spin glasses has been developed within the framework of statistical physics. The obtained results showed to be novel not only from the physical point of vie\l\'1 but they have brought also new mathematical techniques and algorithmic approaches. Indeed, the problem of finding ground state of a spin glass is (in general) NP-complete. The methods that were found brought new ideas to the field of Combinatorial Optimization, and on the other side, the similar methods of Combinatorial Optimization, were applied in physical systems. As it happened with the Monte Carlo sampling and the Simulated Annealing, also the novel Cavity Method lead to algorithms that are open to wide use in various fields of research The Cavity Method shows to be equivalent to Bethe Approximation in its most symmetric version, and the derived algorithm is equivalent to the Belief Propagation, an inference method used widely for example in the field of Pattern Recognition. The Cavity Method in a less symmetric situation, when one has to consider correctly the clustering of the configuration space, lead to a novel messagepassing algorithm-the Survey Propagation. The class of Message-Passing algorithms, among which both the Belief Propagation and the Survey Propagation belong, has found its application as Inference Algorithms in many engineering fields. Among others let us :mention the Low-Density Parity-Check Codes, that are widely used as ErrorCorrecting Codes for communication over noisy cha1mels. In the first part of this work we have compared efficiency of the Survey Propagation Algorithm and of standard heuristic algorithms in the case of the random-MAX-K-SAT problem. The results showed that the algorithms perform similarly in the regions where the clustering of configuration space does not appeai~ but that the Survey Propagation finds much better solutions to the optimization problem in the critical region where one has to consider existence of many ergodic components explicitly. The second part of the thesis targets the problem of protein structure and flexibility. In many proteins the mobility of certain regions and rigidity of other regions of their structure is crucial for their function or interaction with other cellular elements. Our simple model tries to point out the flexible regions from the knowledge of native 3D-structure of the protein. The problem is mapped to a spin glass model which is successfully solved by the Believe Propagation algorithm.
49

Gard, Rikard. "Design-based and Model-assisted estimators using Machine learning methods : Exploring the k-Nearest Neighbor metod applied to data from the Recreational Fishing Survey." Thesis, Örebro universitet, Handelshögskolan vid Örebro Universitet, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-72488.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Yan, Mingjin. "Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion." Diss., Virginia Tech, 2005. http://hdl.handle.net/10919/29957.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
In cluster analysis, a fundamental problem is to determine the best estimate of the number of clusters, which has a deterministic effect on the clustering results. However, a limitation in current applications is that no convincingly acceptable solution to the best-number-of-clusters problem is available due to high complexity of real data sets. In this dissertation, we tackle this problem of estimating the number of clusters, which is particularly oriented at processing very complicated data which may contain multiple types of cluster structure. Two new methods of choosing the number of clusters are proposed which have been shown empirically to be highly effective given clear and distinct cluster structure in a data set. In addition, we propose a sequential type of clustering approach, called multi-layer clustering, by combining these two methods. Multi-layer clustering not only functions as an efficient method of estimating the number of clusters, but also, by superimposing a sequential idea, improves the flexibility and effectiveness of any arbitrary existing one-layer clustering method. Empirical studies have shown that multi-layer clustering has higher efficiency than one layer clustering approaches, especially in detecting clusters in complicated data sets. The multi-layer clustering approach has been successfully implemented in clustering the WTCHP microarray data and the results can be interpreted very well based on known biological knowledge. Choosing an appropriate clustering method is another critical step in clustering. K-means clustering is one of the most popular clustering techniques used in practice. However, the k-means method tends to generate clusters containing a nearly equal number of objects, which is referred to as the ``equal-size'' problem. We propose a clustering method which competes with the k-means method. Our newly defined method is aimed at overcoming the so-called ``equal-size'' problem associated with the k-means method, while maintaining its advantage of computational simplicity. Advantages of the proposed method over k-means clustering have been demonstrated empirically using simulated data with low dimensionality.
Ph. D.