To see the other types of publications on this topic, follow the link: K-means clustering.

Dissertations / Theses on the topic 'K-means clustering'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'K-means clustering.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Buchta, Christian, Martin Kober, Ingo Feinerer, and Kurt Hornik. "Spherical k-Means Clustering." American Statistical Association, 2012. http://epub.wu.ac.at/4000/1/paper.pdf.

Full text
Abstract:
Clustering text documents is a fundamental task in modern data analysis, requiring approaches which perform well both in terms of solution quality and computational efficiency. Spherical k-means clustering is one approach to address both issues, employing cosine dissimilarities to perform prototype-based partitioning of term weight representations of the documents. This paper presents the theory underlying the standard spherical k-means problem and suitable extensions, and introduces the R extension package skmeans which provides a computational environment for spherical k-means clustering featuring several solvers: a fixed-point and genetic algorithm, and interfaces to two external solvers (CLUTO and Gmeans). Performance of these solvers is investigated by means of a large scale benchmark experiment. (authors' abstract)
APA, Harvard, Vancouver, ISO, and other styles
2

Musco, Cameron N. (Cameron Nicholas). "Dimensionality reduction for k-means clustering." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/101473.

Full text
Abstract:
Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 123-131).
In this thesis we study dimensionality reduction techniques for approximate k-means clustering. Given a large dataset, we consider how to quickly compress to a smaller dataset (a sketch), such that solving the k-means clustering problem on the sketch will give an approximately optimal solution on the original dataset. First, we provide an exposition of technical results of [CEM+15], which show that provably accurate dimensionality reduction is possible using common techniques such as principal component analysis, random projection, and random sampling. We next present empirical evaluations of dimensionality reduction techniques to supplement our theoretical results. We show that our dimensionality reduction algorithms, along with heuristics based on these algorithms, indeed perform well in practice. Finally, we discuss possible extensions of our work to neurally plausible algorithms for clustering and dimensionality reduction. This thesis is based on joint work with Michael Cohen, Samuel Elder, Nancy Lynch, Christopher Musco, and Madalina Persu.
by Cameron N. Musco.
S.M.
APA, Harvard, Vancouver, ISO, and other styles
3

Persu, Elena-Mădălina. "Approximate k-means clustering through random projections." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/99847.

Full text
Abstract:
Thesis: S.M. in Computer Science and Engineering, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 39-41).
Using random row projections, we show how to approximate a data matrix A with a much smaller sketch à that can be used to solve a general class of constrained k-rank approximation problems to within (1 + [epsilon]) error. Importantly, this class of problems includes k-means clustering. By reducing data points to just O(k) dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For k-means dimensionality reduction, we provide (1+ [epsilon]) relative error results for random row projections which improve on the (2 + [epsilon]) prior known constant factor approximation associated with this sketching technique, while preserving the number of dimensions. For k-means clustering, we show how to achieve a (9 + [epsilon]) approximation by Johnson-Lindenstrauss projecting data points to just 0(log k/[epsilon]2 ) dimensions. This gives the first result that leverages the specific structure of k-means to achieve dimension independent of input size and sublinear in k.
by Elena-Mădălina Persu.
S.M. in Computer Science and Engineering
APA, Harvard, Vancouver, ISO, and other styles
4

Xiang, Chongyuan. "Private k-means clustering : algorithms and applications." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/106394.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 77-80).
Today is a new era of big data. We contribute our personal data for the common good simply by using our smart phones, searching the web and doing online transactions. Researchers, companies and governments use the collected data to learn various user behavior patterns and make impactful decisions based on that. Is it possible to publish and run queries on those databases without disclosing information about any specific individual? Differential privacy is a strong notion of privacy which guarantees that very little will be learned about individual records in the database, no matter what the attackers already know or wish to learn. Still, there is no practical system applying differential privacy algorithms for clustering points on real databases. This thesis describes the construction of small coresets for computing k-means clustering of a set of points while preserving differential privacy. As a result, it gives the first 𝑘-means clustering algorithm that is both differentially private, and has an approximation error that depends sub-linearly on the data’s dimension d. Previous results introduced errors that are exponential in d. This thesis implements this algorithm and uses it to create differentially private location data from GPS tracks. Specifically the algorithm allows clustering GPS databases generated from mobile nodes, while letting the user control the introduced noise due to privacy. This thesis also provides experimental results for the system and algorithms, and compares them to existing techniques. To the best of my knowledge, this is the first practical system that enables differentially private clustering on real data.
by Chongyuan Xiang.
M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
5

Nelson, Joshua. "On K-Means Clustering Using Mahalanobis Distance." Thesis, North Dakota State University, 2012. https://hdl.handle.net/10365/26766.

Full text
Abstract:
A problem that arises quite frequently in statistics is that of identifying groups, or clusters, of data within a population or sample. The most widely used procedure to identify clusters in a set of observations is known as K-Means. The main limitation of this algorithm is that it uses the Euclidean distance metric to assign points to clusters. Hence, this algorithm operates well only if the covariance structures of the clusters are nearly spherical and homogeneous in nature. To remedy this shortfall in the K-Means algorithm the Mahalanobis distance metric was used to capture the variance structure of the clusters. The issue with using Mahalanobis distances is that the accuracy of the distance is sensitive to initialization. If this method serves as a signicant improvement over its competitors, then it will provide a useful tool for analyzing clusters.
APA, Harvard, Vancouver, ISO, and other styles
6

Li, Yanjun. "High Performance Text Document Clustering." Wright State University / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=wright1181005422.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

ELIASSON, PHILIP, and NIKLAS ROSÉN. "Efficient K-means clustering and the importanceof seeding." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-134910.

Full text
Abstract:
Data clustering is the process of grouping data elements based on some aspect of similarity between the elements in the group. Clustering has many applications such as data compression, data mining, pattern recognition and machine learning and there are many different clustering methods. This paper examines the k-means method of clustering and how the choice of initial seeding affects the result. Lloyd’s algorithm is used as a base line and it is compared to an improved algorithm utilizing kd-trees. Two different methods of seeding are compared, random seeding and partial clustering seeding.
Klustring av data innebär att man grupperar dataelement baserat på någon typ a likhet mellan de grupperade elementen. Klustring har många olika användningsråden såsom datakompression, datautvinning, mönsterigenkänning, och maskininlärning och det finns många olika klustringsmetoder. Den här uppsatsen undersöker klustringsmetoden k-means och hur valet av startvärden för metoden påverkar resultatet. Lloyds algorithm används som utgångspunkt och den jämförs med en förbättrad algorithm som använder sig av kd-träd. Två olika metoder att välja startvärden jämförs, slumpmässigt val av startvärde och delklustring.
APA, Harvard, Vancouver, ISO, and other styles
8

Kondo, Yumi. "Robustification of the sparse K-means clustering algorithm." Thesis, University of British Columbia, 2011. http://hdl.handle.net/2429/37093.

Full text
Abstract:
Searching a dataset for the ‘‘natural grouping / clustering’’ is an important explanatory technique for understanding complex multivariate datasets. One might expect that the true underlying clusters present in a dataset differ only with respect to a small fraction of the features. Furthermore, one might afraid that the dataset might contain potential outliers. Through simulation studies, we find that an existing sparse clustering method can be severely affected by a single outlier. In this thesis, we develop a robust clustering method that is also able to perform variable selection: we robustified sparse K-means (Witten and Tibshirani [28]), based on the idea of trimmed K-means introduced by Gordaliza [7] and Gordaliza [8]. Since high dimensional datasets often contain quite a few missing observations, we made our proposed method capable of handling datasets with missing values. The performance of the proposed robust sparse K-means is assessed in various simulation studies and two data analyses. The simulation studies show that robust sparse K-means performs better than other competing algorithms in terms of both the selection of features and the selection of a partition when datasets are contaminated. The analysis of a microarray dataset shows that robust sparse K-means best reflects the oestrogen receptor status of the patients among all other competing algorithms. We also adapt Clest (Duboit and Fridlyand [5]) to our robust sparse K-means to provide an automatic robust procedure of selecting the number of clusters. Our proposed methods are implemented in the R package RSKC.
APA, Harvard, Vancouver, ISO, and other styles
9

Chowuraya, Tawanda. "Online content clustering using variant K-Means Algorithms." Thesis, Cape Peninsula University of Technology, 2019. http://hdl.handle.net/20.500.11838/3089.

Full text
Abstract:
Thesis (MTech)--Cape Peninsula University of Technology, 2019
We live at a time when so much information is created. Unfortunately, much of the information is redundant. There is a huge amount of online information in the form of news articles that discuss similar stories. The number of articles is projected to grow. The growth makes it difficult for a person to process all that information in order to update themselves on a subject matter. There is an overwhelming amount of similar information on the internet. There is need for a solution that can organize this similar information into specific themes. The solution is a branch of Artificial intelligence (AI) called machine learning (ML) using clustering algorithms. This refers to clustering groups of information that is similar into containers. When the information is clustered people can be presented with information on their subject of interest, grouped together. The information in a group can be further processed into a summary. This research focuses on unsupervised learning. Literature has it that K-Means is one of the most widely used unsupervised clustering algorithm. K-Means is easy to learn, easy to implement and is also efficient. However, there is a horde of variations of K-Means. The research seeks to find a variant of K-Means that can be used with an acceptable performance, to cluster duplicate or similar news articles into correct semantic groups. The research is an experiment. News articles were collected from the internet using gocrawler. gocrawler is a program that takes Universal Resource Locators (URLs) as an argument and collects a story from a website pointed to by the URL. The URLs are read from a repository. The stories come riddled with adverts and images from the web page. This is referred to as a dirty text. The dirty text is sanitized. Sanitization is basically cleaning the collected news articles. This includes removing adverts and images from the web page. The clean text is stored in a repository, it is the input for the algorithm. The other input is the K value. All K-Means based variants take K value that defines the number of clusters to be produced. The stories are manually classified and labelled. The labelling is done to check the accuracy of machine clustering. Each story is labelled with a class to which it belongs. The data collection process itself was not unsupervised but the algorithms used to cluster are totally unsupervised. A total of 45 stories were collected and 9 manual clusters were identified. Under each manual cluster there are sub clusters of stories talking about one specific event. The performance of all the variants is compared to see the one with the best clustering results. Performance was checked by comparing the manual classification and the clustering results from the algorithm. Each K-Means variant is run on the same set of settings and same data set, that is 45 stories. The settings used are, • Dimensionality of the feature vectors, • Window size, • Maximum distance between the current and predicted word in a sentence, • Minimum word frequency, • Specified range of words to ignore, • Number of threads to train the model. • The training algorithm either distributed memory (PV-DM) or distributed bag of words (PV-DBOW), • The initial learning rate. The learning rate decreases to minimum alpha as training progresses, • Number of iterations per cycle, • Final learning rate, • Number of clusters to form, • The number of times the algorithm will be run, • The method used for initialization. The results obtained show that K-Means can perform better than K-Modes. The results are tabulated and presented in graphs in chapter six. Clustering can be improved by incorporating Named Entity (NER) recognition into the K-Means algorithms. Results can also be improved by implementing multi-stage clustering technique. Where initial clustering is done then you take the cluster group and further cluster it to achieve finer clustering results.
APA, Harvard, Vancouver, ISO, and other styles
10

Li, Songzi. "K-groups: A Generalization of K-means by Energy Distance." Bowling Green State University / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1428583805.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Xie, Qing Yan. "K-Centers Dynamic Clustering Algorithms and Applications." University of Cincinnati / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1384427644.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

He, Gaojie. "Authoritative K-Means for Clustering of Web Search Results." Thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, 2010. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-11116.

Full text
Abstract:
Clustering is currently more and more applied on hyperlinked documents, especially for web search results. Although most commercial web search engines will provide their ranking algorithms sorting the matched results to raise the most relevant pages to the top, the size of results is still so huge that most ones including some pages that suffers are really interested in will be discarded. Clustering for web search results separates unrelated pages and clusters the similar pages with the same topic into the same group, thus helps suffers to locate the pages much faster. Many features of web pages have been studied to be used in clustering, such as content information including title, snippet, anchor text and etc. Hyperlink is another primary feature of web pages, some content-link coupled clustering methods have been studied. We propose an authoritative K-Means clustering method that combines content, in-link, out-link and page rank. In this project, we adjust the construction of in-link and out-link vectors and introduce a new page rank vector with two patterns, one is a single value representation of page rank and the other is a 11-dimensional vector. We study the difference of these two types of page rank in clustering, and compare the different clustering based on different web page representations, such as content-based, content-link coupled and etc. The effect of different elements of web page is also studied in our project. We apply the authoritative clustering for the web search results retrieved from Google search engine. Three experiments are conducted and different evaluation metrics are adopted to analyze the results.
APA, Harvard, Vancouver, ISO, and other styles
13

Jankovsky, Zachary Kyle. "Clustering Analysis of Nuclear Proliferation Resistance Measures." The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1398354675.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Leisch, Friedrich. "Bagged clustering." SFB Adaptive Information Systems and Modelling in Economics and Management Science, WU Vienna University of Economics and Business, 1999. http://epub.wu.ac.at/1272/1/document.pdf.

Full text
Abstract:
A new ensemble method for cluster analysis is introduced, which can be interpreted in two different ways: As complexity-reducing preprocessing stage for hierarchical clustering and as combination procedure for several partitioning results. The basic idea is to locate and combine structurally stable cluster centers and/or prototypes. Random effects of the training set are reduced by repeatedly training on resampled sets (bootstrap samples). We discuss the algorithm both from a more theoretical and an applied point of view and demonstrate it on several data sets. (author's abstract)
Series: Working Papers SFB "Adaptive Information Systems and Modelling in Economics and Management Science"
APA, Harvard, Vancouver, ISO, and other styles
15

Zhao, Jianmin. "Optimal Clustering: Genetic Constrained K-Means and Linear Programming Algorithms." VCU Scholars Compass, 2006. http://hdl.handle.net/10156/1583.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Al-Guwaizani, Abdulrahman. "Variable neighbourhood search based heuristic for K-harmonic means clustering." Thesis, Brunel University, 2011. http://bura.brunel.ac.uk/handle/2438/5827.

Full text
Abstract:
Although there has been a rapid development of technology and increase of computation speeds, most of the real-world optimization problems still cannot be solved in a reasonable time. Some times it is impossible for them to be optimally solved, as there are many instances of real problems which cannot be addressed by computers at their present speed. In such cases, the heuristic approach can be used. Heuristic research has been used by many researchers to supply this need. It gives a sufficient solution in reasonable time. The clustering problem is one example of this, formed in many applications. In this thesis, I suggest a Variable Neighbourhood Search (VNS) to improve a recent clustering local search called K-Harmonic Means (KHM).Many experiments are presented to show the strength of my code compared with some algorithms from the literature. Some counter-examples are introduced to show that KHM may degenerate entirely, in either one or more runs. Furthermore, it degenerates and then stops in some familiar datasets, which significantly affects the final solution. Hence, I present a removing degeneracy code for KHM. I also apply VNS to improve the code of KHM after removing the evidence of degeneracy.
APA, Harvard, Vancouver, ISO, and other styles
17

Salman, Raied. "CONTRIBUTIONS TO K-MEANS CLUSTERING AND REGRESSION VIA CLASSIFICATION ALGORITHMS." VCU Scholars Compass, 2012. http://scholarscompass.vcu.edu/etd/2738.

Full text
Abstract:
The dissertation deals with clustering algorithms and transforming regression prob-lems into classification problems. The main contributions of the dissertation are twofold; first, to improve (speed up) the clustering algorithms and second, to develop a strict learn-ing environment for solving regression problems as classification tasks by using support vector machines (SVMs). An extension to the most popular unsupervised clustering meth-od, k-means algorithm, is proposed, dubbed k-means2 (k-means squared) algorithm, appli-cable to ultra large datasets. The main idea is based on using a small portion of the dataset in the first stage of the clustering. Thus, the centers of such a smaller dataset are computed much faster than if computing the centers based on the whole dataset. These final centers of the first stage are naturally much closer to the locations of the final centers rendering a great reduction in the total computational cost. For large datasets the speed up in computa-tion exhibited a trend which is shown to be high and rising with the increase in the size of the dataset. The total transient time for the fast stage was found to depend largely on the portion of the dataset selected in the stage. For medium size datasets it has been shown that an 8-10% portion of data used in the fast stage is a reasonable choice. The centers of the 8-10% samples computed during the fast stage may oscillate towards the final centers' positions of the fast stage along the centers' movement path. The slow stage will start with the final centers of the fast phase and the paths of the centers in the second stage will be much shorter than the ones of a classic k-means algorithm. Additionally, the oscillations of the slow stage centers' trajectories along the path to the final centers' positions are also greatly minimized. In the second part of the dissertation, a novel approach of posing a solution of re-gression problems as the multiclass classification tasks within the common framework of kernel machines is proposed. Based on such an approach both the nonlinear (NL) regression problems and NL multiclass classification tasks will be solved as multiclass classification problems by using SVMs. The accuracy of an approximating classification (hyper)Surface (averaged over several benchmarking data sets used in this study) to the data points over a given high-dimensional input space created by a nonlinear multiclass classifier is slightly superior to the solution obtained by regression (hyper)Surface. In terms of the CPU time needed for training (i.e. for tuning the hyperparameters of the models), the nonlinear SVM classifier also shows significant advantages. Here, the comparisons between the solutions obtained by an SVM solving given regression problem as a classic SVM regressor and as the SVM classifier have been performed. In order to transform a regression problem into a classification task, four possible discretizations of a continuous output (target) vector y are introduced and compared. A very strict double (nested) cross-validation technique has been used for measuring the performances of regression and multiclass classification SVMs. In order to carry out fair comparisons, SVMs are used for solving both tasks - regression and multiclass classification. The readily available and most popular benchmarking SVM tool, LibSVM, was used in all experiments. The results in solving twelve benchmarking regression tasks shown here will present SVM regression and classification algorithms as strongly competing models where each approach shows merits for a specific class of high-dimensional function approximation problems.
APA, Harvard, Vancouver, ISO, and other styles
18

Hong, Sui. "Experiments with K-Means, Fuzzy c-Means and Approaches to Choose K and C." Honors in the Major Thesis, University of Central Florida, 2006. http://digital.library.ucf.edu/cdm/ref/collection/ETH/id/1224.

Full text
Abstract:
This item is only available in print in the UCF Libraries. If this is your Honors Thesis, you can help us make it available online for use by researchers around the world by following the instructions on the distribution consent form at http://library.ucf
Bachelors
Engineering and Computer Science
Computer Engineering
APA, Harvard, Vancouver, ISO, and other styles
19

Hinz, Joel. "Clustering the Web : Comparing Clustering Methods in Swedish." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-95228.

Full text
Abstract:
Clustering -- automatically sorting -- web search results has been the focus of much attention but is by no means a solved problem, and there is little previous work in Swedish. This thesis studies the performance of three clustering algorithms -- k-means, agglomerative hierarchical clustering, and bisecting k-means -- on a total of 32 corpora, as well as whether clustering web search previews, called snippets, instead of full texts can achieve reasonably decent results. Four internal evaluation metrics are used to assess the data. Results indicate that k-means performs worse than the other two algorithms, and that snippets may be good enough to use in an actual product, although there is ample opportunity for further research on both issues; however, results are inconclusive regarding bisecting k-means vis-à-vis agglomerative hierarchical clustering. Stop word and stemmer usage results are not significant, and appear to not affect the clustering by any considerable magnitude.
APA, Harvard, Vancouver, ISO, and other styles
20

CALENDER, CHRISTOPHER R. "APPROXIMATE N-NEAREST NEIGHBOR CLUSTERING ON DISTRIBUTED DATABASES USING ITERATIVE REFINEMENT." University of Cincinnati / OhioLINK, 2004. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1092929952.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Thirathon, Nattavude 1980. "Cyclic exchange neighborhood search technique for the K-means clustering problem." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/17981.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.
Includes bibliographical references (p. 151-152).
Cyclic Exchange is an application of the cyclic transfers neighborhood search technique for the k-means clustering problem. Neighbors of a feasible solution are obtained by moving points between clusters in a cycle. This method attempts to improve local minima obtained by the well-known Lloyd's algorithm. Although the results did not establish usefulness of Cyclic Exchange, our experiments reveal some insights on the k-means clustering and Lloyd's algorithm. While Lloyd's algorithm finds the best local optimum within a thousand iterations for most datasets, it repeatedly finds better local minima after several thousand iterations for some other datasets. For the latter case, Cyclic Exchange also finds better solutions than Lloyd's algorihtm. Although we are unable to identify the features that lead Cyclic Exchange to perform better, our results verify the robustness of Lloyd's algorithm in most datasets.
by Nattavude Thirathon.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
22

Van, Tilburg Ken. "Identifying boosted objects with N-subjettiness and linear k-means clustering." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/65536.

Full text
Abstract:
Thesis (S.B.)--Massachusetts Institute of Technology, Dept. of Physics; and, (S.B.)--Massachusetts Institute of Technology, Dept. of Mathematics, 2011.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 57-59).
In this thesis, I explore aspects of a new jet shape - N-subjettiness - designed to identify boosted hadronically-decaying objects (with a particular focus on tagging top quarks) at particle accelerators such as the Large Hadron Collider. Combined with an invariant mass cut on jets, N-subjettiness is a powerful discriminating variable for tagging boosted objects such as top quarks and rejecting the fake background of QCD jets with large invariant mass. In a crossover analysis, the N-subjettiness method is found to outperform the common top tagging methods of the BOOST2010 conference, with top tagging efficiencies of 50% and 20% against mistag rates of 4.0% and 0.19%, respectively. The N-subjettiness values are calculated using a new infrared- and collinear-safe minimization procedure which I call the linear k-means clustering algorithm. As a true jet shape with highly effective tagging performances, N-subjettiness has many advantages on the experimental as well as on the theoretical side.
by Ken Van Tilburg.
S.B.
APA, Harvard, Vancouver, ISO, and other styles
23

Soheily-Khah, Saeid. "Generalized k-means-based clustering for temporal data under time warp." Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM064/document.

Full text
Abstract:
L’alignement de multiples séries temporelles est un problème important non résolu dans de nombreuses disciplines scientifiques. Les principaux défis pour l’alignement temporel de multiples séries comprennent la détermination et la modélisation des caractéristiques communes et différentielles de classes de séries. Cette thèse est motivée par des travaux récents portant sur l'extension de la DTW pour l’alignement de séries multiples issues d’applications diverses incluant la reconnaissance vocale, l'analyse de données micro-array, la segmentation ou l’analyse de mouvements humain. Ces travaux fondés sur l’extension de la DTW souffrent cependant de plusieurs limites : 1) Ils se limitent au problème de l'alignement par pair de séries 2) Ils impliquent uniformément les descripteurs des séries 3) Les alignements opérés sont globaux. L'objectif de cette thèse est d'explorer de nouvelles approches d’alignement temporel pour la classification non supervisée de séries. Ce travail comprend d'abord le problème de l'extraction de prototypes, puis de l'alignement de séries multiples multidimensionnelles
Temporal alignment of multiple time series is an important unresolved problem in many scientific disciplines. Major challenges for an accurate temporal alignment include determining and modeling the common and differential characteristics of classes of time series. This thesis is motivated by recent works in extending Dynamic time warping for aligning multiple time series from several applications including speech recognition, curve matching, micro-array data analysis, temporal segmentation or human motion. However these DTW-based works suffer of several limitations: 1) They address the problem of aligning two time series regardless of the remaining time series, 2) They involve uniformly the features of the multiple time series, 3) The time series are aligned globally by including the whole observations. The aim of this thesis is to explore a generalized dynamic time warping for time series clustering. This work includes first the problem of prototype extraction, then the alignment of multiple and multidimensional time series
APA, Harvard, Vancouver, ISO, and other styles
24

Stanforth, Robert William. "Extending K-Means clustering for analysis of quantitative structure activity relationships (QSAR)." Thesis, Birkbeck (University of London), 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.500005.

Full text
Abstract:
A Quantitative Structure-Activity Relationship (QSAR) study is an attempt to model some biological activity over a collection of chemical compounds in terms of their structural properties A QSAR model may be constructed through (typically linear) multivariate regression analysis of the biological activity data against a number of features or 'descriptors' of chemical structure. As with any regression model, there are a number of issues emerging in real applications, including (a) domain of applicability of the model, (b) validation of the model within its domain of applicability, and (c) possible non-linearity of the QSAR Unfortunately the existing methods commonly used in QSAR for overcoming these issues all suffer from problems such as computational inefficiency and poor treatment of non- linearity. In practice this often results in the omission of proper analysis of them altogether. In this thesis we develop methods for tackling the issues listed above using K-means clustering. Specifically, we model the shape of a dataset in terms of intelligent K-means clustering results and use this to develop a non- parametric estimate for the domain of applicability of a QSAR model. Next we propose a 'hybrid' variant of K-means, incorporating a regression-wise element, which engenders a technique for non-linear QSAR modelling. Finally we demonstrate how to partition a dataset into training and testing subsets, using the K-means clustering to ensure that the partitioning respects the overall distribution Our experiments involving real QSAR data confirm the effectiveness of the methods developed in the project.
APA, Harvard, Vancouver, ISO, and other styles
25

Malheiros, Larinni. "Detecção de posição e quedas corporais baseado em K-means clustering eThreshold." reponame:Repositório Institucional da UnB, 2017. http://repositorio.unb.br/handle/10482/31978.

Full text
Abstract:
Dissertação (mestrado)—Universidade de Brasília, Faculdade de Tecnologia, Departamento de Engenharia Elétrica, 2017.
Submitted by Raquel Almeida (raquel.df13@gmail.com) on 2018-05-11T21:36:26Z No. of bitstreams: 1 2017_LarinniMalheiros.pdf: 4570904 bytes, checksum: 1684e0b718246ba537552551bc3e22f3 (MD5)
Approved for entry into archive by Raquel Viana (raquelviana@bce.unb.br) on 2018-05-28T18:08:27Z (GMT) No. of bitstreams: 1 2017_LarinniMalheiros.pdf: 4570904 bytes, checksum: 1684e0b718246ba537552551bc3e22f3 (MD5)
Made available in DSpace on 2018-05-28T18:08:27Z (GMT). No. of bitstreams: 1 2017_LarinniMalheiros.pdf: 4570904 bytes, checksum: 1684e0b718246ba537552551bc3e22f3 (MD5) Previous issue date: 2018-05-28
A queda de idosos é caso de saúde pública em todo o mundo e esse assunto tem sido alvo de pesquisa e desenvolvimento tecnológico com objetivo de amenizar as consequências físicas e psicológicas para estas pessoas e seus familiares. Em 2017, 15,7% dos idosos no Brasil vivem sozinhos, de acordo com [1]. Há várias hipóteses para explicar essa tendência, entre elas, o desejo de autonomia e a dispersão e fragmentação familiar, com muitos filhos morando longe dos pais. Nesse contexto, este trabalho apresenta um dispositivo capaz de auxiliar a monitoração dos idosos em suas atividades, especialmente as domésticas. Serão apresentados os fundamentos teóricos para o desenvolvimento do dispositivo. Os fundamentos teóricos apresentados abordam todas as fases de desenvolvimento do dispositivo, abrangendo desde a instalação da parte física até o desenvolvimento dos algoritmos utilizados para processar as informações. Os desafios encontrado s ao longo desse trabalho foram: precisão e adequação. A precisão do dispositivo é dividida em sensibilidade e especificidade. Ambas são parâmetros utilizados para determinar a acurácia do sistema. O desafio relacionado a essa atividade consistiu em avaliar se a acurácia do dispositivo é suficiente para fornecer a confiabilidade necessária para aplicações de detecção de quedas e posição corporais. Além disso, o dispositivo deve se adequar as características físicas do paciente que o utiliza, pois variáveis como altura, peso e idade influenciado resultado da predição. Será avaliado o desempenho do dispositivo utilizando vários cenários e sua aplicação no mundo real. Será apresentado o comparativo de resultados entre o dispositivo criado neste trabalho de Mestrado ao trabalho de Graduação [2]. Será apresentada uma metodologia baseada em aprendizado de máquina para realizar a predição das posições estáticas (sentado, deitado e em pé) e threshold para determinação de posições dinâmicas (andar e cair). Informações sobre essas posições fornecem resultados se o paciente encontra-se em queda, sendo essa uma posição que deve ser tratada imediatamente pelo cuidador. O algoritmo de aprendizado de máquinas utilizado é o K-Means Clustering, com o qual tem-se a posição estática que está sendo realizada pelo paciente. Uma série de condições de decisão baseadas em thresholds foram utilizadas para detectar posições dinâmicas como andar e cair. Para coletar as informações, será utilizado o sensor MPU6050 e para processamento e apresentação dos dados será utilizado o RaspberryPi. Os dados serão apresentados em uma aplicação Android e Web para monitoramento dos idosos através de seus cuidadores. Como resultado desse trabalho, observou-se que a detecção de quedas e posição corporais utilizando o aprendizado de máquinas para detecção de posições estáticas apresenta resultados confiáveis para a posição deitado e inferioridade estatística para diferenciar os movimentos como sentado e em pé. Em relação aos movimentos dinâmicos, verificou-se que é possível diferenciá-los utilizando parâmetros como regressão linear e área da integral entre o ponto de maior amplitude e o valor remanescente do vetor dos dados obtidos do sensor MPU6050.
Fall Detection is a health issue in all over the world. This matter has been searched and developed in the technology field with the goal of decreased physical and phycological consequences to their families and themselves. There are some hypotheses to explain this trend, among them, the desire for independence and families dispersion and fragmentation, with sons and daughters living away from their parents. In this context, this work presents a device capable of auxiliary and monitors elderly in their activities, especially the domestic activities. This work uses machine learning approach to predict static body position (standing, lying and sitting) and threshold to identify dynamic body position (walking and falling). The machine learning algorithm used in this work to detect static positions is K-Means Clustering. A series of decision conditions based on thresholds to detect dynamic movements such as walking and fall. To collect information will be used MPU6050 and to process and present the data will be used RaspberryPi. As a result of this work, it is possible to conclude that fall and position detection using machine learning to detect static position presents reliable data to lying position and lower static data to differentiate sitting and standing positions. It is possible to differentiate dynamic movements trough linear regression and calculate the integer of the vector obtained from the MPU6050 sensor.
APA, Harvard, Vancouver, ISO, and other styles
26

Groth, Gerson Eduardo. "Attribute field K-means : clustering trajectories with attribute by fitting multiple fields." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2016. http://hdl.handle.net/10183/150038.

Full text
Abstract:
A enorme quantidade de trajetórias contendo múltiplas dimensões, e o aumento de complexidade que esses dados ocasionam, impõem desafios ao visualizar e analisar estas informações. Visualização de Trajetórias deve lidar com alterações tanto na dimensão de espaço quanto na dimensão de tempo. Porém, os atributos de cada trajetória podem ocasionar uma melhor compreensão sobre seus comportamentos e características. Dessa forma, eles não deveriam ser neglicenciados. Neste trabalho, nós abordamos este problema interpretando séries temporais multivariadas com foco nos atributos das trajetórias, em um espaço de configuração que codifica um explícito relacionamento entre as variáveis das séries temporais. Nós propomos uma técnica original de clusterização de trajetórias, chamada Attribute Field k-means (AFKM). Ela usa um espaço de configuração dinâmica para gerar clusters baseados nos atributos e parâmetros definidos pelo usuário. Além disso, incorporando uma interface de sketching, nosso método é capaz de encontrar clusters que aproximam os exemplos de trajetórias desenhados pelo usuário. Nós também desenvolvemos um protótipo para explorar as trajetórias e clusters gerados pelo AFKM, de um modo interativo. Nossos resultados, em sintéticos e reais conjuntos de dados de séries temporais, provam a eficiência e o poder de visualização do nosso método.
The amount of high-dimensional trajectory data and its increasing complexity imposes a challenge when visualizing and analysing this information. Trajectory Visualization must deal with changes both in space and time dimensions, but the attributes of each trajectory may provide insights about its behavior and important aspects. Thus, they should not be neglected. In this work, we tackle this problem by interpreting multivariate time series as attribute-rich trajectories in a configuration space that encodes an explicit relationship among the time series variables. We propose a novel trajectory-clustering technique called Attribute Field k-means (AFKM). It uses a dynamic configuration space to generate clusters based on attributes and parameters set by the user. Furthermore, by incorporating a sketching-based interface, our approach is capable of finding clusters that approximates the input sketches. In addiction, we developed a prototype to explore the trajectories and clusters generated by AFKM in an interactive manner. Our results on synthetic and real time series datasets prove the efficiency and visualization power of our approach.
APA, Harvard, Vancouver, ISO, and other styles
27

Dineff, Dimitris. "Clustering using k-means algorithm in multivariate dependent models with factor structure." Thesis, Uppsala universitet, Tillämpad matematik och statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-429528.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Ramler, Ivan Peter. "Improved statistical methods for k-means clustering of noisy and directional data." [Ames, Iowa : Iowa State University], 2008.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
29

Ranby, Erik. "A comparison of clustering techniques for short social text messages." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-196735.

Full text
Abstract:
The amount of social text messages authored each day is huge and the information contained within is potentially very valuable. Software that can cluster and thereby help analyze these messages would consequently be helpful. This thesis explores several ways of clustering social text messages. Two algorithms and several setups with these algorithms have been tested and evaluated with the same data as input. Based on these evaluations, a comparison has been conducted in order to answer the question which algorithm setup is best suited for the task. The two clustering algorithms that have been the main subjects for the comparison are K-means and agglomerative hierarchical. All setups were run with 3-grams as well as with only single words as features. The evaluation measures used were intra-cluster distance, inter-cluster distance and silhouette value. Intra-cluster distance is the distance between data points in the same cluster while inter-cluster is the distance between the clusters. Silhouette value is another more general evaluation measure that is often used to estimate the quality of a clustering. The results showed that if running time is a high priority, using K-means without 3-grams is preferred. On the other hand, if the quality of the clusters is important and performance is less so, introducing 3-grams together with any of the two algorithms will suit your needs better.
Mängden sociala textmeddelanden som skrivs varje dag är enorm och informationen i dessa kan vara mycket värdefull. Mjukvara som kan klustra och på så sätt analysera dessa meddelanden kan därmed vara användbar. Denna avhandling utforskar flera sätt att klustra sociala textmeddelanden. Två algoritmer och flera konfigureringar med dessa algoritmer har testats och utvärderats med samma indata. Baserat på dessa utvärderingar har en jämförelse utförts för att kunna svara på frågan vilken av dessa konfigureringar som är bäst anpassad för sitt syfte. De två klustringsalgoritmerna som i första hand har jämförts är K-means och agglomerative hierarchical. Alla konfigureringar kördes både med och utan 3-gram som komplement till endast enstaka ord. Utvärderingsmetoderna som användes var intra-cluster distance, inter-cluster distance och silhouette value. Intra-cluster distance är avståndet mellan datapunkterna i samma kluster medan inter-cluster distance är avståndet mellan de olika klustrena. Silhouette value är annan, mer generell, utvärderingsmetod som ofta används för att uppskatta kvaliten på en klustring. Resultaten visade att K-means utan 3-gram är att föredra om kravet på körningstid inte är högst prioriterat. Å andra sidan, om kvaliten på klustringen är viktigare än prestandan på algoritmen, så bör 3-gram användas tillsammans med vilken som av de två algoritmerna.
APA, Harvard, Vancouver, ISO, and other styles
30

Mayer-Jochimsen, Morgan. "Clustering Methods and Their Applications to Adolescent Healthcare Data." Scholarship @ Claremont, 2013. http://scholarship.claremont.edu/scripps_theses/297.

Full text
Abstract:
Clustering is a mathematical method of data analysis which identifies trends in data by efficiently separating data into a specified number of clusters so is incredibly useful and widely applicable for questions of interrelatedness of data. Two methods of clustering are considered here. K-means clustering defines clusters in relation to the centroid, or center, of a cluster. Spectral clustering establishes connections between all of the data points to be clustered, then eliminates those connections that link dissimilar points. This is represented as an eigenvector problem where the solution is given by the eigenvectors of the Normalized Graph Laplacian. Spectral clustering establishes groups so that the similarity between points of the same cluster is stronger than similarity between different clusters. K-means and spectral clustering are used to analyze adolescent data from the 2009 California Health Interview Survey. Differences were observed between the results of the clustering methods on 3294 individuals and 22 health-related attributes. K-means clustered the adolescents by exercise, poverty, and variables related to psychological health while spectral clustering groups were informed by smoking, alcohol use, low exercise, psychological distress, low parental involvement, and poverty. We posit some guesses as to this difference, observe characteristics of the clustering methods, and comment on the viability of spectral clustering on healthcare data.
APA, Harvard, Vancouver, ISO, and other styles
31

Narreddy, Naga Sambu Reddy, and Tuğrul Durgun. "Clusters (k) Identification without Triangle Inequality : A newly modelled theory." Thesis, Uppsala universitet, Institutionen för informatik och media, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-183608.

Full text
Abstract:
Cluster analysis characterizes data that are similar enough and useful into meaningful groups (clusters).For example, cluster analysis can be applicable to find group of genes and proteins that are similar, to retrieve information from World Wide Web, and to identify locations that are prone to earthquakes. So the study of clustering has become very important in several fields, which includes psychology and other social sciences, biology, statistics, pattern recognition, information retrieval, machine learning and data mining [1] [2].   Cluster analysis is the one of the widely used technique in the area of data mining. According to complexity and amount of data in a system, we can use variety of cluster analysis algorithms. K-means clustering is one of the most popular and widely used among the ten algorithms in data mining [3]. Like other clustering algorithms, it is not the silver bullet. K-means clustering requires pre analysis and knowledge before the number of clusters and their centroids are determined. Recent studies show a new approach for K-means clustering which does not require any pre knowledge for determining the number of clusters [4].   In this thesis, we propose a new clustering procedure to solve the central problem of identifying the number of clusters (k) by imitating the desired number of clusters with proper properties. The proposed algorithm is validated by investigating different characteristics of the analyzed data with modified theory, analyze parameters efficiency and their relationships. The parameters in this theory include the selection of embryo-size (m), significance level (α), distributions (d), and training set (n), in the identification of clusters (k).
APA, Harvard, Vancouver, ISO, and other styles
32

Reanier, Richard Eugene. "Refinements to K-means clustering : spatial analysis of the Bateman site, arctic Alaska /." Thesis, Connect to this title online; UW restricted, 1992. http://hdl.handle.net/1773/6420.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Camara, Assa. "Využití fuzzy množin ve shlukové analýze se zaměřením na metodu Fuzzy C-means Clustering." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2020. http://www.nusl.cz/ntk/nusl-417051.

Full text
Abstract:
This master thesis deals with cluster analysis, more specifically with clustering methods that use fuzzy sets. Basic clustering algorithms and necessary multivariate transformations are described in the first chapter. In the practical part, which is in the third chapter we apply fuzzy c-means clustering and k-means clustering on real data. Data used for clustering are the inputs of chemical transport model CMAQ. Model CMAQ is used to approximate concentration of air pollutants in the atmosphere. To the data we will apply two different clustering methods. We have used two different methods to select optimal weighting exponent to find data structure in our data. We have compared all 3 created data structures. The structures resembled each other but with fuzzy c-means clustering, one of the clusters did not resemble any of the clustering inputs. The end of the third chapter is dedicated to an attempt to find a regression model that finds the relationship between inputs and outputs of model CMAQ.
APA, Harvard, Vancouver, ISO, and other styles
34

Chahine, Firas Safwan. "A Genetic Algorithm that Exchanges Neighboring Centers for Fuzzy c-Means Clustering." NSUWorks, 2012. http://nsuworks.nova.edu/gscis_etd/116.

Full text
Abstract:
Clustering algorithms are widely used in pattern recognition and data mining applications. Due to their computational efficiency, partitional clustering algorithms are better suited for applications with large datasets than hierarchical clustering algorithms. K-means is among the most popular partitional clustering algorithm, but has a major shortcoming: it is extremely sensitive to the choice of initial centers used to seed the algorithm. Unless k-means is carefully initialized, it converges to an inferior local optimum and results in poor quality partitions. Developing improved method for selecting initial centers for k-means is an active area of research. Genetic algorithms (GAs) have been successfully used to evolve a good set of initial centers. Among the most promising GA-based methods are those that exchange neighboring centers between candidate partitions in their crossover operations. K-means is best suited to work when datasets have well-separated non-overlapping clusters. Fuzzy c-means (FCM) is a popular variant of k-means that is designed for applications when clusters are less well-defined. Rather than assigning each point to a unique cluster, FCM determines the degree to which each point belongs to a cluster. Like k-means, FCM is also extremely sensitive to the choice of initial centers. Building on GA-based methods for initial center selection for k-means, this dissertation developed an evolutionary program for center selection in FCM called FCMGA. The proposed algorithm utilized region-based crossover and other mechanisms to improve the GA. To evaluate the effectiveness of FCMGA, three independent experiments were conducted using real and simulated datasets. The results from the experiments demonstrate the effectiveness and consistency of the proposed algorithm in identifying better quality solutions than extant methods. Moreover, the results confirmed the effectiveness of region-based crossover in enhancing the search process for the GA and the convergence speed of FCM. Taken together, findings in these experiments illustrate that FCMGA was successful in solving the problem of initial center selection in partitional clustering algorithms.
APA, Harvard, Vancouver, ISO, and other styles
35

Karatzoglou, Alexandros, and Ingo Feinerer. "Text Clustering with String Kernels in R." Department of Statistics and Mathematics, WU Vienna University of Economics and Business, 2006. http://epub.wu.ac.at/1002/1/document.pdf.

Full text
Abstract:
We present a package which provides a general framework, including tools and algorithms, for text mining in R using the S4 class system. Using this package and the kernlab R package we explore the use of kernel methods for clustering (e.g., kernel k-means and spectral clustering) on a set of text documents, using string kernels. We compare these methods to a more traditional clustering technique like k-means on a bag of word representation of the text and evaluate the viability of kernel-based methods as a text clustering technique. (author's abstract)
Series: Research Report Series / Department of Statistics and Mathematics
APA, Harvard, Vancouver, ISO, and other styles
36

Zhou, Dunke. "High-dimensional Data Clustering and Statistical Analysis of Clustering-based Data Summarization Products." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1338303646.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Krajčír, Martin. "Internetové souřadnicové systémy." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2009. http://www.nusl.cz/ntk/nusl-218186.

Full text
Abstract:
Network coordinates (NC) system is an efficient mechanism for prediction of Internet distance with limited number of measurement. This work focus on distributed coordinates system which is evaluated by relative error. According to experimental results from simulated application, was created own algorithm to compute network coordinates. Algorithm was tested by using simulated network as well as RTT values from network PlanetLab. Experiments show that clustered nodes achieve positive results of synthetic coordinates with limited connection between nodes. This work propose implementation of own NC system in network with hierarchical aggregation. Created application was placed on research projects web page of the Department of Telecommunications.
APA, Harvard, Vancouver, ISO, and other styles
38

Dsouza, Jeevan. "Region-based Crossover for Clustering Problems." NSUWorks, 2012. http://nsuworks.nova.edu/gscis_etd/139.

Full text
Abstract:
Data clustering, which partitions data points into clusters, has many useful applications in economics, science and engineering. Data clustering algorithms can be partitional or hierarchical. The k-means algorithm is the most widely used partitional clustering algorithm because of its simplicity and efficiency. One problem with the k-means algorithm is that the quality of partitions produced is highly dependent on the initial selection of centers. This problem has been tackled using genetic algorithms (GA) where a set of centers is encoded into an individual of a population and solutions are generated using evolutionary operators such as crossover, mutation and selection. Of the many GA methods, the region-based genetic algorithm (RBGA) has proven to be an effective technique when the centroid was used as the representative object of a cluster (ROC) and the Euclidean distance was used as the distance metric. The RBGA uses a region-based crossover operator that exchanges subsets of centers that belong to a region of space rather than exchanging random centers. The rationale is that subsets of centers that occupy a given region of space tend to serve as building blocks. Exchanging such centers preserves and propagates high-quality partial solutions. This research aims at assessing the RBGA with a variety of ROCs and distance metrics. The RBGA was tested along with other GA methods, on four benchmark datasets using four distance metrics, varied number of centers, and centroids and medoids as ROCs. The results obtained showed the superior performance of the RBGA across all datasets and sets of parameters, indicating that region-based crossover may prove an effective strategy across a broad range of clustering problems.
APA, Harvard, Vancouver, ISO, and other styles
39

Dunkel, Christopher T. "Person detection and tracking using binocular Lucas-Kanade feature tracking and k-means clustering." Connect to this title online, 2008. http://etd.lib.clemson.edu/documents/1219850371/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Liu, Meng-Ting, and 劉孟庭. "A study of k-means clustering." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/11948168797031986321.

Full text
Abstract:
碩士
朝陽科技大學
資訊管理系碩士班
97
Clustering is the assignment of a set of observations into subsets (called clusters) so the traits of observations in the same cluster are similar. According to a distance measure or numbers of nearest neighbor points, similarity measurement can be assessed. Clustering is a method of common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. There are two major parts in this thesis. The first part is K-means clustering applied in food images segmentation. The second part is K-means clustering applied in general data. In both part, we modified the K-means clustering to help the image segmentation. In the first part, we demonstrated our method can segment the food image in enough clusters for the food grading process. In the second part, we provided a heuristic approach on K-mean clustering. Initial centers would be chose in our proposed algorithm instead of randomly selection. And then we used the statistic approach to choose the suitable number of clusters. The experimental showed our proposed algorithm can help the clustering process.
APA, Harvard, Vancouver, ISO, and other styles
41

Chiu, Chao-Wei, and 邱兆偉. "GPU-Accelerated K-Means Image Clustering." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/02981949068777197122.

Full text
Abstract:
碩士
國立中興大學
土木工程學系所
102
K-Means clustering has been a widely used approach in unsupervised classification of remotely sensed images. Due to recent emerging development in Graphics Processing Units (GPUs), the computing performance and memory bandwidth of GPUs have been much higher than those of Central Processing Units (CPUs). Therefore, it is expected to accelerate K-Means clustering by parallel computing in GPUs. This research aims on developing a GPU-optimized parallel processing approach for fast unsupervised classification of remotely sensed images using C++ and NVIDIA’s CUDA. The basic idea of traditional K-Means approach was refined with minimum distance classifier in this research for clustering images. The performance of numerical experiments in clustering 3-band color aerial images, in the size of 1360×1020 and scale-down 680×510, into specified number of spectral clusters will be demonstrated for the advantages of 10 to 20 speed-up ratio in computational efficiency of the GPU-based approach in a highly parallel, multi-thread, and multi-core implementation against traditional CPU-based approach.
APA, Harvard, Vancouver, ISO, and other styles
42

Ranjan, Sameer. "Hyperspectral Image Classification Using K-means Clustering." Thesis, 2015. http://ethesis.nitrkl.ac.in/7895/1/624.pdf.

Full text
Abstract:
Hyperspectral Image stores the reflectance of objects across the electromagnetic spectrum. Each object is identified by its spectral signature. Hyperspectral Sensors records these images from airborne devices. By processing these Images we can get various information about the land-form, seabed etc. This thesis presents an efficient and accurate classification technique for Hyperspectral Images. The approach consists of three steps. Firstly, dimension reduction of the Hyperspectral Image using Principal Component Analysis. This is done in order to reduce the time complexity of the further process. Secondly, the reduced features are clustered by k-means clustering. Lastly, the clusters are individually trained by Support Vector Machine. This scheme was tested with Pavia University Data-set taken by ROSIS sensor. Using the above scheme overall accuracy of 90.2% was achieved which is very promising in comparison to conventional Support Vector Machine classification which had an overall accuracy of 78.67% with the same data-set.
APA, Harvard, Vancouver, ISO, and other styles
43

Zheng, Hao-Wen, and 鄭皓文. "Achieve K-Anonymity with k-means Clustering and Differential Privacy." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/ks2kts.

Full text
Abstract:
碩士
國立中山大學
資訊工程學系研究所
107
De-identification is a technique for protecting personal privacy on public dataset. It has many methods in the world. All de-identification method achieve de-identified dataset by clustering tuple on dataset, so how to cluster tuple with a better algorithm becomes a important issue. We propose a method, not only clustering algorithm of method clusters tuple on dataset is better than the known method, but also another algorithm of method improves the security of sensitive attribute, it makes the sensitive attribute of our result has better security than the known method.
APA, Harvard, Vancouver, ISO, and other styles
44

Tsui, Chen-Kai, and 崔承愷. "Tabular K-means Clustering on Remote Sensing Images." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/60138681590550827372.

Full text
Abstract:
碩士
國立中興大學
土木工程學系所
105
The capability in computer hardware processing accelerates as the computer technology evolves. However, the quantity of remotely sensed images explores much faster than computer hardware as the sensor technology evolves with much more spatial resolution and spectral resolution. For real-time data handling of remote sensing images, it demands optimization of traditional algorithms in image processing, spectral analysis, and image classification using limited computer computational capability. This study develops a Tabular K-means approach for clustering remotely sensed multispectral images. The proposed approach employs principal component transformation, peak detection on two-dimensional (2-D) scatter diagram of the first two principal components as initial seeds, and Voronoi diagram of these seeds in 2-D spectral space for accelerating unsupervised classification of such images. Experimental results from clustering 7-band Landsat-4 Thematic Mapper (TM) images using Visual C++ programs demonstrate that the proposed Tabular K-means performs much more efficiently than the traditional K-means approach.
APA, Harvard, Vancouver, ISO, and other styles
45

Hsiang-Yen, Lin, and 林相延. "Color-Based K-means Clustering for Image Segmentation." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/7sh6mq.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Hsueh, Meng-Lun, and 薛孟倫. "Wheeze Detection using Modified k-Means Clustering Algorithm." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/25519734434912722746.

Full text
Abstract:
博士
國立臺灣大學
電機工程學研究所
99
The aim of this study is to present wheeze time-frequency characteristics in color scales spectrogram, and k-means clustering algorithm is applied to detect wheezes. k-means clustering algorithms are grouped according to its spectrogram nature. The first step is to preset the k value, representing the grouping number. After experiment testing, the k value is set to three. This number corresponds to the color scale spectrogram are red, green, and blue. Wheeze sounds can also be displayed on the spectrogram. However, k-means clustering algorithms group number is assigned randomly. Therefore, the color corresponding to wheezing symptoms has no fixed color. Through the color-indexing method, the wheezing color is set to be red in accordance with the color index production proportions. In addition, this method is also applied to the normal respiratory sounds, and the effects of noise reduction are discussed. After using modified k-means clustering algorithm, the results show that the signal-to-noise ratios are improved for about 2dB for wheeze and normal cases. The color index can mark wheezing sounds on the color spectrogram in red, and this has a stable representation and reproducibility. This helps the doctor very much in wheeze detection.
APA, Harvard, Vancouver, ISO, and other styles
47

Huang, Yi-Fen, and 黃奕棻. "Using K-means Clustering Algorithms for Anomaly Detection." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/50724702843472906729.

Full text
Abstract:
碩士
國立臺灣海洋大學
資訊工程學系
97
In recent years, the enormous utilization of computer and Internet, “network security” has become an extremely important issue. Consequently, the intrusion detection systems are used extensively among this field. In order to defend our network system, we use firewall and intrusion system to protect it. The effects of traditional detecting methods are getting worse in the face of unknown and various network attacks. Therefore, a lot of new detecting methods are proposed with anomaly detection lately to enhance systematic defense capability. We propose a feature selection method to improve the accuracy and detection rate of intrusion detection system. This method chooses specific features using the information of coefficient correlation. In this paper, we use clustering technique and anomaly detection to build the intrusion detection system. The experiments are performed to evaluate detection, classification and false alarm rates. According to the results of our experiments, it proves our new proposed method is better than other traditional intrusion detection approaches.
APA, Harvard, Vancouver, ISO, and other styles
48

Su, Ruey-Chyi, and 蘇睿頎. "Sperm Whale's Voiceprint Classification via K-means Clustering." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/11187908401356738201.

Full text
Abstract:
碩士
國立臺灣海洋大學
電機工程學系
104
Sperm whale is the world's largest species of odontoceti mammals, has highly developed social system, practices different pulse sequence combinations to perform social communication. In this thesis, we focus on two voiceprints, Clicks and Codas, of sperm whale to perform their feature analysis and classification. The source audio files are retrieved from CIBRA (Centor Interdisciplinare di Bioacusticae Ricerche Ambientali) and TCOML (The Cornel Lab of Ornithology Macaulay Library), in which background noises are included except for the bottlenose dolphin’s voices. We use spectral subtraction method to eliminate background noise, construct spectrograms of these two voiceprints via time-frequency analysis, extract the maximum, minimum and average values of inter-click intervals (ICIs) as features, and perform classification by k-means clustering method. Feature analysis results show that both Clicks and Codas consist of pulse sequences, but their patterns as well as the averaged values of ICIs are quite different. Clustering analysis shows that 100% of correct classification can be achieved.
APA, Harvard, Vancouver, ISO, and other styles
49

Yang, Hong-Xiang, and 楊閎翔. "A Modified K-means Algorithm for Sequence Clustering." Thesis, 2007. http://ndltd.ncl.edu.tw/handle/22490180184996174354.

Full text
Abstract:
碩士
輔仁大學
資訊工程學系
95
Our research interest has been focused on “content-based retrieval and feature extraction for multimedia database.” In this paper, we would like to extend our research to construct a system which provides a clustering services, more than user-active search. We use DCT Mapping to feature extraction and our method includes two case which are equal-length and variable-length. In equal-length, we map a sequences to a f-dimensional point in feature space, and then clustering for sequence according to whole similarity. Our methods are apply hierarchical clustering (signal-linkage, average-linkage, complete-linkage) and partitional clustering (K-means). In variable-length, we cut sequence into subsequence by sliding window and map them to f-dimensional points. We proposed a Modified K-means (MK) algorithm to clustering for sequence according to partial similarity. We also apply Minimum Bounding Rectangle (MBR) proposed a MBR Modified K-means (MMK) algorithm to performance more efficiently for MK algorithm. Finally, we implemented our method and carried out experiments.
APA, Harvard, Vancouver, ISO, and other styles
50

NEGI, ROHIT. "K-MEANS CLUSTERING ALGORITHM ON MAP REDUCE ARCHITECTURE." Thesis, 2016. http://dspace.dtu.ac.in:8080/jspui/handle/repository/14749.

Full text
Abstract:
The rapid development of the Internet and its impact on every aspect of life has resulted in the size of the data to increase from GB level to TB even PB level. This has brought about new technologies such as Hadoop for efficient storage and analysis the data. Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Cluster is a collection of data members having similar characteristics. The process of establishing a relation or deriving information from raw data by performing some operations on the data set like clustering is known as data mining. Data collected in practical scenarios is more often than not completely random and unstructured. Hence, there is always a need for analysis of unstructured data sets to derive meaningful information. This is where unsupervised algorithms come in to picture to process unstructured or even semi structured data sets by resultant. K-Means Clustering is one such technique used to provide a structure to unstructured data so that valuable information can be extracted. This paper discusses the implementation of the K-Means Clustering Algorithm over a distributed environment using ApacheTM Hadoop. The key to the implementation of the K-Means Algorithm is the design of the Mapper and Reducer routines which has been discussed in the later part of the paper. The steps involved in the execution of the K-Means Algorithm has also been described in this paper based on a small scale implementation of the K-Means Clustering Algorithm on an experimental setup to serve as a guide for practical implementations.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography