Dissertations / Theses: 'Clustering'

1

Yoo, Jaiyul. "From galaxy clustering to dark matter clustering." Columbus, Ohio : Ohio State University, 2007. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1186586898.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Hinz, Joel. "Clustering the Web : Comparing Clustering Methods in Swedish." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-95228.

Full text

Abstract:

Clustering -- automatically sorting -- web search results has been the focus of much attention but is by no means a solved problem, and there is little previous work in Swedish. This thesis studies the performance of three clustering algorithms -- k-means, agglomerative hierarchical clustering, and bisecting k-means -- on a total of 32 corpora, as well as whether clustering web search previews, called snippets, instead of full texts can achieve reasonably decent results. Four internal evaluation metrics are used to assess the data. Results indicate that k-means performs worse than the other two algorithms, and that snippets may be good enough to use in an actual product, although there is ample opportunity for further research on both issues; however, results are inconclusive regarding bisecting k-means vis-à-vis agglomerative hierarchical clustering. Stop word and stemmer usage results are not significant, and appear to not affect the clustering by any considerable magnitude.

APA, Harvard, Vancouver, ISO, and other styles

3

Bacarella, Daniele. "Distributed clustering algorithm for large scale clustering problems." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-212089.

Full text

Abstract:

Clustering is a task which has got much attention in data mining. The task of finding subsets of objects sharing some sort of common attributes is applied in various fields such as biology, medicine, business and computer science. A document search engine for instance, takes advantage of the information obtained clustering the document database to return a result with relevant information to the query. Two main factors that make clustering a challenging task are the size of the dataset and the dimensionality of the objects to cluster. Sometimes the character of the object makes it difficult identify its attributes. This is the case of the image clustering. A common approach is comparing two images using their visual features like the colors or shapes they contain. However, sometimes they come along with textual information claiming to be sufficiently descriptive of the content (e.g. tags on web images). The purpose of this thesis work is to propose a text-based image clustering algorithm through the combined application of two techniques namely Minhash Locality Sensitive Hashing (MinHash LSH) and Frequent itemset Mining.

APA, Harvard, Vancouver, ISO, and other styles

4

Zimek, Arthur. "Correlation Clustering." Diss., lmu, 2008. http://nbn-resolving.de/urn:nbn:de:bvb:19-87361.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Rutten, Jeroen Hendrik Gerardus Christiaan. "Polyhedral clustering." Maastricht : Maastricht : Universiteit Maastricht ; University Library, Maastricht University [Host], 1998. http://arno.unimaas.nl/show.cgi?fid=6061.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Leisch, Friedrich. "Bagged clustering." SFB Adaptive Information Systems and Modelling in Economics and Management Science, WU Vienna University of Economics and Business, 1999. http://epub.wu.ac.at/1272/1/document.pdf.

Full text

Abstract:

A new ensemble method for cluster analysis is introduced, which can be interpreted in two different ways: As complexity-reducing preprocessing stage for hierarchical clustering and as combination procedure for several partitioning results. The basic idea is to locate and combine structurally stable cluster centers and/or prototypes. Random effects of the training set are reduced by repeatedly training on resampled sets (bootstrap samples). We discuss the algorithm both from a more theoretical and an applied point of view and demonstrate it on several data sets. (author's abstract)
Series: Working Papers SFB "Adaptive Information Systems and Modelling in Economics and Management Science"

APA, Harvard, Vancouver, ISO, and other styles

7

Eldridge, Justin Eldridge. "Clustering Consistently." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1512070374903249.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Salamone, Johnny <1990&gt. "Speaker Clustering." Master's Degree Thesis, Università Ca' Foscari Venezia, 2018. http://hdl.handle.net/10579/12958.

Full text

Abstract:

Lo scopo di questo progetto di tesi, dopo uno studio sui papers di Speaker CLustering di riferimento, è di reimplementare l'algoritmo di clustering che mirando in un implementazione migliore in termini di prestazioni che dimostrino l'efficacia e la flessibilità di un approccio piuttosto nuovo. Diversamente dal solito, questo metodo alternativo per lo Speaker Clustering ridefinisce livemente la definizione di cluster e viene chiamato Dominant Set. La nozione di Dominant Set ruota attorno alla teoria dei grafi e al problema di ottimizzazione nella ricerca del sotto-grafico massimale, e aiutata dalla teoria dei giochi. Tali sotto-grafici sono analoghi ad un insieme con alta coerenza interna e debole con elementi esterni. Il data ser utilizzato in input è stato fornito da un gruppo di ricerca e conosciuto con il nome di TIMIT, con i vettori di features già estratti da registrazioni di file audio. Sebbene TIMIT fosse pensato per i metodi supervisionati e le implementazioni basate su reti neurali, l'obiettivo è appunto quello di dimostrare la flessibilità degli insiemi dominanti nei vettori di features nel riconoscimento degli interlocutori mediante la classificazione delle espressioni vocali. Alcune implementazioni in diversi linguaggi di programmazione dimostrano il potenziale dell'utilizzo dei Dominant Set per lo Speaker Clustering dopo un primo test comparativo su altre tecniche di clustering simili e utilizzando entrambe le versioni ridotta e completa del data set TIMIT.

APA, Harvard, Vancouver, ISO, and other styles

9

Rosell, Magnus. "Text Clustering Exploration : Swedish Text Representation and Clustering Results Unraveled." Doctoral thesis, KTH, Numerisk Analys och Datalogi, NADA, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-10129.

Full text

Abstract:

Text clustering divides a set of texts into clusters (parts), so that texts within each cluster are similar in content. It may be used to uncover the structure and content of unknown text sets as well as to give new perspectives on familiar ones. The main contributions of this thesis are an investigation of text representation for Swedish and some extensions of the work on how to use text clustering as an exploration tool. We have also done some work on synonyms and evaluation of clustering results. Text clustering, at least such as it is treated here, is performed using the vector space model, which is commonly used in information retrieval. This model represents texts by the words that appear in them and considers texts similar in content if they share many words. Languages differ in what is considered a word. We have investigated the impact of some of the characteristics of Swedish on text clustering. Swedish has more morphological variation than for instance English. We show that it is beneficial to use the lemma form of words rather than the word forms. Swedish has a rich production of solid compounds. Most of the constituents of these are used on their own as words and in several different compounds. In fact, Swedish solid compounds often correspond to phrases or open compounds in other languages. Our experiments show that it is beneficial to split solid compounds into their parts when building the representation. The vector space model does not regard word order. We have tried to extend it with nominal phrases in different ways. We have also tried to differentiate between homographs, words that look alike but mean different things, by augmenting all words with a tag indicating their part of speech. None of our experiments using phrases or part of speech information have shown any improvement over using the ordinary model. Evaluation of text clustering results is very hard. What is a good partition of a text set is inherently subjective. External quality measures compare a clustering with a (manual) categorization of the same text set. The theoretical best possible value for a measure is known, but it is not obvious what a good value is – text sets differ in difficulty to cluster and categorizations are more or less adapted to a particular text set. We describe how evaluation can be improved for cases where a text set has more than one categorization. In such cases the result of a clustering can be compared with the result for one of the categorizations, which we assume is a good partition. In some related work we have built a dictionary of synonyms. We use it to compare two different principles for automatic word relation extraction through clustering of words. Text clustering can be used to explore the contents of a text set. We have developed a visualization method that aids such exploration, and implemented it in a tool, called Infomat. It presents the representation matrix directly in two dimensions. When the order of texts and words are changed, by for instance clustering, distributional patterns that indicate similarities between texts and words appear. We have used Infomat to explore a set of free text answers about occupation from a questionnaire given to over 40 000 Swedish twins. The questionnaire also contained a closed answer regarding smoking. We compared several clusterings of the text answers to the closed answer, regarded as a categorization, by means of clustering evaluation. A recurring text cluster of high quality led us to formulate the hypothesis that “farmers smoke less than the average”, which we later could verify by reading previous studies. This hypothesis generation method could be used on any set of texts that is coupled with data that is restricted to a limited number of possible values.
QC 20100806

APA, Harvard, Vancouver, ISO, and other styles

10

Rossi, Alfred Vincent III. "Temporal Clustering of Finite Metric Spaces and Spectral k-Clustering." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1500033042082458.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Keller, Jens. "Clustering biological data using a hybrid approach : Composition of clusterings from different features." Thesis, University of Skövde, School of Humanities and Informatics, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-1078.

Full text

Abstract:

Clustering of data is a well-researched topic in computer sciences. Many approaches have been designed for different tasks. In biology many of these approaches are hierarchical and the result is usually represented in dendrograms, e.g. phylogenetic trees. However, many non-hierarchical clustering algorithms are also well-established in biology. The approach in this thesis is based on such common algorithms. The algorithm which was implemented as part of this thesis uses a non-hierarchical graph clustering algorithm to compute a hierarchical clustering in a top-down fashion. It performs the graph clustering iteratively, with a previously computed cluster as input set. The innovation is that it focuses on another feature of the data in each step and clusters the data according to this feature. Common hierarchical approaches cluster e.g. in biology, a set of genes according to the similarity of their sequences. The clustering then reflects a partitioning of the genes according to their sequence similarity. The approach introduced in this thesis uses many features of the same objects. These features can be various, in biology for instance similarities of the sequences, of gene expression or of motif occurences in the promoter region. As part of this thesis not only the algorithm itself was implemented and evaluated, but a whole software also providing a graphical user interface. The software was implemented as a framework providing the basic functionality with the algorithm as a plug-in extending the framework. The software is meant to be extended in the future, integrating a set of algorithms and analysis tools related to the process of clustering and analysing data not necessarily related to biology.

The thesis deals with topics in biology, data mining and software engineering and is divided into six chapters. The first chapter gives an introduction to the task and the biological background. It gives an overview of common clustering approaches and explains the differences between them. Chapter two shows the idea behind the new clustering approach and points out differences and similarities between it and common clustering approaches. The third chapter discusses the aspects concerning the software, including the algorithm. It illustrates the architecture and analyses the clustering algorithm. After the implementation the software was evaluated, which is described in the fourth chapter, pointing out observations made due to the use of the new algorithm. Furthermore this chapter discusses differences and similarities to related clustering algorithms and software. The thesis ends with the last two chapters, namely conclusions and suggestions for future work. Readers who are interested in repeating the experiments which were made as part of this thesis can contact the author via e-mail, to get the relevant data for the evaluation, scripts or source code.

APA, Harvard, Vancouver, ISO, and other styles

12

Gondek, David. "Non-redundant clustering /." View online version; access limited to Brown University users, 2005. http://wwwlib.umi.com/dissertations/fullcit/3174612.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Gupta, Pramod. "Robust clustering algorithms." Thesis, Georgia Institute of Technology, 2011. http://hdl.handle.net/1853/39553.

Full text

Abstract:

One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across any different fields ranging from computational biology to social sciences to computer vision in part because they are simple and their output is easy to interpret. However, many of these algorithms lack any performance guarantees when the data is noisy, incomplete or has outliers, which is the case for most real world data. It is well known that standard linkage algorithms perform extremely poorly in presence of noise. In this work we propose two new robust algorithms for bottom-up agglomerative clustering and give formal theoretical guarantees for their robustness. We show that our algorithms can be used to cluster accurately in cases where the data satisfies a number of natural properties and where the traditional agglomerative algorithms fail. We also extend our algorithms to an inductive setting with similar guarantees, in which we randomly choose a small subset of points from a much larger instance space and generate a hierarchy over this sample and then insert the rest of the points to it to generate a hierarchy over the entire instance space. We then do a systematic experimental analysis of various linkage algorithms and compare their performance on a variety of real world data sets and show that our algorithms do much better at handling various forms of noise as compared to other hierarchical algorithms in the presence of noise.

APA, Harvard, Vancouver, ISO, and other styles

14

Achtert, Elke. "Hierarchical Subspace Clustering." Diss., lmu, 2007. http://nbn-resolving.de/urn:nbn:de:bvb:19-68071.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Whissell, John. "Significant Feature Clustering." Thesis, University of Waterloo, 2006. http://hdl.handle.net/10012/2926.

Full text

Abstract:

In this thesis, we present a new clustering algorithm we call Significance Feature Clustering, which is designed to cluster text documents. Its central premise is the mapping of raw frequency count vectors to discrete-valued significance vectors which contain values of -1, 0, or 1. These values represent whether a word is significantly negative, neutral, or significantly positive, respectively. Initially, standard tf-idf vectors are computed from raw frequency vectors, then these tf-idf vectors are transformed to significance vectors using a parameter alpha, where alpha controls the mapping -1, 0, or 1 for each vector entry. SFC clusters agglomeratively, with each document's significance vector representing a cluster of size one containing just the document, and iteratively merges the two clusters that exhibit the most similar average using cosine similarity. We show that by using a good alpha value, the significance vectors produced by SFC provide an accurate indication of which words are significant to which documents, as well as the type of significance, and therefore correspondingly yield a good clustering in terms of a well-known definition of clustering quality. We further demonstrate that a user need not manually select an alpha as we develop a new definition of clustering quality that is highly correlated with text clustering quality. Our metric extends the family of metrics known as internal similarity, so that it can be applied to a tree of clusters rather than a set, but it also factors in an aspect of recall that was absent from previous internal similarity metrics. Using this new definition of internal similarity, which we call maximum tree internal similarity, we show that a close to optimal text clustering may be picked from any number of clusterings created by different alpha's. The automatically selected clusterings have qualities that are close to that of a well-known and powerful hierarchical clustering algorithm.

APA, Harvard, Vancouver, ISO, and other styles

16

Johnson, Samuel. "Document Clustering Interface." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-112878.

Full text

Abstract:

This project created a first step prototype interface for a document clustering search engine. The goal is to facilitate the needs of people with reading difficulties as well as being a useful tool for general users when trying to find relevant but easy to read documents. The hypothesis is that minimizing the amount of text and focus on graphical representation will make the service easier to use for all users. The interface was developed using previously established persona and evaluated by general users (i.e. not users with reading disabilities) in order to see if the interface was easy to use and to understand without tooltips and tutorials. The results showed that even though the participants understood the interface and found it intuitive, there was still some information they thought were missing, such as an explanation for the reading indexes and how they determined readability.

APA, Harvard, Vancouver, ISO, and other styles

17

Evans, Reuben James Emmanuel. "Clustering for Classification." The University of Waikato, 2007. http://hdl.handle.net/10289/2403.

Full text

Abstract:

Advances in technology have provided industry with an array of devices for collecting data. The frequency and scale of data collection means that there are now many large datasets being generated. To find patterns in these datasets it would be useful to be able to apply modern methods of classification such as support vector machines. Unfortunately these methods are computationally expensive, quadratic in the number of data points in fact, so cannot be applied directly. This thesis proposes a framework whereby a variety of clustering methods can be used to summarise datasets, that is, reduce them to a smaller but still representative dataset so that these advanced methods can be applied. It compares the results of using this framework against using random selection on a large number of classification and regression problems. Results show that the clustered datasets are on average fifty percent smaller than the original datasets without loss of classification accuracy which is significantly better than random selection. They also show that there is no free lunch, for each dataset it is important to choose a clustering method carefully.

APA, Harvard, Vancouver, ISO, and other styles

18

Karim, Ehsanul, Sri Phani Venkata Siva Krishna Madani, and Feng Yun. "Fuzzy Clustering Analysis." Thesis, Blekinge Tekniska Högskola, Sektionen för ingenjörsvetenskap, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2165.

Full text

Abstract:

The Objective of this thesis is to talk about the usage of Fuzzy Logic in pattern recognition. There are different fuzzy approaches to recognize the pattern and the structure in data. The fuzzy approach that we choose to process the data is completely depends on the type of data. Pattern reorganization as we know involves various mathematical transforms so as to render the pattern or structure with the desired properties such as the identification of a probabilistic model which provides the explaination of the process generating the data clarity seen and so on and so forth. With this basic school of thought we plunge into the world of Fuzzy Logic for the process of pattern recognition. Fuzzy Logic like any other mathematical field has its own set of principles, types, representations, usage so on and so forth. Hence our job primarily would focus to venture the ways in which Fuzzy Logic is applied to pattern recognition and knowledge of the results. That is what will be said in topics to follow. Pattern recognition is the collection of all approaches that understand, represent and process the data as segments and features by using fuzzy sets. The representation and processing depend on the selected fuzzy technique and on the problem to be solved. In the broadest sense, pattern recognition is any form of information processing for which both the input and output are different kind of data, medical records, aerial photos, market trends, library catalogs, galactic positions, fingerprints, psychological profiles, cash flows, chemical constituents, demographic features, stock options, military decisions.. Most pattern recognition techniques involve treating the data as a variable and applying standard processing techniques to it.

APA, Harvard, Vancouver, ISO, and other styles

19

Parker, Jonathon Karl. "Accelerated Fuzzy Clustering." Scholar Commons, 2013. http://scholarcommons.usf.edu/etd/4929.

Full text

Abstract:

Clustering algorithms are a primary tool in data analysis, facilitating the discovery of groups and structure in unlabeled data. They are used in a wide variety of industries and applications. Despite their ubiquity, clustering algorithms have a flaw: they take an unacceptable amount of time to run as the number of data objects increases. The need to compensate for this flaw has led to the development of a large number of techniques intended to accelerate their performance. This need grows greater every day, as collections of unlabeled data grow larger and larger. How does one increase the speed of a clustering algorithm as the number of data objects increases and at the same time preserve the quality of the results? This question was studied using the Fuzzy c-means clustering algorithm as a baseline. Its performance was compared to the performance of four of its accelerated variants. Four key design principles of accelerated clustering algorithms were identified. Further study and exploration of these principles led to four new and unique contributions to the field of accelerated fuzzy clustering. The first was the identification of a statistical technique that can estimate the minimum amount of data needed to ensure a multinomial, proportional sample. This technique was adapted to work with accelerated clustering algorithms. The second was the development of a stopping criterion for incremental algorithms that minimizes the amount of data required, while maximizing quality. The third and fourth techniques were new ways of combining representative data objects. Five new accelerated algorithms were created to demonstrate the value of these contributions. One additional discovery made during the research was that the key design principles most often improve performance when applied in tandem. This discovery was applied during the creation of the new accelerated algorithms. Experiments show that the new algorithms improve speedup with minimal quality loss, are demonstrably better than related methods and occasionally are an improvement in both speedup and quality over the base algorithm.

APA, Harvard, Vancouver, ISO, and other styles

20

Shih, Benjamin. "Target Sequence Clustering." Research Showcase @ CMU, 2011. http://repository.cmu.edu/dissertations/177.

Full text

Abstract:

Researchers have discovered many successful algorithms and methodologies for solving problems at the intersection of machine learning and education research. This umbrella category, “educational data mining,” has enjoyed a series of successes that span the research process, from post-hoc data analysis that generates models to the use of those models in successful educational interventions. However, most of these successes have arisen from the use of pre-existing psychological and educational constructs (e.g., guessing) and thus from the use of semi-supervised or fully-supervised machine learning algorithms. Algorithms for novel discovery, also known as unsupervised clustering, have enjoyed significantly fewer successes in this domain, partially because education data exhibit unique, complex structure. This thesis is a mixture of algorithm development, simulation, and experimentation on real-world data, all designed to define and test a novel paradigm for clustering in education (and a range of other domains). This paradigm, target clustering, revolves around the inclusion of high-level targets, such as student learning from pre-test to post-test. This approach differs from other existing machine learning approaches in that it is designed completely, from the initial concept to the final execution, for solving educational research problems, taking advantage of the structural complexities that are problematic for other algorithms. This thesis includes a range of data sets drawn from a variety of research domains, but does not include new data from experiments in the psychological sense.1 However, the thesis includes analysis of methodology, results, and implications from an educational research perspective and relies entirely on education data and research problems.

APA, Harvard, Vancouver, ISO, and other styles

21

Afsarmanesh, Tehrani Nazanin. "Clustering Multilayer Networks." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-279745.

Full text

Abstract:

Detecting community structure is an important methodology to study complex networks. Community detection methods can be divided into two main categories: partitioning methods and overlapping clustering methods. In partitioning methods, each node can belong to at most one community while overlapping clustering methods allow communities with overlapping nodes as well. Community detection is not only a problem in single networks today, but also in multilayer networks where several networks with the same participants are considered at the same time. In recent years, several methods have been proposed for recognizing communities in multilayer networks; however, none of these methods finds overlapping communities. On the other hand, in many types of systems, this approach is not realistic. For example, in social networks, individuals communicate with different groups of people, like friends, colleagues, and family, and this determine overlaps between communities, while they also communicate through several networks like Facebook, Twitter, etc. The overall purpose of this study was to introduce a method for finding overlapping communities in multilayer networks. The proposed method is an extension of the popular Clique Percolation Method (CPM) for simple networks. It has been shown that the structure of communities is dependent on the definition of cliques in multilayer networks which are the smallest components of communities in CPM, and therefore, several types of communities can be defined based on different definitions of cliques. As the most conventional definition of communities, it is necessary for all nodes to be densely connected in single networks to form a community in the multilayer network. In the last part of the thesis, a method has been proposed for finding these types of communities in multilayer networks.

APA, Harvard, Vancouver, ISO, and other styles

22

Oshiro, Marcio Takashi Iura. "Clustering de trajetórias." Universidade de São Paulo, 2015. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-29102015-142559/.

Full text

Abstract:

Esta tese teve como objetivo estudar problemas cinéticos de clustering, ou seja, problemas de clustering nos quais os objetos se movimentam. O trabalho se concentrou no caso unidimensional, em que os objetos são pontos se movendo na reta real. Diversas variantes desse caso foram abordadas. Em termos do movimento, consideramos o caso em que cada ponto se move com uma velocidade constante num dado intervalo de tempo, o caso em que os pontos se movem arbitrariamente e temos apenas as suas posições em instantes discretos de tempo, o caso em que os pontos se movem com uma velocidade aleatória em que se conhece apenas o valor esperado da velocidade, e o caso em que, dada uma partição do intervalo de tempo, os pontos se movem com velocidades constantes em cada subintervalo. Em termos do tipo de clustering buscado, nos concentramos no caso em que o número de clusters é um dado do problema e consideramos diferentes medidas de qualidade para o clustering. Duas delas são tradicionais para problemas de clustering: a soma dos diâmetros dos clusters e o diâmetro máximo de um cluster. A terceira medida considerada leva em conta a característica cinética do problema, e permite, de uma maneira controlada, que o clustering mude com o tempo. Para cada uma das variantes do problema, são apresentados algoritmos, exatos ou de aproximação, alguns resultados de complexidade obtidos, e questões que ficaram em aberto.
This work aimed to study kinetic problems of clustering, i.e., clustering problems in which the objects are moving. The study focused on the unidimensional case, where the objects are points moving on the real line. Several variants of this case have been discussed. Regarding the movement, we consider the case where each point moves at a constant velocity in a given time interval, the case where the points move arbitrarily and we only know their positions in discrete time instants, the case where the points move at a random velocity in which only the expected value of the velocity is known, and the case where, given a partition of the time interval, the points move at constant velocities in each sub-interval. Regarding the kind of clustering sought, we focused in the case where the number of clusters is part of the input of the problem and we consider different measures of quality for the clustering. Two of them are traditional measures for clustering problems: the sum of the cluster diameters and the maximum diameter of a cluster. The third measure considered takes into account the kinetic characteristic of the problem, and allows, in a controlled manner, that a cluster change along time. For each of the variants of the problem, we present algorithms, exact or approximation, some obtained complexity results, and open questions.

APA, Harvard, Vancouver, ISO, and other styles

23

Alqurashi, Tahani. "Clustering ensemble method." Thesis, University of East Anglia, 2017. https://ueaeprints.uea.ac.uk/62679/.

Full text

Abstract:

Clustering is an unsupervised learning paradigm that partitions a given dataset into clusters so that objects in the same cluster are more similar to each other than to the objects in the other clusters. However, when clustering algorithms are used individually, their results are often inconsistent and unreliable. This research applies the philosophy of Ensemble learning that combines multiple partitions using a consensus function in order to address these issues to improve a clustering performance. A clustering ensemble framework is presented consisting of three phases: Ensemble Member Generation, Consensus and Evaluation. This research focuses on two points: the consensus function and ensemble diversity. For the first, we proposed three new consensus functions: the Object-Neighbourhood Clustering Ensemble (ONCE), the Dual-Similarity Clustering Ensemble (DSCE), and the Adaptive Clustering Ensemble (ACE). ONCE takes into account the neighbourhood relationship between object pairs in the similarity matrix, while DSCE and ACE are based on two similarity measures: cluster similarity and membership similarity. The proposed ensemble methods were tested on benchmark real-world and artificial datasets. The results demonstrated that ONCE outperforms the other similar methods, and is more consistent and reliable than k-means. Furthermore, DSCE and ACE were compared to the ONCE, CO, MCLA and DICLENS clustering ensemble methods. The results demonstrated that on average ACE outperforms the state-of-the-art clustering ensemble methods, which are CO, MCLA and DICLENS. On diversity, we experimentally investigated all the existing measures for determining their relationship with the ensemble quality. The results indicate that none of them are capable of discovering a clear relationship and the reasons for this are: (1) they all are inappropriately defined to measure the useful difference between the members, and (2) none of them have been used directly by any consensus function. Therefore, we point out that these two issues need to be addressed in future research.

APA, Harvard, Vancouver, ISO, and other styles

24

Pahmp, Oliver. "N-sphere Clustering." Thesis, Umeå universitet, Statistik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-172387.

Full text

Abstract:

This thesis introduces n-sphere clustering, a new method of cluster analysis, akin to agglomerative hierarchical clustering. It relies on expanding n-spheres around each observation until they intersect. It then clusters observations based on these intersects, the distance between the spheres, and density of observations. Currently, many commonly used clustering methods struggle when clusters have more complex shapes. The aim of n-sphere clustering is to have a method which functions reasonably well, regardless of the shape of the clusters. Accuracy is shown to be low, particularly when clusters overlap, and extremely sensitive to noise. The time complexity of the algorithm is prohibitively large for large datasets, further limiting its potential use.

APA, Harvard, Vancouver, ISO, and other styles

25

Tadepalli, Sriram Satish. "Schemas of Clustering." Diss., Virginia Tech, 2009. http://hdl.handle.net/10919/26261.

Full text

Abstract:

Data mining techniques, such as clustering, have become a mainstay in many applications such as bioinformatics, geographic information systems, and marketing. Over the last decade, due to new demands posed by these applications, clustering techniques have been significantly adapted and extended. One such extension is the idea of finding clusters in a dataset that preserve information about some auxiliary variable. These approaches tend to guide the clustering algorithms that are traditionally unsupervised learning techniques with the background knowledge of the auxiliary variable. The auxiliary information could be some prior class label attached to the data samples or it could be the relations between data samples across different datasets. In this dissertation, we consider the latter problem of simultaneously clustering several vector valued datasets by taking into account the relationships between the data samples. We formulate objective functions that can be used to find clusters that are local in each individual dataset and at the same time maximally similar or dissimilar with respect to clusters across datasets. We introduce diverse applications of these clustering algorithms: (1) time series segmentation (2) reconstructing temporal models from time series segmentations (3) simultaneously clustering several datasets according to database schemas using a multi-criteria optimization and (4) clustering datasets with many-many relationships between data samples. For each of the above, we demonstrate applications, including modeling the yeast cell cycle and the yeast metabolic cycle, understanding the temporal relationships between yeast biological processes, and cross-genomic studies involving multiple organisms and multiple stresses. The key contribution is to structure the design of complex clustering algorithms over a database schema in terms of clustering algorithms over the underlying entity sets.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

26

Loganathan, Satish Kumar. "Distributed Hierarchical Clustering." University of Cincinnati / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1544001912266574.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Al-Razgan, Muna Saleh. "Weighted clustering ensembles." Fairfax, VA : George Mason University, 2008. http://hdl.handle.net/1920/3212.

Full text

Abstract:

Thesis (Ph.D.)--George Mason University, 2008.
Vita: p. 134. Thesis director: Carlotta Domeniconi. Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Information Technology. Title from PDF t.p. (viewed Oct. 14, 2008). Includes bibliographical references (p. 128-133). Also issued in print.

APA, Harvard, Vancouver, ISO, and other styles

28

Xu, Tianbing. "Nonparametric evolutionary clustering." Diss., Online access via UMI:, 2009.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

29

Zhong, Wei. "Clustering System and Clustering Support Vector Machine for Local Protein Structure Prediction." Digital Archive @ GSU, 2006. http://digitalarchive.gsu.edu/cs_diss/7.

Full text

Abstract:

Protein tertiary structure plays a very important role in determining its possible functional sites and chemical interactions with other related proteins. Experimental methods to determine protein structure are time consuming and expensive. As a result, the gap between protein sequence and its structure has widened substantially due to the high throughput sequencing techniques. Problems of experimental methods motivate us to develop the computational algorithms for protein structure prediction. In this work, the clustering system is used to predict local protein structure. At first, recurring sequence clusters are explored with an improved K-means clustering algorithm. Carefully constructed sequence clusters are used to predict local protein structure. After obtaining the sequence clusters and motifs, we study how sequence variation for sequence clusters may influence its structural similarity. Analysis of the relationship between sequence variation and structural similarity for sequence clusters shows that sequence clusters with tight sequence variation have high structural similarity and sequence clusters with wide sequence variation have poor structural similarity. Based on above knowledge, the established clustering system is used to predict the tertiary structure for local sequence segments. Test results indicate that highest quality clusters can give highly reliable prediction results and high quality clusters can give reliable prediction results. In order to improve the performance of the clustering system for local protein structure prediction, a novel computational model called Clustering Support Vector Machines (CSVMs) is proposed. In our previous work, the sequence-to-structure relationship with the K-means algorithm has been explored by the conventional K-means algorithm. The K-means clustering algorithm may not capture nonlinear sequence-to-structure relationship effectively. As a result, we consider using Support Vector Machine (SVM) to capture the nonlinear sequence-to-structure relationship. However, SVM is not favorable for huge datasets including millions of samples. Therefore, we propose a novel computational model called CSVMs. Taking advantage of both the theory of granular computing and advanced statistical learning methodology, CSVMs are built specifically for each information granule partitioned intelligently by the clustering algorithm. Compared with the clustering system introduced previously, our experimental results show that accuracy for local structure prediction has been improved noticeably when CSVMs are applied.

APA, Harvard, Vancouver, ISO, and other styles

30

Hoffmann, Kai Delf. "Cosmology with galaxy clustering." Doctoral thesis, Universitat Autònoma de Barcelona, 2015. http://hdl.handle.net/10803/297700.

Full text

Abstract:

Per constrènyer models cosmològics mitjançant el creixement de les fluctuacions a gran escala de la matèria és cabdal entendre com les galàxies que observem tracen el camp de densitat de tot el conjunt de matèria. La relació entre el camp de densitat de matèria i el de galàxies s'acostuma a aproximar amb una expansió de segon ordre de la funció anomenada bias. La llibertat en els paràmetres d'aquesta funció redueix la informació cosmològica que es pot extreure de les observacions. En aquesta tesi estudiem dos mètodes per determinar els paràmetres del bias independentment del creixement. L'anàlisi es basa en la distribució de matèria de la gran simulació MICE Grand Challenge. Als halos, identificats en aquesta simulació, se'ls associen galàxies. El primer mètode consisteix en mesurar directament els paràmetres del bias d'estadístiques de tercer ordre de les distribucions d'halos i de matèria. El segon en predir-los a partir de l'abundància d'halos en funció de la seva massa (concepte al qual ens referirem com a funció de massa). Les nostres estimacions del bias amb estadístiques de tercer ordre es basen en les autocorrelacions i correlacions creuades de tres punts dels camps de densitat d'halos i de matèria, en l'espai de configuració tridimensional. Usant les autocorrelacion de tres punts i un model local i quadràtic del bias trobem una sobreestimació del $\sim20\%$ en el paràmetre lineal del bias respecte a la referència provinent de correlacions de dos punts. Aquesta desviació es pot deure a ignorar contribucions no locals i d'ordre superior a la funció bias, així com sistemàtics en les mesures. L'efecte d'aquestes inexactituds en les estimacions del bias en les mesures del creixement són comparables amb els errors en les nostres mesures, procedents de la variància de la mostra i del soroll. També presentem un nou mètode per mesurar el creixement que no requereix un model per a la correlació de tres punts de la matèria fosca. Els resultats d'ambdós enfocaments estan en acord amb les prediccions. Combinant les autocorrelacions i les correlacions creuades de tres punts, per una banda podem mesurar el bias lineal sense ser afectats per termes quadràtics (locals o no locals) en les funcions del bias, i de l'altra podem aïllar aquests termes i comparar-los amb les prediccions. Les nostres mesures de bias lineal a partir d'aquestes combinacions són molt consistents amb el bias lineal de referència. La comparació de les contribucions no lineals amb les prediccions revelen una forta dependència de les mesures amb desviacions significatives de les prediccions, inclús a escales molt grans. El nostre segon enfoc per obtenir els paràmetres de bias són prediccions derivades de la funció de massa a través de l'aproximació de "peak-background !split". Trobem desviacions significatives del 5-10% entre aquestes prediccions i la referència a partir de les estadístiques de dos punts. Aquestes desviacions poden ser explicades només en part a partir dels sistemàtics que afecten les prediccions de bias, provinent del "binning" de la funció de massa d'halos, l'estimació de l'error de la funció de massa i la parametrització de la funció de massa a partir de la qual se'n deriven les prediccions de bias. Estudiant la funció de massa trobem relacions entre diferents parametritzacions de la funció de massa. A més, trobem que el mètode estàndard de Jack-Knife sobreestima la covariança d'error de la funció de massa en el rang de baixa massa. Expliquem aquestes desviacions i presentem un nou i estimador de covariança millorat.
For constraining cosmological models via the growth of large-scale matter fluctuations it is important to understand how the observed galaxies trace the full matter density field. The relation between the density fields of matter and galaxies is often approximated by a second- order expansion of a so-called bias function. The freedom of the parameters in the bias function weakens cosmological constraints from observations. In this thesis we study two methods for determining the bias parameters independently from the growth. Our analysis is based on the matter field from the large MICE Grand Challenge simulation. Haloes, identified in this simulation, are associated with galaxies. The first method is to measure the bias parameters directly from third-order statistics of the halo and matter distributions. The second method is to predict them from the abundance of haloes as a function of halo mass (hereafter referred to as mass function). Our bias estimations from third-order statistics are based on three-point auto- and cross- correlations of halo and matter density fields in three dimensional configuration space. Using three-point auto-correlations and a local quadratic bias model we find a ∼ 20% overestimation of the linear bias parameter with respect to the reference from two-point correlations. This deviation can originate from ignoring non-local and higher-order contributions to the bias function, as well as from systematics in the measurements. The effect of such inaccuracies in the bias estimations on growth measurements are comparable with errors in our measurements, coming from sampling variance and noise. We also present a new method for measuring the growth which does not require a model for the dark matter three-point correlation. Results from both approaches are in good agreement with predictions. By combining three-point auto- and cross-correlations one can either measure the linear bias without being affected by quadratic (local or non-local) terms in the bias functions or one can isolate such terms and compare them to predictions. Our linear bias measurements from such combinations are in very good agreement with the reference linear bias. The comparison of the non-local contributions with predictions reveals a strong scale dependence of the measurements with significant deviations from the predictions, even at very large scales. Our second approach for obtaining the bias parameters are predictions derived from the mass function via the peak-background split approach. We find significant 5−10% deviations between these predictions and the reference from two-point clustering. These deviations can only partly be explained with systematics affecting the bias predictions, coming from the halo mass function binning, the mass function error estimation and the mass function parameterisation from which the bias predictions are derived. Studying the mass function we find unifying relations between different mass function parameterisation. Furthermore, we find that the standard Jack-Knife method overestimates the mass function error covariance in the low mass range. We explain these deviations and present a new improved covariance estimator.

APA, Harvard, Vancouver, ISO, and other styles

31

Batet, Sanromà Montserrat. "Ontology based semantic clustering." Doctoral thesis, Universitat Rovira i Virgili, 2011. http://hdl.handle.net/10803/31913.

Full text

Abstract:

Els algoritmes de clustering desenvolupats fins al moment s’han centrat en el processat de dades numèriques i categòriques, no considerant dades textuals. Per manegar adequadament aquestes dades, es necessari interpretar el seu significat a nivell semàntic. En aquest treball es presenta un nou mètode de clustering que es capaç d’interpretar, de forma integrada, dades numèriques, categòriques i textuals. Aquest últims es processaran mitjançant mesures de similitud semàntica basades en 1) la utilització del coneixement taxonòmic contingut en una o diferents ontologies i 2) l’estimació de la distribució de la informació dels termes a la Web. Els resultats mostren que una interpretació precisa de la informació textual a nivell semàntic millora els resultats del clustering i facilita la interpretació de les classificacions.
Clustering algorithms have focused on the management of numerical and categorical data. However, in the last years, textual information has grown in importance. Proper processing of this kind of information within data mining methods requires an interpretation of their meaning at a semantic level. In this work, a clustering method aimed to interpret, in an integrated manner, numerical, categorical and textual data is presented. Textual data will be interpreted by means of semantic similarity measures. These measures calculate the alikeness between words by exploiting one or several knowledge sources. In this work we also propose two new ways of compute semantic similarity based on 1) the exploitation of the taxonomical knowledge available on one or several ontologies and 2) the estimation of the information distribution of terms in the Web. Results show that a proper interpretation of textual data at a semantic level improves clustering results and eases the interpretability of the classifications

APA, Harvard, Vancouver, ISO, and other styles

32

Chang, Soong Uk. "Clustering with mixed variables /." [St. Lucia, Qld.], 2005. http://www.library.uq.edu.au/pdfserve.php?image=thesisabs/absthe19086.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Galåen, Magnus. "Dokument-klynging (document clustering)." Thesis, Norwegian University of Science and Technology, Department of Computer and Information Science, 2008. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-8868.

Full text

Abstract:

As document searching becomes more and more important with the rapid growth of document bases today, document clustering also becomes more important. Some of the most commonly used document clustering algorithms today, are pure statistical in nature. Other algorithms have emerged, adressing some of the issues with numerical algorithms, claiming to be better. This thesis compares two well-known algorithms: Elliptic K-Means and Suffix Tree Clustering. They are compared in speed and quality, and it is shown that Elliptic K-Means performs better in speed, while Suffix Tree Clustering (STC) performs better in quality. It is further shown that STC performs better using small portions of relevant text (snippets) on real web-data compared to the full document. It is also shown that a threshold value for base cluster merging is unneccesary. As STC is shown to perform adequately in speed when running on snippets only, it is concluded that STC is the better algorithm for the purpose of search results clustering.

APA, Harvard, Vancouver, ISO, and other styles

34

Buchta, Christian, Martin Kober, Ingo Feinerer, and Kurt Hornik. "Spherical k-Means Clustering." American Statistical Association, 2012. http://epub.wu.ac.at/4000/1/paper.pdf.

Full text

Abstract:

Clustering text documents is a fundamental task in modern data analysis, requiring approaches which perform well both in terms of solution quality and computational efficiency. Spherical k-means clustering is one approach to address both issues, employing cosine dissimilarities to perform prototype-based partitioning of term weight representations of the documents. This paper presents the theory underlying the standard spherical k-means problem and suitable extensions, and introduces the R extension package skmeans which provides a computational environment for spherical k-means clustering featuring several solvers: a fixed-point and genetic algorithm, and interfaces to two external solvers (CLUTO and Gmeans). Performance of these solvers is investigated by means of a large scale benchmark experiment. (authors' abstract)

APA, Harvard, Vancouver, ISO, and other styles

35

Cole, Rowena Marie. "Clustering with genetic algorithms." University of Western Australia. Dept. of Computer Science, 1998. http://theses.library.uwa.edu.au/adt-WU2003.0008.

Full text

Abstract:

Clustering is the search for those partitions that reflect the structure of an object set. Traditional clustering algorithms search only a small sub-set of all possible clusterings (the solution space) and consequently, there is no guarantee that the solution found will be optimal. We report here on the application of Genetic Algorithms (GAs) -- stochastic search algorithms touted as effective search methods for large and complex spaces -- to the problem of clustering. GAs which have been made applicable to the problem of clustering (by adapting the representation, fitness function, and developing suitable evolutionary operators) are known as Genetic Clustering Algorithms (GCAs). There are two parts to our investigation of GCAs: first we look at clustering into a given number of clusters. The performance of GCAs on three generated data sets, analysed using 4320 differing combinations of adaptions, establishes their efficacy. Choice of adaptions and parameter settings is data set dependent, but comparison between results using generated and real data sets indicate that performance is consistent for similar data sets with the same number of objects, clusters, attributes, and a similar distribution of objects. Generally, group-number representations are better suited to the clustering problem, as are dynamic scaling, elite selection and high mutation rates. Independent generalised models fitted to the correctness and timing results for each of the generated data sets produced accurate predictions of the performance of GCAs on similar real data sets. While GCAs can be successfully adapted to clustering, and the method produces results as accurate and correct as traditional methods, our findings indicate that, given a criterion based on simple distance metrics, GCAs provide no advantages over traditional methods. Second, we investigate the potential of genetic algorithms for the more general clustering problem, where the number of clusters is unknown. We show that only simple modifications to the adapted GCAs are needed. We have developed a merging operator, which with elite selection, is employed to evolve an initial population with a large number of clusters toward better clusterings. With regards to accuracy and correctness, these GCAs are more successful than optimisation methods such as simulated annealing. However, such GCAs can become trapped in local minima in the same manner as traditional hierarchical methods. Such trapping is characterised by the situation where good (k-1)-clusterings do not result from our merge operator acting on good k-clusterings. A marked improvement in the algorithm is observed with the addition of a local heuristic.

APA, Harvard, Vancouver, ISO, and other styles

36

Hou, Jean Fen-ju. "Clustering with obstacle entities." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape7/PQDD_0023/MQ51360.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Tzerpos, Vassilios. "Comprehension-driven software clustering." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2001. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/NQ63614.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Shortreed, Susan. "Learning in spectral clustering /." Thesis, Connect to this title online; UW restricted, 2006. http://hdl.handle.net/1773/8977.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Sheth, Ravi Kiran. "Gravitational clustering of galaxies." Thesis, University of Cambridge, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.320096.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Stratton, R. A. "Clustering in light nuclei." Thesis, University of Oxford, 1985. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.355812.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Bielby, Richard. "Galaxy clustering and feedback." Thesis, Durham University, 2008. http://etheses.dur.ac.uk/2344/.

Full text

Abstract:

I cross-correlate the WMAP third year data with the АСО, АРМ and 2MASS galaxy and cluster catalogues, confirming the presence of the SZ effect in the WMAP 3rd year data around АСО, АРМ and 2MASS clusters, showing an increase in detection significance compared to previous analyses of the 1-year WMAP data release. I compare the cross-correlation results for a number of clusters to their SZ β-model profiles estimated from ROSAT and Chandra X-ray data. I conclude that the SZ profiles estimated from the β -model over-predict the observed SZ effect in the cluster samples. Additionally, I develop colour cuts using the SDSS optical bands to photometrically select emission line galaxies at redshifts of z < 0.35, 0.35 < z < 0.55 and z > 0.55. The selections have been calibrated using a combination of photometric redshifts from the COMBO-17 survey and spectroscopic observations. I estimate correlation lengths of rо = 2.64 (^+2.64_-0.08) h (^-1) Mpc, ro = 3.62 > ± 0.06h (^-1) and rо = 5.88 ± 0.12h (^-1)Mpc for the low, mid and high redshift samples respectively. Using these photometric samples I search for the Integrated Sachs- Wolfe signal in the WMAP 5yr data, but find no significant detection. I also present a survey of star-forming galaxies at z ≈ 3. Using Lyman Break and U-dropout photometric elections, we identify a total of ≈ 21,000 candidate z > 2 galaxies and perform spectroscopic observations of a selection of these candidates with integration times of 10,000s with the VLT VIMOS. In total this survey has so far produced a total of 1149 LBGs at redshifts of 2 < z < 3.5 over a total area of l.18deg(^2), with a mean redshift of ž = 2.87 ± 0.34. Using both the photometric and spectroscopic LBG catalogues, I investigate the clustering properties of the z > 2 galaxy sample using the angular correlation function, measuring a clustering amplitude of rо = 4.32(^+0.13_-0.12)h (^-1) Mpc with a slope of ϒ2 = 1.90 (^+0.09_-0.14) at separations of r > 0.4h(+-1) Mpc. We then measure the redshift space clustering based on the spectroscopically observed sample and estimate the infall parameter, β, of the sample by fitting a redshift space distortion model to the ع (σ, π). To conclude this work, I analyze the correlation of LBGs with the Lya forest transmissivity of a number of z ~ 3 QSOs, with the aim of looking for the imprint of high velocity winds on the IGM. The data show a fall in the transmissivity in the Lya forest at scales of 5h(^-1)Mpc < r < 10h(^-1)Mpc away from LBGs, indicating an increase in gas densities at these scales. However we find no significant change from the mean transmissivity at scales of <3h(^-1)Mpc, potentially signifying the presence of low density ionised regions close to LBGs.

APA, Harvard, Vancouver, ISO, and other styles

42

Smith, Robert James. "QSO clustering and environments." Thesis, University of Cambridge, 1998. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.624809.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Isheden, Gabriel. "Bayesian Hierarchic Sample Clustering." Thesis, KTH, Matematik (Inst.), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-168316.

Full text

Abstract:

This report presents a novel algorithm for hierarchical clustering called Bayesian Sample Clustering (BSC). BSC is a single linkage algorithm that uses data samples to produce a predictive distribution for each sample. The predictive distributions are compared using the Chan-Darwiche distance, a metric for finite probability distributions, to produce a hierarchy of samples. The implemented version of BSC is found at https://github.com/Skjulet/Bayesian Sample Clustering.
Denna rapport presenterar en ny algoritm för hierarkisk klustring, Bayesian Sample Clustering (BSC). BSC är en single-linkage algoritm som använder stickprov av data för att skapa en prediktiv fördelning för varje stickprov. De prediktiva fördelningarna jämförs med Chan-Darwiche avståndet, en metrik över ändliga sannolikhetsfördelningar, vilket möjliggör skapandet av en hierarki av kluster. BSC finns i implementerad version på https://github.com/Skjulet/Bayesian Sample Clustering.

APA, Harvard, Vancouver, ISO, and other styles

44

Hahmann, Martin. "Feedback-Driven Data Clustering." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-135647.

Full text

Abstract:

The acquisition of data and its analysis has become a common yet critical task in many areas of modern economy and research. Unfortunately, the ever-increasing scale of datasets has long outgrown the capacities and abilities humans can muster to extract information from them and gain new knowledge. For this reason, research areas like data mining and knowledge discovery steadily gain importance. The algorithms they provide for the extraction of knowledge are mandatory prerequisites that enable people to analyze large amounts of information. Among the approaches offered by these areas, clustering is one of the most fundamental. By finding groups of similar objects inside the data, it aims to identify meaningful structures that constitute new knowledge. Clustering results are also often used as input for other analysis techniques like classification or forecasting. As clustering extracts new and unknown knowledge, it obviously has no access to any form of ground truth. For this reason, clustering results have a hypothetical character and must be interpreted with respect to the application domain. This makes clustering very challenging and leads to an extensive and diverse landscape of available algorithms. Most of these are expert tools that are tailored to a single narrowly defined application scenario. Over the years, this specialization has become a major trend that arose to counter the inherent uncertainty of clustering by including as much domain specifics as possible into algorithms. While customized methods often improve result quality, they become more and more complicated to handle and lose versatility. This creates a dilemma especially for amateur users whose numbers are increasing as clustering is applied in more and more domains. While an abundance of tools is offered, guidance is severely lacking and users are left alone with critical tasks like algorithm selection, parameter configuration and the interpretation and adjustment of results. This thesis aims to solve this dilemma by structuring and integrating the necessary steps of clustering into a guided and feedback-driven process. In doing so, users are provided with a default modus operandi for the application of clustering. Two main components constitute the core of said process: the algorithm management and the visual-interactive interface. Algorithm management handles all aspects of actual clustering creation and the involved methods. It employs a modular approach for algorithm description that allows users to understand, design, and compare clustering techniques with the help of building blocks. In addition, algorithm management offers facilities for the integration of multiple clusterings of the same dataset into an improved solution. New approaches based on ensemble clustering not only allow the utilization of different clustering techniques, but also ease their application by acting as an abstraction layer that unifies individual parameters. Finally, this component provides a multi-level interface that structures all available control options and provides the docking points for user interaction. The visual-interactive interface supports users during result interpretation and adjustment. For this, the defining characteristics of a clustering are communicated via a hybrid visualization. In contrast to traditional data-driven visualizations that tend to become overloaded and unusable with increasing volume/dimensionality of data, this novel approach communicates the abstract aspects of cluster composition and relations between clusters. This aspect orientation allows the use of easy-to-understand visual components and makes the visualization immune to scale related effects of the underlying data. This visual communication is attuned to a compact and universally valid set of high-level feedback that allows the modification of clustering results. Instead of technical parameters that indirectly cause changes in the whole clustering by influencing its creation process, users can employ simple commands like merge or split to directly adjust clusters. The orchestrated cooperation of these two main components creates a modus operandi, in which clusterings are no longer created and disposed as a whole until a satisfying result is obtained. Instead, users apply the feedback-driven process to iteratively refine an initial solution. Performance and usability of the proposed approach were evaluated with a user study. Its results show that the feedback-driven process enabled amateur users to easily create satisfying clustering results even from different and not optimal starting situations.

APA, Harvard, Vancouver, ISO, and other styles

45

Al-Harbi, Sami. "Clustering in metric spaces." Thesis, University of East Anglia, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.396604.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Zhou, Hanson M. (Hanson Mi) 1977. "Clustering via matrix exponentiation." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/17671.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2004.
Includes bibliographical references (leaves 26-27).
Given a set of n points with a matrix of pairwise similarity measures, one would like to partition the points into clusters so that similar points are together and different ones apart. We present an algorithm requiring only matrix exponentiation that performs well in practice and bears an elegant interpretation in terms of random walks on a graph. Under a certain mixture model involving planting a partition via randomized rounding of tailored matrix entries, the algorithm can be proven effective for only a single squaring. It is shown that the clustering performance of the algorithm degrades with larger values of the exponent, thus revealing that a single squaring is optimal.
by Hanson M. Zhou.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

47

Bouvrie, Jacob V. "Multi-source contingency clustering." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/33122.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.
Includes bibliographical references (p. 93-96).
This thesis examines the problem of clustering multiple, related sets of data simultaneously. Given datasets which are in some way connected (e.g. temporally) but which do not necessarily share label compatibility, we exploit co-occurrence in- formation in the form of normalized multidimensional contingency tables in order to recover robust mappings between data points and clusters for each of the individual data sources. We outline a unifying formalism by which one might approach cross-channel clustering problems, and begin by defining an information-theoretic objective function that is small when the clustering can be expected to be good. We then propose and explore several multi-source algorithms for optimizing this and other relevant objective functions, borrowing ideas from both continuous and discrete optimization methods. More specifically, we adapt gradient-based techniques, simulated annealing, and spectral clustering to the multi-source clustering problem. Finally, we apply the proposed algorithms to a multi-source human identification task, where the overall goal is to cluster grayscale face images according to identity, using additional temporally connected features. It is our hope that the proposed multi-source clustering framework can ultimately shed light on the problem of when and how models might be automatically created to account for, and adapt to, novel individuals as a surveillance/recognition system accumulates sensory experience.
by Jacob V. Bouvrie.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

48

Bădoiu, Mihai 1978. "Clustering in high dimensions." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/87376.

Full text

Abstract:

Thesis (M.Eng. and S.B.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.
Includes bibliographical references (p. 47-48).
by Mihai Bădoiu.
M.Eng.and S.B.

APA, Harvard, Vancouver, ISO, and other styles

49

Dimitriadou, Evgenia, Andreas Weingessel, and Kurt Hornik. "Fuzzy voting in clustering." SFB Adaptive Information Systems and Modelling in Economics and Management Science, WU Vienna University of Economics and Business, 1999. http://epub.wu.ac.at/742/1/document.pdf.

Full text

Abstract:

In this paper we present a fuzzy voting scheme for cluster algorithms. This fuzzy voting method allows us to combine several runs of cluster algorithms resulting in a common fuzzy partition. This helps us to overcome instabilities of the cluster algorithms and results in a better clustering.
Series: Report Series SFB "Adaptive Information Systems and Modelling in Economics and Management Science"

APA, Harvard, Vancouver, ISO, and other styles

50

Madureira, Erikson Manuel Geraldo Vieira de. "Análise de mercado : clustering." Master's thesis, Instituto Superior de Economia e Gestão, 2016. http://hdl.handle.net/10400.5/13122.

Full text

Abstract:

Mestrado em Decisão Económica e Empresarial
O presente trabalho tem como objetivo descrever as atividades realizadas durante o estágio efetuado na empresa Quidgest. Tendo a empresa a necessidade de estudar as suas diversas vertentes de negócio, optou-se por extrair e identificar as informações presentes no banco de dados da empresa. Para isso, foi utilizado um processo conhecido na análise de dados denominado por Extração de Conhecimento em Bases de Dados (ECBD). O maior desafio na utilização deste processo deveu-se há grande acumulação de informação pela empresa, que se foi intensificando a partir de 2013. Das fases do processo de ECBD, a que tem maior relevância é o data mining, onde é feito um estudo das variáveis caracterizadoras necessárias para a análise em foco. Foi escolhida a técnica de análise cluster da fase de data mining para que que toda análise possa ser eficiente, eficaz e se possa obter resultados de fácil leitura. Após o desenvolvimento do processo de ECBD, foi decidido que a fase de data mining podia ser implementada de modo a facilitar um trabalho futuro de uma análise realizada pela empresa. Para implementar essa fase, utilizaram-se técnicas de análise cluster e foi desenvolvida um programa em VBA/Excel centrada no utilizador. Para testar o programa criado foi utilizado um caso concreto da empresa. Esse caso consistiu em determinar quais os atuais clientes que mais contribuíram para a evolução da empresa nos anos de 2013 a 2015. Aplicando o caso referido no programa criado, obtiveram-se resultados e informações que foram analisadas e interpretadas.
This paper aims to describe the activities performed during the internship made in Quidgest company. Having the company need to study their various business areas, it was decided to extract and identify the information contained in the company's database. For this end, we used a process known in the data analysis called for Knowledge Discovery in Databases (KDD). The biggest challenge in using this process was due to their large accumulation of information by the company, which was intensified from 2013. The phases of the KDD process, which is the most relevant is data mining, where a study of characterizing variables required for the analysis is done. The cluster analysis technique of data mining phase was chosen for that any analysis can be efficient, effective and could provide results easy to read. After the development of the KDD process, it was decided that the data mining phase could be automated to facilitate future work carried out by the company. To automate this phase, cluster analysis techniques were used and was developed a program in VBA/Excel user-centered. To test the created program we used a specific case of the company. This case consisted in determining the current customers that have contributed to the company's evolution during the years 2013-2015. The application of the program has revealed useful information that has been analyzed and interpreted.
info:eu-repo/semantics/publishedVersion

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Clustering'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles