Teses / dissertações sobre o tema "Variables clustering"
Crie uma referência precisa em APA, MLA, Chicago, Harvard, e outros estilos
Veja os 50 melhores trabalhos (teses / dissertações) para estudos sobre o assunto "Variables clustering".
Ao lado de cada fonte na lista de referências, há um botão "Adicionar à bibliografia". Clique e geraremos automaticamente a citação bibliográfica do trabalho escolhido no estilo de citação de que você precisa: APA, MLA, Harvard, Chicago, Vancouver, etc.
Você também pode baixar o texto completo da publicação científica em formato .pdf e ler o resumo do trabalho online se estiver presente nos metadados.
Veja as teses / dissertações das mais diversas áreas científicas e compile uma bibliografia correta.
Chang, Soong Uk. "Clustering with mixed variables /". [St. Lucia, Qld.], 2005. http://www.library.uq.edu.au/pdfserve.php?image=thesisabs/absthe19086.pdf.
Texto completo da fonteEndrizzi, Isabella <1975>. "Clustering of variables around latent components: an application in consumer science". Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2008. http://amsdottorato.unibo.it/667/1/Tesi_Endrizzi_Isabella.pdf.
Texto completo da fonteEndrizzi, Isabella <1975>. "Clustering of variables around latent components: an application in consumer science". Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2008. http://amsdottorato.unibo.it/667/.
Texto completo da fonteSaraiya, Devang. "The Impact of Environmental Variables in Efficiency Analysis: A fuzzy clustering-DEA Approach". Thesis, Virginia Tech, 2005. http://hdl.handle.net/10919/34637.
Texto completo da fonteMaster of Science
Dean, Nema. "Variable selection and other extensions of the mixture model clustering framework /". Thesis, Connect to this title online; UW restricted, 2006. http://hdl.handle.net/1773/8943.
Texto completo da fonteDoan, Nath-Quang. "Modèles hiérarchiques et topologiques pour le clustering et la visualisation des données". Paris 13, 2013. http://scbd-sto.univ-paris13.fr/secure/edgalilee_th_2013_doan.pdf.
Texto completo da fonteThis thesis focuses on clustering approaches inspired from topological models and an autonomous hierarchical clustering method. The clustering problem becomes more complicated and difficult due to the growth in quality and quantify of structured data such as graphs, trees or sequences. In this thesis, we are particularly interested in self-organizing maps which have been generally used for learning topological preservation, clustering, vector quantization and graph visualization. Our studyconcerns also a hierarchical clustering method AntTree which models the ability of real ants to build structure by connect themselves. By combining the topological map with the self-assembly rules inspired from AntTree, the goal is to represent data in a hierarchical and topological structure providing more insight data information. The advantage is to visualize the clustering results as multiple hierarchical trees and a topological network. In this report, we present three new models that are able to address clustering, visualization and feature selection problems. In the first model, our study shows the interest in the use of hierarchical and topological structure through several applications on numerical datasets, as well as structured datasets e. G. Graphs and biological dataset. The second model consists of a flexible and growing structure which does not impose the strict network-topology preservation rules. Using statistical characteristics provided by hierarchical trees, it accelerates significantly the learning process. The third model addresses particularly the issue of unsupervised feature selection. The idea is to use hierarchical structure provided by AntTree to discover automatically local data structure and local neighbors. By using the tree topology, we propose a new score for feature selection by constraining the Laplacian score. Finally, this thesis offers several perspectives for future work
Ndaoud, Mohamed. "Contributions to variable selection, clustering and statistical estimation inhigh dimension". Electronic Thesis or Diss., Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLG005.
Texto completo da fonteThis PhD thesis deals with the following statistical problems: Variable selection in high-Dimensional Linear Regression, Clustering in the Gaussian Mixture Model, Some effects of adaptivity under sparsity and Simulation of Gaussian processes.Under the sparsity assumption, variable selection corresponds to recovering the "small" set of significant variables. We study non-asymptotic properties of this problem in the high-dimensional linear regression. Moreover, we recover optimal necessary and sufficient conditions for variable selection in this model. We also study some effects of adaptation under sparsity. Namely, in the sparse vector model, we investigate, the changes in the estimation rates of some of the model parameters when the noise level or its nominal law are unknown.Clustering is a non-supervised machine learning task aiming to group observations that are close to each other in some sense. We study the problem of community detection in the Gaussian Mixture Model with two components, and characterize precisely the sharp separation between clusters in order to recover exactly the clusters. We also provide a fast polynomial time procedure achieving optimal recovery.Gaussian processes are extremely useful in practice, when it comes to model price fluctuations for instance. Nevertheless, their simulation is not easy in general. We propose and study a new rate-optimal series expansion to simulate a large class of Gaussian processes
Naik, Vaibhav C. "Fuzzy C-means clustering approach to design a warehouse layout". [Tampa, Fla.] : University of South Florida, 2004. http://purl.fcla.edu/fcla/etd/SFE0000437.
Texto completo da fonteNdaoud, Mohamed. "Contributions to variable selection, clustering and statistical estimation inhigh dimension". Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLG005/document.
Texto completo da fonteThis PhD thesis deals with the following statistical problems: Variable selection in high-Dimensional Linear Regression, Clustering in the Gaussian Mixture Model, Some effects of adaptivity under sparsity and Simulation of Gaussian processes.Under the sparsity assumption, variable selection corresponds to recovering the "small" set of significant variables. We study non-asymptotic properties of this problem in the high-dimensional linear regression. Moreover, we recover optimal necessary and sufficient conditions for variable selection in this model. We also study some effects of adaptation under sparsity. Namely, in the sparse vector model, we investigate, the changes in the estimation rates of some of the model parameters when the noise level or its nominal law are unknown.Clustering is a non-supervised machine learning task aiming to group observations that are close to each other in some sense. We study the problem of community detection in the Gaussian Mixture Model with two components, and characterize precisely the sharp separation between clusters in order to recover exactly the clusters. We also provide a fast polynomial time procedure achieving optimal recovery.Gaussian processes are extremely useful in practice, when it comes to model price fluctuations for instance. Nevertheless, their simulation is not easy in general. We propose and study a new rate-optimal series expansion to simulate a large class of Gaussian processes
Giacofci, Joyce. "Classification non supervisée et sélection de variables dans les modèles mixtes fonctionnels. Applications à la biologie moléculaire". Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENM025/document.
Texto completo da fonteMore and more scientific studies yield to the collection of a large amount of data that consist of sets of curves recorded on individuals. These data can be seen as an extension of longitudinal data in high dimension and are often modeled as functional data in a mixed-effects framework. In a first part we focus on performing unsupervised clustering of these curves in the presence of inter-individual variability. To this end, we develop a new procedure based on a wavelet representation of the model, for both fixed and random effects. Our approach follows two steps : a dimension reduction step, based on wavelet thresholding techniques, is first performed. Then a clustering step is applied on the selected coefficients. An EM-algorithm is used for maximum likelihood estimation of parameters. The properties of the overall procedure are validated by an extensive simulation study. We also illustrate our method on high throughput molecular data (omics data) like microarray CGH or mass spectrometry data. Our procedure is available through the R package "curvclust", available on the CRAN website. In a second part, we concentrate on estimation and dimension reduction issues in the mixed-effects functional framework. Two distinct approaches are developed according to these issues. The first approach deals with parameters estimation in a non parametrical setting. We demonstrate that the functional fixed effects estimator based on wavelet thresholding techniques achieves the expected rate of convergence toward the true function. The second approach is dedicated to the selection of both fixed and random effects. We propose a method based on a penalized likelihood criterion with SCAD penalties for the estimation and the selection of both fixed effects and random effects variances. In the context of variable selection we prove that the penalized estimators enjoy the oracle property when the signal size diverges with the sample size. A simulation study is carried out to assess the behaviour of the two proposed approaches
Michel, Pierre. "Sélection d'items en classification non supervisée et questionnaires informatisés adaptatifs : applications à des données de qualité de vie liée à la santé". Thesis, Aix-Marseille, 2016. http://www.theses.fr/2016AIXM4097/document.
Texto completo da fonteAn adaptive test provides a valid measure of quality of life of patients and reduces the number of items to be filled. This approach is dependent on the models used, sometimes based on unverifiable assumptions. We propose an alternative approach based on decision trees. This approach is not based on any assumptions and requires less calculation time for item administration. We present different simulations that demonstrate the relevance of our approach.We present an unsupervised classification method called CUBT. CUBT includes three steps to obtain an optimal partition of a data set. The first step grows a tree by recursively dividing the data set. The second step groups together the pairs of terminal nodes of the tree. The third step aggregates terminal nodes that do not come from the same split. Different simulations are presented to compare CUBT with other approaches. We also define heuristics for the choice of CUBT parameters.CUBT identifies the variables that are active in the construction of the tree. However, although some variables may be irrelevant, they may be competitive for the active variables. It is essential to rank the variables according to an importance score to determine their relevance in a given model. We present a method to measure the importance of variables based on CUBT and competitive binary splis to define a score of variable importance. We analyze the efficiency and stability of this new index, comparing it with other methods
Youssfi, Younès. "Exploring Risk Factors and Prediction Models for Sudden Cardiac Death with Machine Learning". Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAG006.
Texto completo da fonteSudden cardiac death (SCD) is defined as a sudden natural death presumed to be of cardiac cause, heralded by abrupt loss of consciousness in the presence of witness, or in the absence of witness occurring within an hour after the onset of symptoms. Despite progress in clinical profiling and interventions, it remains a major public health problem, accounting for 10 to 20% of deaths in industrialised countries, with survival after SCD below 10%. The annual incidence is estimated 350,000 in Europe, and 300,000 in the United States. Efficient treatments for SCD management are available. One of the most effective options is the use of implantable cardioverter defibrillators (ICD). However, identifying the best candidates for ICD implantation remains a difficult challenge, with disappointing results so far. This thesis aims to address this problem, and to provide a better understanding of SCD in the general population, using statistical modeling. We analyze data from the Paris Sudden Death Expertise Center and the French National Healthcare System Database to develop three main works:- The first part of the thesis aims to identify new subgroups of SCD to improve current stratification guidelines, which are mainly based on cardiovascular variables. To this end, we use natural language processing methods and clustering analysis to build a meaningful representation of medical history of patients.- The second part aims to build a prediction model of SCD in order to propose a personalized and explainable risk score for each patient, and accurately identify very-high risk subjects in the general population. To this end, we train a supervised classification algorithm, combined with the SHapley Additive exPlanation method, to analyze all medical events that occurred up to 5 years prior to the event.- The last part of the thesis aims to identify the most relevant information to select in large medical history of patients. We propose a bi-level variable selection algorithm for generalized linear models, in order to identify both individual and group effects from predictors. Our algorithm is based on a Bayesian approach and uses a Sequential Monte Carlo method to estimate the posterior distribution of variables inclusion
Ouali, Abdelkader. "Méthodes hybrides parallèles pour la résolution de problèmes d'optimisation combinatoire : application au clustering sous contraintes". Thesis, Normandie, 2017. http://www.theses.fr/2017NORMC215/document.
Texto completo da fonteCombinatorial optimization problems have become the target of many scientific researches for their importance in solving academic problems and real problems encountered in the field of engineering and industry. Solving these problems by exact methods is often intractable because of the exorbitant time processing that these methods would require to reach the optimal solution(s). In this thesis, we were interested in the algorithmic context of solving combinatorial problems, and the modeling context of these problems. At the algorithmic level, we have explored the hybrid methods which excel in their ability to cooperate exact methods and approximate methods in order to produce rapidly solutions of best quality. At the modeling level, we worked on the specification and the exact resolution of complex problems in pattern set mining, in particular, by studying scaling issues in large databases. On the one hand, we proposed a first parallelization of the DGVNS algorithm, called CPDGVNS, which explores in parallel the different clusters of the tree decomposition by sharing the best overall solution on a master-worker model. Two other strategies, called RADGVNS and RSDGVNS, have been proposed which improve the frequency of exchanging intermediate solutions between the different processes. Experiments carried out on difficult combinatorial problems show the effectiveness of our parallel methods. On the other hand, we proposed a hybrid approach combining techniques of both Integer Linear Programming (ILP) and pattern mining. Our approach is comprehensive and takes advantage of the general ILP framework (by providing a high level of flexibility and expressiveness) and specialized heuristics for data mining (to improve computing time). In addition to the general framework for the pattern set mining, two problems were studied: conceptual clustering and the tiling problem. The experiments carried out showed the contribution of our proposition in relation to constraint-based approaches and specialized heuristics
Ta, Minh Thuy. "Techniques d'optimisation non convexe basée sur la programmation DC et DCA et méthodes évolutives pour la classification non supervisée". Thesis, Université de Lorraine, 2014. http://www.theses.fr/2014LORR0099/document.
Texto completo da fonteThis thesis focus on four problems in data mining and machine learning: clustering data streams, clustering massive data sets, weighted hard and fuzzy clustering and finally the clustering without a prior knowledge of the clusters number. Our methods are based on deterministic optimization approaches, namely the DC (Difference of Convex functions) programming and DCA (Difference of Convex Algorithm) for solving some classes of clustering problems cited before. Our methods are also, based on elitist evolutionary approaches. We adapt the clustering algorithm DCA–MSSC to deal with data streams using two windows models: sub–windows and sliding windows. For the problem of clustering massive data sets, we propose to use the DCA algorithm with two phases. In the first phase, massive data is divided into several subsets, on which the algorithm DCA–MSSC performs clustering. In the second phase, we propose a DCA–Weight algorithm to perform a weighted clustering on the obtained centers in the first phase. For the weighted clustering, we also propose two approaches: weighted hard clustering and weighted fuzzy clustering. We test our approach on image segmentation application. The final issue addressed in this thesis is the clustering without a prior knowledge of the clusters number. We propose an elitist evolutionary approach, where we apply several evolutionary algorithms (EAs) at the same time, to find the optimal combination of initial clusters seed and in the same time the optimal clusters number. The various tests performed on several sets of large data are very promising and demonstrate the effectiveness of the proposed approaches
Jaafar, Amine. "Traitement de la mission et des variables environnementales et intégration au processus de conception systémique". Thesis, Toulouse, INPT, 2011. http://www.theses.fr/2011INPT0070/document.
Texto completo da fonteThis work presents a methodological approach aiming at analyzing and processing mission profiles and more generally environmental variables (e.g. solar or wind energy potential, temperature, boundary conditions) in the context of system design. This process constitutes a key issue in order to ensure system effectiveness with regards to design constraints and objectives. In this thesis, we pay a particular attention on the use of compact profiles for environmental variables in the frame of system level integrated optimal design, which requires a wide number of system simulations. In a first part, we propose a clustering approach based on partition criteria with the aim of analyzing mission profiles. This phase can help designers to identify different system configurations in compliance with the corresponding clusters: it may guide suppliers towards “market segmentation” not only fulfilling economic constraints but also technical design objectives. The second stage of the study proposes a synthesis process of a compact profile which represents the corresponding data of the studied environmental variable. This compact profile is generated by combining parameters and number of elementary patterns (segment, sine or cardinal sine) with regards to design indicators. These latter are established with respect to the main objectives and constraints associated to the designed system. All pattern parameters are obtained by solving the corresponding inverse problem with evolutionary algorithms. Finally, this synthesis process is applied to two different case studies. The first consists in the simplification of wind data issued from measurements in two geographic sites of Guadeloupe and Tunisia. The second case deals with the reduction of a set of railway mission profiles relative to a hybrid locomotive devoted to shunting and switching missions. It is shown from those examples that our approach leads to a wide reduction of the profiles associated with environmental variables which allows a significant decrease of the computational time in the context of an integrated optimal design process
Meynet, Caroline. "Sélection de variables pour la classification non supervisée en grande dimension". Phd thesis, Université Paris Sud - Paris XI, 2012. http://tel.archives-ouvertes.fr/tel-00752613.
Texto completo da fonteBoulin, Alexis. "Partitionnement des variables de séries temporelles multivariées selon la dépendance de leurs extrêmes". Electronic Thesis or Diss., Université Côte d'Azur, 2024. http://www.theses.fr/2024COAZ5039.
Texto completo da fonteIn a wide range of applications, from climate science to finance, extreme events with a non-negligible probability can occur, leading to disastrous consequences. Extremes in climatic events such as wind, temperature, and precipitation can profoundly impact humans and ecosystems, resulting in events like floods, landslides, or heatwaves. When the focus is on studying variables measured over time at numerous specific locations, such as the previously mentioned variables, partitioning these variables becomes essential to summarize and visualize spatial trends, which is crucial in the study of extreme events. This thesis explores several models and methods for partitioning the variables of a multivariate stationary process, focusing on extreme dependencies.Chapter 1 introduces the concepts of modeling dependence through copulas, which are fundamental for extreme dependence. The notion of regular variation, essential for studying extremes, is introduced, and weakly dependent processes are discussed. Partitioning is examined through the paradigms of separation-proximity and model-based clustering. Non-asymptotic analysis is also addressed to evaluate our methods in fixed dimensions.Chapter 2 study the dependence between maximum values is crucial for risk analysis. Using the extreme value copula function and the madogram, this chapter focuses on non-parametric estimation with missing data. A functional central limit theorem is established, demonstrating the convergence of the madogram to a tight Gaussian process. Formulas for asymptotic variance are presented, illustrated by a numerical study.Chapter 3 proposes asymptotically independent block (AI-block) models for partitioning variables, defining clusters based on the independence of maxima. An algorithm is introduced to recover clusters without specifying their number in advance. Theoretical efficiency of the algorithm is demonstrated, and a data-driven parameter selection method is proposed. The method is applied to neuroscience and environmental data, showcasing its potential.Chapter 4 adapts partitioning techniques to analyze composite extreme events in European climate data. Sub-regions with dependencies in extreme precipitation and wind speed are identified using ERA5 data from 1979 to 2022. The obtained clusters are spatially concentrated, offering a deep understanding of the regional distribution of extremes. The proposed methods efficiently reduce data size while extracting critical information on extreme events.Chapter 5 proposes a new estimation method for matrices in a latent factor linear model, where each component of a random vector is expressed by a linear equation with factors and noise. Unlike classical approaches based on joint normality, we assume factors are distributed according to standard Fréchet distributions, allowing a better description of extreme dependence. An estimation method is proposed, ensuring a unique solution under certain conditions. An adaptive upper bound for the estimator is provided, adaptable to dimension and the number of factors
Labenne, Amaury. "Méthodes de réduction de dimension pour la construction d'indicateurs de qualité de vie". Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0239/document.
Texto completo da fonteThe purpose of this thesis is to develop and suggest new dimensionreduction methods to construct composite indicators on a municipal scale. The developedstatistical methodology highlights the consideration of the multi-dimensionalityof the quality of life concept, with a particular attention on the treatment of mixeddata (quantitative and qualitative variables) and the introduction of environmentalconditions. We opt for a variable clustering approach and for a multi-table method(multiple factorial analysis for mixed data). These two methods allow to build compositeindicators that we propose as a measure of living conditions at the municipalscale. In order to facilitate the interpretation of the created composite indicators, weintroduce a method of selections of variables based on a bootstrap approach. Finally,we suggest the clustering of observations method, named hclustgeo, which integratesgeographical proximity constraints in the clustering procedure, in order to apprehendthe spatiality specificities better
Kuentz, Vanessa. "Contributions à la réduction de dimension". Thesis, Bordeaux 1, 2009. http://www.theses.fr/2009BOR13871/document.
Texto completo da fonteThis thesis concentrates on dimension reduction approaches, that seek for lower dimensional subspaces minimizing the lost of statistical information. First we focus on multivariate analysis for categorical data. The rotation problem in Multiple Correspondence Analysis (MCA) is treated. We give the analytic expression of the optimal angle of planar rotation for the chosen criterion. If more than two principal components are to be retained, this planar solution is used in a practical algorithm applying successive pairwise planar rotations. Different algorithms for the clustering of categorical variables are also proposed to maximize a given partitioning criterion based on correlation ratios. A real data application highlights the benefits of using rotation in MCA and provides an empirical comparison of the proposed algorithms for categorical variable clustering. Then we study the semiparametric regression method SIR (Sliced Inverse Regression). We propose an extension based on the partitioning of the predictor space that can be used when the crucial linearity condition of the predictor is not verified. We also introduce bagging versions of SIR to improve the estimation of the basis of the dimension reduction subspace. Asymptotic properties of the estimators are obtained and a simulation study shows the good numerical behaviour of the proposed methods. Finally applied multivariate data analysis on various areas is described
Makkhongkaew, Raywat. "Semi-supervised co-selection : instances and features : application to diagnosis of dry port by rail". Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1341.
Texto completo da fonteWe are drowning in massive data but starved for knowledge retrieval. It is well known through the dimensionality tradeoff that more data increase informative but pay a price in computational complexity, which has to be made up in some way. When the labeled sample size is too little to bring sufficient information about the target concept, supervised learning fail with this serious challenge. Unsupervised learning can be an alternative in this problem. However, as these algorithms ignore label information, important hints from labeled data are left out and this will generally downgrades the performance of unsupervised learning algorithms. Using both labeled and unlabeled data is expected to better procedure in semi-supervised learning, which is more adapted for large domain applications when labels are hardly and costly to obtain. In addition, when data are large, feature selection and instance selection are two important dual operations for removing irrelevant information. Both of tasks with semisupervised learning are different challenges for machine learning and data mining communities for data dimensionality reduction and knowledge retrieval. In this thesis, we focus on co-selection of instances and features in the context of semi-supervised learning. In this context, co-selection becomes a more challenging problem as the data contains labeled and unlabeled examples sampled from the same population. To do such semi-supervised coselection, we propose two unified frameworks, which efficiently integrate labeled and unlabeled parts into the co-selection process. The first framework is based on weighting constrained clustering and the second one is based on similarity preserving selection. Both approaches evaluate the usefulness of features and instances in order to select the most relevant ones, simultaneously. Finally, we present a variety of empirical studies over high-dimensional data sets, which are well-known in the literature. The results are promising and prove the efficiency and effectiveness of the proposed approaches. In addition, the developed methods are validated on a real world application, over data provided by the State Railway of Thailand (SRT). The purpose is to propose the application models from our methodological contributions to diagnose the performance of rail dry port systems. First, we present the results of some ensemble methods applied on a first data set, which is fully labeled. Second, we show how can our co-selection approaches improve the performance of learning algorithms over partially labeled data provided by SRT
Liu, Gang. "Spatiotemporal Sensing and Informatics for Complex Systems Monitoring, Fault Identification and Root Cause Diagnostics". Scholar Commons, 2015. https://scholarcommons.usf.edu/etd/5727.
Texto completo da fonteNing, Hoi-Kwan Flora. "Model-based regression clustering with variable selection". Thesis, University of Oxford, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.497059.
Texto completo da fonteMcClelland, Robyn L. "Regression based variable clustering for data reduction /". Thesis, Connect to this title online; UW restricted, 2000. http://hdl.handle.net/1773/9611.
Texto completo da fonteChannarond, Antoine. "Recherche de structure dans un graphe aléatoire : modèles à espace latent". Thesis, Paris 11, 2013. http://www.theses.fr/2013PA112338/document.
Texto completo da fonte.This thesis addresses the clustering of the nodes of a graph, in the framework of randommodels with latent variables. To each node i is allocated an unobserved (latent) variable Zi and the probability of nodes i and j being connected depends conditionally on Zi and Zj . Unlike Erdos-Renyi's model, connections are not independent identically distributed; the latent variables rule the connection distribution of the nodes. These models are thus heterogeneous and their structure is fully described by the latent variables and their distribution. Hence we aim at infering them from the graph, which the only observed data.In both original works of this thesis, we propose consistent inference methods with a computational cost no more than linear with respect to the number of nodes or edges, so that large graphs can be processed in a reasonable time. They both are based on a study of the distribution of the degrees, which are normalized in a convenient way for the model.The first work deals with the Stochastic Blockmodel. We show the consistency of an unsupervised classiffcation algorithm using concentration inequalities. We deduce from it a parametric estimation method, a model selection method for the number of latent classes, and a clustering test (testing whether there is one cluster or more), which are all proved to be consistent. In the second work, the latent variables are positions in the ℝd space, having a density f. The connection probability depends on the distance between the node positions. The clusters are defined as connected components of some level set of f. The goal is to estimate the number of such clusters from the observed graph only. We estimate the density at the latent positions of the nodes with their degree, which allows to establish a link between clusters and connected components of some subgraphs of the observed graph, obtained by removing low degree nodes. In particular, we thus derive an estimator of the cluster number and we also show the consistency in some sense
Ta, Minh Thuy. "Techniques d'optimisation non convexe basée sur la programmation DC et DCA et méthodes évolutives pour la classification non supervisée". Electronic Thesis or Diss., Université de Lorraine, 2014. http://www.theses.fr/2014LORR0099.
Texto completo da fonteThis thesis focus on four problems in data mining and machine learning: clustering data streams, clustering massive data sets, weighted hard and fuzzy clustering and finally the clustering without a prior knowledge of the clusters number. Our methods are based on deterministic optimization approaches, namely the DC (Difference of Convex functions) programming and DCA (Difference of Convex Algorithm) for solving some classes of clustering problems cited before. Our methods are also, based on elitist evolutionary approaches. We adapt the clustering algorithm DCA–MSSC to deal with data streams using two windows models: sub–windows and sliding windows. For the problem of clustering massive data sets, we propose to use the DCA algorithm with two phases. In the first phase, massive data is divided into several subsets, on which the algorithm DCA–MSSC performs clustering. In the second phase, we propose a DCA–Weight algorithm to perform a weighted clustering on the obtained centers in the first phase. For the weighted clustering, we also propose two approaches: weighted hard clustering and weighted fuzzy clustering. We test our approach on image segmentation application. The final issue addressed in this thesis is the clustering without a prior knowledge of the clusters number. We propose an elitist evolutionary approach, where we apply several evolutionary algorithms (EAs) at the same time, to find the optimal combination of initial clusters seed and in the same time the optimal clusters number. The various tests performed on several sets of large data are very promising and demonstrate the effectiveness of the proposed approaches
Sanchez, Merchante Luis Francisco. "Learning algorithms for sparse classification". Phd thesis, Université de Technologie de Compiègne, 2013. http://tel.archives-ouvertes.fr/tel-00868847.
Texto completo da fonteBenkaci, Mourad. "Surveillance des systèmes mécatronique d'automobile par des méthodes d'apprentissage". Toulouse 3, 2011. https://tel.archives-ouvertes.fr/tel-00647456.
Texto completo da fonteMechatronic systems monitoring, especially those built on today's vehicles, is increasingly complicated. The interconnections of these systems for increased performance and comfort of vehicles increases the complexity of information needed for decision-making in real time. This PhD thesis is devoted to the problem of detection and isolation (FDI Fault Detection & Isolation) of faults in automotive systems using algorithms based on research and evaluation of information by mono-criterion approaches. Relevant variables for rapid detection of faults are selected in an automatic manner by using two different approaches: I. The first is to introduce the notion of conflict between all the measurable variables of mechatronic system and to analyze these variables using their projections in hyper-rectangles spaces classification. II. The second approach is to use Kolmogorov complexity as a tool for classification of fault signatures. The estimate of the Kolmogorov complexity by compression algorithms, without loss of information, allows defining a dictionary of faults and giving a score of criticality with respect to the healthy functioning of the vehicle. The two proposed approaches have been successfully applied to many types of automotive data in the ANR-DIAP project
Moraes, Renan Manhabosco. "Aplicações de técnicas multivariadas na área comercial de uma empresa de comunicação". reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2017. http://hdl.handle.net/10183/173130.
Texto completo da fonteThe change in the behavior of consumers with the advent of technology and social networks generates a great empowerment of themselves, substantially altering the relationship form of companies to their final audience. Attentive to this market, media companies undergo profound changes, both from the point of view of delivering content to their audience, as well as in their administrative, strategic and financial format. Thus, the present dissertation presents approaches supported by multivariate techniques for the composition of commercial and remuneration teams of the sales group of a communication company. In article 1, the objective is to generate a model to estimate the commercial awards of the sales teams of the RBS Group radios. To do this, we initially generate groupings of radio stations from the RBS Group in the state of Rio Grande do Sul and Santa Catarina based on their profiles of similarities. For each cluster generated, a multiple linear regression of the commercial award is generated, validated through cross validation through the adjusted R2 and Mean Absolute Percentage Error (MAPE). The second article addresses the clustering of RBS Group top clients and the impact on the composition of business teams through the variable selection method. The original 7 variables were evaluated through the variable selection method "Omit one variable at a time"; the best Silhouette Index (SI) average, metric used to evaluate the quality of the generated clusters, was obtained when 3 variables were retained. Clusters generated by such variables reflect customers' buying behavior of media; the clusters were considered satisfactory when evaluated by RBS Group experts.
Devijver, Emilie. "Modèles de mélange pour la régression en grande dimension, application aux données fonctionnelles". Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112130/document.
Texto completo da fonteFinite mixture regression models are useful for modeling the relationship between a response and predictors, arising from different subpopulations. In this thesis, we focus on high-dimensional predictors and a high-dimensional response. First of all, we provide an ℓ1-oracle inequality satisfied by the Lasso estimator. We focus on this estimator for its ℓ1-regularization properties rather than for the variable selection procedure. We also propose two procedures to deal with this issue. The first procedure leads to estimate the unknown conditional mixture density by a maximum likelihood estimator, restricted to the relevant variables selected by an ℓ1-penalized maximum likelihood estimator. The second procedure considers jointly predictor selection and rank reduction for obtaining lower-dimensional approximations of parameters matrices. For each procedure, we get an oracle inequality, which derives the penalty shape of the criterion, depending on the complexity of the random model collection. We extend these procedures to the functional case, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms, apply and evaluate our methods both on simulations and real datasets. In particular, we illustrate the first procedure on an electricity load consumption dataset
Cozzini, Alberto Maria. "Supervised and unsupervised model-based clustering with variable selection". Thesis, Imperial College London, 2012. http://hdl.handle.net/10044/1/9973.
Texto completo da fonteBenkaci, Mourad. "Surveillance des systèmes automatiques et systèmes embraqués". Phd thesis, Université Paul Sabatier - Toulouse III, 2011. http://tel.archives-ouvertes.fr/tel-00647456.
Texto completo da fonteKim, Sinae. "Bayesian variable selection in clustering via dirichlet process mixture models". Texas A&M University, 2003. http://hdl.handle.net/1969.1/5888.
Texto completo da fonteAl-Guwaizani, Abdulrahman. "Variable neighbourhood search based heuristic for K-harmonic means clustering". Thesis, Brunel University, 2011. http://bura.brunel.ac.uk/handle/2438/5827.
Texto completo da fonteLynch, Sarah K. "A scale-independent clustering method with automatic variable selection based on trees". Thesis, Monterey, California: Naval Postgraduate School, 2014. http://hdl.handle.net/10945/41412.
Texto completo da fonteClustering is the process of putting observations into groups based on their distance, or dissimilarity, from one another. Measuring distance for continuous variables often requires scaling or monotonic transformation. Determining dissimilarity when observations have both continuous and categorical measurements can be difficult because each type of measurement must be approached differently. We introduce a new clustering method that uses one of three new distance metrics. In a dataset with p variables, we create p trees, one with each variable as the response. Distance is measured by determining on which leaf an observation falls in each tree. Two observations are similar if they tend to fall on the same leaf and dissimilar if they are usually on different leaves. The distance metrics are not affected by scaling or transformations of the variables and easily determine distances in datasets with both continuous and categorical variables. This method is tested on several well-known datasets, both with and without added noise variables, and performs very well in the presence of noise due in part to automatic variable selection. The new distance metrics outperform several existing clustering methods in a large number of scenarios.
Palla, Konstantina. "Probabilistic nonparametric models for relational data, variable clustering and reversible Markov chains". Thesis, University of Cambridge, 2015. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.709019.
Texto completo da fonteGiovinazzi, Francesco <1988>. "Solution Path Clustering for Fixed-Effects Models in a Latent Variable Context". Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amsdottorato.unibo.it/8740/1/giovinazzi_phdthesis.pdf.
Texto completo da fonteCAPPOZZO, ANDREA. "Robust model-based classification and clustering: advances in learning from contaminated datasets". Doctoral thesis, Università degli Studi di Milano-Bicocca, 2020. http://hdl.handle.net/10281/262919.
Texto completo da fonteAt the time of writing, an ever-increasing amount of data is collected every day, with its volume estimated to be doubling every two years. Thanks to the technological advancements, datasets are becoming massive in terms of size and substantially more complex in nature. Nevertheless, this abundance of ``raw information'' does come at a price: wrong measurements, data-entry errors, breakdowns of automatic collection systems and several other causes may ultimately undermine the overall data quality. To this extent, robust methods have a central role in properly converting contaminated ``raw information'' to trustworthy knowledge: a primary goal of any statistical analysis. The present manuscript presents novel methodologies for performing reliable inference, within the model-based classification and clustering framework, in presence of contaminated data. First, we propose a robust modification to a family of semi-supervised patterned models, for accomplishing classification when dealing with both class and attribute noise. Second, we develop a discriminant analysis method for anomaly and novelty detection, with the final aim of discovering label noise, outliers and unobserved classes in an unlabelled dataset. Third, we introduce two robust variable selection methods, that effectively perform high-dimensional discrimination within an adulterated scenario.
Abonyi, J., FD Tamás, S. Potgieter e H. Potgieter. "Analysis of Trace Elements in South African Clinkers using Latent Variable Model and Clustering". South African Journal of Chemistry, 2003. http://encore.tut.ac.za/iii/cpro/DigitalItemViewPage.external?sp=1000893.
Texto completo da fonteLazic, Jasmina. "New variants of variable neighbourhood search for 0-1 mixed integer programming and clustering". Thesis, Brunel University, 2010. http://bura.brunel.ac.uk/handle/2438/4602.
Texto completo da fonteAbonyia, J., FD Tamas e S. Potgieter. "Analysis of trace elements in South African clinkers using latent variable model and clustering". South African Journal of Chemistry, 2003. http://encore.tut.ac.za/iii/cpro/DigitalItemViewPage.external?sp=1001952.
Texto completo da fonteRastelli, Riccardo, e Nial Friel. "Optimal Bayesian estimators for latent variable cluster models". Springer Nature, 2018. http://dx.doi.org/10.1007/s11222-017-9786-y.
Texto completo da fonteMello, Paula Lunardi de. "Sistemáticas de agrupamento de países com base em indicadores de desempenho". reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2017. http://hdl.handle.net/10183/158359.
Texto completo da fonteThe world economy faced transformations in the last century. Periods of sustained growth followed by others of stagnation, governments alternating strategies of market liberalization with policies of commercial protectionism, and instability in markets, among others. As an aid to understand economic and social problems in a systemic way, the analysis of performance indicators generates relevant information about patterns, behavior and trends, as well as guiding policies and strategies to increase results in economy and social issues. Indicators describing main economic dimensions of a country can be used guiding principles in the development and monitoring of development and growth policies of these countries. In this way, this dissertation uses data from World Bank to elaborate a system of grouping countries with similar characteristics in terms of the indicators that describe them. To do so, it integrates clustering techniques (hierarchical and non-hierarchical), selection of variables (through the "leave one variable out at a time" technique) and dimensional reduction (appling Principal Component Analysis). The generated clusters quality is evaluated by the Silhouette Index, Calinski-Harabasz and Davies-Bouldin indexes. The results were satisfactory regarding the representativity of the highlighted indicators and the generated a good clustering quality.
Ren, Sheng. "New Methods of Variable Selection and Inference on High Dimensional Data". University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1511883302569683.
Texto completo da fonteHuynh, Bao Tuyen. "Estimation and feature selection in high-dimensional mixtures-of-experts models". Thesis, Normandie, 2019. http://www.theses.fr/2019NORMC237.
Texto completo da fonteThis thesis deals with the problem of modeling and estimation of high-dimensional MoE models, towards effective density estimation, prediction and clustering of such heterogeneous and high-dimensional data. We propose new strategies based on regularized maximum-likelihood estimation (MLE) of MoE models to overcome the limitations of standard methods, including MLE estimation with Expectation-Maximization (EM) algorithms, and to simultaneously perform feature selection so that sparse models are encouraged in such a high-dimensional setting. We first introduce a mixture-of-experts’ parameter estimation and variable selection methodology, based on l1 (lasso) regularizations and the EM framework, for regression and clustering suited to high-dimensional contexts. Then, we extend the method to regularized mixture of experts models for discrete data, including classification. We develop efficient algorithms to maximize the proposed l1 -penalized observed-data log-likelihood function. Our proposed strategies enjoy the efficient monotone maximization of the optimized criterion, and unlike previous approaches, they do not rely on approximations on the penalty functions, avoid matrix inversion, and exploit the efficiency of the coordinate ascent algorithm, particularly within the proximal Newton-based approach
Šulc, Zdeněk. "Similarity Measures for Nominal Data in Hierarchical Clustering". Doctoral thesis, Vysoká škola ekonomická v Praze, 2013. http://www.nusl.cz/ntk/nusl-261939.
Texto completo da fonteHilton, Ross P. "Model-based data mining methods for identifying patterns in biomedical and health data". Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54387.
Texto completo da fonteDemigha, Oualid. "Energy Conservation for Collaborative Applications in Wireless Sensor Networks". Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0058/document.
Texto completo da fonteWireless Sensor Networks is an emerging technology enabled by the recent advances in Micro-Electro-Mechanical Systems, that led to design tiny wireless sensor nodes characterized by small capacities of sensing, data processing and communication. To accomplish complex tasks such as target tracking, data collection and zone surveillance, these nodes need to collaborate between each others to overcome the lack of battery capacity. Since the development of the batteries hardware is very slow, the optimization effort should be inevitably focused on the software layers of the protocol stack of the nodes and their operating systems. In this thesis, we investigated the energy problem in the context of collaborative applications and proposed an approach based on node selection using predictions and data correlations, to meet the application requirements in terms of energy-efficiency and quality of data. First, we surveyed almost all the recent approaches proposed in the literature that treat the problem of energy-efficiency of prediction-based target tracking schemes, in order to extract the relevant recommendations. Next, we proposed a dynamic clustering protocol based on an enhanced version of the Distributed Kalman Filter used as a prediction algorithm, to design an energy-efficient target tracking scheme. Our proposed scheme use these predictions to anticipate the actions of the nodes and their roles to minimize their number in the tasks. Based on our findings issued from the simulation data, we generalized our approach to any data collection scheme that uses a geographic-based clustering algorithm. We formulated the problem of energy minimization under data precision constraints using a binary integer linear program to find its exact solution in the general context. We validated the model and proved some of its fundamental properties. Finally and given the complexity of the problem, we proposed and evaluated a heuristic solution consisting of a correlation-based adaptive clustering algorithm for data collection. We showed that, by relaxing some constraints of the problem, our heuristic solution achieves an acceptable level of energy-efficiency while preserving the quality of data
Jin, Zhongnan. "Statistical Methods for Multivariate Functional Data Clustering, Recurrent Event Prediction, and Accelerated Degradation Data Analysis". Diss., Virginia Tech, 2019. http://hdl.handle.net/10919/102628.
Texto completo da fonteDoctor of Philosophy
Ghebre, Michael Abrha. "A statistical framework for modeling asthma and COPD biological heterogeneity, and a novel variable selection method for model-based clustering". Thesis, University of Leicester, 2016. http://hdl.handle.net/2381/38488.
Texto completo da fonteHarz, Jonas. "Variablen-Verdichtung und Clustern von Big Data – Wie lassen sich die Free-Floating-Carsharing-Nutzer typisieren?" Master's thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-210015.
Texto completo da fonte