Thèses : « Variables clustering »

1

Chang, Soong Uk. « Clustering with mixed variables / ». [St. Lucia, Qld.], 2005. http://www.library.uq.edu.au/pdfserve.php?image=thesisabs/absthe19086.pdf.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

2

Endrizzi, Isabella <1975&gt. « Clustering of variables around latent components : an application in consumer science ». Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2008. http://amsdottorato.unibo.it/667/1/Tesi_Endrizzi_Isabella.pdf.

Texte intégral

Résumé :

The present work proposes a method based on CLV (Clustering around Latent Variables) for identifying groups of consumers in L-shape data. This kind of datastructure is very common in consumer studies where a panel of consumers is asked to assess the global liking of a certain number of products and then, preference scores are arranged in a two-way table Y. External information on both products (physicalchemical description or sensory attributes) and consumers (socio-demographic background, purchase behaviours or consumption habits) may be available in a row descriptor matrix X and in a column descriptor matrix Z respectively. The aim of this method is to automatically provide a consumer segmentation where all the three matrices play an active role in the classification, getting homogeneous groups from all points of view: preference, products and consumer characteristics. The proposed clustering method is illustrated on data from preference studies on food products: juices based on berry fruits and traditional cheeses from Trentino. The hedonic ratings given by the consumer panel on the products under study were explained with respect to the product chemical compounds, sensory evaluation and consumer socio-demographic information, purchase behaviour and consumption habits.

Styles APA, Harvard, Vancouver, ISO, etc.

3

Endrizzi, Isabella <1975&gt. « Clustering of variables around latent components : an application in consumer science ». Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2008. http://amsdottorato.unibo.it/667/.

Texte intégral

Résumé :

The present work proposes a method based on CLV (Clustering around Latent Variables) for identifying groups of consumers in L-shape data. This kind of datastructure is very common in consumer studies where a panel of consumers is asked to assess the global liking of a certain number of products and then, preference scores are arranged in a two-way table Y. External information on both products (physicalchemical description or sensory attributes) and consumers (socio-demographic background, purchase behaviours or consumption habits) may be available in a row descriptor matrix X and in a column descriptor matrix Z respectively. The aim of this method is to automatically provide a consumer segmentation where all the three matrices play an active role in the classification, getting homogeneous groups from all points of view: preference, products and consumer characteristics. The proposed clustering method is illustrated on data from preference studies on food products: juices based on berry fruits and traditional cheeses from Trentino. The hedonic ratings given by the consumer panel on the products under study were explained with respect to the product chemical compounds, sensory evaluation and consumer socio-demographic information, purchase behaviour and consumption habits.

Styles APA, Harvard, Vancouver, ISO, etc.

4

Saraiya, Devang. « The Impact of Environmental Variables in Efficiency Analysis : A fuzzy clustering-DEA Approach ». Thesis, Virginia Tech, 2005. http://hdl.handle.net/10919/34637.

Texte intégral

Résumé :

Data Envelopment Analysis (Charnes et al, 1978) is a technique used to evaluate the relative efficiency of any process or an organization. The efficiency evaluation is relative, which means it is compared with other processes or organizations. In real life situations different processes or units seldom operate in similar environments. Within a relative efficiency context, if units operating in different environments are compared, the units that operate in less desirable environments are at a disadvantage. In order to ensure that the comparison is fair within the DEA framework, a two-stage framework is presented in this thesis. Fuzzy clustering is used in the first stage to suitably group the units with similar environments. In a subsequent stage, a relative efficiency analysis is performed on these groups. By approaching the problem in this manner the influence of environmental variables on the efficiency analysis is removed. The concept of environmental dependency index is introduced in this thesis. The EDI reflects the extent to which the efficiency behavior of units is due to their environment of operation. The EDI also assists the decision maker to choose appropriate peers to guide the changes that the inefficient units need to make. A more rigorous series of steps to obtain the clustering solution is also presented in a separate chapter (chapter 5).
Master of Science

Styles APA, Harvard, Vancouver, ISO, etc.

5

Dean, Nema. « Variable selection and other extensions of the mixture model clustering framework / ». Thesis, Connect to this title online ; UW restricted, 2006. http://hdl.handle.net/1773/8943.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

6

Doan, Nath-Quang. « Modèles hiérarchiques et topologiques pour le clustering et la visualisation des données ». Paris 13, 2013. http://scbd-sto.univ-paris13.fr/secure/edgalilee_th_2013_doan.pdf.

Texte intégral

Résumé :

Cette thèse se concentre sur les approches hiérarchiques et topologiques pour le clustering et la visualisation de données. Le problème du clustering devient de plus en plus compliqué en raison de présence de données structurées sous forme de graphes, arbres ou données séquentielles. Nous nous sommes particulièrement intéressés aux cartes auto-organisatrices et au modèle hiérarchique AntTree qui modélise la capacité des fourmis réelles. En combinant ces approches, l’objectif est de présenter les données dans une structure hiérarchique et topologique. Dans ce rapport, nous présentons trois modèles, dans le premier modèle nous montrons l’intérêt d’utiliser les structures hiérarchiques et topologiques sur des ensembles de données structurés sous forme de graphes. Le second modèle est une version incrémentale qui n’impose pas de règles sur la préservation de la topologie. Le troisième modèle aborde notamment la problématique de la sélection de variable en utilisant la structure hiérarchique, nous proposons un nouveau score pour sélectionner les variables pertinentes en contraignant le score Laplacien. Enfin, cette thèse propose plusieurs perspectives pour des travaux futurs
This thesis focuses on clustering approaches inspired from topological models and an autonomous hierarchical clustering method. The clustering problem becomes more complicated and difficult due to the growth in quality and quantify of structured data such as graphs, trees or sequences. In this thesis, we are particularly interested in self-organizing maps which have been generally used for learning topological preservation, clustering, vector quantization and graph visualization. Our studyconcerns also a hierarchical clustering method AntTree which models the ability of real ants to build structure by connect themselves. By combining the topological map with the self-assembly rules inspired from AntTree, the goal is to represent data in a hierarchical and topological structure providing more insight data information. The advantage is to visualize the clustering results as multiple hierarchical trees and a topological network. In this report, we present three new models that are able to address clustering, visualization and feature selection problems. In the first model, our study shows the interest in the use of hierarchical and topological structure through several applications on numerical datasets, as well as structured datasets e. G. Graphs and biological dataset. The second model consists of a flexible and growing structure which does not impose the strict network-topology preservation rules. Using statistical characteristics provided by hierarchical trees, it accelerates significantly the learning process. The third model addresses particularly the issue of unsupervised feature selection. The idea is to use hierarchical structure provided by AntTree to discover automatically local data structure and local neighbors. By using the tree topology, we propose a new score for feature selection by constraining the Laplacian score. Finally, this thesis offers several perspectives for future work

Styles APA, Harvard, Vancouver, ISO, etc.

7

Ndaoud, Mohamed. « Contributions to variable selection, clustering and statistical estimation inhigh dimension ». Electronic Thesis or Diss., Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLG005.

Texte intégral

Résumé :

Cette thèse traite les problèmes statistiques suivants : la sélection de variables dans le modèle de régression linéaire en grande dimension, le clustering dans le modèle de mélange Gaussien, quelques effets de l'adaptabilité sous l'hypothèse de parcimonie ainsi que la simulation des processus Gaussiens.Sous l'hypothèse de parcimonie, la sélection de variables correspond au recouvrement du "petit" ensemble de variables significatives. Nous étudions les propriétés non-asymptotiques de ce problème dans la régression linéaire en grande dimension. De plus, nous caractérisons les conditions optimales nécessaires et suffisantes pour la sélection de variables dans ce modèle. Nous étudions également certains effets de l'adaptation sous la même hypothèse. Dans le modèle à vecteur parcimonieux, nous analysons les changements dans les taux d'estimation de certains des paramètres du modèle lorsque le niveau de bruit ou sa loi nominale sont inconnus.Le clustering est une tâche d'apprentissage statistique non supervisée visant à regrouper des observations proches les unes des autres dans un certain sens. Nous étudions le problème de la détection de communautés dans le modèle de mélange Gaussien à deux composantes, et caractérisons précisément la séparation optimale entre les groupes afin de les recouvrir de façon exacte. Nous fournissons également une procédure en temps polynomial permettant un recouvrement optimal des communautés.Les processus Gaussiens sont extrêmement utiles dans la pratique, par exemple lorsqu'il s'agit de modéliser les fluctuations de prix. Néanmoins, leur simulation n'est pas facile en général. Nous proposons et étudions un nouveau développement en série à taux optimal pour simuler une grande classe de processus Gaussiens
This PhD thesis deals with the following statistical problems: Variable selection in high-Dimensional Linear Regression, Clustering in the Gaussian Mixture Model, Some effects of adaptivity under sparsity and Simulation of Gaussian processes.Under the sparsity assumption, variable selection corresponds to recovering the "small" set of significant variables. We study non-asymptotic properties of this problem in the high-dimensional linear regression. Moreover, we recover optimal necessary and sufficient conditions for variable selection in this model. We also study some effects of adaptation under sparsity. Namely, in the sparse vector model, we investigate, the changes in the estimation rates of some of the model parameters when the noise level or its nominal law are unknown.Clustering is a non-supervised machine learning task aiming to group observations that are close to each other in some sense. We study the problem of community detection in the Gaussian Mixture Model with two components, and characterize precisely the sharp separation between clusters in order to recover exactly the clusters. We also provide a fast polynomial time procedure achieving optimal recovery.Gaussian processes are extremely useful in practice, when it comes to model price fluctuations for instance. Nevertheless, their simulation is not easy in general. We propose and study a new rate-optimal series expansion to simulate a large class of Gaussian processes

Styles APA, Harvard, Vancouver, ISO, etc.

8

Naik, Vaibhav C. « Fuzzy C-means clustering approach to design a warehouse layout ». [Tampa, Fla.] : University of South Florida, 2004. http://purl.fcla.edu/fcla/etd/SFE0000437.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

9

Ndaoud, Mohamed. « Contributions to variable selection, clustering and statistical estimation inhigh dimension ». Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLG005/document.

Texte intégral

Résumé :

Cette thèse traite les problèmes statistiques suivants : la sélection de variables dans le modèle de régression linéaire en grande dimension, le clustering dans le modèle de mélange Gaussien, quelques effets de l'adaptabilité sous l'hypothèse de parcimonie ainsi que la simulation des processus Gaussiens.Sous l'hypothèse de parcimonie, la sélection de variables correspond au recouvrement du "petit" ensemble de variables significatives. Nous étudions les propriétés non-asymptotiques de ce problème dans la régression linéaire en grande dimension. De plus, nous caractérisons les conditions optimales nécessaires et suffisantes pour la sélection de variables dans ce modèle. Nous étudions également certains effets de l'adaptation sous la même hypothèse. Dans le modèle à vecteur parcimonieux, nous analysons les changements dans les taux d'estimation de certains des paramètres du modèle lorsque le niveau de bruit ou sa loi nominale sont inconnus.Le clustering est une tâche d'apprentissage statistique non supervisée visant à regrouper des observations proches les unes des autres dans un certain sens. Nous étudions le problème de la détection de communautés dans le modèle de mélange Gaussien à deux composantes, et caractérisons précisément la séparation optimale entre les groupes afin de les recouvrir de façon exacte. Nous fournissons également une procédure en temps polynomial permettant un recouvrement optimal des communautés.Les processus Gaussiens sont extrêmement utiles dans la pratique, par exemple lorsqu'il s'agit de modéliser les fluctuations de prix. Néanmoins, leur simulation n'est pas facile en général. Nous proposons et étudions un nouveau développement en série à taux optimal pour simuler une grande classe de processus Gaussiens
This PhD thesis deals with the following statistical problems: Variable selection in high-Dimensional Linear Regression, Clustering in the Gaussian Mixture Model, Some effects of adaptivity under sparsity and Simulation of Gaussian processes.Under the sparsity assumption, variable selection corresponds to recovering the "small" set of significant variables. We study non-asymptotic properties of this problem in the high-dimensional linear regression. Moreover, we recover optimal necessary and sufficient conditions for variable selection in this model. We also study some effects of adaptation under sparsity. Namely, in the sparse vector model, we investigate, the changes in the estimation rates of some of the model parameters when the noise level or its nominal law are unknown.Clustering is a non-supervised machine learning task aiming to group observations that are close to each other in some sense. We study the problem of community detection in the Gaussian Mixture Model with two components, and characterize precisely the sharp separation between clusters in order to recover exactly the clusters. We also provide a fast polynomial time procedure achieving optimal recovery.Gaussian processes are extremely useful in practice, when it comes to model price fluctuations for instance. Nevertheless, their simulation is not easy in general. We propose and study a new rate-optimal series expansion to simulate a large class of Gaussian processes

Styles APA, Harvard, Vancouver, ISO, etc.

10

Giacofci, Joyce. « Classification non supervisée et sélection de variables dans les modèles mixtes fonctionnels. Applications à la biologie moléculaire ». Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENM025/document.

Texte intégral

Résumé :

Un nombre croissant de domaines scientifiques collectent de grandes quantités de données comportant beaucoup de mesures répétées pour chaque individu. Ce type de données peut être vu comme une extension des données longitudinales en grande dimension. Le cadre naturel pour modéliser ce type de données est alors celui des modèles mixtes fonctionnels. Nous traitons, dans une première partie, de la classification non-supervisée dans les modèles mixtes fonctionnels. Nous présentons dans ce cadre une nouvelle procédure utilisant une décomposition en ondelettes des effets fixes et des effets aléatoires. Notre approche se décompose en deux étapes : une étape de réduction de dimension basée sur les techniques de seuillage des ondelettes et une étape de classification où l'algorithme EM est utilisé pour l'estimation des paramètres par maximum de vraisemblance. Nous présentons des résultats de simulations et nous illustrons notre méthode sur des jeux de données issus de la biologie moléculaire (données omiques). Cette procédure est implémentée dans le package R "curvclust" disponible sur le site du CRAN. Dans une deuxième partie, nous nous intéressons aux questions d'estimation et de réduction de dimension au sein des modèles mixtes fonctionnels et nous développons en ce sens deux approches. La première approche se place dans un objectif d'estimation dans un contexte non-paramétrique et nous montrons dans ce cadre, que l'estimateur de l'effet fixe fonctionnel basé sur les techniques de seuillage par ondelettes possède de bonnes propriétés de convergence. Notre deuxième approche s'intéresse à la problématique de sélection des effets fixes et aléatoires et nous proposons une procédure basée sur les techniques de sélection de variables par maximum de vraisemblance pénalisée et utilisant deux pénalités SCAD sur les effets fixes et les variances des effets aléatoires. Nous montrons dans ce cadre que le critère considéré conduit à des estimateurs possédant des propriétés oraculaires dans un cadre où le nombre d'individus et la taille des signaux divergent. Une étude de simulation visant à appréhender les comportements des deux approches développées est réalisée dans ce contexte
More and more scientific studies yield to the collection of a large amount of data that consist of sets of curves recorded on individuals. These data can be seen as an extension of longitudinal data in high dimension and are often modeled as functional data in a mixed-effects framework. In a first part we focus on performing unsupervised clustering of these curves in the presence of inter-individual variability. To this end, we develop a new procedure based on a wavelet representation of the model, for both fixed and random effects. Our approach follows two steps : a dimension reduction step, based on wavelet thresholding techniques, is first performed. Then a clustering step is applied on the selected coefficients. An EM-algorithm is used for maximum likelihood estimation of parameters. The properties of the overall procedure are validated by an extensive simulation study. We also illustrate our method on high throughput molecular data (omics data) like microarray CGH or mass spectrometry data. Our procedure is available through the R package "curvclust", available on the CRAN website. In a second part, we concentrate on estimation and dimension reduction issues in the mixed-effects functional framework. Two distinct approaches are developed according to these issues. The first approach deals with parameters estimation in a non parametrical setting. We demonstrate that the functional fixed effects estimator based on wavelet thresholding techniques achieves the expected rate of convergence toward the true function. The second approach is dedicated to the selection of both fixed and random effects. We propose a method based on a penalized likelihood criterion with SCAD penalties for the estimation and the selection of both fixed effects and random effects variances. In the context of variable selection we prove that the penalized estimators enjoy the oracle property when the signal size diverges with the sample size. A simulation study is carried out to assess the behaviour of the two proposed approaches

Styles APA, Harvard, Vancouver, ISO, etc.

11

Michel, Pierre. « Sélection d'items en classification non supervisée et questionnaires informatisés adaptatifs : applications à des données de qualité de vie liée à la santé ». Thesis, Aix-Marseille, 2016. http://www.theses.fr/2016AIXM4097/document.

Texte intégral

Résumé :

Un questionnaire adaptatif fournit une mesure valide de la qualité de vie des patients et réduit le nombre d'items à remplir. Cette approche est dépendante des modèles utilisés, basés sur des hypothèses parfois non vérifiables. Nous proposons une approche alternative basée sur les arbres de décision. Cette approche n'est basée sur aucune hypothèse et requiert moins de temps de calcul pour l'administration des items. Nous présentons différentes simulations qui démontrent la pertinence de notre approche. Nous présentons une méthode de classification non supervisée appelée CUBT. CUBT comprend trois étapes pour obtenir une partition optimale d'un jeu de données. La première étape construit un arbre en divisant récursivement le jeu de données. La deuxième étape regroupe les paires de noeuds terminaux de l'arbre. La troisième étape agrège des nœuds terminaux qui ne sont pas issus de la même division. Différentes simulations sont présentés pour comparer CUBT avec d'autres approches. Nous définissons également des heuristiques concernant le choix des paramètres de CUBT. CUBT identifie les variables qui sont actives dans la construction de l'arbre. Cependant, bien que certaines variables peuvent être sans importance, elles peuvent être compétitives pour les variables actives. Il est essentiel de classer les variables en fonction d'un score d'importance pour déterminer leur pertinence dans un modèle donné. Nous présentons une méthode pour mesurer l'importance des variables basée sur CUBT et les divisions binaires compétitives pour définir un score d'importance des variables. Nous analysons l'efficacité et la stabilité de ce nouvel indice, en le comparant à d'autres méthodes
An adaptive test provides a valid measure of quality of life of patients and reduces the number of items to be filled. This approach is dependent on the models used, sometimes based on unverifiable assumptions. We propose an alternative approach based on decision trees. This approach is not based on any assumptions and requires less calculation time for item administration. We present different simulations that demonstrate the relevance of our approach.We present an unsupervised classification method called CUBT. CUBT includes three steps to obtain an optimal partition of a data set. The first step grows a tree by recursively dividing the data set. The second step groups together the pairs of terminal nodes of the tree. The third step aggregates terminal nodes that do not come from the same split. Different simulations are presented to compare CUBT with other approaches. We also define heuristics for the choice of CUBT parameters.CUBT identifies the variables that are active in the construction of the tree. However, although some variables may be irrelevant, they may be competitive for the active variables. It is essential to rank the variables according to an importance score to determine their relevance in a given model. We present a method to measure the importance of variables based on CUBT and competitive binary splis to define a score of variable importance. We analyze the efficiency and stability of this new index, comparing it with other methods

Styles APA, Harvard, Vancouver, ISO, etc.

12

Youssfi, Younès. « Exploring Risk Factors and Prediction Models for Sudden Cardiac Death with Machine Learning ». Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAG006.

Texte intégral

Résumé :

La mort subite de l'adulte est définie comme une mort inattendue sans cause extracardiaque évidente, survenant avec un effondrement rapide en présence d'un témoin, ou en l'absence de témoin dans l'heure après le début des symptômes. Son incidence est estimée à 350,000 personnes par an en Europe et 300,000 personnes aux Etats-Unis, ce qui représente 10 à 20% des décès dans les pays industrialisés. Malgré les progrès réalisés dans la prise en charge, le pronostic demeure extrêmement sombre. Moins de 10% des patients sortent vivants de l'hôpital après la survenue d'une mort subite. Les défibrillateurs automatiques implantables offrent une solution thérapeutique efficace chez les patients identifiés à haut risque de mort subite. Leur identification en population générale demeure donc un enjeu de santé publique majeur, avec des résultats jusqu'à présent décevants. Cette thèse propose des outils statistiques pour répondre à ce problème, et améliorer notre compréhension de la mort subite en population générale. Nous analysons les données du Centre d'Expertise de la Mort Subite et les bases médico-administratives de l'Assurance Maladie, pour développer trois travaux principaux :- La première partie de la thèse vise à identifier de nouveaux sous-groupes de mort subite pour améliorer les modèles actuels de stratification du risque, qui reposent essentiellement sur des variables cardiovasculaires. Nous utilisons des modèles d'analyse du langage naturel et de clustering pour construire une nouvelle représentation pertinente de l'historique médical des patients.- La deuxième partie vise à construire un modèle de prédiction de la mort subite, capable de proposer un score de risque personnalisé et explicable pour chaque patient, et d'identifier avec précision les individus à très haut risque en population générale. Nous entraînons pour cela un algorithme de classification supervisée, combiné avec l'algorithme SHapley Additive exPlanations, pour analyser l'ensemble des consommations de soin survenues jusqu'à 5 ans avant l'événement.- La dernière partie de la thèse vise à identifier le niveau optimal d'information à sélectionner dans des bases médico-administratives de grande dimension. Nous proposons un algorithme de sélection de variables bi-niveaux pour des modèles linéaires généralisés, permettant de distinguer les effets de groupe des effets individuels pour chaque variable. Cet algorithme repose sur une approche bayésienne et utilise une méthode de Monte Carlo séquentiel pour estimer la loi a posteriori de sélection des variables
Sudden cardiac death (SCD) is defined as a sudden natural death presumed to be of cardiac cause, heralded by abrupt loss of consciousness in the presence of witness, or in the absence of witness occurring within an hour after the onset of symptoms. Despite progress in clinical profiling and interventions, it remains a major public health problem, accounting for 10 to 20% of deaths in industrialised countries, with survival after SCD below 10%. The annual incidence is estimated 350,000 in Europe, and 300,000 in the United States. Efficient treatments for SCD management are available. One of the most effective options is the use of implantable cardioverter defibrillators (ICD). However, identifying the best candidates for ICD implantation remains a difficult challenge, with disappointing results so far. This thesis aims to address this problem, and to provide a better understanding of SCD in the general population, using statistical modeling. We analyze data from the Paris Sudden Death Expertise Center and the French National Healthcare System Database to develop three main works:- The first part of the thesis aims to identify new subgroups of SCD to improve current stratification guidelines, which are mainly based on cardiovascular variables. To this end, we use natural language processing methods and clustering analysis to build a meaningful representation of medical history of patients.- The second part aims to build a prediction model of SCD in order to propose a personalized and explainable risk score for each patient, and accurately identify very-high risk subjects in the general population. To this end, we train a supervised classification algorithm, combined with the SHapley Additive exPlanation method, to analyze all medical events that occurred up to 5 years prior to the event.- The last part of the thesis aims to identify the most relevant information to select in large medical history of patients. We propose a bi-level variable selection algorithm for generalized linear models, in order to identify both individual and group effects from predictors. Our algorithm is based on a Bayesian approach and uses a Sequential Monte Carlo method to estimate the posterior distribution of variables inclusion

Styles APA, Harvard, Vancouver, ISO, etc.

13

Ouali, Abdelkader. « Méthodes hybrides parallèles pour la résolution de problèmes d'optimisation combinatoire : application au clustering sous contraintes ». Thesis, Normandie, 2017. http://www.theses.fr/2017NORMC215/document.

Texte intégral

Résumé :

Les problèmes d’optimisation combinatoire sont devenus la cible de nombreuses recherches scientifiques pour leur importance dans la résolution de problèmes académiques et de problèmes réels rencontrés dans le domaine de l’ingénierie et dans l’industrie. La résolution de ces problèmes par des méthodes exactes ne peut être envisagée à cause des délais de traitement souvent exorbitants que nécessiteraient ces méthodes pour atteindre la (les) solution(s) optimale(s). Dans cette thèse, nous nous sommes intéressés au contexte algorithmique de résolution des problèmes combinatoires, et au contexte de modélisation de ces problèmes. Au niveau algorithmique, nous avons appréhendé les méthodes hybrides qui excellent par leur capacité à faire coopérer les méthodes exactes et les méthodes approchées afin de produire rapidement des solutions. Au niveau modélisation, nous avons travaillé sur la spécification et la résolution exacte des problématiques complexes de fouille des ensembles de motifs en étudiant tout particulièrement le passage à l’échelle sur des bases de données de grande taille. D'une part, nous avons proposé une première parallélisation de l'algorithme DGVNS, appelée CPDGVNS, qui explore en parallèle les différents clusters fournis par la décomposition arborescente en partageant la meilleure solution trouvée sur un modèle maître-travailleur. Deux autres stratégies, appelées RADGVNS et RSDGVNS, ont été proposées qui améliorent la fréquence d'échange des solutions intermédiaires entre les différents processus. Les expérimentations effectuées sur des problèmes combinatoires difficiles montrent l'adéquation et l'efficacité de nos méthodes parallèles. D'autre part, nous avons proposé une approche hybride combinant à la fois les techniques de programmation linéaire en nombres entiers (PLNE) et la fouille de motifs. Notre approche est complète et tire profit du cadre général de la PLNE (en procurant un haut niveau de flexibilité et d’expressivité) et des heuristiques spécialisées pour l’exploration et l’extraction de données (pour améliorer les temps de calcul). Outre le cadre général de l’extraction des ensembles de motifs, nous avons étudié plus particulièrement deux problèmes : le clustering conceptuel et le problème de tuilage (tiling). Les expérimentations menées ont montré l’apport de notre proposition par rapport aux approches à base de contraintes et aux heuristiques spécialisées
Combinatorial optimization problems have become the target of many scientific researches for their importance in solving academic problems and real problems encountered in the field of engineering and industry. Solving these problems by exact methods is often intractable because of the exorbitant time processing that these methods would require to reach the optimal solution(s). In this thesis, we were interested in the algorithmic context of solving combinatorial problems, and the modeling context of these problems. At the algorithmic level, we have explored the hybrid methods which excel in their ability to cooperate exact methods and approximate methods in order to produce rapidly solutions of best quality. At the modeling level, we worked on the specification and the exact resolution of complex problems in pattern set mining, in particular, by studying scaling issues in large databases. On the one hand, we proposed a first parallelization of the DGVNS algorithm, called CPDGVNS, which explores in parallel the different clusters of the tree decomposition by sharing the best overall solution on a master-worker model. Two other strategies, called RADGVNS and RSDGVNS, have been proposed which improve the frequency of exchanging intermediate solutions between the different processes. Experiments carried out on difficult combinatorial problems show the effectiveness of our parallel methods. On the other hand, we proposed a hybrid approach combining techniques of both Integer Linear Programming (ILP) and pattern mining. Our approach is comprehensive and takes advantage of the general ILP framework (by providing a high level of flexibility and expressiveness) and specialized heuristics for data mining (to improve computing time). In addition to the general framework for the pattern set mining, two problems were studied: conceptual clustering and the tiling problem. The experiments carried out showed the contribution of our proposition in relation to constraint-based approaches and specialized heuristics

Styles APA, Harvard, Vancouver, ISO, etc.

14

Ta, Minh Thuy. « Techniques d'optimisation non convexe basée sur la programmation DC et DCA et méthodes évolutives pour la classification non supervisée ». Thesis, Université de Lorraine, 2014. http://www.theses.fr/2014LORR0099/document.

Texte intégral

Résumé :

Nous nous intéressons particulièrement, dans cette thèse, à quatre problèmes en apprentissage et fouille de données : clustering pour les données évolutives, clustering pour les données massives, clustering avec pondération de variables et enfin le clustering sans connaissance a priori du nombre de clusters avec initialisation optimale des centres de clusters. Les méthodes que nous décrivons se basent sur des approches d’optimisation déterministe, à savoir la programmation DC (Difference of Convex functions) et DCA (Difference of Convex Algorithms), pour la résolution de problèmes de clustering cités précédemment, ainsi que des approches évolutionnaires élitistes. Nous adaptons l’algorithme de clustering DCA–MSSC pour le traitement de données évolutives par fenêtres, en appréhendant les données évolutives avec deux modèles : fenêtres fixes et fenêtres glissantes. Pour le problème du clustering de données massives, nous utilisons l’algorithme DCA en deux phases. Dans la première phase, les données massives sont divisées en plusieurs sous-ensembles, sur lesquelles nous appliquons l’algorithme DCA–MSSC pour effectuer un clustering. Dans la deuxième phase, nous proposons un algorithme DCA-Weight pour effectuer un clustering pondéré sur l’ensemble des centres obtenues à la première phase. Concernant le clustering avec pondération de variables, nous proposons également deux approches: clustering dur avec pondération de variables et clustering floue avec pondération de variables. Nous testons notre approche sur un problème de segmentation d’image. Le dernier problème abordé dans cette thèse est le clustering sans connaissance a priori du nombre des clusters. Nous proposons pour cela une approche évolutionnaire élitiste. Le principe consiste à utiliser plusieurs algorithmes évolutionnaires (EAs) en même temps, de les faire concourir afin d’obtenir la meilleure combinaison de centres initiaux pour le clustering et par la même occasion le nombre optimal de clusters. Les différents tests réalisés sur plusieurs ensembles de données de grande taille sont très prometteurs et montrent l’efficacité des approches proposées
This thesis focus on four problems in data mining and machine learning: clustering data streams, clustering massive data sets, weighted hard and fuzzy clustering and finally the clustering without a prior knowledge of the clusters number. Our methods are based on deterministic optimization approaches, namely the DC (Difference of Convex functions) programming and DCA (Difference of Convex Algorithm) for solving some classes of clustering problems cited before. Our methods are also, based on elitist evolutionary approaches. We adapt the clustering algorithm DCA–MSSC to deal with data streams using two windows models: sub–windows and sliding windows. For the problem of clustering massive data sets, we propose to use the DCA algorithm with two phases. In the first phase, massive data is divided into several subsets, on which the algorithm DCA–MSSC performs clustering. In the second phase, we propose a DCA–Weight algorithm to perform a weighted clustering on the obtained centers in the first phase. For the weighted clustering, we also propose two approaches: weighted hard clustering and weighted fuzzy clustering. We test our approach on image segmentation application. The final issue addressed in this thesis is the clustering without a prior knowledge of the clusters number. We propose an elitist evolutionary approach, where we apply several evolutionary algorithms (EAs) at the same time, to find the optimal combination of initial clusters seed and in the same time the optimal clusters number. The various tests performed on several sets of large data are very promising and demonstrate the effectiveness of the proposed approaches

Styles APA, Harvard, Vancouver, ISO, etc.

15

Jaafar, Amine. « Traitement de la mission et des variables environnementales et intégration au processus de conception systémique ». Thesis, Toulouse, INPT, 2011. http://www.theses.fr/2011INPT0070/document.

Texte intégral

Résumé :

Ce travail présente une démarche méthodologique visant le «traitement de profils» de «mission» et plus généralement de «variables environnementales» (mission, gisement, conditions aux limites), démarche constituant la phase amont essentielle d’un processus de conception systémique. La «classification» et la «synthèse» des profils relatifs aux variables d’environnement du système constituent en effet une première étape inévitable permettant de garantir, dans une large mesure, la qualité du dispositif conçu et ce à condition de se baser sur des «indicateurs» pertinents au sens des critères et contraintes de conception. Cette approche s’inscrit donc comme un outil d’aide à la décision dans un contexte de conception systémique. Nous mettons en particulier l’accent dans cette thèse sur l’apport de notre approche dans le contexte de la conception par optimisation qui, nécessitant un grand nombre d’itérations (évaluation de solutions de conception), exige l’utilisation de «profils compacts» au niveau informationnel (temps, fréquence,…). Nous proposons dans une première phase d’étude, une démarche de «classification» et de «segmentation» des profils basée sur des critères de partitionnement. Cette étape permet de guider le concepteur vers le choix du nombre de dispositifs à concevoir pour sectionner les produits créés dans une gamme. Dans une deuxième phase d’étude, nous proposons un processus de «synthèse de profil compact», représentatif des données relatives aux variables environnementales étudiées et dont les indicateurs de caractérisation correspondent aux caractéristiques de référence des données réelles. Ce signal de durée réduite est obtenu par la résolution d’un problème inverse à l’aide d’un algorithme évolutionnaire en agrégeant des motifs élémentaires paramétrés (sinusoïde, segments, sinus cardinaux). Ce processus de «synthèse compacte» est appliqué ensuite sur des exemples de profils de missions ferroviaires puis sur des gisements éoliens (vitesse du vent) associés à la conception de chaînes éoliennes. Nous prouvons enfin que la démarche de synthèse de profil représentatif et compact accroît notablement l’efficacité de l’optimisation en minimisant le coût de calcul facilitant dès lors une approche de conception par optimisation
This work presents a methodological approach aiming at analyzing and processing mission profiles and more generally environmental variables (e.g. solar or wind energy potential, temperature, boundary conditions) in the context of system design. This process constitutes a key issue in order to ensure system effectiveness with regards to design constraints and objectives. In this thesis, we pay a particular attention on the use of compact profiles for environmental variables in the frame of system level integrated optimal design, which requires a wide number of system simulations. In a first part, we propose a clustering approach based on partition criteria with the aim of analyzing mission profiles. This phase can help designers to identify different system configurations in compliance with the corresponding clusters: it may guide suppliers towards “market segmentation” not only fulfilling economic constraints but also technical design objectives. The second stage of the study proposes a synthesis process of a compact profile which represents the corresponding data of the studied environmental variable. This compact profile is generated by combining parameters and number of elementary patterns (segment, sine or cardinal sine) with regards to design indicators. These latter are established with respect to the main objectives and constraints associated to the designed system. All pattern parameters are obtained by solving the corresponding inverse problem with evolutionary algorithms. Finally, this synthesis process is applied to two different case studies. The first consists in the simplification of wind data issued from measurements in two geographic sites of Guadeloupe and Tunisia. The second case deals with the reduction of a set of railway mission profiles relative to a hybrid locomotive devoted to shunting and switching missions. It is shown from those examples that our approach leads to a wide reduction of the profiles associated with environmental variables which allows a significant decrease of the computational time in the context of an integrated optimal design process

Styles APA, Harvard, Vancouver, ISO, etc.

16

Meynet, Caroline. « Sélection de variables pour la classification non supervisée en grande dimension ». Phd thesis, Université Paris Sud - Paris XI, 2012. http://tel.archives-ouvertes.fr/tel-00752613.

Texte intégral

Résumé :

Il existe des situations de modélisation statistique pour lesquelles le problème classique de classification non supervisée (c'est-à-dire sans information a priori sur la nature ou le nombre de classes à constituer) se double d'un problème d'identification des variables réellement pertinentes pour déterminer la classification. Cette problématique est d'autant plus essentielle que les données dites de grande dimension, comportant bien plus de variables que d'observations, se multiplient ces dernières années : données d'expression de gènes, classification de courbes... Nous proposons une procédure de sélection de variables pour la classification non supervisée adaptée aux problèmes de grande dimension. Nous envisageons une approche par modèles de mélange gaussien, ce qui nous permet de reformuler le problème de sélection des variables et du choix du nombre de classes en un problème global de sélection de modèle. Nous exploitons les propriétés de sélection de variables de la régularisation l1 pour construire efficacement, à partir des données, une collection de modèles qui reste de taille raisonnable même en grande dimension. Nous nous démarquons des procédures classiques de sélection de variables par régularisation l1 en ce qui concerne l'estimation des paramètres : dans chaque modèle, au lieu de considérer l'estimateur Lasso, nous calculons l'estimateur du maximum de vraisemblance. Ensuite, nous sélectionnons l'un des ces estimateurs du maximum de vraisemblance par un critère pénalisé non asymptotique basé sur l'heuristique de pente introduite par Birgé et Massart. D'un point de vue théorique, nous établissons un théorème de sélection de modèle pour l'estimation d'une densité par maximum de vraisemblance pour une collection aléatoire de modèles. Nous l'appliquons dans notre contexte pour trouver une forme de pénalité minimale pour notre critère pénalisé. D'un point de vue pratique, des simulations sont effectuées pour valider notre procédure, en particulier dans le cadre de la classification non supervisée de courbes. L'idée clé de notre procédure est de n'utiliser la régularisation l1 que pour constituer une collection restreinte de modèles et non pas aussi pour estimer les paramètres des modèles. Cette étape d'estimation est réalisée par maximum de vraisemblance. Cette procédure hybride nous est inspirée par une étude théorique menée dans une première partie dans laquelle nous établissons des inégalités oracle l1 pour le Lasso dans les cadres de régression gaussienne et de mélange de régressions gaussiennes, qui se démarquent des inégalités oracle l0 traditionnellement établies par leur absence totale d'hypothèse.

Styles APA, Harvard, Vancouver, ISO, etc.

17

Boulin, Alexis. « Partitionnement des variables de séries temporelles multivariées selon la dépendance de leurs extrêmes ». Electronic Thesis or Diss., Université Côte d'Azur, 2024. http://www.theses.fr/2024COAZ5039.

Texte intégral

Résumé :

Dans un grand éventail d'applications allant des sciences du climat à la finance, des événements extrêmes avec une probabilité loin d'être négligeable peuvent se produire, entraînant des conséquences désastreuses. Les extrêmes d'évènements climatiques tels que le vent, la température et les précipitations peuvent profondément affecter les êtres humains et les écosystèmes, entraînant des événements tels que des inondations, des glissements de terrain ou des vagues de chaleur. Lorsque l'emphase est mise sur l'étude de variables mesurées dans le temps sur un grand nombre de stations ayant une localisation spécifique, comme les variables mentionnées précédemment, le partitionnement de variables devient essentiel pour résumer et visualiser des tendances spatiales, ce qui est crucial dans l'étude des événements extrêmes. Cette thèse explore plusieurs modèles et méthodes pour partitionner les variables d'un processus stationnaire multivarié, en se concentrant sur les dépendances extrémales.Le chapitre 1 présente les concepts de modélisation de la dépendance via les copules, fondamentales pour la dépendance extrême. La notion de variation régulière est introduite, essentielle pour l'étude des extrêmes, et les processus faiblement dépendants sont abordés. Le partitionnement est discuté à travers les paradigmes de séparation-proximité et de partitionnement basé sur un modèle. Nous abordons aussi l'analyse non-asymptotique pour évaluer nos méthodes dans des dimensions fixes.Le chapitre 2 est à propos de la dépendance entre valeurs maximales est cruciale pour l'analyse des risques. Utilisant la fonction de copule de valeur extrême et le madogramme, ce chapitre se concentre sur l'estimation non paramétrique avec des données manquantes. Un théorème central limite fonctionnel est établi, démontrant la convergence du madogramme vers un processus Gaussien tendu. Des formules pour la variance asymptotique sont présentées, illustrées par une étude numérique.Le chapitre 3 propose les modèles asymptotiquement indépendants par blocs (AI-blocs) pour le partitionnement de variables, définissant des clusters basés sur l'indépendance des maxima. Un algorithme est introduit pour récupérer les clusters sans spécifier leur nombre à l'avance. L'efficacité théorique de l'algorithme est démontrée, et une méthode de sélection de paramètre basée sur les données est proposée. La méthode est appliquée à des données de neurosciences et environnementales, démontrant son potentiel.Le chapitre 4 adapte des techniques de partitionnement pour analyser des événements extrêmes composites sur des données climatiques européennes. Les sous-régions présentant une dépendance des extrêmes de précipitations et de vitesse du vent sont identifiées en utilisant des données ERA5 de 1979 à 2022. Les clusters obtenus sont spatialement concentrés, offrant une compréhension approfondie de la distribution régionale des extrêmes. Les méthodes proposées réduisent efficacement la taille des données tout en extrayant des informations cruciales sur les événements extrêmes.Le chapitre 5 propose une nouvelle méthode d'estimation pour les matrices dans un modèle linéaire à facteurs latents, où chaque composante d'un vecteur aléatoire est exprimée par une équation linéaire avec des facteurs et du bruit. Contrairement aux approches classiques basées sur la normalité conjointe, nous supposons que les facteurs sont distribués selon des distributions de Fréchet standards, ce qui permet une meilleure description de la dépendance extrémale. Une méthode d'estimation est proposée garantissant une solution unique sous certaines conditions. Une borne supérieure adaptative pour l'estimateur est fournie, adaptable à la dimension et au nombre de facteurs
In a wide range of applications, from climate science to finance, extreme events with a non-negligible probability can occur, leading to disastrous consequences. Extremes in climatic events such as wind, temperature, and precipitation can profoundly impact humans and ecosystems, resulting in events like floods, landslides, or heatwaves. When the focus is on studying variables measured over time at numerous specific locations, such as the previously mentioned variables, partitioning these variables becomes essential to summarize and visualize spatial trends, which is crucial in the study of extreme events. This thesis explores several models and methods for partitioning the variables of a multivariate stationary process, focusing on extreme dependencies.Chapter 1 introduces the concepts of modeling dependence through copulas, which are fundamental for extreme dependence. The notion of regular variation, essential for studying extremes, is introduced, and weakly dependent processes are discussed. Partitioning is examined through the paradigms of separation-proximity and model-based clustering. Non-asymptotic analysis is also addressed to evaluate our methods in fixed dimensions.Chapter 2 study the dependence between maximum values is crucial for risk analysis. Using the extreme value copula function and the madogram, this chapter focuses on non-parametric estimation with missing data. A functional central limit theorem is established, demonstrating the convergence of the madogram to a tight Gaussian process. Formulas for asymptotic variance are presented, illustrated by a numerical study.Chapter 3 proposes asymptotically independent block (AI-block) models for partitioning variables, defining clusters based on the independence of maxima. An algorithm is introduced to recover clusters without specifying their number in advance. Theoretical efficiency of the algorithm is demonstrated, and a data-driven parameter selection method is proposed. The method is applied to neuroscience and environmental data, showcasing its potential.Chapter 4 adapts partitioning techniques to analyze composite extreme events in European climate data. Sub-regions with dependencies in extreme precipitation and wind speed are identified using ERA5 data from 1979 to 2022. The obtained clusters are spatially concentrated, offering a deep understanding of the regional distribution of extremes. The proposed methods efficiently reduce data size while extracting critical information on extreme events.Chapter 5 proposes a new estimation method for matrices in a latent factor linear model, where each component of a random vector is expressed by a linear equation with factors and noise. Unlike classical approaches based on joint normality, we assume factors are distributed according to standard Fréchet distributions, allowing a better description of extreme dependence. An estimation method is proposed, ensuring a unique solution under certain conditions. An adaptive upper bound for the estimator is provided, adaptable to dimension and the number of factors

Styles APA, Harvard, Vancouver, ISO, etc.

18

Labenne, Amaury. « Méthodes de réduction de dimension pour la construction d'indicateurs de qualité de vie ». Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0239/document.

Texte intégral

Résumé :

L’objectif de cette thèse est de développer et de proposer de nouvellesméthodes de réduction de dimension pour la construction d’indicateurs composites dequalité de vie à l’échelle communale. La méthodologie statistique développée met l’accentsur la prise en compte de la multidimensionnalité du concept de qualité de vie, avecune attention particulière sur le traitement de la mixité des données (variables quantitativeset qualitatives) et l’introduction des conditions environnementales. Nous optonspour une approche par classification de variables et pour une méthode multi-tableaux(analyse factorielle multiple pour données mixtes). Ces deux méthodes permettent deconstruire des indicateurs composites que nous proposons comme mesure des conditionsde vie à l’échelle communale. Afin de faciliter l’interprétation des indicateurscomposites construits, une méthode de sélection de variables de type bootstrap estintroduite en analyse factorielle multiple. Enfin nous proposons la méthode hclustgeode classification d’observations qui intègre des contraintes de proximité géographiqueafin de mieux appréhender la spatialité des phénomènes mis en jeu
The purpose of this thesis is to develop and suggest new dimensionreduction methods to construct composite indicators on a municipal scale. The developedstatistical methodology highlights the consideration of the multi-dimensionalityof the quality of life concept, with a particular attention on the treatment of mixeddata (quantitative and qualitative variables) and the introduction of environmentalconditions. We opt for a variable clustering approach and for a multi-table method(multiple factorial analysis for mixed data). These two methods allow to build compositeindicators that we propose as a measure of living conditions at the municipalscale. In order to facilitate the interpretation of the created composite indicators, weintroduce a method of selections of variables based on a bootstrap approach. Finally,we suggest the clustering of observations method, named hclustgeo, which integratesgeographical proximity constraints in the clustering procedure, in order to apprehendthe spatiality specificities better

Styles APA, Harvard, Vancouver, ISO, etc.

19

Kuentz, Vanessa. « Contributions à la réduction de dimension ». Thesis, Bordeaux 1, 2009. http://www.theses.fr/2009BOR13871/document.

Texte intégral

Résumé :

Cette thèse est consacrée au problème de la réduction de dimension. Cette thématique centrale en Statistique vise à rechercher des sous-espaces de faibles dimensions tout en minimisant la perte d'information contenue dans les données. Tout d'abord, nous nous intéressons à des méthodes de statistique multidimensionnelle dans le cas de variables qualitatives. Nous abordons la question de la rotation en Analyse des Correspondances Multiples (ACM). Nous définissons l'expression analytique de l'angle de rotation planaire optimal pour le critère de rotation choisi. Lorsque le nombre de composantes principales retenues est supérieur à deux, nous utilisons un algorithme de rotations planaires successives de paires de facteurs. Nous proposons également différents algorithmes de classification de variables qualitatives qui visent à optimiser un critère de partitionnement basé sur la notion de rapports de corrélation. Un jeu de données réelles illustre les intérêts pratiques de la rotation en ACM et permet de comparer empiriquement les différents algorithmes de classification de variables qualitatives proposés. Puis nous considérons un modèle de régression semiparamétrique, plus précisément nous nous intéressons à la méthode de régression inverse par tranchage (SIR pour Sliced Inverse Regression). Nous développons une approche basée sur un partitionnement de l'espace des covariables, qui est utilisable lorsque la condition fondamentale de linéarité de la variable explicative est violée. Une seconde adaptation, utilisant le bootstrap, est proposée afin d'améliorer l'estimation de la base du sous-espace de réduction de dimension. Des résultats asymptotiques sont donnés et une étude sur des données simulées démontre la supériorité des approches proposées. Enfin les différentes applications et collaborations interdisciplinaires réalisées durant la thèse sont décrites
This thesis concentrates on dimension reduction approaches, that seek for lower dimensional subspaces minimizing the lost of statistical information. First we focus on multivariate analysis for categorical data. The rotation problem in Multiple Correspondence Analysis (MCA) is treated. We give the analytic expression of the optimal angle of planar rotation for the chosen criterion. If more than two principal components are to be retained, this planar solution is used in a practical algorithm applying successive pairwise planar rotations. Different algorithms for the clustering of categorical variables are also proposed to maximize a given partitioning criterion based on correlation ratios. A real data application highlights the benefits of using rotation in MCA and provides an empirical comparison of the proposed algorithms for categorical variable clustering. Then we study the semiparametric regression method SIR (Sliced Inverse Regression). We propose an extension based on the partitioning of the predictor space that can be used when the crucial linearity condition of the predictor is not verified. We also introduce bagging versions of SIR to improve the estimation of the basis of the dimension reduction subspace. Asymptotic properties of the estimators are obtained and a simulation study shows the good numerical behaviour of the proposed methods. Finally applied multivariate data analysis on various areas is described

Styles APA, Harvard, Vancouver, ISO, etc.

20

Makkhongkaew, Raywat. « Semi-supervised co-selection : instances and features : application to diagnosis of dry port by rail ». Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1341.

Texte intégral

Résumé :

Depuis la prolifération des bases de données partiellement étiquetées, l'apprentissage automatique a connu un développement important dans le mode semi-supervisé. Cette tendance est due à la difficulté de l'étiquetage des données d'une part et au coût induit de cet étiquetage quand il est possible, d'autre part.L'apprentissage semi-supervisé consiste en général à modéliser une fonction statistique à partir de base de données regroupant à la fois des exemples étiquetés et d'autres non-étiquetés. Pour aborder une telle problématique, deux familles d'approches existent : celles basées sur la propagation de la supervision en vue de la classification supervisée et celles basées sur les contraintes en vue du clustering (non-supervisé). Nous nous intéressons ici à la deuxième famille avec une difficulté particulière. Il s'agit d'apprendre à partir de données avec une partie étiquetée relativement très réduite par rapport à la partie non-étiquetée.Dans cette thèse, nous nous intéressons à l'optimisation des bases de données statistiques en vue de l'amélioration des modèles d'apprentissage. Cette optimisation peut être horizontale et/ou verticale. La première définit la sélection d'instances et la deuxième définit la tâche de la sélection de variables.Les deux taches sont habituellement étudiées de manière indépendante avec une série de travaux considérable dans la littérature. Nous proposons ici de les étudier dans un cadre simultané, ce qui définit la thématique de la co-sélection. Pour ce faire, nous proposons deux cadres unifiés considérant à la fois la partie étiquetée des données et leur partie non-étiquetée. Le premier cadre est basé sur un clustering pondéré sous contraintes et le deuxième sur la préservation de similarités entre les données. Les deux approches consistent à qualifier les instances et les variables pour en sélectionner les plus pertinentes de manière simultanée.Enfin, nous présentons une série d'études empiriques sur des données publiques connues de la littérature pour valider les approches proposées et les comparer avec d'autres approches connues dans la littérature. De plus, une validation expérimentale est fournie sur un problème réel, concernant le diagnostic de transport ferroviaire de l'état de la Thaïlande
We are drowning in massive data but starved for knowledge retrieval. It is well known through the dimensionality tradeoff that more data increase informative but pay a price in computational complexity, which has to be made up in some way. When the labeled sample size is too little to bring sufficient information about the target concept, supervised learning fail with this serious challenge. Unsupervised learning can be an alternative in this problem. However, as these algorithms ignore label information, important hints from labeled data are left out and this will generally downgrades the performance of unsupervised learning algorithms. Using both labeled and unlabeled data is expected to better procedure in semi-supervised learning, which is more adapted for large domain applications when labels are hardly and costly to obtain. In addition, when data are large, feature selection and instance selection are two important dual operations for removing irrelevant information. Both of tasks with semisupervised learning are different challenges for machine learning and data mining communities for data dimensionality reduction and knowledge retrieval. In this thesis, we focus on co-selection of instances and features in the context of semi-supervised learning. In this context, co-selection becomes a more challenging problem as the data contains labeled and unlabeled examples sampled from the same population. To do such semi-supervised coselection, we propose two unified frameworks, which efficiently integrate labeled and unlabeled parts into the co-selection process. The first framework is based on weighting constrained clustering and the second one is based on similarity preserving selection. Both approaches evaluate the usefulness of features and instances in order to select the most relevant ones, simultaneously. Finally, we present a variety of empirical studies over high-dimensional data sets, which are well-known in the literature. The results are promising and prove the efficiency and effectiveness of the proposed approaches. In addition, the developed methods are validated on a real world application, over data provided by the State Railway of Thailand (SRT). The purpose is to propose the application models from our methodological contributions to diagnose the performance of rail dry port systems. First, we present the results of some ensemble methods applied on a first data set, which is fully labeled. Second, we show how can our co-selection approaches improve the performance of learning algorithms over partially labeled data provided by SRT

Styles APA, Harvard, Vancouver, ISO, etc.

21

Liu, Gang. « Spatiotemporal Sensing and Informatics for Complex Systems Monitoring, Fault Identification and Root Cause Diagnostics ». Scholar Commons, 2015. https://scholarcommons.usf.edu/etd/5727.

Texte intégral

Résumé :

In order to cope with system complexity and dynamic environments, modern industries are investing in a variety of sensor networks and data acquisition systems to increase information visibility. Multi-sensor systems bring the proliferation of high-dimensional functional Big Data that capture rich information on the evolving dynamics of natural and engineered processes. With spatially and temporally dense data readily available, there is an urgent need to develop advanced methodologies and associated tools that will enable and assist (i) the handling of the big data communicated by the contemporary complex systems, (ii) the extraction and identification of pertinent knowledge about the environmental and operational dynamics driving these systems, and (iii) the exploitation of the acquired knowledge for more enhanced design, analysis, monitoring, diagnostics and control. My methodological and theoretical research as well as a considerable portion of my applied and collaborative work in this dissertation aims at addressing high-dimensional functional big data communicated by the systems. An innovative contribution of my work is the establishment of a series of systematic methodologies to investigate the complex system informatics including multi-dimensional modeling, feature extraction and selection, model-based monitoring and root cause diagnostics. This study presents systematic methodologies to investigate spatiotemporal informatics of complex systems from multi-dimensional modeling and feature extraction to model-driven monitoring, fault identification and root cause diagnostics. In particular, we developed a multiscale adaptive basis function model to represent and characterize the high-dimensional nonlinear functional profiles, thereby reducing the large amount of data to a parsimonious set of variables (i.e., model parameters) while preserving the information. Furthermore, the complex interdependence structure among variables is identified by a novel self-organizing network algorithm, in which the homogeneous variables are clustered into sub-network communities. Then we minimize the redundancy of variables in each cluster and integrate the new set of clustered variables with predictive models to identify a sparse set of sensitive variables for process monitoring and fault diagnostics. We evaluated and validated our methodologies using real-world case studies that extract parameters from representation models of vectorcardiogram (VCG) signals for the diagnosis of myocardial infarctions. The proposed systematic methodologies are generally applicable for modeling, monitoring and diagnosis in many disciplines that involve a large number of highly-redundant variables extracted from the big data. The self-organizing approach was also innovatively developed to derive the steady geometric structure of a network from the recurrence-based adjacency matrix. As such, novel network-theoretic measures can be achieved based on actual node-to-node distances in the self-organized network topology.

Styles APA, Harvard, Vancouver, ISO, etc.

22

Ning, Hoi-Kwan Flora. « Model-based regression clustering with variable selection ». Thesis, University of Oxford, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.497059.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

23

McClelland, Robyn L. « Regression based variable clustering for data reduction / ». Thesis, Connect to this title online ; UW restricted, 2000. http://hdl.handle.net/1773/9611.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

24

Channarond, Antoine. « Recherche de structure dans un graphe aléatoire : modèles à espace latent ». Thesis, Paris 11, 2013. http://www.theses.fr/2013PA112338/document.

Texte intégral

Résumé :

Cette thèse aborde le problème de la recherche d'une structure (ou clustering) dans lesnoeuds d'un graphe. Dans le cadre des modèles aléatoires à variables latentes, on attribue à chaque noeud i une variable aléatoire non observée (latente) Zi, et la probabilité de connexion des noeuds i et j dépend conditionnellement de Zi et Zj . Contrairement au modèle d'Erdos-Rényi, les connexions ne sont pas indépendantes identiquement distribuées; les variables latentes régissent la loi des connexions des noeuds. Ces modèles sont donc hétérogènes, et leur structure est décrite par les variables latentes et leur loi; ce pourquoi on s'attache à en faire l'inférence à partir du graphe, seule variable observée.La volonté commune des deux travaux originaux de cette thèse est de proposer des méthodes d'inférence de ces modèles, consistentes et de complexité algorithmique au plus linéaire en le nombre de noeuds ou d'arêtes, de sorte à pouvoir traiter de grands graphes en temps raisonnable. Ils sont aussi tous deux fondés sur une étude fine de la distribution des degrés, normalisés de façon convenable selon le modèle.Le premier travail concerne le Stochastic Blockmodel. Nous y montrons la consistence d'un algorithme de classiffcation non supervisée à l'aide d'inégalités de concentration. Nous en déduisons une méthode d'estimation des paramètres, de sélection de modèles pour le nombre de classes latentes, et un test de la présence d'une ou plusieurs classes latentes (absence ou présence de clustering), et nous montrons leur consistence.Dans le deuxième travail, les variables latentes sont des positions dans l'espace ℝd, admettant une densité f, et la probabilité de connexion dépend de la distance entre les positions des noeuds. Les clusters sont définis comme les composantes connexes de l'ensemble de niveau t > 0 fixé de f, et l'objectif est d'en estimer le nombre à partir du graphe. Nous estimons la densité en les positions latentes des noeuds grâce à leur degré, ce qui permet d'établir une correspondance entre les clusters et les composantes connexes de certains sous-graphes du graphe observé, obtenus en retirant les nœuds de faible degré. En particulier, nous en déduisons un estimateur du nombre de clusters et montrons saconsistence en un certain sens
.This thesis addresses the clustering of the nodes of a graph, in the framework of randommodels with latent variables. To each node i is allocated an unobserved (latent) variable Zi and the probability of nodes i and j being connected depends conditionally on Zi and Zj . Unlike Erdos-Renyi's model, connections are not independent identically distributed; the latent variables rule the connection distribution of the nodes. These models are thus heterogeneous and their structure is fully described by the latent variables and their distribution. Hence we aim at infering them from the graph, which the only observed data.In both original works of this thesis, we propose consistent inference methods with a computational cost no more than linear with respect to the number of nodes or edges, so that large graphs can be processed in a reasonable time. They both are based on a study of the distribution of the degrees, which are normalized in a convenient way for the model.The first work deals with the Stochastic Blockmodel. We show the consistency of an unsupervised classiffcation algorithm using concentration inequalities. We deduce from it a parametric estimation method, a model selection method for the number of latent classes, and a clustering test (testing whether there is one cluster or more), which are all proved to be consistent. In the second work, the latent variables are positions in the ℝd space, having a density f. The connection probability depends on the distance between the node positions. The clusters are defined as connected components of some level set of f. The goal is to estimate the number of such clusters from the observed graph only. We estimate the density at the latent positions of the nodes with their degree, which allows to establish a link between clusters and connected components of some subgraphs of the observed graph, obtained by removing low degree nodes. In particular, we thus derive an estimator of the cluster number and we also show the consistency in some sense

Styles APA, Harvard, Vancouver, ISO, etc.

25

Ta, Minh Thuy. « Techniques d'optimisation non convexe basée sur la programmation DC et DCA et méthodes évolutives pour la classification non supervisée ». Electronic Thesis or Diss., Université de Lorraine, 2014. http://www.theses.fr/2014LORR0099.

Texte intégral

Résumé :

Nous nous intéressons particulièrement, dans cette thèse, à quatre problèmes en apprentissage et fouille de données : clustering pour les données évolutives, clustering pour les données massives, clustering avec pondération de variables et enfin le clustering sans connaissance a priori du nombre de clusters avec initialisation optimale des centres de clusters. Les méthodes que nous décrivons se basent sur des approches d’optimisation déterministe, à savoir la programmation DC (Difference of Convex functions) et DCA (Difference of Convex Algorithms), pour la résolution de problèmes de clustering cités précédemment, ainsi que des approches évolutionnaires élitistes. Nous adaptons l’algorithme de clustering DCA–MSSC pour le traitement de données évolutives par fenêtres, en appréhendant les données évolutives avec deux modèles : fenêtres fixes et fenêtres glissantes. Pour le problème du clustering de données massives, nous utilisons l’algorithme DCA en deux phases. Dans la première phase, les données massives sont divisées en plusieurs sous-ensembles, sur lesquelles nous appliquons l’algorithme DCA–MSSC pour effectuer un clustering. Dans la deuxième phase, nous proposons un algorithme DCA-Weight pour effectuer un clustering pondéré sur l’ensemble des centres obtenues à la première phase. Concernant le clustering avec pondération de variables, nous proposons également deux approches: clustering dur avec pondération de variables et clustering floue avec pondération de variables. Nous testons notre approche sur un problème de segmentation d’image. Le dernier problème abordé dans cette thèse est le clustering sans connaissance a priori du nombre des clusters. Nous proposons pour cela une approche évolutionnaire élitiste. Le principe consiste à utiliser plusieurs algorithmes évolutionnaires (EAs) en même temps, de les faire concourir afin d’obtenir la meilleure combinaison de centres initiaux pour le clustering et par la même occasion le nombre optimal de clusters. Les différents tests réalisés sur plusieurs ensembles de données de grande taille sont très prometteurs et montrent l’efficacité des approches proposées
This thesis focus on four problems in data mining and machine learning: clustering data streams, clustering massive data sets, weighted hard and fuzzy clustering and finally the clustering without a prior knowledge of the clusters number. Our methods are based on deterministic optimization approaches, namely the DC (Difference of Convex functions) programming and DCA (Difference of Convex Algorithm) for solving some classes of clustering problems cited before. Our methods are also, based on elitist evolutionary approaches. We adapt the clustering algorithm DCA–MSSC to deal with data streams using two windows models: sub–windows and sliding windows. For the problem of clustering massive data sets, we propose to use the DCA algorithm with two phases. In the first phase, massive data is divided into several subsets, on which the algorithm DCA–MSSC performs clustering. In the second phase, we propose a DCA–Weight algorithm to perform a weighted clustering on the obtained centers in the first phase. For the weighted clustering, we also propose two approaches: weighted hard clustering and weighted fuzzy clustering. We test our approach on image segmentation application. The final issue addressed in this thesis is the clustering without a prior knowledge of the clusters number. We propose an elitist evolutionary approach, where we apply several evolutionary algorithms (EAs) at the same time, to find the optimal combination of initial clusters seed and in the same time the optimal clusters number. The various tests performed on several sets of large data are very promising and demonstrate the effectiveness of the proposed approaches

Styles APA, Harvard, Vancouver, ISO, etc.

26

Sanchez, Merchante Luis Francisco. « Learning algorithms for sparse classification ». Phd thesis, Université de Technologie de Compiègne, 2013. http://tel.archives-ouvertes.fr/tel-00868847.

Texte intégral

Résumé :

This thesis deals with the development of estimation algorithms with embedded feature selection the context of high dimensional data, in the supervised and unsupervised frameworks. The contributions of this work are materialized by two algorithms, GLOSS for the supervised domain and Mix-GLOSS for unsupervised counterpart. Both algorithms are based on the resolution of optimal scoring regression regularized with a quadratic formulation of the group-Lasso penalty which encourages the removal of uninformative features. The theoretical foundations that prove that a group-Lasso penalized optimal scoring regression can be used to solve a linear discriminant analysis bave been firstly developed in this work. The theory that adapts this technique to the unsupervised domain by means of the EM algorithm is not new, but it has never been clearly exposed for a sparsity-inducing penalty. This thesis solidly demonstrates that the utilization of group-Lasso penalized optimal scoring regression inside an EM algorithm is possible. Our algorithms have been tested with real and artificial high dimensional databases with impressive resuits from the point of view of the parsimony without compromising prediction performances.

Styles APA, Harvard, Vancouver, ISO, etc.

27

Benkaci, Mourad. « Surveillance des systèmes mécatronique d'automobile par des méthodes d'apprentissage ». Toulouse 3, 2011. https://tel.archives-ouvertes.fr/tel-00647456.

Texte intégral

Résumé :

La surveillance des systèmes mécatroniques, en particulier, ceux intégrés sur les véhicules d'aujourd'hui est de plus en plus difficile. Les interconnexions de ces systèmes en vue de l'accroissement des performances et du confort de véhicule augmentent la complexité de l'information nécessaire à la prise de décision en temps réel. Cette thèse est consacrée à la problématique de détection et d'isolation (FDI, Fault Detection & Isolation) de pannes automobiles en utilisant des systèmes de recherche et d'évaluation de l'information par des approches monocritères. Les variables pertinentes pour la détection rapide des pannes sont sélectionnées d'une manière automatique en utilisant deux approches différentes : I. La première consiste à introduire la notion de conflit entre toutes les variables mesurables du système mécatronique et les analyser à partir des projections dans des espaces de classification hyper-rectangles. II. La deuxième approche consiste à utiliser la complexité de Kolmogorov comme outil de classification des signatures de pannes. L'estimation de la complexité de Kolmogorov par des algorithmes de compression sans perte d'information permet de définir un dictionnaire de pannes et de donner un score de criticité par rapport au bon fonctionnement du véhicule. Les deux approches proposées ont été appliquées avec succès sur plusieurs types de données automobiles dans le cadre du projet ANR-DIAPA
Mechatronic systems monitoring, especially those built on today's vehicles, is increasingly complicated. The interconnections of these systems for increased performance and comfort of vehicles increases the complexity of information needed for decision-making in real time. This PhD thesis is devoted to the problem of detection and isolation (FDI Fault Detection & Isolation) of faults in automotive systems using algorithms based on research and evaluation of information by mono-criterion approaches. Relevant variables for rapid detection of faults are selected in an automatic manner by using two different approaches: I. The first is to introduce the notion of conflict between all the measurable variables of mechatronic system and to analyze these variables using their projections in hyper-rectangles spaces classification. II. The second approach is to use Kolmogorov complexity as a tool for classification of fault signatures. The estimate of the Kolmogorov complexity by compression algorithms, without loss of information, allows defining a dictionary of faults and giving a score of criticality with respect to the healthy functioning of the vehicle. The two proposed approaches have been successfully applied to many types of automotive data in the ANR-DIAP project

Styles APA, Harvard, Vancouver, ISO, etc.

28

Moraes, Renan Manhabosco. « Aplicações de técnicas multivariadas na área comercial de uma empresa de comunicação ». reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2017. http://hdl.handle.net/10183/173130.

Texte intégral

Résumé :

A mudança de comportamento dos consumidores através do advento da tecnologia e das redes sociais gera um grande empoderamento dos mesmos, alterando substancialmente a forma de relacionamento das empresas com seu público final. Atentas a este mercado, as empresas de mídia passam por profundas mudanças, tanto do ponto de vista da entrega de conteúdo ao seu público, quanto no seu formato administrativo, estratégico e financeiro. Sendo assim, a presente dissertação apresenta abordagens apoiadas em técnicas multivariadas para composição de equipes comerciais e de remuneração dos times de venda de uma empresa de comunicação. No artigo 1, objetiva-se gerar um modelo para estimar a premiação comercial das equipes de venda das rádios do Grupo RBS. Para tanto, inicialmente geram-se agrupamentos das emissoras de rádio do Grupo RBS no estado do Rio Grande do Sul e de Santa Catarina com base nos seus perfis de similaridades. Para cada cluster gerado, gera-se uma regressão linear múltipla da premiação comercial validado através de validação cruzada por intermédio do R2 ajustado e Mean Absolute Percentage Error (MAPE). O segundo artigo aborda a clusterização dos top clientes do Grupo RBS e o impacto na composição das equipes comerciais por meio do método da seleção de variáveis. As 7 variáveis originais foram avaliadas através do método de seleção de variáveis “Omita uma variável por vez”; o melhor Silhouette Index (SI) médio, métrica utilizada para avaliar a qualidade dos agrupamentos gerados, foi obtido quando 3 variáveis foram retidas. Os agrupamentos gerados por tais variáveis refletem o comportamento de compra de mídia dos clientes; os agrupamentos foram considerados satisfatórios quando avaliados por especialistas do Grupo RBS.
The change in the behavior of consumers with the advent of technology and social networks generates a great empowerment of themselves, substantially altering the relationship form of companies to their final audience. Attentive to this market, media companies undergo profound changes, both from the point of view of delivering content to their audience, as well as in their administrative, strategic and financial format. Thus, the present dissertation presents approaches supported by multivariate techniques for the composition of commercial and remuneration teams of the sales group of a communication company. In article 1, the objective is to generate a model to estimate the commercial awards of the sales teams of the RBS Group radios. To do this, we initially generate groupings of radio stations from the RBS Group in the state of Rio Grande do Sul and Santa Catarina based on their profiles of similarities. For each cluster generated, a multiple linear regression of the commercial award is generated, validated through cross validation through the adjusted R2 and Mean Absolute Percentage Error (MAPE). The second article addresses the clustering of RBS Group top clients and the impact on the composition of business teams through the variable selection method. The original 7 variables were evaluated through the variable selection method "Omit one variable at a time"; the best Silhouette Index (SI) average, metric used to evaluate the quality of the generated clusters, was obtained when 3 variables were retained. Clusters generated by such variables reflect customers' buying behavior of media; the clusters were considered satisfactory when evaluated by RBS Group experts.

Styles APA, Harvard, Vancouver, ISO, etc.

29

Devijver, Emilie. « Modèles de mélange pour la régression en grande dimension, application aux données fonctionnelles ». Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112130/document.

Texte intégral

Résumé :

Les modèles de mélange pour la régression sont utilisés pour modéliser la relation entre la réponse et les prédicteurs, pour des données issues de différentes sous-populations. Dans cette thèse, on étudie des prédicteurs de grande dimension et une réponse de grande dimension. Tout d’abord, on obtient une inégalité oracle ℓ1 satisfaite par l’estimateur du Lasso. On s’intéresse à cet estimateur pour ses propriétés de régularisation ℓ1. On propose aussi deux procédures pour pallier ce problème de classification en grande dimension. La première procédure utilise l’estimateur du maximum de vraisemblance pour estimer la densité conditionnelle inconnue, en se restreignant aux variables actives sélectionnées par un estimateur de type Lasso. La seconde procédure considère la sélection de variables et la réduction de rang pour diminuer la dimension. Pour chaque procédure, on obtient une inégalité oracle, qui explicite la pénalité nécessaire pour sélectionner un modèle proche de l’oracle. On étend ces procédures au cas des données fonctionnelles, où les prédicteurs et la réponse peuvent être des fonctions. Dans ce but, on utilise une approche par ondelettes. Pour chaque procédure, on fournit des algorithmes, et on applique et évalue nos méthodes sur des simulations et des données réelles. En particulier, on illustre la première méthode par des données de consommation électrique
Finite mixture regression models are useful for modeling the relationship between a response and predictors, arising from different subpopulations. In this thesis, we focus on high-dimensional predictors and a high-dimensional response. First of all, we provide an ℓ1-oracle inequality satisfied by the Lasso estimator. We focus on this estimator for its ℓ1-regularization properties rather than for the variable selection procedure. We also propose two procedures to deal with this issue. The first procedure leads to estimate the unknown conditional mixture density by a maximum likelihood estimator, restricted to the relevant variables selected by an ℓ1-penalized maximum likelihood estimator. The second procedure considers jointly predictor selection and rank reduction for obtaining lower-dimensional approximations of parameters matrices. For each procedure, we get an oracle inequality, which derives the penalty shape of the criterion, depending on the complexity of the random model collection. We extend these procedures to the functional case, where predictors and responses are functions. For this purpose, we use a wavelet-based approach. For each situation, we provide algorithms, apply and evaluate our methods both on simulations and real datasets. In particular, we illustrate the first procedure on an electricity load consumption dataset

Styles APA, Harvard, Vancouver, ISO, etc.

30

Cozzini, Alberto Maria. « Supervised and unsupervised model-based clustering with variable selection ». Thesis, Imperial College London, 2012. http://hdl.handle.net/10044/1/9973.

Texte intégral

Résumé :

The thesis tackles the problem of uncovering hidden structures in high-dimensional data in the presence of noise and non informative variables. It proposes a supervised and an unsupervised mixture models that select the relevant variables and are robust to measurement errors and outliers. Within the class of unsupervised clustering models we extend variable selection to the family of Student's t mixture models. While t distributions are naturally robust to noise and extreme events, sparsity is achieved by imposing regularization on the location and dispersion parameters. An EM algorithm is implemented to return the maximum likelihood estimate of the model parameters given the added penalty term. To further asses the contribution of each variable we propose a resampling procedure that ranks the variables according to their selection probability. Supervised clustering is implemented in a Bayesian framework. The model assumes a mixture of Lasso type regressions with t-distributed errors. While the Lasso representation of the normal linear model imposes regularization on the regression coefficient, variable selection is explicitly modelled by a latent binary indicator variable. The model relies on particle Markov chain Monte Carlo algorithm to approximate the posterior distribution of the parameters of interest. To highlight the properties and advantages of the proposed models, two real life problems are considered. The first one requires us to identify subtypes of breast cancer tumors by grouping patients based only on their gene expression levels when only few of the thousands genes are informative. In the second case our aim is to cluster different financial markets spanning several macro sectors and explain their trading performance only on the basis of the observed statistical features of their price dynamics.

Styles APA, Harvard, Vancouver, ISO, etc.

31

Benkaci, Mourad. « Surveillance des systèmes automatiques et systèmes embraqués ». Phd thesis, Université Paul Sabatier - Toulouse III, 2011. http://tel.archives-ouvertes.fr/tel-00647456.

Texte intégral

Résumé :

La surveillance des systèmes mécatroniques, en particulier, ceux intégrés sur les véhicules d'aujourd'hui est de plus en plus difficile. Les interconnexions de ces systèmes en vue de l'accroissement des performances et du confort de véhicule augmentent la complexité de l'information nécessaire à la prise de décision en temps réel. Cette thèse est consacrée à la problématique de détection et d'isolation (FDI, Fault Detection & Isolation) de pannes automobiles en utilisant des systèmes de recherche et d'évaluation de l'information par des approches monocritères. Les variables pertinentes pour la détection rapide des pannes sont sélectionnées d'une manière automatique en utilisant deux approches différentes : I. La première consiste à introduire la notion de conflit entre toutes les variables mesurables du système mécatronique et les analyser à partir des projections dans des espaces de classification hyper-rectangles. II. La deuxième approche consiste à utiliser la complexité de Kolmogorov comme outil de classification des signatures de pannes. L'estimation de la complexité de Kolmogorov par des algorithmes de compression sans perte d'information permet de définir un dictionnaire de pannes et de donner un score de criticité par rapport au bon fonctionnement du véhicule. Les deux approches proposées ont été appliquées avec succès sur plusieurs types de données automobiles dans le cadre du projet ANR-DIAPA

Styles APA, Harvard, Vancouver, ISO, etc.

32

Kim, Sinae. « Bayesian variable selection in clustering via dirichlet process mixture models ». Texas A&M University, 2003. http://hdl.handle.net/1969.1/5888.

Texte intégral

Résumé :

The increased collection of high-dimensional data in various fields has raised a strong interest in clustering algorithms and variable selection procedures. In this disserta- tion, I propose a model-based method that addresses the two problems simultane- ously. I use Dirichlet process mixture models to define the cluster structure and to introduce in the model a latent binary vector to identify discriminating variables. I update the variable selection index using a Metropolis algorithm and obtain inference on the cluster structure via a split-merge Markov chain Monte Carlo technique. I evaluate the method on simulated data and illustrate an application with a DNA microarray study. I also show that the methodology can be adapted to the problem of clustering functional high-dimensional data. There I employ wavelet thresholding methods in order to reduce the dimension of the data and to remove noise from the observed curves. I then apply variable selection and sample clustering methods in the wavelet domain. Thus my methodology is wavelet-based and aims at clustering the curves while identifying wavelet coefficients describing discriminating local features. I exemplify the method on high-dimensional and high-frequency tidal volume traces measured under an induced panic attack model in normal humans.

Styles APA, Harvard, Vancouver, ISO, etc.

33

Al-Guwaizani, Abdulrahman. « Variable neighbourhood search based heuristic for K-harmonic means clustering ». Thesis, Brunel University, 2011. http://bura.brunel.ac.uk/handle/2438/5827.

Texte intégral

Résumé :

Although there has been a rapid development of technology and increase of computation speeds, most of the real-world optimization problems still cannot be solved in a reasonable time. Some times it is impossible for them to be optimally solved, as there are many instances of real problems which cannot be addressed by computers at their present speed. In such cases, the heuristic approach can be used. Heuristic research has been used by many researchers to supply this need. It gives a sufficient solution in reasonable time. The clustering problem is one example of this, formed in many applications. In this thesis, I suggest a Variable Neighbourhood Search (VNS) to improve a recent clustering local search called K-Harmonic Means (KHM).Many experiments are presented to show the strength of my code compared with some algorithms from the literature. Some counter-examples are introduced to show that KHM may degenerate entirely, in either one or more runs. Furthermore, it degenerates and then stops in some familiar datasets, which significantly affects the final solution. Hence, I present a removing degeneracy code for KHM. I also apply VNS to improve the code of KHM after removing the evidence of degeneracy.

Styles APA, Harvard, Vancouver, ISO, etc.

34

Lynch, Sarah K. « A scale-independent clustering method with automatic variable selection based on trees ». Thesis, Monterey, California : Naval Postgraduate School, 2014. http://hdl.handle.net/10945/41412.

Texte intégral

Résumé :

Approved for public release; distribution is unlimited.
Clustering is the process of putting observations into groups based on their distance, or dissimilarity, from one another. Measuring distance for continuous variables often requires scaling or monotonic transformation. Determining dissimilarity when observations have both continuous and categorical measurements can be difficult because each type of measurement must be approached differently. We introduce a new clustering method that uses one of three new distance metrics. In a dataset with p variables, we create p trees, one with each variable as the response. Distance is measured by determining on which leaf an observation falls in each tree. Two observations are similar if they tend to fall on the same leaf and dissimilar if they are usually on different leaves. The distance metrics are not affected by scaling or transformations of the variables and easily determine distances in datasets with both continuous and categorical variables. This method is tested on several well-known datasets, both with and without added noise variables, and performs very well in the presence of noise due in part to automatic variable selection. The new distance metrics outperform several existing clustering methods in a large number of scenarios.

Styles APA, Harvard, Vancouver, ISO, etc.

35

Palla, Konstantina. « Probabilistic nonparametric models for relational data, variable clustering and reversible Markov chains ». Thesis, University of Cambridge, 2015. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.709019.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

36

Giovinazzi, Francesco <1988&gt. « Solution Path Clustering for Fixed-Effects Models in a Latent Variable Context ». Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amsdottorato.unibo.it/8740/1/giovinazzi_phdthesis.pdf.

Texte intégral

Résumé :

The main drawback of estimating latent variable models with fixed effects is the direct dependence between the number of free parameters and the number of observations. We propose to apply a well suited penalization technique in order to regularize the parameter estimates. In particular, we promote sparsity based on the pairwise differences of subject-specific parameters, inducing the latter to shrink on each other. This method allows to group statistical units into clusters that are homogeneous with respect to a latent attribute, without the need to specify any distributional assumption, and without adopting random effects. In practice, applying the proposed penalization, the number of free parameters is reduced and the adopted model becomes more parsimonious. The estimation of the fixed effects is based on an algorithm that builds a solution path, in the form of a hierarchical aggregation tree, whose outcome depends on a single tuning parameter. The method is intended to be general, and in principle it can be applied on the likelihood of any latent variable model with fixed effects. We describe in detail its application to the Rasch model, for which we provide a real data example and a simulation study. We then extend the method to the case of a latent variable model for continuous data, where the number of fixed effects to be estimated is higher.

Styles APA, Harvard, Vancouver, ISO, etc.

37

CAPPOZZO, ANDREA. « Robust model-based classification and clustering : advances in learning from contaminated datasets ». Doctoral thesis, Università degli Studi di Milano-Bicocca, 2020. http://hdl.handle.net/10281/262919.

Texte intégral

Résumé :

Al momento della stesura della tesi, ogni giorno viene raccolta una quantità sempre maggiore di dati, con un volume stimato che è destinato a raddoppiare ogni due anni. Grazie ai progressi tecnologici, i datasets stanno diventando enormi in termini di dimensioni e sostanzialmente più complessi in natura. Tuttavia, questa abbondanza di informazioni non elaborate ha un prezzo: misurazioni errate, errori di immissione dei dati, guasti dei sistemi di raccolta automatica e diverse altre cause possono in definitiva compromettere la qualità complessiva dei dati. I metodi robusti hanno un ruolo centrale nel convertire correttamente le informazioni grezze contaminate in conoscenze affidabili: un obiettivo primario di qualsiasi analisi statistica. La tesi presenta nuove metodologie per ottenere risultati affidabili, nell'ambito della classificazione e del clustering model-based, in presenza di dati contaminati. In primo luogo, si propone una modifica robusta di una famiglia di modelli semi-supervisionati, per ottenere una corretta classificazione in presenza di valori anomali ed errori nelle etichette. In secondo luogo, si sviluppa un metodo di analisi discriminante per il rilevamento di anomalie e novelties, con l'obiettivo finale di scoprire outliers, osservazioni assegnate a classi sbagliate e gruppi non precedentemente osservati nel training set. In terzo luogo, si introducono due metodi per la selezione delle variabili robusta, che eseguono efficacemente una high-dimensional classification in uno scenario adulterato.
At the time of writing, an ever-increasing amount of data is collected every day, with its volume estimated to be doubling every two years. Thanks to the technological advancements, datasets are becoming massive in terms of size and substantially more complex in nature. Nevertheless, this abundance of ``raw information'' does come at a price: wrong measurements, data-entry errors, breakdowns of automatic collection systems and several other causes may ultimately undermine the overall data quality. To this extent, robust methods have a central role in properly converting contaminated ``raw information'' to trustworthy knowledge: a primary goal of any statistical analysis. The present manuscript presents novel methodologies for performing reliable inference, within the model-based classification and clustering framework, in presence of contaminated data. First, we propose a robust modification to a family of semi-supervised patterned models, for accomplishing classification when dealing with both class and attribute noise. Second, we develop a discriminant analysis method for anomaly and novelty detection, with the final aim of discovering label noise, outliers and unobserved classes in an unlabelled dataset. Third, we introduce two robust variable selection methods, that effectively perform high-dimensional discrimination within an adulterated scenario.

Styles APA, Harvard, Vancouver, ISO, etc.

38

Abonyi, J., FD Tamás, S. Potgieter et H. Potgieter. « Analysis of Trace Elements in South African Clinkers using Latent Variable Model and Clustering ». South African Journal of Chemistry, 2003. http://encore.tut.ac.za/iii/cpro/DigitalItemViewPage.external?sp=1000893.

Texte intégral

Résumé :

The trace element content of clinkers (and possibly of cements) can be used to identify the manufacturing factory. The Mg, Sr, Ba, Mn, Ti, Zr, Zn and V content of clinkers give detailed information for the determination of the origin of clinkers produced in different factories. However, for the analysis of such complex data there is a need for algorithmic tools for the visualization and clustering of the samples. This paper proposes a new approach for this purpose. The analytical data are transformed into a twodimensional latent space by factor analysis (probabilistic principal component analysis) and dendograms are constructed for cluster formation. The classification of South African clinkers is used as an illustrative example for the approach.

Styles APA, Harvard, Vancouver, ISO, etc.

39

Lazic, Jasmina. « New variants of variable neighbourhood search for 0-1 mixed integer programming and clustering ». Thesis, Brunel University, 2010. http://bura.brunel.ac.uk/handle/2438/4602.

Texte intégral

Résumé :

Many real-world optimisation problems are discrete in nature. Although recent rapid developments in computer technologies are steadily increasing the speed of computations, the size of an instance of a hard discrete optimisation problem solvable in prescribed time does not increase linearly with the computer speed. This calls for the development of new solution methodologies for solving larger instances in shorter time. Furthermore, large instances of discrete optimisation problems are normally impossible to solve to optimality within a reasonable computational time/space and can only be tackled with a heuristic approach. In this thesis the development of so called matheuristics, the heuristics which are based on the mathematical formulation of the problem, is studied and employed within the variable neighbourhood search framework. Some new variants of the variable neighbourhood searchmetaheuristic itself are suggested, which naturally emerge from exploiting the information from the mathematical programming formulation of the problem. However, those variants may also be applied to problems described by the combinatorial formulation. A unifying perspective on modern advances in local search-based metaheuristics, a so called hyper-reactive approach, is also proposed. Two NP-hard discrete optimisation problems are considered: 0-1 mixed integer programming and clustering with application to colour image quantisation. Several new heuristics for 0-1 mixed integer programming problem are developed, based on the principle of variable neighbourhood search. One set of proposed heuristics consists of improvement heuristics, which attempt to find high-quality near-optimal solutions starting from a given feasible solution. Another set consists of constructive heuristics, which attempt to find initial feasible solutions for 0-1 mixed integer programs. Finally, some variable neighbourhood search based clustering techniques are applied for solving the colour image quantisation problem. All new methods presented are compared to other algorithms recommended in literature and a comprehensive performance analysis is provided. Computational results show that the methods proposed either outperform the existing state-of-the-art methods for the problems observed, or provide comparable results. The theory and algorithms presented in this thesis indicate that hybridisation of the CPLEX MIP solver and the VNS metaheuristic can be very effective for solving large instances of the 0-1 mixed integer programming problem. More generally, the results presented in this thesis suggest that hybridisation of exact (commercial) integer programming solvers and some metaheuristic methods is of high interest and such combinations deserve further practical and theoretical investigation. Results also show that VNS can be successfully applied to solving a colour image quantisation problem.

Styles APA, Harvard, Vancouver, ISO, etc.

40

Abonyia, J., FD Tamas et S. Potgieter. « Analysis of trace elements in South African clinkers using latent variable model and clustering ». South African Journal of Chemistry, 2003. http://encore.tut.ac.za/iii/cpro/DigitalItemViewPage.external?sp=1001952.

Texte intégral

Résumé :

Abstract The trace element content of clinkers (and possibly of cements) can be used to identify the manufacturing factory. The Mg, Sr, Ba, Mn, Ti, Zr, Zn and V content of clinkers give detailed information for the determination of the origin of clinkers produced in different factories. However, for the analysis of such complex data there is a need for algorithmic tools for the visualization and clustering of the samples. This paper proposes a new approach for this purpose. The analytical data are transformed into a twodimensional latent space by factor analysis (probabilistic principal component analysis) and dendograms are constructed for cluster formation. The classification of South African clinkers is used as an illustrative example for the approach.

Styles APA, Harvard, Vancouver, ISO, etc.

41

Rastelli, Riccardo, et Nial Friel. « Optimal Bayesian estimators for latent variable cluster models ». Springer Nature, 2018. http://dx.doi.org/10.1007/s11222-017-9786-y.

Texte intégral

Résumé :

In cluster analysis interest lies in probabilistically capturing partitions of individuals, items or observations into groups, such that those belonging to the same group share similar attributes or relational profiles. Bayesian posterior samples for the latent allocation variables can be effectively obtained in a wide range of clustering models, including finite mixtures, infinite mixtures, hidden Markov models and block models for networks. However, due to the categorical nature of the clustering variables and the lack of scalable algorithms, summary tools that can interpret such samples are not available. We adopt a Bayesian decision theoretical approach to define an optimality criterion for clusterings and propose a fast and context-independent greedy algorithm to find the best allocations. One important facet of our approach is that the optimal number of groups is automatically selected, thereby solving the clustering and the model-choice problems at the same time. We consider several loss functions to compare partitions and show that our approach can accommodate a wide range of cases. Finally, we illustrate our approach on both artificial and real datasets for three different clustering models: Gaussian mixtures, stochastic block models and latent block models for networks.

Styles APA, Harvard, Vancouver, ISO, etc.

42

Mello, Paula Lunardi de. « Sistemáticas de agrupamento de países com base em indicadores de desempenho ». reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2017. http://hdl.handle.net/10183/158359.

Texte intégral

Résumé :

A economia mundial passou por grandes transformações no último século, as quais incluiram períodos de crescimento sustentado seguidos por outros de estagnação, governos alternando estratégias de liberalização de mercado com políticas de protecionismo comercial e instabilidade nos mercados, dentre outros. Figurando como auxiliar na compreensão de problemas econômicos e sociais de forma sistêmica, a análise de indicadores de desempenho é capaz de gerar informações relevantes a respeito de padrões de comportamento e tendências, além de orientar políticas e estratégias para incremento de resultados econômicos e sociais. Indicadores que descrevem as principais dimensões econômicas de um país podem ser utilizados como norteadores na elaboração e monitoramento de políticas de desenvolvimento e crescimento desses países. Neste sentido, esta dissertação utiliza dados do Banco Mundial para aplicar e avaliar sistemáticas de agrupamento de países com características similares em termos dos indicadores que os descrevem. Para tanto, integra técnicas de clusterização (hierárquicas e não-hierárquicas), seleção de variáveis (por meio da técnica “leave one variable out at a time”) e redução dimensional (através da Análise de Componentes Principais) com vistas à formação de agrupamentos consistentes de países. A qualidade dos clusters gerados é avaliada pelos índices Silhouette, Calinski-Harabasz e Davies-Bouldin. Os resultados se mostraram satisfatórios quanto à representatividade dos indicadores destacados e qualidade da clusterização gerada.
The world economy faced transformations in the last century. Periods of sustained growth followed by others of stagnation, governments alternating strategies of market liberalization with policies of commercial protectionism, and instability in markets, among others. As an aid to understand economic and social problems in a systemic way, the analysis of performance indicators generates relevant information about patterns, behavior and trends, as well as guiding policies and strategies to increase results in economy and social issues. Indicators describing main economic dimensions of a country can be used guiding principles in the development and monitoring of development and growth policies of these countries. In this way, this dissertation uses data from World Bank to elaborate a system of grouping countries with similar characteristics in terms of the indicators that describe them. To do so, it integrates clustering techniques (hierarchical and non-hierarchical), selection of variables (through the "leave one variable out at a time" technique) and dimensional reduction (appling Principal Component Analysis). The generated clusters quality is evaluated by the Silhouette Index, Calinski-Harabasz and Davies-Bouldin indexes. The results were satisfactory regarding the representativity of the highlighted indicators and the generated a good clustering quality.

Styles APA, Harvard, Vancouver, ISO, etc.

43

Ren, Sheng. « New Methods of Variable Selection and Inference on High Dimensional Data ». University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1511883302569683.

Texte intégral

Styles APA, Harvard, Vancouver, ISO, etc.

44

Huynh, Bao Tuyen. « Estimation and feature selection in high-dimensional mixtures-of-experts models ». Thesis, Normandie, 2019. http://www.theses.fr/2019NORMC237.

Texte intégral

Résumé :

Cette thèse traite de la modélisation et de l’estimation de modèles de mélanges d’experts de grande dimension, en vue d’efficaces estimation de densité, prédiction et classification de telles données complexes car hétérogènes et de grande dimension. Nous proposons de nouvelles stratégies basées sur l’estimation par maximum de vraisemblance régularisé des modèles pour pallier aux limites des méthodes standards, y compris l’EMV avec les algorithmes d’espérance-maximisation (EM), et pour effectuer simultanément la sélection des variables pertinentes afin d’encourager des solutions parcimonieuses dans un contexte haute dimension. Nous introduisons d’abord une méthode d’estimation régularisée des paramètres et de sélection de variables d’un mélange d’experts, basée sur des régularisations l1 (lasso) et le cadre de l’algorithme EM, pour la régression et la classification adaptés aux contextes de la grande dimension. Ensuite, nous étendons la stratégie un mélange régularisé de modèles d’experts pour les données discrètes, y compris pour la classification. Nous développons des algorithmes efficaces pour maximiser la fonction de log-vraisemblance l1 -pénalisée des données observées. Nos stratégies proposées jouissent de la maximisation monotone efficace du critère optimisé, et contrairement aux approches précédentes, ne s’appuient pas sur des approximations des fonctions de pénalité, évitent l’inversion de matrices et exploitent l’efficacité de l’algorithme de montée de coordonnées, particulièrement dans l’approche proximale par montée de coordonnées
This thesis deals with the problem of modeling and estimation of high-dimensional MoE models, towards effective density estimation, prediction and clustering of such heterogeneous and high-dimensional data. We propose new strategies based on regularized maximum-likelihood estimation (MLE) of MoE models to overcome the limitations of standard methods, including MLE estimation with Expectation-Maximization (EM) algorithms, and to simultaneously perform feature selection so that sparse models are encouraged in such a high-dimensional setting. We first introduce a mixture-of-experts’ parameter estimation and variable selection methodology, based on l1 (lasso) regularizations and the EM framework, for regression and clustering suited to high-dimensional contexts. Then, we extend the method to regularized mixture of experts models for discrete data, including classification. We develop efficient algorithms to maximize the proposed l1 -penalized observed-data log-likelihood function. Our proposed strategies enjoy the efficient monotone maximization of the optimized criterion, and unlike previous approaches, they do not rely on approximations on the penalty functions, avoid matrix inversion, and exploit the efficiency of the coordinate ascent algorithm, particularly within the proximal Newton-based approach

Styles APA, Harvard, Vancouver, ISO, etc.

45

Šulc, Zdeněk. « Similarity Measures for Nominal Data in Hierarchical Clustering ». Doctoral thesis, Vysoká škola ekonomická v Praze, 2013. http://www.nusl.cz/ntk/nusl-261939.

Texte intégral

Résumé :

This dissertation thesis deals with similarity measures for nominal data in hierarchical clustering, which can cope with variables with more than two categories, and which aspire to replace the simple matching approach standardly used in this area. These similarity measures take into account additional characteristics of a dataset, such as frequency distribution of categories or number of categories of a given variable. The thesis recognizes three main aims. The first one is an examination and clustering performance evaluation of selected similarity measures for nominal data in hierarchical clustering of objects and variables. To achieve this goal, four experiments dealing both with the object and variable clustering were performed. They examine the clustering quality of the examined similarity measures for nominal data in comparison with the commonly used similarity measures using a binary transformation, and moreover, with several alternative methods for nominal data clustering. The comparison and evaluation are performed on real and generated datasets. Outputs of these experiments lead to knowledge, which similarity measures can generally be used, which ones perform well in a particular situation, and which ones are not recommended to use for an object or variable clustering. The second aim is to propose a theory-based similarity measure, evaluate its properties, and compare it with the other examined similarity measures. Based on this aim, two novel similarity measures, Variable Entropy and Variable Mutability are proposed; especially, the former one performs very well in datasets with a lower number of variables. The third aim of this thesis is to provide a convenient software implementation based on the examined similarity measures for nominal data, which covers the whole clustering process from a computation of a proximity matrix to evaluation of resulting clusters. This goal was also achieved by creating the nomclust package for the software R, which covers this issue, and which is freely available.

Styles APA, Harvard, Vancouver, ISO, etc.

46

Hilton, Ross P. « Model-based data mining methods for identifying patterns in biomedical and health data ». Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54387.

Texte intégral

Résumé :

In this thesis we provide statistical and model-based data mining methods for pattern detection with applications to biomedical and healthcare data sets. In particular, we examine applications in costly acute or chronic disease management. In Chapter II, we consider nuclear magnetic resonance experiments in which we seek to locate and demix smooth, yet highly localized components in a noisy two-dimensional signal. By using wavelet-based methods we are able to separate components from the noisy background, as well as from other neighboring components. In Chapter III, we pilot methods for identifying profiles of patient utilization of the healthcare system from large, highly-sensitive, patient-level data. We combine model-based data mining methods with clustering analysis in order to extract longitudinal utilization profiles. We transform these profiles into simple visual displays that can inform policy decisions and quantify the potential cost savings of interventions that improve adherence to recommended care guidelines. In Chapter IV, we propose new methods integrating survival analysis models and clustering analysis to profile patient-level utilization behaviors while controlling for variations in the population’s demographic and healthcare characteristics and explaining variations in utilization due to different state-based Medicaid programs, as well as access and urbanicity measures.

Styles APA, Harvard, Vancouver, ISO, etc.

47

Demigha, Oualid. « Energy Conservation for Collaborative Applications in Wireless Sensor Networks ». Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0058/document.

Texte intégral

Résumé :

Les réseaux de capteurs sans fil est une technologie nouvelle dont les applications s'étendent sur plusieurs domaines: militaire, scientifique, médicale, industriel, etc. La collaboration entre les noeuds capteurs, caractérisés par des capacités minimales en termes de capture, de transmission, de traitement et d'énergie, est une nécessité pour réaliser des tâches aussi complexes que la collecte des données, le pistage des objets mobiles, la surveillance des zones sensibles, etc. La contrainte matérielle sur le développement des ressources énergétiques des noeuds capteurs est persistante. D'où la nécessité de l'optimisation logicielle dans les différentes couches de la pile protocolaire et du système d'exploitation des noeuds. Dans cette thèse, nous approchons le problème d'optimisation d'énergie pour les applications collaboratives via les méthodes de sélection des capteurs basées sur la prédiction et la corrélation des données issues du réseau lui-même. Nous élaborons plusieurs méthodes pour conserver les ressources énergétiques du réseau en utilisant la prédiction comme un moyen pour anticiper les actions des noeuds et leurs rôles afin de minimiser le nombre des noeuds impliqués dans la tâche en question. Nous prenons l'application de pistage d'objets mobiles comme un cas d'étude. Ceci, après avoir dresser un état de l'art des différentes méthodes et approches récentes utilisées dans ce contexte. Nous formalisons le problème à l'aide d'un programme linéaire à variables binaires dans le but de trouver une solution générale exacte. Nous modélisons ainsi le problème de minimisation de la consommation d'énergie des réseaux de capteurs sans fil, déployé pour des applications de collecte de données soumis à la contrainte de précision de données, appelé EMDP. Nous montrons que ce problème est NP-Complet. D'où la nécessité de solutions heuristiques. Comme solution approchée, nous proposons un algorithme de clustering dynamique, appelé CORAD, qui adapte la topologie du réseau à la dynamique des données capturées afin d'optimiser la consommation d'énergie en exploitant la corrélation qui pourrait exister entre les noeuds. Toutes ces méthodes ont été implémentées et testées via des simulations afin de montrer leur efficacité
Wireless Sensor Networks is an emerging technology enabled by the recent advances in Micro-Electro-Mechanical Systems, that led to design tiny wireless sensor nodes characterized by small capacities of sensing, data processing and communication. To accomplish complex tasks such as target tracking, data collection and zone surveillance, these nodes need to collaborate between each others to overcome the lack of battery capacity. Since the development of the batteries hardware is very slow, the optimization effort should be inevitably focused on the software layers of the protocol stack of the nodes and their operating systems. In this thesis, we investigated the energy problem in the context of collaborative applications and proposed an approach based on node selection using predictions and data correlations, to meet the application requirements in terms of energy-efficiency and quality of data. First, we surveyed almost all the recent approaches proposed in the literature that treat the problem of energy-efficiency of prediction-based target tracking schemes, in order to extract the relevant recommendations. Next, we proposed a dynamic clustering protocol based on an enhanced version of the Distributed Kalman Filter used as a prediction algorithm, to design an energy-efficient target tracking scheme. Our proposed scheme use these predictions to anticipate the actions of the nodes and their roles to minimize their number in the tasks. Based on our findings issued from the simulation data, we generalized our approach to any data collection scheme that uses a geographic-based clustering algorithm. We formulated the problem of energy minimization under data precision constraints using a binary integer linear program to find its exact solution in the general context. We validated the model and proved some of its fundamental properties. Finally and given the complexity of the problem, we proposed and evaluated a heuristic solution consisting of a correlation-based adaptive clustering algorithm for data collection. We showed that, by relaxing some constraints of the problem, our heuristic solution achieves an acceptable level of energy-efficiency while preserving the quality of data

Styles APA, Harvard, Vancouver, ISO, etc.

48

Jin, Zhongnan. « Statistical Methods for Multivariate Functional Data Clustering, Recurrent Event Prediction, and Accelerated Degradation Data Analysis ». Diss., Virginia Tech, 2019. http://hdl.handle.net/10919/102628.

Texte intégral

Résumé :

In this dissertation, we introduce three projects in machine learning and reliability applications after the general introductions in Chapter 1. The first project concentrates on the multivariate sensory data, the second project is related to the bivariate recurrent process, and the third project introduces thermal index (TI) estimation in accelerated destructive degradation test (ADDT) data, in which an R package is developed. All three projects are related to and can be used to solve certain reliability problems. Specifically, in Chapter 2, we introduce a clustering method for multivariate functional data. In order to cluster the customized events extracted from multivariate functional data, we apply the functional principal component analysis (FPCA), and use a model based clustering method on a transformed matrix. A penalty term is imposed on the likelihood so that variable selection is performed automatically. In Chapter 3, we propose a covariate-adjusted model to predict next event in a bivariate recurrent event system. Inspired by geyser eruptions in Yellowstone National Park, we consider two event types and model their event gap time relationship. External systematic conditions are taken account into the model with covariates. The proposed covariate adjusted recurrent process (CARP) model is applied to the Yellowstone National Park geyser data. In Chapter 4, we compare estimation methods for TI. In ADDT, TI is an important index indicating the reliability of materials, when the accelerating variable is temperature. Three methods are introduced in TI estimations, which are least-squares method, parametric model and semi-parametric model. An R package is implemented for all three methods. Applications of R functions are introduced in Chapter 5 with publicly available ADDT datasets. Chapter 6 includes conclusions and areas for future works.
Doctor of Philosophy

Styles APA, Harvard, Vancouver, ISO, etc.

49

Ghebre, Michael Abrha. « A statistical framework for modeling asthma and COPD biological heterogeneity, and a novel variable selection method for model-based clustering ». Thesis, University of Leicester, 2016. http://hdl.handle.net/2381/38488.

Texte intégral

Résumé :

This thesis has two main parts. The first part is an application that focuses on the identification of a statistical framework to model the biological heterogeneity of asthma and COPD using sputum cytokines. Clustering subjects using the actual cytokines measurements may not be straightforward as these mediators have strong correlations, which are currently ignored by standard clustering techniques. Artificial data, which have similar patterns as the cytokines, but with known class membership, are simulated. Several approaches, such as data reduction using factor analysis, were performed on the simulated data to identify suitable representative of the variables and to use as input into clustering algorithm. In the simulation study, using "factor-scores" (derived from factor analysis) as input variables into clustering outperformed the alternative approaches. Thus, this approach was applied to model the biological heterogeneity of asthma and COPD, and identified three stable and three exacerbation clusters, with different proportions of overlap between the diseases. The second part is a statistical methodology in which a new method for variable selection in model-based clustering was proposed. This method generalizes the approach of Raftery and Dean (2006, JASA 101, 168-178). It relaxes the global prior assumptions of linear-relationships between clustering relevant and irrelevant variables by searching for latent structures among the variables, and accounts for nonlinear relationships between these variables by splitting the data into sub-samples. A Gaussian mixture model (unconstrained variance-covariance matrices fitted using the EM-algorithm) is applied to identify the optimal clusters. The new method performed considerably better than the Raftery and Dean technique when applied to simulated and real datasets, and demonstrates that variable selection within clustering can substantially improve the identification of optimal clusters. However, at the moment it perhaps does not perform adequately in uncovering the optimal clusters in the dataset which have strong correlations such as sputum mediators.

Styles APA, Harvard, Vancouver, ISO, etc.

50

Harz, Jonas. « Variablen-Verdichtung und Clustern von Big Data – Wie lassen sich die Free-Floating-Carsharing-Nutzer typisieren ? » Master's thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-210015.

Texte intégral

Résumé :

In den letzten Jahren hat die Verbreitung von stationsungebundenem Carsharing (Free- Floating-Carsharing) weltweit stark zugenommen. Aufgrund dessen wurden verschiedene Studien, welche die verkehrliche Wirkung von Free-Floating-Carsharing beschreiben, erstellt. Bisher unzureichend unter-sucht wurden jedoch die Nutzer von Free-Floating-Carsharing- Systemen. Im Rahmen der Mitarbeit der TU Dresden am Evaluationsbericht Carsharing in der Landeshauptstadt München standen für sämtliche Münchener Carsharinganbieter Daten zu Buchungen und Kunden zur Verfügung. Ziel dieser Arbeit war es nun, für die zwei Anbieter von Free-Floating-Carsharing eine Typisierung der Nutzer vorzunehmen. Für die Einteilung der Nutzer in Gruppen wurden zunächst Input-Variablen ausgewählt und erzeugt. Neben den zeitlichen Häufigkeiten der Nutzung für Monate, Wochentage und Zeitscheiben wurden zudem Gini-Faktoren berechnet, welche die Regelmäßigkeit der Nutzung abbilden. Außerdem wurden verschiedene Variablen aus den Buchungsdaten erzeugt. Dazu zählen Untersuchungen wie viele Fahrten amWohnort der Nutzer beginnen und/oder enden, ob Fahrten am gleichen Ort beginnen und enden und bei wie vielen Fahrten der Parktarif der Anbieter zum Einsatz kommt. Des Weiteren wurde untersucht, wie viele Fahrten den Flughafen als Start oder Ziel haben, wie der Einfluss des Wetters auf die Anzahl der Buchungen ist und wie hoch die mittlere Fahrtzeit pro Buchung je Nutzer ist. Alle Variablen dienten nun als Input für die Typisierung der Nutzer. Für die Typisierung wurde das Verfahren der Clusteranalyse ausgewählt. Dabei sind jedoch 30 Variablen eine zu große Anzahl, weswegen zuerst eine Verdichtung der Input-Variablen durchgeführt wurde. Dabei kam eine sogenannte Hauptkomponentenanalyse zum Einsatz. Diese bietet die Möglichkeit, verschieden stark korrelierende Variablen zusammenzufassen und dabei den Informationsgehalt dieser zu erhalten. Aus den 30 einfließenden Variablen ergaben sich mit Hilfe der Hauptkomponentenanalyse vier Faktoren, welche anschließend für die Clusteranalyse genutzt wurden. Jeder Nutzer lässt sich durch die vier Faktoren in einem vierdimensionalen Koordinatensystem ein-tragen. Anschließend kann in diesem Raum eine Clusterung durchgeführt werden. Für diese Arbeit wurde sich für das k-Means-Verfahren entschieden. Mit diesem wurden fünf Cluster bestimmt, welche die 13 000 Nutzer abbilden. Jeder Cluster lässt sich durch die Mittelwerte der eingeflossenen sowie durch soziodemografische Variablen wie Alter und Geschlecht und die Wohnorte der Nutzer hinsichtlich seiner Aussage interpretieren. Die fünf Cluster können in zwei Cluster mit einer niedrigen (Nr. 1 und 2), einen mit einer mittleren (Nr. 3) und zwei mit einer hohen Nutzungsintensität einteilen werden (Nr. 4 und 5). Cluster 1 vereint Nutzer, die selten aber spontane Fahrten unternehmen. Dabei sind überdurchschnittliche viele Fahrten am Wochenende und abends zu verzeichnen. In Cluster 2 finden sich Nutzer, die vorwiegend Fahrten mit langen Fahrtzeiten unternehmen. Dabei werden innerhalb einer Buchung mehrere Wege zurückgelegt, was sich an der hohen Nutzung des Parktarifs zeigt und daran, dass der größte Teil der Fahrten am Ausgangsort wieder enden. Diese Gruppe besitzt unter allen Gruppen einen überdurchschnittlich hohen Anteil an Frauen. Cluster 3 beschreibt den normalen Nutzer hinsichtlich der Nutzungsintensität und der zeitlichen Nutzung. Er ist mit 41,4% der Kunden der größte aller Cluster. Cluster 4 und 5 vereinen Kunden mit einer hohen Nutzungsintensität. Obwohl nur ca. 5% der Kunden in diesen beiden Gruppen zu finden sind, werden jedoch ein Drittel aller Fahrten von diesen Nutzern zurückgelegt. Cluster 4 beschreibt Nutzer mit einem typischen Pendlerverhalten. Dabei werden Fahrten vorwiegend Werktags und während der Hauptverkehrszeiten unternommen. Eine abnehmende Nutzung von Januar zu Juni lässt vermuten, dass andere Verkehrsmittel wie das Fahrrad genutzt werden. In Cluster 5 finden sich Kunden, die häufig Carsharing in der Nacht nutzen. Dies lässt vermuten, dass Aktivitäten des Nachtlebens besucht werden. Dieser Cluster hat im Vergleich zum Durchschnitt den geringsten Anteil an Frauen. Da die Ergebnisse ausschließlich auf den Anbieterdaten basieren, ist es nicht möglich, konkrete Aus-sagen über Effekte und Wirkungen von Free-Floating-Carsharing zu treffen und zu bewerten. Dafür wäre weitere Daten zum Beispiel aus Umfragen notwendig. Die klar abgrenzbaren und gut interpre-tierbaren Nutzergruppen zeigen jedoch, dass die gewählte Methodik sich zur Typisierung von Carsha-ringnutzern eignet. Eine Wiederholung des Verfahrens mit anderen Daten, zum Beispiel aus einem späteren Untersuchungszeitraum oder einer anderen Stadt, ist zu empfehlen.

Styles APA, Harvard, Vancouver, ISO, etc.

Thèses sur le sujet « Variables clustering »

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres