Дисертації з теми "Module clustering"

Щоб переглянути інші типи публікацій з цієї теми, перейдіть за посиланням: Module clustering.

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся з топ-50 дисертацій для дослідження на тему "Module clustering".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.

1

Ptitsyn, Andrey. "New algorithms for EST clustering." Thesis, University of the Western Cape, 2000. http://etd.uwc.ac.za/index.php?module=etd&amp.

Повний текст джерела
Анотація:
Expressed sequence tag database is a rich and fast growing source of data for gene expression analysis and drug discovery. Clustering of raw EST data is a necessary step for further analysis and one of the most challenging problems of modem computational biology.
Стилі APA, Harvard, Vancouver, ISO та ін.
2

Passmoor, Sean Stuart. "Clustering studies of radio-selected galaxies." Thesis, University of the Western Cape, 2011. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_7521_1332410859.

Повний текст джерела
Анотація:

We investigate the clustering of HI-selected galaxies in the ALFALFA survey and compare results with those obtained for HIPASS. Measurements of the angular correlation function and the inferred 3D-clustering are compared with results from direct spatial-correlation measurements. We are able to measure clustering on smaller angular scales and for galaxies with lower HI masses than was previously possible. We calculate the expected clustering of dark matter using the redshift distributions of HIPASS and ALFALFA and show that the ALFALFA sample is somewhat more anti-biased with respect to dark matter than the HIPASS sample. We are able to conform the validity of the dark matter correlation predictions by performing simulations of the non-linear structure formation. Further we examine how the bias evolves with redshift for radio galaxies detected in the the first survey.

Стилі APA, Harvard, Vancouver, ISO та ін.
3

Javar, Shima. "Measurement and comparison of clustering algorithms." Thesis, Växjö University, School of Mathematics and Systems Engineering, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:vxu:diva-1735.

Повний текст джерела
Анотація:

In this project, a number of different clustering algorithms are described and their workings explained. They are compared to each other by implementing them on number of graphs with a known architecture.

These clustering algorithm, in the order they are implemented, are as follows: Nearest neighbour hillclimbing, Nearest neighbour big step hillclimbing, Best neighbour hillclimbing, Best neighbour big step hillclimbing, Gem 3D, K-means simple, K-means Gem 3D, One cluster and One cluster per node.

The graphs are Unconnected, Directed KX, Directed Cycle KX and Directed Cycle.

The results of these clusterings are compared with each other according to three criteria: Time, Quality and Extremity of nodes distribution. This enables us to find out which algorithm is most suitable for which graph. These artificial graphs are then compared with the reference architecture graph to reach the conclusions.

Стилі APA, Harvard, Vancouver, ISO та ін.
4

Hu, Yang. "PV Module Performance Under Real-world Test Conditions - A Data Analytics Approach." Case Western Reserve University School of Graduate Studies / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=case1396615109.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
5

Riedl, Pavel. "Modul shlukové analýzy systému pro dolování z dat." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237095.

Повний текст джерела
Анотація:
This master's thesis deals with development of a module for a data mining system, which is being developed on FIT. The first part describes the general knowledge discovery process and cluster analysis including cluster validation; it also describes Oracle Data Mining including algorithms, which it uses for clustering. At the end it deals with the system and the technologies it uses, such as NetBeans Platform and DMSL. The second part describes design of a clustering module and a module used to compare its results. It also deals with visualization of cluster analysis results and shows the achievements.
Стилі APA, Harvard, Vancouver, ISO та ін.
6

Handfield, Louis-François. "Cis-regulatory modules clustering from sequence similarity." Thesis, McGill University, 2007. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=112632.

Повний текст джерела
Анотація:
I present a method that regroups cis-regulatory modules by shared sequences motifs. The goal of this approach is to search for clusters of modules that may share some function, using only sequence similarity. The proposed similarity measure is based on a variable-order Markov model likelihood scoring of sequences. I also introduce an extension of the variable-order Markov model which could better perform the required task. Results. I show that my method may recover subsets of sequences sharing a pattern in a set of generated sequences. I found that the proposed approach is successful in finding groups of modules that shared a type of transcription factor binding site.
Стилі APA, Harvard, Vancouver, ISO та ін.
7

Wu, Jingwen. "Model-based clustering and model selection for binned data." Thesis, Supélec, 2014. http://www.theses.fr/2014SUPL0005/document.

Повний текст джерела
Анотація:
Cette thèse étudie les approches de classification automatique basées sur les modèles de mélange gaussiens et les critères de choix de modèles pour la classification automatique de données discrétisées. Quatorze algorithmes binned-EM et quatorze algorithmes bin-EM-CEM sont développés pour quatorze modèles de mélange gaussiens parcimonieux. Ces nouveaux algorithmes combinent les avantages des données discrétisées en termes de réduction du temps d’exécution et les avantages des modèles de mélange gaussiens parcimonieux en termes de simplification de l'estimation des paramètres. Les complexités des algorithmes binned-EM et bin-EM-CEM sont calculées et comparées aux complexités des algorithmes EM et CEM respectivement. Afin de choisir le bon modèle qui s'adapte bien aux données et qui satisfait les exigences de précision en classification avec un temps de calcul raisonnable, les critères AIC, BIC, ICL, NEC et AWE sont étendus à la classification automatique de données discrétisées lorsque l'on utilise les algorithmes binned-EM et bin-EM-CEM proposés. Les avantages des différentes méthodes proposées sont illustrés par des études expérimentales
This thesis studies the Gaussian mixture model-based clustering approaches and the criteria of model selection for binned data clustering. Fourteen binned-EM algorithms and fourteen bin-EM-CEM algorithms are developed for fourteen parsimonious Gaussian mixture models. These new algorithms combine the advantages in computation time reduction of binning data and the advantages in parameters estimation simplification of parsimonious Gaussian mixture models. The complexities of the binned-EM and the bin-EM-CEM algorithms are calculated and compared to the complexities of the EM and the CEM algorithms respectively. In order to select the right model which fits well the data and satisfies the clustering precision requirements with a reasonable computation time, AIC, BIC, ICL, NEC, and AWE criteria, are extended to binned data clustering when the proposed binned-EM and bin-EM-CEM algorithms are used. The advantages of the different proposed methods are illustrated through experimental studies
Стилі APA, Harvard, Vancouver, ISO та ін.
8

Sampson, Joshua Neil. "Clustering genes in genetical genomics /." Thesis, Connect to this title online; UW restricted, 2007. http://hdl.handle.net/1773/9549.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
9

Yelibi, Lionel. "Introduction to fast Super-Paramagnetic Clustering." Master's thesis, Faculty of Science, 2019. http://hdl.handle.net/11427/31332.

Повний текст джерела
Анотація:
We map stock market interactions to spin models to recover their hierarchical structure using a simulated annealing based Super-Paramagnetic Clustering (SPC) algorithm. This is directly compared to a modified implementation of a maximum likelihood approach to fast-Super-Paramagnetic Clustering (f-SPC). The methods are first applied standard toy test-case problems, and then to a dataset of 447 stocks traded on the New York Stock Exchange (NYSE) over 1249 days. The signal to noise ratio of stock market correlation matrices is briefly considered. Our result recover approximately clusters representative of standard economic sectors and mixed clusters whose dynamics shine light on the adaptive nature of financial markets and raise concerns relating to the effectiveness of industry based static financial market classification in the world of real-time data-analytics. A key result is that we show that the standard maximum likelihood methods are confirmed to converge to solutions within a Super-Paramagnetic (SP) phase. We use insights arising from this to discuss the implications of using a Maximum Entropy Principle (MEP) as opposed to the Maximum Likelihood Principle (MLP) as an optimization device for this class of problems.
Стилі APA, Harvard, Vancouver, ISO та ін.
10

Mair, Patrick, and Marcus Hudec. "Session Clustering Using Mixtures of Proportional Hazards Models." Department of Statistics and Mathematics, WU Vienna University of Economics and Business, 2008. http://epub.wu.ac.at/598/1/document.pdf.

Повний текст джерела
Анотація:
Emanating from classical Weibull mixture models we propose a framework for clustering survival data with various proportionality restrictions imposed. By introducing mixtures of Weibull proportional hazards models on a multivariate data set a parametric cluster approach based on the EM-algorithm is carried out. The problem of non-response in the data is considered. The application example is a real life data set stemming from the analysis of a world-wide operating eCommerce application. Sessions are clustered due to the dwell times a user spends on certain page-areas. The solution allows for the interpretation of the navigation behavior in terms of survival and hazard functions. A software implementation by means of an R package is provided. (author´s abstract)
Series: Research Report Series / Department of Statistics and Mathematics
Стилі APA, Harvard, Vancouver, ISO та ін.
11

Lu, Zhengdong. "Constrained clustering and cognitive decline detection /." Full text open access at:, 2008. http://content.ohsu.edu/u?/etd,650.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
12

Madsen, Christopher. "Clustering of the Stockholm County housing market." Thesis, KTH, Matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-252301.

Повний текст джерела
Анотація:
In this thesis a clustering of the Stockholm county housing market has been performed using different clustering methods. Data has been derived and different geographical constraints have been used. DeSO areas (Demographic statistical areas), developed by SCB, have been used to divide the housing market in to smaller regions for which the derived variables have been calculated. Hierarchical clustering methods, SKATER and Gaussian mixture models have been applied. Methods using different kinds of geographical constraints have also been applied in an attempt to create more geographically contiguous clusters. The different methods are then compared with respect to performance and stability. The best performing method is the Gaussian mixture model EII, also known as the K-means algorithm. The most stable method when applied to bootstrapped samples is the ClustGeo-method.
I denna uppsats har en klustring av Stockholms läns bostadsmarknad genomförts med olika klustringsmetoder. Data har bearbetats och olika geografiska begränsningar har använts. DeSO (Demografiska Statistiska Områden), som utvecklats av SCB, har använts för att dela in bostadsmarknaden i mindre regioner för vilka områdesattribut har beräknats. Hierarkiska klustringsmetoder, SKATER och Gaussian mixture models har tillämpats. Metoder som använder olika typer av geografiska begränsningar har också tillämpats i ett försök att skapa mer geografiskt sammanhängande kluster. De olika metoderna jämförs sedan med avseende på kvalitet och stabilitet. Den bästa metoden, med avseende på kvalitet, är en Gaussian mixture model kallad EII, även känd som K-means. Den mest stabila metoden är ClustGeo-metoden.
Стилі APA, Harvard, Vancouver, ISO та ін.
13

Yan, Guohua. "Linear clustering with application to single nucleotide polymorphism genotyping." Thesis, University of British Columbia, 2008. http://hdl.handle.net/2429/958.

Повний текст джерела
Анотація:
Single nucleotide polymorphisms (SNPs) have been increasingly popular for a wide range of genetic studies. A high-throughput genotyping technologies usually involves a statistical genotype calling algorithm. Most calling algorithms in the literature, using methods such as k-means and mixturemodels, rely on elliptical structures of the genotyping data; they may fail when the minor allele homozygous cluster is small or absent, or when the data have extreme tails or linear patterns. We propose an automatic genotype calling algorithm by further developing a linear grouping algorithm (Van Aelst et al., 2006). The proposed algorithm clusters unnormalized data points around lines as against around centroids. In addition, we associate a quality value, silhouette width, with each DNA sample and a whole plate as well. This algorithm shows promise for genotyping data generated from TaqMan technology (Applied Biosystems). A key feature of the proposed algorithm is that it applies to unnormalized fluorescent signals when the TaqMan SNP assay is used. The algorithm could also be potentially adapted to other fluorescence-based SNP genotyping technologies such as Invader Assay. Motivated by the SNP genotyping problem, we propose a partial likelihood approach to linear clustering which explores potential linear clusters in a data set. Instead of fully modelling the data, we assume only the signed orthogonal distance from each data point to a hyperplane is normally distributed. Its relationships with several existing clustering methods are discussed. Some existing methods to determine the number of components in a data set are adapted to this linear clustering setting. Several simulated and real data sets are analyzed for comparison and illustration purpose. We also investigate some asymptotic properties of the partial likelihood approach. A Bayesian version of this methodology is helpful if some clusters are sparse but there is strong prior information about their approximate locations or properties. We propose a Bayesian hierarchical approach which is particularly appropriate for identifying sparse linear clusters. We show that the sparse cluster in SNP genotyping datasets can be successfully identified after a careful specification of the prior distributions.
Стилі APA, Harvard, Vancouver, ISO та ін.
14

Corneli, Marco. "Dynamic stochastic block models, clustering and segmentation in dynamic graphs." Thesis, Paris 1, 2017. http://www.theses.fr/2017PA01E012/document.

Повний текст джерела
Анотація:
Cette thèse porte sur l’analyse de graphes dynamiques, définis en temps discret ou continu. Nous introduisons une nouvelle extension dynamique du modèle a blocs stochastiques (SBM), appelée dSBM, qui utilise des processus de Poisson non homogènes pour modéliser les interactions parmi les paires de nœuds d’un graphe dynamique. Les fonctions d’intensité des processus ne dépendent que des classes des nœuds comme dans SBM. De plus, ces fonctions d’intensité ont des propriétés de régularité sur des intervalles temporels qui sont à estimer, et à l’intérieur desquels les processus de Poisson redeviennent homogènes. Un récent algorithme d’estimation pour SBM, qui repose sur la maximisation d’un critère exact (ICL exacte) est ici adopté pour estimer les paramètres de dSBM et sélectionner simultanément le modèle optimal. Ensuite, un algorithme exact pour la détection de rupture dans les séries temporelles, la méthode «pruned exact linear time» (PELT), est étendu pour faire de la détection de rupture dans des données de graphe dynamique selon le modèle dSBM. Enfin, le modèle dSBM est étendu ultérieurement pour faire de l’analyse de réseau textuel dynamique. Les réseaux sociaux sont un exemple de réseaux textuels: les acteurs s’échangent des documents (posts, tweets, etc.) dont le contenu textuel peut être utilisé pour faire de la classification et détecter la structure temporelle du graphe dynamique. Le modèle que nous introduisons est appelé «dynamic stochastic topic block model» (dSTBM)
This thesis focuses on the statistical analysis of dynamic graphs, both defined in discrete or continuous time. We introduce a new extension of the stochastic block model (SBM) for dynamic graphs. The proposed approach, called dSBM, adopts non homogeneous Poisson processes to model the interaction times between pairs of nodes in dynamic graphs, either in discrete or continuous time. The intensity functions of the processes only depend on the node clusters, in a block modelling perspective. Moreover, all the intensity functions share some regularity properties on hidden time intervals that need to be estimated. A recent estimation algorithm for SBM, based on the greedy maximization of an exact criterion (exact ICL) is adopted for inference and model selection in dSBM. Moreover, an exact algorithm for change point detection in time series, the "pruned exact linear time" (PELT) method is extended to deal with dynamic graph data modelled via dSBM. The approach we propose can be used for change point analysis in graph data. Finally, a further extension of dSBM is developed to analyse dynamic net- works with textual edges (like social networks, for instance). In this context, the graph edges are associated with documents exchanged between the corresponding vertices. The textual content of the documents can provide additional information about the dynamic graph topological structure. The new model we propose is called "dynamic stochastic topic block model" (dSTBM).Graphs are mathematical structures very suitable to model interactions between objects or actors of interest. Several real networks such as communication networks, financial transaction networks, mobile telephone networks and social networks (Facebook, Linkedin, etc.) can be modelled via graphs. When observing a network, the time variable comes into play in two different ways: we can study the time dates at which the interactions occur and/or the interaction time spans. This thesis only focuses on the first time dimension and each interaction is assumed to be instantaneous, for simplicity. Hence, the network evolution is given by the interaction time dates only. In this framework, graphs can be used in two different ways to model networks. Discrete time […] Continuous time […]. In this thesis both these perspectives are adopted, alternatively. We consider new unsupervised methods to cluster the vertices of a graph into groups of homogeneous connection profiles. In this manuscript, the node groups are assumed to be time invariant to avoid possible identifiability issues. Moreover, the approaches that we propose aim to detect structural changes in the way the node clusters interact with each other. The building block of this thesis is the stochastic block model (SBM), a probabilistic approach initially used in social sciences. The standard SBM assumes that the nodes of a graph belong to hidden (disjoint) clusters and that the probability of observing an edge between two nodes only depends on their clusters. Since no further assumption is made on the connection probabilities, SBM is a very flexible model able to detect different network topologies (hubs, stars, communities, etc.)
Стилі APA, Harvard, Vancouver, ISO та ін.
15

Mak, Brian Kan-Wing. "Towards a compact speech recognizer : subspace distribution clustering hidden Markov model /." Full text open access at:, 1998. http://content.ohsu.edu/u?/etd,215.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
16

Hajj, Hussein Rami. "Étude Raman des alliages (Ge,Si), (Zn,Be)Se et Zn(Se,S) via le modèle de percolation : agrégation vs. dispersion et phonon-polaritons." Thesis, Université de Lorraine, 2014. http://www.theses.fr/2014LORR0103/document.

Повний текст джерела
Анотація:
Les tenants et aboutissants du modèle phénoménologique de percolation (multi-mode par liaison) développé sur site pour la compréhension de base des spectres de vibration Raman et infrarouges des alliages semi-conducteurs de structure zincblende (II-VI et III-V) et diamant (IV-IV) sont explorés plus avant dans des registres novateurs avec les systèmes Ge1-xSix (diamant), Zn1-xBexSe (zincblende) et ZnSe1-xSx (zincblende). La version du modèle de percolation élaborée pour l’alliage GeSi de structure diamant (3 liaisons, 6 modes/phonons), plus élaborée que la version standard originellement développée pour les alliages zincblende (2 liaisons, 3 phonons), est utilisée comme version modèle pour formaliser à travers l’introduction d’un paramètre d’ordre k ad hoc, une aptitude intrinsèque des spectres de vibration, révélée par le modèle de percolation, à ‘mesurer’ la nature du désordre d’alliage, en termes de substitution aléatoire, ségrégation locale ou dispersion locale. L’alliage de percolation Zn0.67Be0.33Se est utilisé comme système modèle pour étudier, à l’aide d’un montage inhabituel de diffusion Raman en avant, la dispersion des phonons transverses optique au tout proche voisinage du centre tau de la zone de Brillouin. A cette limite, ces modes acquièrent un champ électrique semblable à celui d’une onde électromagnétique pure, i.e. un photon, et se voient désignés sous la terminologie de phonon-polaritons. Une spécificité inexplorée des phonon-polariton d’alliage, à savoir leur renforcement à l’approche de tau, est étudiée plus avant avec les alliages Zn0.47Be0.53Se et ZnSe0.68S0.32, et effectivement observée avec le second alliage. Une étude infrarouge a récemment révélé dans la littérature un comportement vibrationnel multi-mode déconcertant pour la liaison courte (Zn-S) de l’alliage ZnSeS. Nous montrons que ce comportement peut être expliqué dans le cadre d’une version généralisée du modèle de percolation, plus élaborée que la version standard, qui prend en compte l’effet de la dispersion phonon en plus de l’effet de la contrainte locale. Par ailleurs l’étude fine du comportement phonon-polariton de la liaison longue (Zn-Se) de l’alliage représentatif ZnSe0.68S0.32 par diffusion Raman en avant révèle un comportement bimodal insoupçonné, qui fait écho à celui de la liaison courte (Zn-S). Cela établit expérimentalement que le schéma de percolation (multi-phonon par liaison) est générique et s’applique à toutes les liaisons d’un alliage donné, en principe. Enfin, nous explorons le comportement du doublet Zn-S de l’alliage ZnSeS à l’approche de la transition de phase zincblende->rocksalt (~14 GPa) par diffusion Raman en avant sous pression, i.e. dans le régime phonon-polariton. Le mode Zn-S basse fréquence s’affaiblit et converge vers le mode haute fréquence sous pression, comme observé plus tôt en rétrodiffusion pour le doublet Be-Se de l’alliage ZnBeSe. Il semble s’agir d’un comportement intrinsèque du doublet de percolation pour la transition de phase considérée, celui-ci reflèterait une sensibilité aux instabilités locales des liaisons hôtes (Zn-Se) à l’approche de leur transition de phase naturelle, caractéristiques composé pur (ZnSe). Ces comportements sont discutés sur la base d’une modélisation des spectres Raman enregistrés pour des processus de diffusion en arrière (géométrie usuelle) et en avant (en fonction de l’angle de diffusion) dans le cadre du formalisme de la réponse diélectrique linéaire. L’attribution des modes Raman est réalisée via des calculs ab initio (code SIESTA) menés sur site avec des motifs d’impureté prototypes. Les prédictions du modèle de percolation concernant la dépendance du spectre Raman de GeSi vs. k sont confrontées à un calcul ab initio direct des spectres Raman (code AIMPRO), mené en collaboration à partir de supercellules couvrant une série représentative de valeurs de k
The ins and outs of the phenomenological percolation model (multi-mode per bond) developed by the team for the basic understanding of the Raman and infrared spectra of semiconductor alloys with zincblende (II-VI & III-V) and diamond (IV-IV) structure are further explored in novel areas with the Ge1-xSix (diamant), Zn1-xBexSe (zincblende) and ZnSe1-xSx (zincblende) alloys. The version of the percolation worked out for the GeSi diamond alloy (3 bonds, 6 modes/phonons), more refined than the current one for zincblende alloys (2 bonds, 3 phonons), is used as a model version to formalize, via the introduction of a relevant order parameter k, an intrinsic ability behind the vibration spectra, to ‘measure’ the nature of the alloy disorder, as to whether this reflects a random substitution, or a trend towards local clustering or local anticlustering. The percolation-type Zn0.67Be0.33Se alloy is used as a model system to study, by using an unconventional Raman setup corresponding to forward scattering, the dispersion of the transverse optic phonons on approaching of tau, the centre of the Brillouin zone. At this limit such modes become equipped with a macroscopic electric field similar in every point to that carried by a pure electromagnetic wave, namely a photon, being then identified as phonon-polaritons. A specificity of the alloy-related phonon-polaritons, namely their reinforcement approaching of tau ,unexplored so far, is further investigated experimentally with the Zn0.47Be0.53Se et ZnSe0.68S0.32 alloys, selected on purpose, and was indeed confirmed in the latter alloy. A recent infrared study of ZnSeS in the literature has revealed a disconcerting multi-phonon pattern for its shorter bond species (Zn-S). We show that such pattern can be explained within a generalized version of the percolation scheme, a more sophisticated one than the standard version, taking into account the effect of the phonon dispersion in addition to the effect of the local strain. Besides, a refined study of the phonon-polariton regime related to the long Zn-Se bond reveals an unsuspected bimodal pattern, which echoes that earlier evidenced for the short (Zn-S) species. This establishes on an experimental basis that the percolation scheme (multi-phonon per bond) is generic and applies as well to any bond species in an alloy, in principle. Last, we explore the behavior of the Zn-S doublet of ZnSeS at the approach of the zincblende->rocksalt (~14 GPa) transition, by near-forward Raman scattering under pressure, i.e. in the phonon-polariton regime. The low-frequency Zn-S mode appears to weakens and converges onto the high-frequency Zn-S mode under pressure, as earlier observed for the Be-Se doublet of ZnBeSe in backscattering. Such behavior seems to be intrinsic to the percolation-type doublet for the considered structural phase transition. This would reflect a sensitivity to the local instabilities of the host bonds (Zn-Se) at the approach of their natural structure phase transitions characteristic of the related pure compound (ZnSe). The above mentioned behaviors are discussed on the basis of a detailed contour modeling of the Raman spectra taken in backscattering (usual geometry) and forward scattering (depending on the scattering angle then) within the scope of the linear dielectric response. The assignment of the Raman modes is achieved via ab initio phonon calculations done within the SIESTA code using prototype impurity motifs. The predictions of the percolation scheme concerning the k-dependence of the GeSi Raman spectra are confronted with direct ab initio calculations of the GeSi Raman spectra done in collaboration (with V.J.B. Torres) using the AIMPRO code on supercells covering a selection of representative k values
Стилі APA, Harvard, Vancouver, ISO та ін.
17

Louw, Jan Paul. "Evidence of volatility clustering on the FTSE/JSE top 40 index." Thesis, Stellenbosch : Stellenbosch University, 2008. http://hdl.handle.net/10019.1/5039.

Повний текст джерела
Анотація:
Thesis (MBA (Business Management))--Stellenbosch University, 2008.
ENGLISH ABSTRACT: This research report investigated whether evidence of volatility clustering exists on the FTSE/JSE Top 40 Index. The presence of volatility clustering has practical implications relating to market decisions as well as the accurate measurement and reliable forecasting of volatility. This research report was conducted as an in-depth analysis of volatility, measured over five different return interval sizes covering the sample in non-overlapping periods. Each of the return interval sizes' volatility were analysed to reveal the distributional characteristics and if it violated the normality assumption. The volatility was also analysed to identify in which way, if any, subsequent periods are correlated. For each of the interval sizes one-step-ahead volatility forecasting was conducted using Linear Regression, Exponential Smoothing, GARCH(1,1) and EGARCH(1,1) models. The results were analysed using appropriate criteria to determine which of the forecasting models were more powerful. The forecasting models range from very simple to very complex, the rationale for this was to determine if more complex models outperform simpler models. The analysis showed that there was sufficient evidence to conclude that there was volatility clustering on the FTSE/JSE Top 40 Index. It further showed that more complex models such as the GARCH(1,1) and EGARCH(1,1) only marginally outperformed less complex models, and does not offer any real benefit over simpler models such as Linear Regression. This can be ascribed to the mean reversion effect of volatility and gives further insight into the volatility structure over the sample period.
AFRIKAANSE OPSOMMING: Die navorsingsverslag ondersoek die FTSE/JSE Top 40 Indeks om te bepaal of daar genoegsame bewyse is dat volatiliteitsbondeling teenwoordig is. Die teenwoordigheid van volatiliteitsbondeling het praktiese implikasies vir besluite in finansiele markte en akkurate en betroubare volatiliteitsvooruitskattings. Die verslag doen 'n diepgaande ontleding van volatiliteit, gemeet oor vyf verskillende opbrengs interval groottes wat die die steekproef dek in nie-oorvleuelende periodes. Elk van die opbrengs interval groottes se volatiliteitsverdelings word ontleed om te bepaal of dit verskil van die normaalverdeling. Die volatiliteit van die intervalle word ook ondersoek om te bepaal tot watter mate, indien enige, opeenvolgende waarnemings gekorreleer is. Vir elk van die interval groottes word 'n een-stap-vooruit vooruitskatting gedoen van volatiliteit. Dit word gedoen deur middel van Lineêre Regressie, Eksponensiële Gladstryking, GARCH(1,1) en die EGARCH(1,1) modelle. Die resultate word ontleed deur middel van erkende kriteria om te bepaal watter model die beste vooruitskattings lewer. Die modelle strek van baie eenvoudig tot baie kompleks, die rasionaal is om te bepaal of meer komplekse modelle beter resultate lewer as eenvoudiger modelle. Die ontleding toon dat daar genoegsame bewyse is om tot die gevolgtrekking te kom dat daar volatiliteitsbondeling is op die FTSE/JSE Top 40 Indeks. Dit toon verder dat meer komplekse vooruitskattingsmodelle soos die GARCH(1,1) en die EGARCH(1,1) slegs marginaal beter presteer het as die eenvoudiger vooruitskattingsmodelle en nie enige werklike voordeel soos Lineêre Regressie bied nie. Dit kan toegeskryf word aan die neiging van volatiliteit am terug te keer tot die gemiddelde, wat verdere insig lewer oor volatiliteit gedurende die steekproef.
Стилі APA, Harvard, Vancouver, ISO та ін.
18

Masmoudi, Nesrine. "Modèle bio-inspiré pour le clustering de graphes : application à la fouille de données et à la distribution de simulations." Thesis, Normandie, 2017. http://www.theses.fr/2017NORMLH26/document.

Повний текст джерела
Анотація:
Dans ce travail de thèse, nous présentons une méthode originale s’inspirant des comportements des fourmis réelles pour la résolution de problème de classification non supervisée non hiérarchique. Cette approche créée dynamiquement des groupes de données. Elle est basée sur le concept des fourmis artificielles qui se déplacent en même temps de manière complexe avec les règles de localisation simples. Chaque fourmi représente une donnée dans l’algorithme. Les mouvements des fourmis visent à créer des groupes homogènes de données qui évoluent ensemble dans une structure de graphe. Nous proposons également une méthode de construction incrémentale de graphes de voisinage par des fourmis artificielles. Nous proposons deux méthodes qui se dérivent parmi les algorithmes biomimétiques. Ces méthodes sont hybrides dans le sens où la recherche du nombre de classes, de départ, est effectuée par l’algorithme de classification K-Means, qui est utilisé pour initialiser la première partition et la structure de graphe
In this work, we present a novel method based on behavior of real ants for solving unsupervised non-hierarchical classification problem. This approach dynamically creates data groups. It is based on the concept of artificial ants moving complexly at the same time with simple location rules. Each ant represents a data in the algorithm. The movements of ants aim to create homogenous data groups that evolve together in a graph structure. We also propose a method of incremental building neighborhood graphs by artificial ants. We propose two approaches that are derived among biomimetic algorithms, they are hybrid in the sense that the search for the number of classes starting, which are performed by the classical algorithm K-Means classification, it is used to initialize the first partition and the graph structure
Стилі APA, Harvard, Vancouver, ISO та ін.
19

Revillon, Guillaume. "Uncertainty in radar emitter classification and clustering." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLS098/document.

Повний текст джерела
Анотація:
En Guerre Electronique, l’identification des signaux radar est un atout majeur de la prise de décisions tactiques liées au théâtre d’opérations militaires. En fournissant des informations sur la présence de menaces, la classification et le partitionnement des signaux radar ont alors un rôle crucial assurant un choix adapté des contre-mesures dédiées à ces menaces et permettant la détection de signaux radar inconnus pour la mise à jour des bases de données. Les systèmes de Mesures de Soutien Electronique enregistrent la plupart du temps des mélanges de signaux radar provenant de différents émetteurs présents dans l’environnement électromagnétique. Le signal radar, décrit par un motif de modulations impulsionnelles, est alors souvent partiellement observé du fait de mesures manquantes et aberrantes. Le processus d’identification se fonde sur l’analyse statistique des paramètres mesurables du signal radar qui le caractérisent tant quantitativement que qualitativement. De nombreuses approches mêlant des techniques de fusion de données et d’apprentissage statistique ont été développées. Cependant, ces algorithmes ne peuvent pas gérer les données manquantes et des méthodes de substitution de données sont requises afin d’utiliser ces derniers. L’objectif principal de cette thèse est alors de définir un modèle de classification et partitionnement intégrant la gestion des valeurs aberrantes et manquantes présentes dans tout type de données. Une approche fondée sur les modèles de mélange de lois de probabilités est proposée dans cette thèse. Les modèles de mélange fournissent un formalisme mathématique flexible favorisant l’introduction de variables latentes permettant la gestion des données aberrantes et la modélisation des données manquantes dans les problèmes de classification et de partionnement. L’apprentissage du modèle ainsi que la classification et le partitionnement sont réalisés dans un cadre d’inférence bayésienne où une méthode d’approximation variationnelle est introduite afin d’estimer la loi jointe a posteriori des variables latentes et des paramètres. Des expériences sur diverses données montrent que la méthode proposée fournit de meilleurs résultats que les algorithmes standards
In Electronic Warfare, radar signals identification is a supreme asset for decision making in military tactical situations. By providing information about the presence of threats, classification and clustering of radar signals have a significant role ensuring that countermeasures against enemies are well-chosen and enabling detection of unknown radar signals to update databases. Most of the time, Electronic Support Measures systems receive mixtures of signals from different radar emitters in the electromagnetic environment. Hence a radar signal, described by a pulse-to-pulse modulation pattern, is often partially observed due to missing measurements and measurement errors. The identification process relies on statistical analysis of basic measurable parameters of a radar signal which constitute both quantitative and qualitative data. Many general and practical approaches based on data fusion and machine learning have been developed and traditionally proceed to feature extraction, dimensionality reduction and classification or clustering. However, these algorithms cannot handle missing data and imputation methods are required to generate data to use them. Hence, the main objective of this work is to define a classification/clustering framework that handles both outliers and missing values for any types of data. Here, an approach based on mixture models is developed since mixture models provide a mathematically based, flexible and meaningful framework for the wide variety of classification and clustering requirements. The proposed approach focuses on the introduction of latent variables that give us the possibility to handle sensitivity of the model to outliers and to allow a less restrictive modelling of missing data. A Bayesian treatment is adopted for model learning, supervised classification and clustering and inference is processed through a variational Bayesian approximation since the joint posterior distribution of latent variables and parameters is untractable. Some numerical experiments on synthetic and real data show that the proposed method provides more accurate results than standard algorithms
Стилі APA, Harvard, Vancouver, ISO та ін.
20

Laclau, Charlotte. "Hard and fuzzy block clustering algorithms for high dimensional data." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB014.

Повний текст джерела
Анотація:
Notre capacité grandissante à collecter et stocker des données a fait de l'apprentissage non supervisé un outil indispensable qui permet la découverte de structures et de modèles sous-jacents aux données, sans avoir à \étiqueter les individus manuellement. Parmi les différentes approches proposées pour aborder ce type de problème, le clustering est très certainement le plus répandu. Le clustering suppose que chaque groupe, également appelé cluster, est distribué autour d'un centre défini en fonction des valeurs qu'il prend pour l'ensemble des variables. Cependant, dans certaines applications du monde réel, et notamment dans le cas de données de dimension importante, cette hypothèse peut être invalidée. Aussi, les algorithmes de co-clustering ont-ils été proposés: ils décrivent les groupes d'individus par un ou plusieurs sous-ensembles de variables au regard de leur pertinence. La structure des données finalement obtenue est composée de blocs communément appelés co-clusters. Dans les deux premiers chapitres de cette thèse, nous présentons deux approches de co-clustering permettant de différencier les variables pertinentes du bruit en fonction de leur capacité \`a révéler la structure latente des données, dans un cadre probabiliste d'une part et basée sur la notion de métrique, d'autre part. L'approche probabiliste utilise le principe des modèles de mélanges, et suppose que les variables non pertinentes sont distribuées selon une loi de probabilité dont les paramètres sont indépendants de la partition des données en cluster. L'approche métrique est fondée sur l'utilisation d'une distance adaptative permettant d'affecter à chaque variable un poids définissant sa contribution au co-clustering. D'un point de vue théorique, nous démontrons la convergence des algorithmes proposés en nous appuyant sur le théorème de convergence de Zangwill. Dans les deux chapitres suivants, nous considérons un cas particulier de structure en co-clustering, qui suppose que chaque sous-ensemble d'individus et décrit par un unique sous-ensemble de variables. La réorganisation de la matrice originale selon les partitions obtenues sous cette hypothèse révèle alors une structure de blocks homogènes diagonaux. Comme pour les deux contributions précédentes, nous nous plaçons dans le cadre probabiliste et métrique. L'idée principale des méthodes proposées est d'imposer deux types de contraintes : (1) nous fixons le même nombre de cluster pour les individus et les variables; (2) nous cherchons une structure de la matrice de données d'origine qui possède les valeurs maximales sur sa diagonale (par exemple pour le cas des données binaires, on cherche des blocs diagonaux majoritairement composés de valeurs 1, et de 0 à l’extérieur de la diagonale). Les approches proposées bénéficient des garanties de convergence issues des résultats des chapitres précédents. Enfin, pour chaque chapitre, nous dérivons des algorithmes permettant d'obtenir des partitions dures et floues. Nous évaluons nos contributions sur un large éventail de données simulées et liées a des applications réelles telles que le text mining, dont les données peuvent être binaires ou continues. Ces expérimentations nous permettent également de mettre en avant les avantages et les inconvénients des différentes approches proposées. Pour conclure, nous pensons que cette thèse couvre explicitement une grande majorité des scénarios possibles découlant du co-clustering flou et dur, et peut être vu comme une généralisation de certaines approches de biclustering populaires
With the increasing number of data available, unsupervised learning has become an important tool used to discover underlying patterns without the need to label instances manually. Among different approaches proposed to tackle this problem, clustering is arguably the most popular one. Clustering is usually based on the assumption that each group, also called cluster, is distributed around a center defined in terms of all features while in some real-world applications dealing with high-dimensional data, this assumption may be false. To this end, co-clustering algorithms were proposed to describe clusters by subsets of features that are the most relevant to them. The obtained latent structure of data is composed of blocks usually called co-clusters. In first two chapters, we describe two co-clustering methods that proceed by differentiating the relevance of features calculated with respect to their capability of revealing the latent structure of the data in both probabilistic and distance-based framework. The probabilistic approach uses the mixture model framework where the irrelevant features are assumed to have a different probability distribution that is independent of the co-clustering structure. On the other hand, the distance-based (also called metric-based) approach relied on the adaptive metric where each variable is assigned with its weight that defines its contribution in the resulting co-clustering. From the theoretical point of view, we show the global convergence of the proposed algorithms using Zangwill convergence theorem. In the last two chapters, we consider a special case of co-clustering where contrary to the original setting, each subset of instances is described by a unique subset of features resulting in a diagonal structure of the initial data matrix. Same as for the two first contributions, we consider both probabilistic and metric-based approaches. The main idea of the proposed contributions is to impose two different kinds of constraints: (1) we fix the number of row clusters to the number of column clusters; (2) we seek a structure of the original data matrix that has the maximum values on its diagonal (for instance for binary data, we look for diagonal blocks composed of ones with zeros outside the main diagonal). The proposed approaches enjoy the convergence guarantees derived from the results of the previous chapters. Finally, we present both hard and fuzzy versions of the proposed algorithms. We evaluate our contributions on a wide variety of synthetic and real-world benchmark binary and continuous data sets related to text mining applications and analyze advantages and inconvenients of each approach. To conclude, we believe that this thesis covers explicitly a vast majority of possible scenarios arising in hard and fuzzy co-clustering and can be seen as a generalization of some popular biclustering approaches
Стилі APA, Harvard, Vancouver, ISO та ін.
21

Zeng, Jingying. "Latent Factor Models for Recommender Systems and Market Segmentation Through Clustering." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1491255524283942.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
22

Bartcus, Marius. "Bayesian non-parametric parsimonious mixtures for model-based clustering." Thesis, Toulon, 2015. http://www.theses.fr/2015TOUL0010/document.

Повний текст джерела
Анотація:
Cette thèse porte sur l’apprentissage statistique et l’analyse de données multi-dimensionnelles. Elle se focalise particulièrement sur l’apprentissage non supervisé de modèles génératifs pour la classification automatique. Nous étudions les modèles de mélanges Gaussians, aussi bien dans le contexte d’estimation par maximum de vraisemblance via l’algorithme EM, que dans le contexte Bayésien d’estimation par Maximum A Posteriori via des techniques d’échantillonnage par Monte Carlo. Nous considérons principalement les modèles de mélange parcimonieux qui reposent sur une décomposition spectrale de la matrice de covariance et qui offre un cadre flexible notamment pour les problèmes de classification en grande dimension. Ensuite, nous investiguons les mélanges Bayésiens non-paramétriques qui se basent sur des processus généraux flexibles comme le processus de Dirichlet et le Processus du Restaurant Chinois. Cette formulation non-paramétrique des modèles est pertinente aussi bien pour l’apprentissage du modèle, que pour la question difficile du choix de modèle. Nous proposons de nouveaux modèles de mélanges Bayésiens non-paramétriques parcimonieux et dérivons une technique d’échantillonnage par Monte Carlo dans laquelle le modèle de mélange et son nombre de composantes sont appris simultanément à partir des données. La sélection de la structure du modèle est effectuée en utilisant le facteur de Bayes. Ces modèles, par leur formulation non-paramétrique et parcimonieuse, sont utiles pour les problèmes d’analyse de masses de données lorsque le nombre de classe est indéterminé et augmente avec les données, et lorsque la dimension est grande. Les modèles proposés validés sur des données simulées et des jeux de données réelles standard. Ensuite, ils sont appliqués sur un problème réel difficile de structuration automatique de données bioacoustiques complexes issues de signaux de chant de baleine. Enfin, nous ouvrons des perspectives Markoviennes via les processus de Dirichlet hiérarchiques pour les modèles Markov cachés
This thesis focuses on statistical learning and multi-dimensional data analysis. It particularly focuses on unsupervised learning of generative models for model-based clustering. We study the Gaussians mixture models, in the context of maximum likelihood estimation via the EM algorithm, as well as in the Bayesian estimation context by maximum a posteriori via Markov Chain Monte Carlo (MCMC) sampling techniques. We mainly consider the parsimonious mixture models which are based on a spectral decomposition of the covariance matrix and provide a flexible framework particularly for the analysis of high-dimensional data. Then, we investigate non-parametric Bayesian mixtures which are based on general flexible processes such as the Dirichlet process and the Chinese Restaurant Process. This non-parametric model formulation is relevant for both learning the model, as well for dealing with the issue of model selection. We propose new Bayesian non-parametric parsimonious mixtures and derive a MCMC sampling technique where the mixture model and the number of mixture components are simultaneously learned from the data. The selection of the model structure is performed by using Bayes Factors. These models, by their non-parametric and sparse formulation, are useful for the analysis of large data sets when the number of classes is undetermined and increases with the data, and when the dimension is high. The models are validated on simulated data and standard real data sets. Then, they are applied to a real difficult problem of automatic structuring of complex bioacoustic data issued from whale song signals. Finally, we open Markovian perspectives via hierarchical Dirichlet processes hidden Markov models
Стилі APA, Harvard, Vancouver, ISO та ін.
23

Claudia, da Rocha Rego Monteiro Carla. "Bi-clustering de Dados Genéticos Binários Baseado em Modelos de Classificação Logística." Universidade Federal de Pernambuco, 2009. https://repositorio.ufpe.br/handle/123456789/6991.

Повний текст джерела
Анотація:
Made available in DSpace on 2014-06-12T18:28:11Z (GMT). No. of bitstreams: 2 arquivo2996_1.pdf: 1090235 bytes, checksum: c9df39a664777bc77995e62019585122 (MD5) license.txt: 1748 bytes, checksum: 8a4605be74aa9ea9d79846c1fba20a33 (MD5) Previous issue date: 2009
Informações de interações de proteínas são fundamentais para a compreensão dos processos celulares. Por esta razão, várias abordagens têm sido propostas para inferir sobre pares de proteínas de redes de todos os tipos de dados biológicos. Nesta tese é proposto um método de bi-clustering, Lbic, baseado num modelo de classificação logística, para analisar dados biológicos binários. O Lbic é comparado com outros dois métodos de bi-clustering apresentados na literatura, mostrando melhores resultados. Seu desempenho também é comparado àqueles de um método supervisionado, análise de correlação canônica com Kernel, aplicado aos mesmos conjuntos de dados. Os resultados mostram que o Lbic alcança desempenho superior aos da aborgadem supervisionada treinada com até 25% do conhecimento da rede alvo
Стилі APA, Harvard, Vancouver, ISO та ін.
24

Bechchi, Mounir. "Clustering-based Approximate Answering of Query Result in Large and Distributed Databases." Phd thesis, Université de Nantes, 2009. http://tel.archives-ouvertes.fr/tel-00475917.

Повний текст джерела
Анотація:
Les utilisateurs des bases de données doivent faire face au problème de surcharge d'information lors de l'interrogation de leurs données, qui se traduit par un nombre de réponses trop élevé à des requêtes exploratoires. Pour remédier à ce problème, nous proposons un algorithme efficace et rapide, ap- pelé ESRA (Explore-Select-Rearrange Algorithm), qui utilise les résumés SAINTETIQ pré-calculés sur l'ensemble des données pour regrouper les réponses à une requête utilisateur en un ensemble de classes (ou résumés) organisées hiérarchiquement. Chaque classe décrit un sous-ensemble de résul- tats dont les propriétés sont voisines. L'utilisateur pourra ainsi explorer la hiérarchie pour localiser les données qui l'intéressent et en écarter les autres. Les résultats expérimentaux montrent que l'al- gorithme ESRA est efficace et fournit des classes bien formées (i.e., leur nombre reste faible et elles sont bien séparées). Cependant, le modèle SAINTETIQ, utilisé par l'algorithme ESRA, exige que les données soient disponibles sur le serveur des résumés. Cette hypothèse rend inapplicable l'algo- rithme ESRA dans des environnements distribués où il est souvent impossible ou peu souhaitable de rassembler toutes les données sur un même site. Pour remédier à ce problème, nous proposons une collection d'algorithmes qui combinent deux résumés générés localement et de manière autonome sur deux sites distincts pour en produire un seul résumant l'ensemble des données distribuées, sans accéder aux données d'origine. Les résultats expérimentaux montrent que ces algorithmes sont aussi performants que l'approche centralisée (i.e., SAINTETIQ appliqué aux données après regroupement sur un même site) et produisent des hiérarchies très semblables en structure et en qualité à celles produites par l'approche centralisée.
Стилі APA, Harvard, Vancouver, ISO та ін.
25

Talavera, Edwin Rafael Villanueva. "Métodos Bayesianos aplicados em taxonomia molecular." Universidade de São Paulo, 2007. http://www.teses.usp.br/teses/disponiveis/18/18152/tde-03102007-105125/.

Повний текст джерела
Анотація:
Neste trabalho são apresentados dois métodos de agrupamento de dados visados para aplicações em taxonomia molecular. Estes métodos estão baseados em modelos probabilísticos, o que permite superar alguns problemas apresentados nos métodos não probabilísticos existentes, como a dificuldade na escolha da métrica de distância e a falta de tratamento e aproveitamento do conhecimento a priori disponível. Os métodos apresentados combinam por meio do teorema de Bayes a informação extraída dos dados com o conhecimento a priori que se dispõe, razão pela qual são denominados métodos Bayesianos. O primeiro método, método de agrupamento hierárquico Bayesiano, está baseado no algoritmo HBC (Hierarchical Bayesian Clustering). Este método constrói uma hierarquia de partições (dendrograma) baseado no critério da máxima probabilidade a posteriori de cada partição. O segundo método é baseado em um tipo de modelo gráfico probabilístico conhecido como redes Gaussianas condicionais, o qual foi adaptado para problemas de agrupamento. Ambos métodos foram avaliados em três bancos de dados donde se conhece a rótulo da classe. Os métodos foram usados também em um problema de aplicação real: a taxonomia de uma coleção brasileira de estirpes de bactérias do gênero Bradyrhizobium (conhecidas por sua capacidade de fixar o \'N IND.2\' do ar no solo). Este banco de dados é composto por dados genotípicos resultantes da análise do RNA ribossômico. Os resultados mostraram que o método hierárquico Bayesiano gera dendrogramas de boa qualidade, em alguns casos superior que o melhor dos algoritmos hierárquicos analisados. O método baseado em redes gaussianas condicionais também apresentou resultados aceitáveis, mostrando um adequado aproveitamento do conhecimento a priori sobre as classes tanto na determinação do número ótimo de grupos, quanto no melhoramento da qualidade dos agrupamentos.
In this work are presented two clustering methods thought to be applied in molecular taxonomy. These methods are based in probabilistic models which overcome some problems observed in traditional clustering methods such as the difficulty to know which distance metric must be used or the lack of treatment of available prior information. The proposed methods use the Bayes theorem to combine the information of the data with the available prior information, reason why they are called Bayesian methods. The first method implemented in this work was the hierarchical Bayesian clustering, which is an agglomerative hierarchical method that constructs a hierarchy of partitions (dendogram) guided by the criterion of maximum Bayesian posterior probability of the partition. The second method is based in a type of probabilistic graphical model knows as conditional Gaussian network, which was adapted for data clustering. Both methods were validated in 3 datasets where the labels are known. The methods were used too in a real problem: the clustering of a brazilian collection of bacterial strains belonging to the genus Bradyrhizobium, known by their capacity to transform the nitrogen (\'N IND.2\') of the atmosphere into nitrogen compounds useful for the host plants. This dataset is formed by genetic data resulting of the analysis of the ribosomal RNA. The results shown that the hierarchical Bayesian clustering method built dendrograms with good quality, in some cases, better than the other hierarchical methods. In the method based in conditional Gaussian network was observed acceptable results, showing an adequate utilization of the prior information (about the clusters) to determine the optimal number of clusters and to improve the quality of the groups.
Стилі APA, Harvard, Vancouver, ISO та ін.
26

Freudenberg, Johannes M. "Bayesian Infinite Mixture Models for Gene Clustering and Simultaneous Context Selection Using High-Throughput Gene Expression Data." University of Cincinnati / OhioLINK, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1258660232.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
27

KC, Rabi. "Study of Some Biologically Relevant Dynamical System Models: (In)stability Regions of Cyclic Solutions in Cell Cycle Population Structure Model Under Negative Feedback and Random Connectivities in Multitype Neuronal Network Models." Ohio University / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou16049254273607.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
28

Faria, Rodrigo Augusto Dias. "Human skin segmentation using correlation rules on dynamic color clustering." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-01102018-101814/.

Повний текст джерела
Анотація:
Human skin is made of a stack of different layers, each of which reflects a portion of impinging light, after absorbing a certain amount of it by the pigments which lie in the layer. The main pigments responsible for skin color origins are melanin and hemoglobin. Skin segmentation plays an important role in a wide range of image processing and computer vision applications. In short, there are three major approaches for skin segmentation: rule-based, machine learning and hybrid. They differ in terms of accuracy and computational efficiency. Generally, machine learning and hybrid approaches outperform the rule-based methods but require a large and representative training dataset and, sometimes, costly classification time as well, which can be a deal breaker for real-time applications. In this work, we propose an improvement, in three distinct versions, of a novel method for rule-based skin segmentation that works in the YCbCr color space. Our motivation is based on the hypotheses that: (1) the original rule can be complemented and, (2) human skin pixels do not appear isolated, i.e. neighborhood operations are taken into consideration. The method is a combination of some correlation rules based on these hypotheses. Such rules evaluate the combinations of chrominance Cb, Cr values to identify the skin pixels depending on the shape and size of dynamically generated skin color clusters. The method is very efficient in terms of computational effort as well as robust in very complex images.
A pele humana é constituída de uma série de camadas distintas, cada uma das quais reflete uma porção de luz incidente, depois de absorver uma certa quantidade dela pelos pigmentos que se encontram na camada. Os principais pigmentos responsáveis pela origem da cor da pele são a melanina e a hemoglobina. A segmentação de pele desempenha um papel importante em uma ampla gama de aplicações em processamento de imagens e visão computacional. Em suma, existem três abordagens principais para segmentação de pele: baseadas em regras, aprendizado de máquina e híbridos. Elas diferem em termos de precisão e eficiência computacional. Geralmente, as abordagens com aprendizado de máquina e as híbridas superam os métodos baseados em regras, mas exigem um conjunto de dados de treinamento grande e representativo e, por vezes, também um tempo de classificação custoso, que pode ser um fator decisivo para aplicações em tempo real. Neste trabalho, propomos uma melhoria, em três versões distintas, de um novo método de segmentação de pele baseado em regras que funciona no espaço de cores YCbCr. Nossa motivação baseia-se nas hipóteses de que: (1) a regra original pode ser complementada e, (2) pixels de pele humana não aparecem isolados, ou seja, as operações de vizinhança são levadas em consideração. O método é uma combinação de algumas regras de correlação baseadas nessas hipóteses. Essas regras avaliam as combinações de valores de crominância Cb, Cr para identificar os pixels de pele, dependendo da forma e tamanho dos agrupamentos de cores de pele gerados dinamicamente. O método é muito eficiente em termos de esforço computacional, bem como robusto em imagens muito complexas.
Стилі APA, Harvard, Vancouver, ISO та ін.
29

Steinberg, Daniel. "An Unsupervised Approach to Modelling Visual Data." Thesis, The University of Sydney, 2013. http://hdl.handle.net/2123/9415.

Повний текст джерела
Анотація:
For very large visual datasets, producing expert ground-truth data for training supervised algorithms can represent a substantial human effort. In these situations there is scope for the use of unsupervised approaches that can model collections of images and automatically summarise their content. The primary motivation for this thesis comes from the problem of labelling large visual datasets of the seafloor obtained by an Autonomous Underwater Vehicle (AUV) for ecological analysis. It is expensive to label this data, as taxonomical experts for the specific region are required, whereas automatically generated summaries can be used to focus the efforts of experts, and inform decisions on additional sampling. The contributions in this thesis arise from modelling this visual data in entirely unsupervised ways to obtain comprehensive visual summaries. Firstly, popular unsupervised image feature learning approaches are adapted to work with large datasets and unsupervised clustering algorithms. Next, using Bayesian models the performance of rudimentary scene clustering is boosted by sharing clusters between multiple related datasets, such as regular photo albums or AUV surveys. These Bayesian scene clustering models are extended to simultaneously cluster sub-image segments to form unsupervised notions of “objects” within scenes. The frequency distribution of these objects within scenes is used as the scene descriptor for simultaneous scene clustering. Finally, this simultaneous clustering model is extended to make use of whole image descriptors, which encode rudimentary spatial information, as well as object frequency distributions to describe scenes. This is achieved by unifying the previously presented Bayesian clustering models, and in so doing rectifies some of their weaknesses and limitations. Hence, the final contribution of this thesis is a practical unsupervised algorithm for modelling images from the super-pixel to album levels, and is applicable to large datasets.
Стилі APA, Harvard, Vancouver, ISO та ін.
30

Borke, Lukas. "Dynamic Clustering and Visualization of Smart Data via D3-3D-LSA." Doctoral thesis, Humboldt-Universität zu Berlin, 2017. http://dx.doi.org/10.18452/18307.

Повний текст джерела
Анотація:
Mit der wachsenden Popularität von GitHub, dem größten Online-Anbieter von Programm-Quellcode und der größten Kollaborationsplattform der Welt, hat es sich zu einer Big-Data-Ressource entfaltet, die eine Vielfalt von Open-Source-Repositorien (OSR) anbietet. Gegenwärtig gibt es auf GitHub mehr als eine Million Organisationen, darunter solche wie Google, Facebook, Twitter, Yahoo, CRAN, RStudio, D3, Plotly und viele mehr. GitHub verfügt über eine umfassende REST API, die es Forschern ermöglicht, wertvolle Informationen über die Entwicklungszyklen von Software und Forschung abzurufen. Unsere Arbeit verfolgt zwei Hauptziele: (I) ein automatisches OSR-Kategorisierungssystem für Data Science Teams und Softwareentwickler zu ermöglichen, das Entdeckbarkeit, Technologietransfer und Koexistenz fördert. (II) Visuelle Daten-Exploration und thematisch strukturierte Navigation innerhalb von GitHub-Organisationen für reproduzierbare Kooperationsforschung und Web-Applikationen zu etablieren. Um Mehrwert aus Big Data zu generieren, ist die Speicherung und Verarbeitung der Datensemantik und Metadaten essenziell. Ferner ist die Wahl eines geeigneten Text Mining (TM) Modells von Bedeutung. Die dynamische Kalibrierung der Metadaten-Konfigurationen, TM Modelle (VSM, GVSM, LSA), Clustering-Methoden und Clustering-Qualitätsindizes wird als "Smart Clusterization" abgekürzt. Data-Driven Documents (D3) und Three.js (3D) sind JavaScript-Bibliotheken, um dynamische, interaktive Datenvisualisierung zu erzeugen. Beide Techniken erlauben Visuelles Data Mining (VDM) in Webbrowsern, und werden als D3-3D abgekürzt. Latent Semantic Analysis (LSA) misst semantische Information durch Kontingenzanalyse des Textkorpus. Ihre Eigenschaften und Anwendbarkeit für Big-Data-Analytik werden demonstriert. "Smart clusterization", kombiniert mit den dynamischen VDM-Möglichkeiten von D3-3D, wird unter dem Begriff "Dynamic Clustering and Visualization of Smart Data via D3-3D-LSA" zusammengefasst.
With the growing popularity of GitHub, the largest host of source code and collaboration platform in the world, it has evolved to a Big Data resource offering a variety of Open Source repositories (OSR). At present, there are more than one million organizations on GitHub, among them Google, Facebook, Twitter, Yahoo, CRAN, RStudio, D3, Plotly and many more. GitHub provides an extensive REST API, which enables scientists to retrieve valuable information about the software and research development life cycles. Our research pursues two main objectives: (I) provide an automatic OSR categorization system for data science teams and software developers promoting discoverability, technology transfer and coexistence; (II) establish visual data exploration and topic driven navigation of GitHub organizations for collaborative reproducible research and web deployment. To transform Big Data into value, in other words into Smart Data, storing and processing of the data semantics and metadata is essential. Further, the choice of an adequate text mining (TM) model is important. The dynamic calibration of metadata configurations, TM models (VSM, GVSM, LSA), clustering methods and clustering quality indices will be shortened as "smart clusterization". Data-Driven Documents (D3) and Three.js (3D) are JavaScript libraries for producing dynamic, interactive data visualizations, featuring hardware acceleration for rendering complex 2D or 3D computer animations of large data sets. Both techniques enable visual data mining (VDM) in web browsers, and will be abbreviated as D3-3D. Latent Semantic Analysis (LSA) measures semantic information through co-occurrence analysis in the text corpus. Its properties and applicability for Big Data analytics will be demonstrated. "Smart clusterization" combined with the dynamic VDM capabilities of D3-3D will be summarized under the term "Dynamic Clustering and Visualization of Smart Data via D3-3D-LSA".
Стилі APA, Harvard, Vancouver, ISO та ін.
31

Hlosta, Martin. "Modul pro shlukovou analýzu systému pro dolování z dat." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237158.

Повний текст джерела
Анотація:
This thesis deals with the design and implementation of a cluster analysis module for currently developing datamining system DataMiner on FIT BUT. So far, the system lacked cluster analysis module. The main objective of the thesis was therefore to extend the system of such a module. Together with me, Pavel Riedl worked on the module. We have created a common part for all the algorithms so that the system can be easily extended to other clustering algorithms. In the second part, I extended the clustering module by adding three density based clustering aglorithms - DBSCAN, OPTICS and DENCLUE. Algorithms have been implemented and appropriate sample data was chosen to verify theirs functionality.
Стилі APA, Harvard, Vancouver, ISO та ін.
32

Pešout, Pavel. "Přístupy k shlukování funkčních dat." Doctoral thesis, Vysoká škola ekonomická v Praze, 2007. http://www.nusl.cz/ntk/nusl-77066.

Повний текст джерела
Анотація:
Classification is a very common task in information processing and important problem in many sectors of science and industry. In the case of data measured as a function of a dependent variable such as time, the most used algorithms may not pattern each of the individual shapes properly, because they are interested only in the choiced measurements. For the reason, the presented paper focuses on the specific techniques that directly address the curve clustering problem and classifying new individuals. The main goal of this work is to develop alternative methodologies through the extension to various statistical approaches, consolidate already established algorithms, expose their modified forms fitted to demands of clustering issue and compare some efficient curve clustering methods thanks to reported extensive simulated data experiments. Last but not least is made, for the sake of executed experiments, comprehensive confrontation of effectual utility. Proposed clustering algorithms are based on two principles. Firstly, it is presumed that the set of trajectories may be probabilistic modelled as sequences of points generated from a finite mixture model consisting of regression components and hence the density-based clustering methods using the Maximum Likehood Estimation are investigated to recognize the most homogenous partitioning. Attention is paid to both the Maximum Likehood Approach, which assumes the cluster memberships to be some of the model parameters, and the probabilistic model with the iterative Expectation-Maximization algorithm, that assumes them to be random variables. To deal with the hidden data problem both Gaussian and less conventional gamma mixtures are comprehended with arranging for use in two dimensions. To cope with data with high variability within each subpopulation it is introduced two-level random effects regression mixture with the ability to let an individual vary from the template for its group. Secondly, it is taken advantage of well known K-Means algorithm applied to the estimated regression coefficients, though. The task of the optimal data fitting is devoted, because K-Means is not invariant to linear transformations. In order to overcome this problem it is suggested integrating clustering issue with the Markov Chain Monte Carlo approaches. What is more, this paper is concerned in functional discriminant analysis including linear and quadratic scores and their modified probabilistic forms by using random mixtures. Alike in K-Means it is shown how to apply Fisher's method of canonical scores to the regression coefficients. Experiments of simulated datasets are made that demonstrate the performance of all mentioned methods and enable to choose those with the most result and time efficiency. Considerable boon is the facture of new advisable application advances. Implementation is processed in Mathematica 4.0. Finally, the possibilities offered by the development of curve clustering algorithms in vast research areas of modern science are examined, like neurology, genome studies, speech and image recognition systems, and future investigation with incorporation with ubiquitous computing is not forbidden. Utility in economy is illustrated with executed application in claims analysis of some life insurance products. The goals of the thesis have been achieved.
Стилі APA, Harvard, Vancouver, ISO та ін.
33

Korger, Christina. "Clustering of Distributed Word Representations and its Applicability for Enterprise Search." Master's thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-208869.

Повний текст джерела
Анотація:
Machine learning of distributed word representations with neural embeddings is a state-of-the-art approach to modelling semantic relationships hidden in natural language. The thesis “Clustering of Distributed Word Representations and its Applicability for Enterprise Search” covers different aspects of how such a model can be applied to knowledge management in enterprises. A review of distributed word representations and related language modelling techniques, combined with an overview of applicable clustering algorithms, constitutes the basis for practical studies. The latter have two goals: firstly, they examine the quality of German embedding models trained with gensim and a selected choice of parameter configurations. Secondly, clusterings conducted on the resulting word representations are evaluated against the objective of retrieving immediate semantic relations for a given term. The application of the final results to company-wide knowledge management is subsequently outlined by the example of the platform intergator and conceptual extensions."
Стилі APA, Harvard, Vancouver, ISO та ін.
34

Sandoval, Arenas Santiago. "Revisiting stormwater quality conceptual models in a large urban catchment : Online measurements, uncertainties in data and models." Thesis, Lyon, 2017. http://www.theses.fr/2017LYSEI089/document.

Повний текст джерела
Анотація:
Les modèles de Rejets Urbains par Temps de Pluie (MRUTP) de Matières en Suspension (MES) dans les systèmes d’assainissement urbains sont essentiels pour des raisons scientifiques, environnementales, opérationnelles et réglementaires. Néanmoins, les MRUTP ont été largement mis en question, surtout pour reproduire des données mesurées en continu à l’exutoire des grands bassins versants. Dans cette thèse, trois limitations potentielles des MRUTP traditionnels ont été étudiées dans un bassin versant de 185 ha (Chassieu, France), avec des mesures en ligne de 365 événements pluvieux : a) incertitudes des données dû aux conditions sur le terrain, b) incertitudes dans les modèles hydrologiques et mesures de pluie et c) incertitudes dans les structures traditionnelles des MRUTP. Ces aspects sont approfondis dans six apports séparés, dont leurs résultats principaux peuvent être synthétisés comme suites : a) Acquisition et validation des données : (i) quatre stratégies d’échantillonnage pendant des événements pluvieux sont simulées et évaluées à partir de mesures en ligne de MES et débit. Les intervalles d’échantillonnage recommandés sont de 5 min, avec des erreurs moyennes entre 7 % et 20 % et des incertitudes sur ces erreurs d’environ 5 %, selon l’intervalle d’échantillonnage; (ii) la probabilité de sous-estimation de la concentration moyenne dans la section transversale du réseau est estimée à partir de deux méthodologies. Une méthodologie montre des sous-estimations de MES plus réelles (vers 39 %) par apport à l'autre (vers 269 %). b) Modèles hydrologiques et mesures de pluie : (iii) une stratégie d’estimation de paramètres d’un modèle conceptuel pluie-débit est proposée, en analysant la variabilité des paramètres optimaux obtenus à partir d’un calage Bayésien évènement-par-évènement; (iv) une méthode pour calculer les précipitations moyennes sur un bassin versant est proposée, sur la base du même modèle hydrologique et les données de débit. c) MRUTP (pollutographes) : (v) la performance de modélisation à partir du modèle traditionnel courbe d’étalonnage (RC) a été supérieur aux différents modèles linéaires de fonctions de transfert (TF), surtout en termes de parcimonie et précision des simulations. Aucune relation entre les potentielles erreurs de mesure de la pluie et les conditions hydrologiques définies en (iii) et (iv) avec les performances de RC et TFs n’a pu être établie. Des tests statistiques renforcent que l’occurrence des évènements non-représentables par RC ou TF au cours de temps suit une distribution aléatoire (indépendante de la période sèche précédente); (vi) une méthode de reconstruction Bayésienne de variables d’état virtuelles indique que des processus potentiellement manquants dans une description RC sont ininterprétables en termes d’un unique état virtuel de masse disponible dans le bassin versant qui diminue avec le temps, comme nombre de modèles traditionnels l’ont supposé
Total Suspended Solids (TSS) stormwater models in urban drainage systems are often required for scientific, legal, environmental and operational reasons. However, these TSS stormwater traditional model structures have been widely questioned, especially when reproducing data from online measurements at the outlet of large urban catchments. In this thesis, three potential limitations of traditional TSS stormwater models are analyzed in a 185 ha urban catchment (Chassieu, Lyon, France), by means 365 rainfall events monitored online: a) uncertainties in TSS data due to field conditions; b) uncertainties in hydrological models and rainfall measurements and c) uncertainties in the stormwater quality model structures. These aspects are investigated in six separate contributions, whose principal results can be summarized as follows: a) TSS data acquisition and validation: (i) four sampling strategies during rainfall events are simulated and evaluated by online TSS and flow rate measurements. Recommended sampling time intervals are of 5 min, with average sampling errors between 7 % and 20 % and uncertainties in sampling errors of about 5 %, depending on the sampling interval; (ii) the probability of underestimating the cross section mean TSS concentration is estimated by two methodologies. One method shows more realistic TSS underestimations (about 39 %) than the other (about 269 %). b) Hydrological models and rainfall measurements: (iii) a parameter estimation strategy is proposed for conceptual rainfall-runoff model by analyzing the variability of the optimal parameters obtained by single-event Bayesian calibrations, based on clusters and graphs representations. The new strategy shows more performant results in terms of accuracy and precision in validation; (iv) a methodology aimed to calculate “mean” areal rainfall estimation is proposed, based on the same hydrological model and flow rate data. Rainfall estimations by multiplying factors over constant-length time window and rainfall zero records filled with a reverse model show the most satisfactory results compared to further rainfall estimation models. c) Stormwater TSS pollutograph modelling: (v) the modelling performance of the traditional Rating Curve (RC) model is superior to different linear Transfer Function models (TFs), especially in terms of parsimony and precision of the simulations. No relation between the rainfall corrections or hydrological conditions defined in (iii) and (iv) with performances of RC and TFs could be established. Statistical tests strengthen that the occurrence of events not representable by the RC model in time is independent of antecedent dry weather conditions; (vi) a Bayesian reconstruction method of virtual state variables indicate that potential missing processes in the RC description are hardly interpretable as a unique state of virtual available mass over the catchment decreasing over time, as assumed by a great number of traditional models
Стилі APA, Harvard, Vancouver, ISO та ін.
35

D'ANGELO, LAURA. "Bayesian modeling of calcium imaging data." Doctoral thesis, Università degli Studi di Padova, 2022. https://hdl.handle.net/10281/399067.

Повний текст джерела
Анотація:
Recent advancements in miniaturized fluorescence microscopy have made it possible to investigate neuronal responses to external stimuli in awake behaving animals through the analysis of intra-cellular calcium signals. An ongoing challenge is deconvolving the noisy calcium signals to extract the spike trains, and understanding how this activity is affected by external stimuli and conditions. In this thesis, we aim to provide novel approaches to tackle various aspects of the analysis of calcium imaging data within a Bayesian framework. Following the standard methodology to the analysis of calcium imaging data based on a two-stage approach, we investigate efficient computational methods to link the output of the deconvolved fluorescence traces with the experimental conditions. In particular, we focus on the use of Poisson regression models to relate the number of detected spikes with several covariates. Motivated by this framework, but with a general impact in terms of application to other fields, we develop an efficient Metropolis-Hastings and importance sampling algorithm to simulate from the posterior distribution of the parameters of Poisson log-linear models under conditional Gaussian priors, with superior performance with respect to the state-of-the-art alternatives. Motivated by the lack of clear uncertainty quantification resulting from the use of a two-stage approach, and the impossibility to borrow information between the two stages, we focus on the analysis of individual neurons, and develop a coherent mixture model that allows for estimation of spiking activity and, simultaneously, reconstructing the distributions of the calcium transient spikes' amplitudes under different experimental conditions. More specifically, our modeling framework leverages two nested layers of random discrete mixture priors to borrow information between experiments and discover similarities in the distributional patterns of the neuronal response to different stimuli. Finally, we move to the multivariate analysis of populations of neurons. Here the interest is not only to detect and analyze the spiking activity but also to investigate the existence of groups of co-activating neurons. Estimation of such groups is a challenging problem due to the need to deconvolve the calcium traces and then cluster the resulting latent binary time series of activity. We describe a nonparametric mixture model that allows for simultaneous deconvolution and clustering of time series based on common patterns of activity. The model makes use of a latent continuous process for the spike probabilities to identify groups of co-activating cells. Neurons' dependence is taken into account by informing the mixture weights with their spatial location, following the common neuroscience assumption that neighboring neurons often activate together.
Стилі APA, Harvard, Vancouver, ISO та ін.
36

Perronnet, Caroline. "Etude de thérapies génique et pharmacologique visant à restaurer les capacités cognitives d’un modèle murin de la Dystrophie musculaire de Duchenne." Thesis, Paris 11, 2011. http://www.theses.fr/2011PA112009.

Повний текст джерела
Анотація:
L’objectif était d’évaluer l’efficacité de thérapies développées pour traiter la dystrophie musculaire de Duchenne (DMD, due à des mutations du gène de la dystrophine) dans la restauration de déficits cognitifs associés à ce syndrome. Deux pistes thérapeutiques visant à compenser les altérations cérébrales liées à la perte de dystrophine ont été explorées chez les souris mdx, modèle de DMD. Une approche pharmacologique basée sur la surexpression de l’utrophine, homologue de la dystrophine, n’améliore pas les déficits comportementaux des souris mdx. Par contre, une intervention génique basée sur l’épissage de l’exon muté conduit à la restauration d’une dystrophine endogène et une récupération d’altérations cérébrales comme l’agrégation des récepteurs GABAA et la plasticité synaptique hippocampique. Ceci suggère un rôle de la dystrophine dans la plasticité du cerveau adulte et l’applicabilité de cette approche de thérapie génique au traitement des altérations cognitives de la DMD
Therapies have been developed to treat Duchenne muscular dystrophy (DMD, due to mutation in the dystrophin gene), but their ability to restore the cognitive deficits associated with this syndrome has not been yet studied. We explored two therapeutic approaches to compensate for the brain alterations resulting from the loss of dystrophin in the mdx mouse, a model of DMD. A pharmacological approach based on the overexpression of utrophin, a dystrophin homologue, does not alleviate the behavioural deficits in these mice. In contrast, a genetic intervention based on the splicing of the mutated exon leads to the restoration of endogenous dystrophin and a recovery of brain alterations such as the clustering of GABAA receptors and hippocampal synaptic plasticity in mdx mice. These results suggest a role for dystrophin in adult brain plasticity and indicate that this gene therapy approach is applicable to the treatment of cognitive impairments in DMD
Стилі APA, Harvard, Vancouver, ISO та ін.
37

Gorin, Arseniy. "Structuration du modèle acoustique pour améliorer les performance de reconnaissance automatique de la parole." Thesis, Université de Lorraine, 2014. http://www.theses.fr/2014LORR0161/document.

Повний текст джерела
Анотація:
Cette thèse se concentre sur la structuration du modèle acoustique pour améliorer la reconnaissance de la parole par modèle de Markov. La structuration repose sur l’utilisation d’une classification non supervisée des phrases du corpus d’apprentissage pour tenir compte des variabilités dues aux locuteurs et aux canaux de transmission. L’idée est de regrouper automatiquement les phrases prononcées en classes correspondant à des données acoustiquement similaires. Pour la modélisation multiple, un modèle acoustique indépendant du locuteur est adapté aux données de chaque classe. Quand le nombre de classes augmente, la quantité de données disponibles pour l’apprentissage du modèle de chaque classe diminue, et cela peut rendre la modélisation moins fiable. Une façon de pallier ce problème est de modifier le critère de classification appliqué sur les données d’apprentissage pour permettre à une phrase d’être associée à plusieurs classes. Ceci est obtenu par l’introduction d’une marge de tolérance lors de la classification ; et cette approche est étudiée dans la première partie de la thèse. L’essentiel de la thèse est consacré à une nouvelle approche qui utilise la classification automatique des données d’apprentissage pour structurer le modèle acoustique. Ainsi, au lieu d’adapter tous les paramètres du modèle HMM-GMM pour chaque classe de données, les informations de classe sont explicitement introduites dans la structure des GMM en associant chaque composante des densités multigaussiennes avec une classe. Pour exploiter efficacement cette structuration des composantes, deux types de modélisations sont proposés. Dans la première approche on propose de compléter cette structuration des densités par des pondérations des composantes gaussiennes dépendantes des classes de locuteurs. Pour cette modélisation, les composantes gaussiennes des mélanges GMM sont structurées en fonction des classes et partagées entre toutes les classes, tandis que les pondérations des composantes des densités sont dépendantes de la classe. Lors du décodage, le jeu de pondérations des gaussiennes est sélectionné en fonction de la classe estimée. Dans une deuxième approche, les pondérations des gaussiennes sont remplacées par des matrices de transition entre les composantes gaussiennes des densités. Les approches proposées dans cette thèse sont analysées et évaluées sur différents corpus de parole qui couvrent différentes sources de variabilité (âge, sexe, accent et bruit)
This thesis focuses on acoustic model structuring for improving HMM-Based automatic speech recognition. The structuring relies on unsupervised clustering of speech utterances of the training data in order to handle speaker and channel variability. The idea is to split the data into acoustically similar classes. In conventional multi-Modeling (or class-Based) approach, separate class-Dependent models are built via adaptation of a speaker-Independent model. When the number of classes increases, less data becomes available for the estimation of the class-Based models, and the parameters are less reliable. One way to handle such problem is to modify the classification criterion applied on the training data, allowing a given utterance to belong to more than one class. This is obtained by relaxing the classification decision through a soft margin. This is investigated in the first part of the thesis. In the main part of the thesis, a novel approach is proposed that uses the clustered data more efficiently in a class-Structured GMM. Instead of adapting all HMM-GMM parameters separately for each class of data, the class information is explicitly introduced into the GMM structure by associating a given density component with a given class. To efficiently exploit such structured HMM-GMM, two different approaches are proposed. The first approach combines class-Structured GMM with class-Dependent mixture weights. In this model the Gaussian components are shared across speaker classes, but they are class-Structured, and the mixture weights are class-Dependent. For decoding an utterance, the set of mixture weights is selected according to the estimated class. In the second approach, the mixture weights are replaced by density component transition probabilities. The approaches proposed in the thesis are analyzed and evaluated on various speech data, which cover different types of variability sources (age, gender, accent and noise)
Стилі APA, Harvard, Vancouver, ISO та ін.
38

Hasnat, Md Abul. "Unsupervised 3D image clustering and extension to joint color and depth segmentation." Thesis, Saint-Etienne, 2014. http://www.theses.fr/2014STET4013/document.

Повний текст джерела
Анотація:
L'accès aux séquences d'images 3D s'est aujourd'hui démocratisé, grâce aux récentes avancées dans le développement des capteurs de profondeur ainsi que des méthodes permettant de manipuler des informations 3D à partir d'images 2D. De ce fait, il y a une attente importante de la part de la communauté scientifique de la vision par ordinateur dans l'intégration de l'information 3D. En effet, des travaux de recherche ont montré que les performances de certaines applications pouvaient être améliorées en intégrant l'information 3D. Cependant, il reste des problèmes à résoudre pour l'analyse et la segmentation de scènes intérieures comme (a) comment l'information 3D peut-elle être exploitée au mieux ? et (b) quelle est la meilleure manière de prendre en compte de manière conjointe les informations couleur et 3D ? Nous abordons ces deux questions dans cette thèse et nous proposons de nouvelles méthodes non supervisées pour la classification d'images 3D et la segmentation prenant en compte de manière conjointe les informations de couleur et de profondeur. A cet effet, nous formulons l'hypothèse que les normales aux surfaces dans les images 3D sont des éléments à prendre en compte pour leur analyse, et leurs distributions sont modélisables à l'aide de lois de mélange. Nous utilisons la méthode dite « Bregman Soft Clustering » afin d'être efficace d'un point de vue calculatoire. De plus, nous étudions plusieurs lois de probabilités permettant de modéliser les distributions de directions : la loi de von Mises-Fisher et la loi de Watson. Les méthodes de classification « basées modèles » proposées sont ensuite validées en utilisant des données de synthèse puis nous montrons leur intérêt pour l'analyse des images 3D (ou de profondeur). Une nouvelle méthode de segmentation d'images couleur et profondeur, appelées aussi images RGB-D, exploitant conjointement la couleur, la position 3D, et la normale locale est alors développée par extension des précédentes méthodes et en introduisant une méthode statistique de fusion de régions « planes » à l'aide d'un graphe. Les résultats montrent que la méthode proposée donne des résultats au moins comparables aux méthodes de l'état de l'art tout en demandant moins de temps de calcul. De plus, elle ouvre des perspectives nouvelles pour la fusion non supervisée des informations de couleur et de géométrie. Nous sommes convaincus que les méthodes proposées dans cette thèse pourront être utilisées pour la classification d'autres types de données comme la parole, les données d'expression en génétique, etc. Elles devraient aussi permettre la réalisation de tâches complexes comme l'analyse conjointe de données contenant des images et de la parole
Access to the 3D images at a reasonable frame rate is widespread now, thanks to the recent advances in low cost depth sensors as well as the efficient methods to compute 3D from 2D images. As a consequence, it is highly demanding to enhance the capability of existing computer vision applications by incorporating 3D information. Indeed, it has been demonstrated in numerous researches that the accuracy of different tasks increases by including 3D information as an additional feature. However, for the task of indoor scene analysis and segmentation, it remains several important issues, such as: (a) how the 3D information itself can be exploited? and (b) what is the best way to fuse color and 3D in an unsupervised manner? In this thesis, we address these issues and propose novel unsupervised methods for 3D image clustering and joint color and depth image segmentation. To this aim, we consider image normals as the prominent feature from 3D image and cluster them with methods based on finite statistical mixture models. We consider Bregman Soft Clustering method to ensure computationally efficient clustering. Moreover, we exploit several probability distributions from directional statistics, such as the von Mises-Fisher distribution and the Watson distribution. By combining these, we propose novel Model Based Clustering methods. We empirically validate these methods using synthetic data and then demonstrate their application for 3D/depth image analysis. Afterward, we extend these methods to segment synchronized 3D and color image, also called RGB-D image. To this aim, first we propose a statistical image generation model for RGB-D image. Then, we propose novel RGB-D segmentation method using a joint color-spatial-axial clustering and a statistical planar region merging method. Results show that, the proposed method is comparable with the state of the art methods and requires less computation time. Moreover, it opens interesting perspectives to fuse color and geometry in an unsupervised manner. We believe that the methods proposed in this thesis are equally applicable and extendable for clustering different types of data, such as speech, gene expressions, etc. Moreover, they can be used for complex tasks, such as joint image-speech data analysis
Стилі APA, Harvard, Vancouver, ISO та ін.
39

Schmutz, Amandine. "Contributions à l'analyse de données fonctionnelles multivariées, application à l'étude de la locomotion du cheval de sport." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1241.

Повний текст джерела
Анотація:
Avec l'essor des objets connectés pour fournir un suivi systématique, objectif et fiable aux sportifs et à leur entraineur, de plus en plus de paramètres sont collectés pour un même individu. Une alternative aux méthodes d'évaluation en laboratoire est l'utilisation de capteurs inertiels qui permettent de suivre la performance sans l'entraver, sans limite d'espace et sans procédure d'initialisation fastidieuse. Les données collectées par ces capteurs peuvent être vues comme des données fonctionnelles multivariées : se sont des entités quantitatives évoluant au cours du temps de façon simultanée pour un même individu statistique. Cette thèse a pour objectif de chercher des paramètres d'analyse de la locomotion du cheval athlète à l'aide d'un capteur positionné dans la selle. Cet objet connecté (centrale inertielle, IMU) pour le secteur équestre permet de collecter l'accélération et la vitesse angulaire au cours du temps, dans les trois directions de l'espace et selon une fréquence d'échantillonnage de 100 Hz. Une base de données a ainsi été constituée rassemblant 3221 foulées de galop, collectées en ligne droite et en courbe et issues de 58 chevaux de sauts d'obstacles de niveaux et d'âges variés. Nous avons restreint notre travail à la prédiction de trois paramètres : la vitesse par foulée, la longueur de foulée et la qualité de saut. Pour répondre aux deux premiers objectifs nous avons développé une méthode de clustering fonctionnelle multivariée permettant de diviser notre base de données en sous-groupes plus homogènes du point de vue des signaux collectés. Cette méthode permet de caractériser chaque groupe par son profil moyen, facilitant leur compréhension et leur interprétation. Mais, contre toute attente, ce modèle de clustering n'a pas permis d'améliorer les résultats de prédiction de vitesse, les SVM restant le modèle ayant le pourcentage d'erreur inférieur à 0.6 m/s le plus faible. Il en est de même pour la longueur de foulée où une précision de 20 cm est atteinte grâce aux Support Vector Machine (SVM). Ces résultats peuvent s'expliquer par le fait que notre base de données est composée uniquement de 58 chevaux, ce qui est un nombre d'individus très faible pour du clustering. Nous avons ensuite étendu cette méthode au co-clustering de courbes fonctionnelles multivariées afin de faciliter la fouille des données collectées pour un même cheval au cours du temps. Cette méthode pourrait permettre de détecter et prévenir d'éventuels troubles locomoteurs, principale source d'arrêt du cheval de saut d'obstacle. Pour finir, nous avons investigué les liens entre qualité du saut et les signaux collectés par l'IMU. Nos premiers résultats montrent que les signaux collectés par la selle seuls ne suffisent pas à différencier finement la qualité du saut d'obstacle. Un apport d'information supplémentaire sera nécessaire, à l'aide d'autres capteurs complémentaires par exemple ou encore en étoffant la base de données de façon à avoir un panel de chevaux et de profils de sauts plus variés
With the growth of smart devices market to provide athletes and trainers a systematic, objective and reliable follow-up, more and more parameters are monitored for a same individual. An alternative to laboratory evaluation methods is the use of inertial sensors which allow following the performance without hindering it, without space limits and without tedious initialization procedures. Data collected by those sensors can be classified as multivariate functional data: some quantitative entities evolving along time and collected simultaneously for a same individual. The aim of this thesis is to find parameters for analysing the athlete horse locomotion thanks to a sensor put in the saddle. This connected device (inertial sensor, IMU) for equestrian sports allows the collection of acceleration and angular velocity along time in the three space directions and with a sampling frequency of 100 Hz. The database used for model development is made of 3221 canter strides from 58 ridden jumping horses of different age and level of competition. Two different protocols are used to collect data: one for straight path and one for curved path. We restricted our work to the prediction of three parameters: the speed per stride, the stride length and the jump quality. To meet the first to objectives, we developed a multivariate functional clustering method that allow the division of the database into smaller more homogeneous sub-groups from the collected signals point of view. This method allows the characterization of each group by it average profile, which ease the data understanding and interpretation. But surprisingly, this clustering model did not improve the results of speed prediction, Support Vector Machine (SVM) is the model with the lowest percentage of error above 0.6 m/s. The same applied for the stride length where an accuracy of 20 cm is reached thanks to SVM model. Those results can be explained by the fact that our database is build from 58 horses only, which is a quite low number of individuals for a clustering method. Then we extend this method to the co-clustering of multivariate functional data in order to ease the datamining of horses’ follow-up databases. This method might allow the detection and prevention of locomotor disturbances, main source of interruption of jumping horses. Lastly, we looked for correlation between jumping quality and signals collected by the IMU. First results show that signals collected by the saddle alone are not sufficient to differentiate finely the jumping quality. Additional information will be needed, for example using complementary sensors or by expanding the database to have a more diverse range of horses and jump profiles
Стилі APA, Harvard, Vancouver, ISO та ін.
40

Ailem, Melissa. "Sparsity-sensitive diagonal co-clustering algorithms for the effective handling of text data." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB087.

Повний текст джерела
Анотація:
Dans le contexte actuel, il y a un besoin évident de techniques de fouille de textes pour analyser l'énorme quantité de documents textuelles non structurées disponibles sur Internet. Ces données textuelles sont souvent représentées par des matrices creuses (sparses) de grande dimension où les lignes et les colonnes représentent respectivement des documents et des termes. Ainsi, il serait intéressant de regrouper de façon simultanée ces termes et documents en classes homogènes, rendant ainsi cette quantité importante de données plus faciles à manipuler et à interpréter. Les techniques de classification croisée servent justement cet objectif. Bien que plusieurs techniques existantes de co-clustering ont révélé avec succès des blocs homogènes dans plusieurs domaines, ces techniques sont toujours contraintes par la grande dimensionalité et la sparsité caractérisant les matrices documents-termes. En raison de cette sparsité, plusieurs co-clusters sont principalement composés de zéros. Bien que ces derniers soient homogènes, ils ne sont pas pertinents et doivent donc être filtrés en aval pour ne garder que les plus importants. L'objectif de cette thèse est de proposer de nouveaux algorithmes de co-clustering conçus pour tenir compte des problèmes liés à la sparsité mentionnés ci-dessus. Ces algorithmes cherchent une structure diagonale par blocs et permettent directement d'identifier les co-clusters les plus pertinents, ce qui les rend particulièrement efficaces pour le co-clustering de données textuelles. Dans ce contexte, nos contributions peuvent être résumées comme suit: Tout d'abord, nous introduisons et démontrons l'efficacité d'un nouvel algorithme de co-clustering basé sur la maximisation directe de la modularité de graphes. Alors que les algorithmes de co-clustering existants qui se basent sur des critères de graphes utilisent des approximations spectrales, l'algorithme proposé utilise une procédure d'optimisation itérative pour révéler les co-clusters les plus pertinents dans une matrice documents-termes. Par ailleurs, l'optimisation proposée présente l'avantage d'éviter le calcul de vecteurs propres, qui est une tâche rédhibitoire lorsque l'on considère des données de grande dimension. Ceci est une amélioration par rapport aux approches spectrales, où le calcul des vecteurs propres est nécessaire pour effectuer le co-clustering. Dans un second temps, nous utilisons une approche probabiliste pour découvrir des structures en blocs homogènes diagonaux dans des matrices documents-termes. Nous nous appuyons sur des approches de type modèles de mélanges, qui offrent de solides bases théoriques et une grande flexibilité qui permet de découvrir diverses structures de co-clusters. Plus précisément, nous proposons un modèle de blocs latents parcimonieux avec des distributions de Poisson sous contraintes. De façon intéressante, ce modèle comprend la sparsité dans sa formulation, ce qui le rend particulièrement adapté aux données textuelles. En plaçant l'estimation des paramètres de ce modèle dans le cadre du maximum de vraisemblance et du maximum de vraisemblance classifiante, quatre algorithmes de co-clustering ont été proposées, incluant une variante dure, floue, stochastique et une quatrième variante qui tire profit des avantages des variantes floue et stochastique simultanément. Pour finir, nous proposons un nouveau cadre de fouille de textes biomédicaux qui comprend certains algorithmes de co-clustering mentionnés ci-dessus. Ce travail montre la contribution du co-clustering dans une problématique réelle de fouille de textes biomédicaux. Le cadre proposé permet de générer de nouveaux indices sur les résultats retournés par les études d'association pan-génomique (GWAS) en exploitant les abstracts de la base de données PUBMED. (...)
In the current context, there is a clear need for Text Mining techniques to analyse the huge quantity of unstructured text documents available on the Internet. These textual data are often represented by sparse high dimensional matrices where rows and columns represent documents and terms respectively. Thus, it would be worthwhile to simultaneously group these terms and documents into meaningful clusters, making this substantial amount of data easier to handle and interpret. Co-clustering techniques just serve this purpose. Although many existing co-clustering approaches have been successful in revealing homogeneous blocks in several domains, these techniques are still challenged by the high dimensionality and sparsity characteristics exhibited by document-term matrices. Due to this sparsity, several co-clusters are primarily composed of zeros. While homogeneous, these co-clusters are irrelevant and must be filtered out in a post-processing step to keep only the most significant ones. The objective of this thesis is to propose new co-clustering algorithms tailored to take into account these sparsity-related issues. The proposed algorithms seek a block diagonal structure and allow to straightaway identify the most useful co-clusters, which makes them specially effective for the text co-clustering task. Our contributions can be summarized as follows: First, we introduce and demonstrate the effectiveness of a novel co-clustering algorithm based on a direct maximization of graph modularity. While existing graph-based co-clustering algorithms rely on spectral relaxation, the proposed algorithm uses an iterative alternating optimization procedure to reveal the most meaningful co-clusters in a document-term matrix. Moreover, the proposed optimization has the advantage of avoiding the computation of eigenvectors, a task which is prohibitive when considering high dimensional data. This is an improvement over spectral approaches, where the eigenvectors computation is necessary to perform the co-clustering. Second, we use an even more powerful approach to discover block diagonal structures in document-term matrices. We rely on mixture models, which offer strong theoretical foundations and considerable flexibility that makes it possible to uncover various specific cluster structure. More precisely, we propose a rigorous probabilistic model based on the Poisson distribution and the well known Latent Block Model. Interestingly, this model includes the sparsity in its formulation, which makes it particularly effective for text data. Setting the estimate of this model’s parameters under the Maximum Likelihood (ML) and the Classification Maximum Likelihood (CML) approaches, four co-clustering algorithms have been proposed, including a hard, a soft, a stochastic and a fourth algorithm which leverages the benefits of both the soft and stochastic variants, simultaneously. As a last contribution of this thesis, we propose a new biomedical text mining framework that includes some of the above mentioned co-clustering algorithms. This work shows the contribution of co-clustering in a real biomedical text mining problematic. The proposed framework is able to propose new clues about the results of genome wide association studies (GWAS) by mining PUBMED abstracts. This framework has been tested on asthma disease and allowed to assess the strength of associations between asthma genes reported in previous GWAS as well as discover new candidate genes likely associated to asthma. In a nutshell, while several text co-clustering algorithms already exist, their performance can be substantially increased if more appropriate models and algorithms are available. According to the extensive experiments done on several challenging real-world text data sets, we believe that this thesis has served well this objective
Стилі APA, Harvard, Vancouver, ISO та ін.
41

Tekieh, Mohammad Hossein. "Analysis of Healthcare Coverage Using Data Mining Techniques." Thèse, Université d'Ottawa / University of Ottawa, 2012. http://hdl.handle.net/10393/20547.

Повний текст джерела
Анотація:
This study explores healthcare coverage disparity using a quantitative analysis on a large dataset from the United States. One of the objectives is to build supervised models including decision tree and neural network to study the efficient factors in healthcare coverage. We also discover groups of people with health coverage problems and inconsistencies by employing unsupervised modeling including K-Means clustering algorithm. Our modeling is based on the dataset retrieved from Medical Expenditure Panel Survey with 98,175 records in the original dataset. After pre-processing the data, including binning, cleaning, dealing with missing values, and balancing, it contains 26,932 records and 23 variables. We build 50 classification models in IBM SPSS Modeler employing decision tree and neural networks. The accuracy of the models varies between 76% and 81%. The models can predict the healthcare coverage for a new sample based on its significant attributes. We demonstrate that the decision tree models provide higher accuracy that the models based on neural networks. Also, having extensively analyzed the results, we discover the most efficient factors in healthcare coverage to be: access to care, age, poverty level of family, and race/ethnicity.
Стилі APA, Harvard, Vancouver, ISO та ін.
42

Rastelli, Riccardo, and Nial Friel. "Optimal Bayesian estimators for latent variable cluster models." Springer Nature, 2018. http://dx.doi.org/10.1007/s11222-017-9786-y.

Повний текст джерела
Анотація:
In cluster analysis interest lies in probabilistically capturing partitions of individuals, items or observations into groups, such that those belonging to the same group share similar attributes or relational profiles. Bayesian posterior samples for the latent allocation variables can be effectively obtained in a wide range of clustering models, including finite mixtures, infinite mixtures, hidden Markov models and block models for networks. However, due to the categorical nature of the clustering variables and the lack of scalable algorithms, summary tools that can interpret such samples are not available. We adopt a Bayesian decision theoretical approach to define an optimality criterion for clusterings and propose a fast and context-independent greedy algorithm to find the best allocations. One important facet of our approach is that the optimal number of groups is automatically selected, thereby solving the clustering and the model-choice problems at the same time. We consider several loss functions to compare partitions and show that our approach can accommodate a wide range of cases. Finally, we illustrate our approach on both artificial and real datasets for three different clustering models: Gaussian mixtures, stochastic block models and latent block models for networks.
Стилі APA, Harvard, Vancouver, ISO та ін.
43

Majed, Aliah. "Sensing-based self-reconfigurable strategies for autonomous modular robotic systems." Electronic Thesis or Diss., Brest, École nationale supérieure de techniques avancées Bretagne, 2022. http://www.theses.fr/2022ENTA0013.

Повний текст джерела
Анотація:
Les systèmes robotiques modulaires (MRS) font aujourd’hui l’objet de recherches très actives. Ils ont la capacité de changer la perspective des systèmes robotiques, passant de machines conçues pour effectuer certaines tâches à des outils polyvalents capables d'accomplir presque toutes les tâches. Ils sont utilisés dans un large éventail d'applications, notamment la reconnaissance, les missions de sauvetage, l'exploration spatiale, les tâches militaires, etc. Constamment, MRS est constitué de "modules" allant de quelques à plusieurs centaines, voire milliers. Chaque module implique des actionneurs, des capteurs, des capacités de calcul et de communication. Habituellement, ces systèmes sont homogènes où tous les modules sont identiques ; cependant, il pourrait y avoir des systèmes hétérogènes contenant différents modules pour maximiser la polyvalence. L’un des avantages de ces systèmes est leur capacité à fonctionner dans des environnements difficiles dans lesquels les schémas de travail contemporains avec intervention humaine sont risqués, inefficaces et parfois irréalisables. Dans cette thèse, nous nous intéressons à la robotique modulaire auto-reconfigurable. Dans de tels systèmes, il utilise un ensemble de détecteurs afin de détecter en permanence son environnement, de localiser sa propre position, puis de se transformer en une forme spécifique pour effectuer les tâches requises. Par conséquent, MRS est confronté à trois défis majeurs. Premièrement, il offre une grande quantité de données collectées qui surchargent la mémoire de stockage du robot. Deuxièmement, cela génère des données redondantes qui compliquent la prise de décision concernant la prochaine morphologie du contrôleur. Troisièmement, le processus d'auto-reconfiguration nécessite une communication massive entre les modules pour atteindre la morphologie cible et prend un temps de traitement important pour auto-reconfigurer le robot. Par conséquent, les stratégies des chercheurs visent souvent à minimiser la quantité de données collectées par les modules sans perte considérable de fidélité. Le but de cette réduction est d'abord d'économiser de l'espace de stockage dans le MRS, puis de faciliter l'analyse des données et la prise de décision sur la morphologie à utiliser ensuite afin de s'adapter aux nouvelles circonstances et d'effectuer de nouvelles tâches. Dans cette thèse, nous proposons un mécanisme efficace de traitement de données et de prise de décision auto-reconfigurable dédié aux systèmes robotiques modulaires. Plus spécifiquement, nous nous concentrons sur la réduction du stockage de données, la prise de décision d'auto-reconfiguration et la gestion efficace des communications entre les modules des MRS dans le but principal d'assurer un processus d'auto-reconfiguration rapide
Modular robotic systems (MRSs) have become a highly active research today. It has the ability to change the perspective of robotic systems from machines designed to do certain tasks to multipurpose tools capable of accomplishing almost any task. They are used in a wide range of applications, including reconnaissance, rescue missions, space exploration, military task, etc. Constantly, MRS is built of “modules” from a few to several hundreds or even thousands. Each module involves actuators, sensors, computational, and communicational capabilities. Usually, these systems are homogeneous where all the modules are identical; however, there could be heterogeneous systems that contain different modules to maximize versatility. One of the advantages of these systems is their ability to operate in harsh environments in which contemporary human-in-the-loop working schemes are risky, inefficient and sometimes infeasible. In this thesis, we are interested in self-reconfigurable modular robotics. In such systems, it uses a set of detectors in order to continuously sense its surroundings, locate its own position, and then transform to a specific shape to perform the required tasks. Consequently, MRS faces three major challenges. First, it offers a great amount of collected data that overloads the memory storage of the robot. Second it generates redundant data which complicates the decision making about the next morphology in the controller. Third, the self reconfiguration process necessitates massive communication between the modules to reach the target morphology and takes a significant processing time to self-reconfigure the robotic. Therefore, researchers’ strategies are often targeted to minimize the amount of data collected by the modules without considerable loss in fidelity. The goal of this reduction is first to save the storage space in the MRS, and then to facilitate analyzing data and making decision about what morphology to use next in order to adapt to new circumstances and perform new tasks. In this thesis, we propose an efficient mechanism for data processing and self-reconfigurable decision-making dedicated to modular robotic systems. More specifically, we focus on data storage reduction, self-reconfiguration decision-making, and efficient communication management between modules in MRSs with the main goal of ensuring fast self-reconfiguration process
Стилі APA, Harvard, Vancouver, ISO та ін.
44

Kulhanek, Raymond Daniel. "A Latent Dirichlet Allocation/N-gram Composite Language Model." Wright State University / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=wright1379520876.

Повний текст джерела
Стилі APA, Harvard, Vancouver, ISO та ін.
45

Gallopin, Mélina. "Classification et inférence de réseaux pour les données RNA-seq." Thesis, Université Paris-Saclay (ComUE), 2015. http://www.theses.fr/2015SACLS174/document.

Повний текст джерела
Анотація:
Cette thèse regroupe des contributions méthodologiques à l'analyse statistique des données issues des technologies de séquençage du transcriptome (RNA-seq). Les difficultés de modélisation des données de comptage RNA-seq sont liées à leur caractère discret et au faible nombre d'échantillons disponibles, limité par le coût financier du séquençage. Une première partie de travaux de cette thèse porte sur la classification à l'aide de modèle de mélange. L'objectif de la classification est la détection de modules de gènes co-exprimés. Un choix naturel de modélisation des données RNA-seq est un modèle de mélange de lois de Poisson. Mais des transformations simples des données permettent de se ramener à un modèle de mélange de lois gaussiennes. Nous proposons de comparer, pour chaque jeu de données RNA-seq, les différentes modélisations à l'aide d'un critère objectif permettant de sélectionner la modélisation la plus adaptée aux données. Par ailleurs, nous présentons un critère de sélection de modèle prenant en compte des informations biologiques externes sur les gènes. Ce critère facilite l'obtention de classes biologiquement interprétables. Il n'est pas spécifique aux données RNA-seq. Il est utile à toute analyse de co-expression à l'aide de modèles de mélange visant à enrichir les bases de données d'annotations fonctionnelles des gènes. Une seconde partie de travaux de cette thèse porte sur l'inférence de réseau à l'aide d'un modèle graphique. L'objectif de l'inférence de réseau est la détection des relations de dépendance entre les niveaux d'expression des gènes. Nous proposons un modèle d'inférence de réseau basé sur des lois de Poisson, prenant en compte le caractère discret et la grande variabilité inter-échantillons des données RNA-seq. Cependant, les méthodes d'inférence de réseau nécessitent un nombre d'échantillons élevé.Dans le cadre du modèle graphique gaussien, modèle concurrent au précédent, nous présentons une approche non-asymptotique pour sélectionner des sous-ensembles de gènes pertinents, en décomposant la matrice variance en blocs diagonaux. Cette méthode n'est pas spécifique aux données RNA-seq et permet de réduire la dimension de tout problème d'inférence de réseau basé sur le modèle graphique gaussien
This thesis gathers methodologicals contributions to the statistical analysis of next-generation high-throughput transcriptome sequencing data (RNA-seq). RNA-seq data are discrete and the number of samples sequenced is usually small due to the cost of the technology. These two points are the main statistical challenges for modelling RNA-seq data.The first part of the thesis is dedicated to the co-expression analysis of RNA-seq data using model-based clustering. A natural model for discrete RNA-seq data is a Poisson mixture model. However, a Gaussian mixture model in conjunction with a simple transformation applied to the data is a reasonable alternative. We propose to compare the two alternatives using a data-driven criterion to select the model that best fits each dataset. In addition, we present a model selection criterion to take into account external gene annotations. This model selection criterion is not specific to RNA-seq data. It is useful in any co-expression analysis using model-based clustering designed to enrich functional annotation databases.The second part of the thesis is dedicated to network inference using graphical models. The aim of network inference is to detect relationships among genes based on their expression. We propose a network inference model based on a Poisson distribution taking into account the discrete nature and high inter sample variability of RNA-seq data. However, network inference methods require a large number of samples. For Gaussian graphical models, we propose a non-asymptotic approach to detect relevant subsets of genes based on a block-diagonale decomposition of the covariance matrix. This method is not specific to RNA-seq data and reduces the dimension of any network inference problem based on the Gaussian graphical model
Стилі APA, Harvard, Vancouver, ISO та ін.
46

El, Assaad Hani. "Modélisation et classification dynamique de données temporelles non stationnaires." Thesis, Paris Est, 2014. http://www.theses.fr/2014PEST1162/document.

Повний текст джерела
Анотація:
Cette thèse aborde la problématique de la classification non supervisée de données lorsque les caractéristiques des classes sont susceptibles d'évoluer au cours du temps. On parlera également, dans ce cas, de classification dynamique de données temporelles non stationnaires. Le cadre applicatif des travaux concerne le diagnostic par reconnaissance des formes de systèmes complexes dynamiques dont les classes de fonctionnement peuvent, suite à des phénomènes d'usures, des déréglages progressifs ou des contextes d'exploitation variables, évoluer au cours du temps. Un modèle probabiliste dynamique, fondé à la fois sur les mélanges de lois et sur les modèles dynamiques à espace d'état, a ainsi été proposé. Compte tenu de la structure complexe de ce modèle, une variante variationnelle de l'algorithme EM a été proposée pour l'apprentissage de ses paramètres. Dans la perspective du traitement rapide de flux de données, une version séquentielle de cet algorithme a également été développée, ainsi qu'une stratégie de choix dynamique du nombre de classes. Une série d'expérimentations menées sur des données simulées et des données réelles acquises sur le système d'aiguillage des trains a permis d'évaluer le potentiel des approches proposées
Nowadays, diagnosis and monitoring for predictive maintenance of railway components are important key subjects for both operators and manufacturers. They seek to anticipate upcoming maintenance actions, reduce maintenance costs and increase the availability of rail network. In order to maintain the components at a satisfactory level of operation, the implementation of reliable diagnostic strategy is required. In this thesis, we are interested in a main component of railway infrastructure, the railway switch; an important safety device whose failure could heavily impact the availability of the transportation system. The diagnosis of this system is therefore essential and can be done by exploiting sequential measurements acquired successively while the state of the system is evolving over time. These measurements consist of power consumption curves that are acquired during several switch operations. The shape of these curves is indicative of the operating state of the system. The aim is to track the temporal dynamic evolution of railway component state under different operating contexts by analyzing the specific data in order to detect and diagnose problems that may lead to functioning failure. This thesis tackles the problem of temporal data clustering within a broader context of developing innovative tools and decision-aid methods. We propose a new dynamic probabilistic approach within a temporal data clustering framework. This approach is based on both Gaussian mixture models and state-space models. The main challenge facing this work is the estimation of model parameters associated with this approach because of its complex structure. In order to meet this challenge, a variational approach has been developed. The results obtained on both synthetic and real data highlight the advantage of the proposed algorithms compared to other state of the art methods in terms of clustering and estimation accuracy
Стилі APA, Harvard, Vancouver, ISO та ін.
47

Haider, Peter. "Prediction with Mixture Models." Phd thesis, Universität Potsdam, 2013. http://opus.kobv.de/ubp/volltexte/2014/6961/.

Повний текст джерела
Анотація:
Learning a model for the relationship between the attributes and the annotated labels of data examples serves two purposes. Firstly, it enables the prediction of the label for examples without annotation. Secondly, the parameters of the model can provide useful insights into the structure of the data. If the data has an inherent partitioned structure, it is natural to mirror this structure in the model. Such mixture models predict by combining the individual predictions generated by the mixture components which correspond to the partitions in the data. Often the partitioned structure is latent, and has to be inferred when learning the mixture model. Directly evaluating the accuracy of the inferred partition structure is, in many cases, impossible because the ground truth cannot be obtained for comparison. However it can be assessed indirectly by measuring the prediction accuracy of the mixture model that arises from it. This thesis addresses the interplay between the improvement of predictive accuracy by uncovering latent cluster structure in data, and further addresses the validation of the estimated structure by measuring the accuracy of the resulting predictive model. In the application of filtering unsolicited emails, the emails in the training set are latently clustered into advertisement campaigns. Uncovering this latent structure allows filtering of future emails with very low false positive rates. In order to model the cluster structure, a Bayesian clustering model for dependent binary features is developed in this thesis. Knowing the clustering of emails into campaigns can also aid in uncovering which emails have been sent on behalf of the same network of captured hosts, so-called botnets. This association of emails to networks is another layer of latent clustering. Uncovering this latent structure allows service providers to further increase the accuracy of email filtering and to effectively defend against distributed denial-of-service attacks. To this end, a discriminative clustering model is derived in this thesis that is based on the graph of observed emails. The partitionings inferred using this model are evaluated through their capacity to predict the campaigns of new emails. Furthermore, when classifying the content of emails, statistical information about the sending server can be valuable. Learning a model that is able to make use of it requires training data that includes server statistics. In order to also use training data where the server statistics are missing, a model that is a mixture over potentially all substitutions thereof is developed. Another application is to predict the navigation behavior of the users of a website. Here, there is no a priori partitioning of the users into clusters, but to understand different usage scenarios and design different layouts for them, imposing a partitioning is necessary. The presented approach simultaneously optimizes the discriminative as well as the predictive power of the clusters. Each model is evaluated on real-world data and compared to baseline methods. The results show that explicitly modeling the assumptions about the latent cluster structure leads to improved predictions compared to the baselines. It is beneficial to incorporate a small number of hyperparameters that can be tuned to yield the best predictions in cases where the prediction accuracy can not be optimized directly.
Das Lernen eines Modells für den Zusammenhang zwischen den Eingabeattributen und annotierten Zielattributen von Dateninstanzen dient zwei Zwecken. Einerseits ermöglicht es die Vorhersage des Zielattributs für Instanzen ohne Annotation. Andererseits können die Parameter des Modells nützliche Einsichten in die Struktur der Daten liefern. Wenn die Daten eine inhärente Partitionsstruktur besitzen, ist es natürlich, diese Struktur im Modell widerzuspiegeln. Solche Mischmodelle generieren Vorhersagen, indem sie die individuellen Vorhersagen der Mischkomponenten, welche mit den Partitionen der Daten korrespondieren, kombinieren. Oft ist die Partitionsstruktur latent und muss beim Lernen des Mischmodells mitinferiert werden. Eine direkte Evaluierung der Genauigkeit der inferierten Partitionsstruktur ist in vielen Fällen unmöglich, weil keine wahren Referenzdaten zum Vergleich herangezogen werden können. Jedoch kann man sie indirekt einschätzen, indem man die Vorhersagegenauigkeit des darauf basierenden Mischmodells misst. Diese Arbeit beschäftigt sich mit dem Zusammenspiel zwischen der Verbesserung der Vorhersagegenauigkeit durch das Aufdecken latenter Partitionierungen in Daten, und der Bewertung der geschätzen Struktur durch das Messen der Genauigkeit des resultierenden Vorhersagemodells. Bei der Anwendung des Filterns unerwünschter E-Mails sind die E-Mails in der Trainingsmende latent in Werbekampagnen partitioniert. Das Aufdecken dieser latenten Struktur erlaubt das Filtern zukünftiger E-Mails mit sehr niedrigen Falsch-Positiv-Raten. In dieser Arbeit wird ein Bayes'sches Partitionierunsmodell entwickelt, um diese Partitionierungsstruktur zu modellieren. Das Wissen über die Partitionierung von E-Mails in Kampagnen hilft auch dabei herauszufinden, welche E-Mails auf Veranlassen des selben Netzes von infiltrierten Rechnern, sogenannten Botnetzen, verschickt wurden. Dies ist eine weitere Schicht latenter Partitionierung. Diese latente Struktur aufzudecken erlaubt es, die Genauigkeit von E-Mail-Filtern zu erhöhen und sich effektiv gegen verteilte Denial-of-Service-Angriffe zu verteidigen. Zu diesem Zweck wird in dieser Arbeit ein diskriminatives Partitionierungsmodell hergeleitet, welches auf dem Graphen der beobachteten E-Mails basiert. Die mit diesem Modell inferierten Partitionierungen werden via ihrer Leistungsfähigkeit bei der Vorhersage der Kampagnen neuer E-Mails evaluiert. Weiterhin kann bei der Klassifikation des Inhalts einer E-Mail statistische Information über den sendenden Server wertvoll sein. Ein Modell zu lernen das diese Informationen nutzen kann erfordert Trainingsdaten, die Serverstatistiken enthalten. Um zusätzlich Trainingsdaten benutzen zu können, bei denen die Serverstatistiken fehlen, wird ein Modell entwickelt, das eine Mischung über potentiell alle Einsetzungen davon ist. Eine weitere Anwendung ist die Vorhersage des Navigationsverhaltens von Benutzern einer Webseite. Hier gibt es nicht a priori eine Partitionierung der Benutzer. Jedoch ist es notwendig, eine Partitionierung zu erzeugen, um verschiedene Nutzungsszenarien zu verstehen und verschiedene Layouts dafür zu entwerfen. Der vorgestellte Ansatz optimiert gleichzeitig die Fähigkeiten des Modells, sowohl die beste Partition zu bestimmen als auch mittels dieser Partition Vorhersagen über das Verhalten zu generieren. Jedes Modell wird auf realen Daten evaluiert und mit Referenzmethoden verglichen. Die Ergebnisse zeigen, dass das explizite Modellieren der Annahmen über die latente Partitionierungsstruktur zu verbesserten Vorhersagen führt. In den Fällen bei denen die Vorhersagegenauigkeit nicht direkt optimiert werden kann, erweist sich die Hinzunahme einer kleinen Anzahl von übergeordneten, direkt einstellbaren Parametern als nützlich.
Стилі APA, Harvard, Vancouver, ISO та ін.
48

Westerlund, Annie M. "Computational Study of Calmodulin’s Ca2+-dependent Conformational Ensembles." Licentiate thesis, KTH, Biofysik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-234888.

Повний текст джерела
Анотація:
Ca2+ and calmodulin play important roles in many physiologically crucial pathways. The conformational landscape of calmodulin is intriguing. Conformational changes allow for binding target-proteins, while binding Ca2+ yields population shifts within the landscape. Thus, target-proteins become Ca2+-sensitive upon calmodulin binding. Calmodulin regulates more than 300 target-proteins, and mutations are linked to lethal disorders. The mechanisms underlying Ca2+ and target-protein binding are complex and pose interesting questions. Such questions are typically addressed with experiments which fail to provide simultaneous molecular and dynamics insights. In this thesis, questions on binding mechanisms are probed with molecular dynamics simulations together with tailored unsupervised learning and data analysis. In Paper 1, a free energy landscape estimator based on Gaussian mixture models with cross-validation was developed and used to evaluate the efficiency of regular molecular dynamics compared to temperature-enhanced molecular dynamics. This comparison revealed interesting properties of the free energy landscapes, highlighting different behaviors of the Ca2+-bound and unbound calmodulin conformational ensembles. In Paper 2, spectral clustering was used to shed light on Ca2+ and target protein binding. With these tools, it was possible to characterize differences in target-protein binding depending on Ca2+-state as well as N-terminal or C-terminal lobe binding. This work invites data-driven analysis into the field of biomolecule molecular dynamics, provides further insight into calmodulin’s Ca2+ and targetprotein binding, and serves as a stepping-stone towards a complete understanding of calmodulin’s Ca2+-dependent conformational ensembles.

QC 20180912

Стилі APA, Harvard, Vancouver, ISO та ін.
49

Dahl, Oskar, and Fredrik Johansson. "Understanding usage of Volvo trucks." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-40826.

Повний текст джерела
Анотація:
Trucks are designed, configured and marketed for various working environments. There lies a concern whether trucks are used as intended by the manufacturer, as usage may impact the longevity, efficiency and productivity of the trucks. In this thesis we propose a framework divided into two separate parts, that aims to extract costumers’ driving behaviours from Logged Vehicle Data (LVD) in order to a): evaluate whether they align with so-called Global Transport Application (GTA) parameters and b): evaluate the usage in terms of performance. Gaussian mixture model (GMM) is employed to cluster and classify various driving behaviors. Association rule mining was applied on the categorized clusters to validate that the usage follow GTA configuration. Furthermore, Correlation Coefficient (CC) was used to find linear relationships between usage and performance in terms of Fuel Consumption (FC). It is found that the vast majority of the trucks seemingly follow GTA parameters, thus used as marketed. Likewise, the fuel economy was found to be linearly dependent with drivers’ various performances. The LVD lacks detail, such as Global Positioning System (GPS) information, needed to capture the usage in such a way that more definitive conclusions can be drawn.

This thesis was later conducted as a scientific paper and was submit- ted to the conference of ICIMP, 2020. The publication was accepted the 23th of September (2019), and will be presented in January, 2020.

Стилі APA, Harvard, Vancouver, ISO та ін.
50

Barceló, Rico Fátima. "Multimodel Approaches for Plasma Glucose Estimation in Continuous Glucose Monitoring. Development of New Calibration Algorithms." Doctoral thesis, Universitat Politècnica de València, 2012. http://hdl.handle.net/10251/17173.

Повний текст джерела
Анотація:
ABSTRACT Diabetes Mellitus (DM) embraces a group of metabolic diseases which main characteristic is the presence of high glucose levels in blood. It is one of the diseases with major social and health impact, both for its prevalence and also the consequences of the chronic complications that it implies. One of the research lines to improve the quality of life of people with diabetes is of technical focus. It involves several lines of research, including the development and improvement of devices to estimate "online" plasma glucose: continuous glucose monitoring systems (CGMS), both invasive and non-invasive. These devices estimate plasma glucose from sensor measurements from compartments alternative to blood. Current commercially available CGMS are minimally invasive and offer an estimation of plasma glucose from measurements in the interstitial fluid CGMS is a key component of the technical approach to build the artificial pancreas, aiming at closing the loop in combination with an insulin pump. Yet, the accuracy of current CGMS is still poor and it may partly depend on low performance of the implemented Calibration Algorithm (CA). In addition, the sensor-to-patient sensitivity is different between patients and also for the same patient in time. It is clear, then, that the development of new efficient calibration algorithms for CGMS is an interesting and challenging problem. The indirect measurement of plasma glucose through interstitial glucose is a main confounder of CGMS accuracy. Many components take part in the glucose transport dynamics. Indeed, physiology might suggest the existence of different local behaviors in the glucose transport process. For this reason, local modeling techniques may be the best option for the structure of the desired CA. Thus, similar input samples are represented by the same local model. The integration of all of them considering the input regions where they are valid is the final model of the whole data set. Clustering is t
Barceló Rico, F. (2012). Multimodel Approaches for Plasma Glucose Estimation in Continuous Glucose Monitoring. Development of New Calibration Algorithms [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/17173
Palancia
Стилі APA, Harvard, Vancouver, ISO та ін.
Ми пропонуємо знижки на всі преміум-плани для авторів, чиї праці увійшли до тематичних добірок літератури. Зв'яжіться з нами, щоб отримати унікальний промокод!

До бібліографії