Tesis: "Dirichlet modeling"

1

Heaton, Matthew J. "Temporally Correlated Dirichlet Processes in Pollution Receptor Modeling". Diss., CLICK HERE for online access, 2007. http://contentdm.lib.byu.edu/ETD/image/etd1861.pdf.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

2

Hu, Zhen. "Modeling photonic crystal devices by Dirichlet-to-Neumann maps /". access full-text access abstract and table of contents, 2009. http://libweb.cityu.edu.hk/cgi-bin/ezdb/thesis.pl?phd-ma-b30082559f.pdf.

Texto completo

Resumen

Thesis (Ph.D.)--City University of Hong Kong, 2009.
"Submitted to Department of Mathematics in partial fulfillment of the requirements for the degree of Doctor of Philosophy." Includes bibliographical references (leaves [85]-91)

Los estilos APA, Harvard, Vancouver, ISO, etc.

3

Gao, Wenyu. "Advanced Nonparametric Bayesian Functional Modeling". Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/99913.

Texto completo

Resumen

Functional analyses have gained more interest as we have easier access to massive data sets. However, such data sets often contain large heterogeneities, noise, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This dissertation considers noisy information reduction in functional analyses from two perspectives: functional variable selection to reduce the dimensionality and functional clustering to group similar observations and thus reduce the sample size. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model, or developed from a more generic one by changing the prior distributions. Hence, this dissertation focuses on the development of Bayesian approaches for functional analyses due to their flexibilities. A nonparametric Bayesian approach, such as the Dirichlet process mixture (DPM) model, has a nonparametric distribution as the prior. This approach provides flexibility and reduces assumptions, especially for functional clustering, because the DPM model has an automatic clustering property, so the number of clusters does not need to be specified in advance. Furthermore, a weighted Dirichlet process mixture (WDPM) model allows for more heterogeneities from the data by assuming more than one unknown prior distribution. It also gathers more information from the data by introducing a weight function that assigns different candidate priors, such that the less similar observations are more separated. Thus, the WDPM model will improve the clustering and model estimation results. In this dissertation, we used an advanced nonparametric Bayesian approach to study functional variable selection and functional clustering methods. We proposed 1) a stochastic search functional selection method with application to 1-M matched case-crossover studies for aseptic meningitis, to examine the time-varying unknown relationship and find out important covariates affecting disease contractions; 2) a functional clustering method via the WDPM model, with application to three pathways related to genetic diabetes data, to identify essential genes distinguishing between normal and disease groups; and 3) a combined functional clustering, with the WDPM model, and variable selection approach with application to high-frequency spectral data, to select wavelengths associated with breast cancer racial disparities.
Doctor of Philosophy
As we have easier access to massive data sets, functional analyses have gained more interest to analyze data providing information about curves, surfaces, or others varying over a continuum. However, such data sets often contain large heterogeneities and noise. When generalizing the analyses from vectors to functions, classical methods might not work directly. This dissertation considers noisy information reduction in functional analyses from two perspectives: functional variable selection to reduce the dimensionality and functional clustering to group similar observations and thus reduce the sample size. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model due to its flexibility. Hence, this dissertation focuses on the development of nonparametric Bayesian approaches for functional analyses. Our proposed methods can be applied in various applications: the epidemiological studies on aseptic meningitis with clustered binary data, the genetic diabetes data, and breast cancer racial disparities.

Los estilos APA, Harvard, Vancouver, ISO, etc.

4

Monson, Rebecca Lee. "Modeling Transition Probabilities for Loan States Using a Bayesian Hierarchical Model". Diss., CLICK HERE for online access, 2007. http://contentdm.lib.byu.edu/ETD/image/etd2179.pdf.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

5

lim, woobeen. "Bayesian Semiparametric Joint Modeling of Longitudinal Predictors and Discrete Outcomes". The Ohio State University, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=osu1618955725276958.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

6

Domingues, Rémi. "Probabilistic Modeling for Novelty Detection with Applications to Fraud Identification". Electronic Thesis or Diss., Sorbonne université, 2019. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2019SORUS473.pdf.

Texto completo

Resumen

La détection de nouveauté est le problème non supervisé d’identification d’anomalies dans des données de test qui diffèrent de manière significative des données d’apprentissage. La représentation de données temporelles ou de données de types mixtes, telles des données numériques et catégorielles, est une tâche complexe. Outre le type de données supporté, l'efficacité des méthodes de détection de nouveauté repose également sur la capacité à dissocier avec précision les anomalies des échantillons nominaux, l'interprétabilité, la scalabilité et la robustesse aux anomalies présentes dans les données d'entraînement. Dans cette thèse, nous explorons de nouvelles façons de répondre à ces contraintes. Plus spécifiquement, nous proposons (i) une étude de l'état de l'art des méthodes de détection de nouveauté, appliquée aux données de types mixtes, et évaluant la scalabilité, la consommation mémoire et la robustesse des méthodes (ii) une étude des méthodes de détection de nouveauté adaptées aux séquences d'évènements (iii) une méthode de détection de nouveauté probabiliste et non paramétrique pour les données de types mixtes basée sur des mélanges de processus de Dirichlet et des distributions de famille exponentielle et (iv) un modèle de détection de nouveauté basé sur un autoencodeur dans lequel l'encodeur et le décodeur sont modélisés par des processus Gaussiens profonds. L’apprentissage de ce modèle est effectué par extension aléatoire des dimensions et par inférence stochastique variationnelle. Cette méthode est adaptée aux dimensions de types mixtes et aux larges volumes de données
Novelty detection is the unsupervised problem of identifying anomalies in test data which significantly differ from the training set. While numerous novelty detection methods were designed to model continuous numerical data, tackling datasets composed of mixed-type features, such as numerical and categorical data, or temporal datasets describing discrete event sequences is a challenging task. In addition to the supported data types, the key criteria for efficient novelty detection methods are the ability to accurately dissociate novelties from nominal samples, the interpretability, the scalability and the robustness to anomalies located in the training data. In this thesis, we investigate novel ways to tackle these issues. In particular, we propose (i) a survey of state-of-the-art novelty detection methods applied to mixed-type data, including extensive scalability, memory consumption and robustness tests (ii) a survey of state-of-the-art novelty detection methods suitable for sequence data (iii) a probabilistic nonparametric novelty detection method for mixed-type data based on Dirichlet process mixtures and exponential-family distributions and (iv) an autoencoder-based novelty detection model with encoder/decoder modelled as deep Gaussian processes. The learning of this last model is made tractable and scalable through the use of random feature approximations and stochastic variational inference. The method is suitable for large-scale novelty detection problems and data with mixed-type features. The experiments indicate that the proposed model achieves competitive results with state-of-the-art novelty detection methods

Los estilos APA, Harvard, Vancouver, ISO, etc.

7

Race, Jonathan Andrew. "Semi-parametric Survival Analysis via Dirichlet Process Mixtures of the First Hitting Time Model". The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu157357742741077.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

8

Huo, Shuning. "Bayesian Modeling of Complex High-Dimensional Data". Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/101037.

Texto completo

Resumen

With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional complex data in different forms, such as medical images, genomics measurements. However, acquisition of more data does not automatically lead to better knowledge discovery. One needs efficient and reliable analytical tools to extract useful information from complex datasets. The main objective of this dissertation is to develop innovative Bayesian methodologies to enable effective and efficient knowledge discovery from complex high-dimensional data. It contains two parts—the development of computationally efficient functional mixed models and the modeling of data heterogeneity via Dirichlet Diffusion Tree. The first part focuses on tackling the computational bottleneck in Bayesian functional mixed models. We propose a computational framework called variational functional mixed model (VFMM). This new method facilitates efficient data compression and high-performance computing in basis space. We also propose a new multiple testing procedure in basis space, which can be used to detect significant local regions. The effectiveness of the proposed model is demonstrated through two datasets, a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer's disease study. The second part is about modeling data heterogeneity by using Dirichlet Diffusion Trees. We propose a Bayesian latent tree model that incorporates covariates of subjects to characterize the heterogeneity and uncover the latent tree structure underlying data. This innovative model may reveal the hierarchical evolution process through branch structures and estimate systematic differences between groups of samples. We demonstrate the effectiveness of the model through the simulation study and a brain tumor real data.
Doctor of Philosophy
With the rapid development of modern high-throughput technologies, scientists can now collect high-dimensional data in different forms, such as engineering signals, medical images, and genomics measurements. However, acquisition of such data does not automatically lead to efficient knowledge discovery. The main objective of this dissertation is to develop novel Bayesian methods to extract useful knowledge from complex high-dimensional data. It has two parts—the development of an ultra-fast functional mixed model and the modeling of data heterogeneity via Dirichlet Diffusion Trees. The first part focuses on developing approximate Bayesian methods in functional mixed models to estimate parameters and detect significant regions. Two datasets demonstrate the effectiveness of proposed method—a mass spectrometry dataset in a cancer study and a neuroimaging dataset in an Alzheimer's disease study. The second part focuses on modeling data heterogeneity via Dirichlet Diffusion Trees. The method helps uncover the underlying hierarchical tree structures and estimate systematic differences between the group of samples. We demonstrate the effectiveness of the method through the brain tumor imaging data.

Los estilos APA, Harvard, Vancouver, ISO, etc.

9

Liu, Jia. "Heterogeneous Sensor Data based Online Quality Assurance for Advanced Manufacturing using Spatiotemporal Modeling". Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/78722.

Texto completo

Resumen

Online quality assurance is crucial for elevating product quality and boosting process productivity in advanced manufacturing. However, the inherent complexity of advanced manufacturing, including nonlinear process dynamics, multiple process attributes, and low signal/noise ratio, poses severe challenges for both maintaining stable process operations and establishing efficacious online quality assurance schemes. To address these challenges, four different advanced manufacturing processes, namely, fused filament fabrication (FFF), binder jetting, chemical mechanical planarization (CMP), and the slicing process in wafer production, are investigated in this dissertation for applications of online quality assurance, with utilization of various sensors, such as thermocouples, infrared temperature sensors, accelerometers, etc. The overarching goal of this dissertation is to develop innovative integrated methodologies tailored for these individual manufacturing processes but addressing their common challenges to achieve satisfying performance in online quality assurance based on heterogeneous sensor data. Specifically, three new methodologies are created and validated using actual sensor data, namely, (1) Real-time process monitoring methods using Dirichlet process (DP) mixture model for timely detection of process changes and identification of different process states for FFF and CMP. The proposed methodology is capable of tackling non-Gaussian data from heterogeneous sensors in these advanced manufacturing processes for successful online quality assurance. (2) Spatial Dirichlet process (SDP) for modeling complex multimodal wafer thickness profiles and exploring their clustering effects. The SDP-based statistical control scheme can effectively detect out-of-control wafers and achieve wafer thickness quality assurance for the slicing process with high accuracy. (3) Augmented spatiotemporal log Gaussian Cox process (AST-LGCP) quantifying the spatiotemporal evolution of porosity in binder jetting parts, capable of predicting high-risk areas on consecutive layers. This work fills the long-standing research gap of lacking rigorous layer-wise porosity quantification for parts made by additive manufacturing (AM), and provides the basis for facilitating corrective actions for product quality improvements in a prognostic way. These developed methodologies surmount some common challenges of advanced manufacturing which paralyze traditional methods in online quality assurance, and embody key components for implementing effective online quality assurance with various sensor data. There is a promising potential to extend them to other manufacturing processes in the future.
Ph. D.

Los estilos APA, Harvard, Vancouver, ISO, etc.

10

Bui, Quang Vu. "Pretopology and Topic Modeling for Complex Systems Analysis : Application on Document Classification and Complex Network Analysis". Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEP034/document.

Texto completo

Resumen

Les travaux de cette thèse présentent le développement d'algorithmes de classification de documents d'une part, ou d'analyse de réseaux complexes d'autre part, en s'appuyant sur la prétopologie, une théorie qui modélise le concept de proximité. Le premier travail développe un cadre pour la classification de documents en combinant une approche de topicmodeling et la prétopologie. Notre contribution propose d'utiliser des distributions de sujets extraites à partir d'un traitement topic-modeling comme entrées pour des méthodes de classification. Dans cette approche, nous avons étudié deux aspects : déterminer une distance adaptée entre documents en étudiant la pertinence des mesures probabilistes et des mesures vectorielles, et effet réaliser des regroupements selon plusieurs critères en utilisant une pseudo-distance définie à partir de la prétopologie. Le deuxième travail introduit un cadre général de modélisation des Réseaux Complexes en développant une reformulation de la prétopologie stochastique, il propose également un modèle prétopologique de cascade d'informations comme modèle général de diffusion. De plus, nous avons proposé un modèle agent, Textual-ABM, pour analyser des réseaux complexes dynamiques associés à des informations textuelles en utilisant un modèle auteur-sujet et nous avons introduit le Textual-Homo-IC, un modèle de cascade indépendant de la ressemblance, dans lequel l'homophilie est fondée sur du contenu textuel obtenu par un topic-model
The work of this thesis presents the development of algorithms for document classification on the one hand, or complex network analysis on the other hand, based on pretopology, a theory that models the concept of proximity. The first work develops a framework for document clustering by combining Topic Modeling and Pretopology. Our contribution proposes using topic distributions extracted from topic modeling treatment as input for classification methods. In this approach, we investigated two aspects: determine an appropriate distance between documents by studying the relevance of Probabilistic-Based and Vector-Based Measurements and effect groupings according to several criteria using a pseudo-distance defined from pretopology. The second work introduces a general framework for modeling Complex Networks by developing a reformulation of stochastic pretopology and proposes Pretopology Cascade Model as a general model for information diffusion. In addition, we proposed an agent-based model, Textual-ABM, to analyze complex dynamic networks associated with textual information using author-topic model and introduced Textual-Homo-IC, an independent cascade model of the resemblance, in which homophily is measured based on textual content obtained by utilizing Topic Modeling

Los estilos APA, Harvard, Vancouver, ISO, etc.

11

Schulte, Lukas. "Investigating topic modeling techniques for historical feature location". Thesis, Karlstads universitet, Institutionen för matematik och datavetenskap (from 2013), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-85379.

Texto completo

Resumen

Software maintenance and the understanding of where in the source code features are implemented are two strongly coupled tasks that make up a large portion of the effort spent on developing applications. The concept of feature location investigated in this thesis can serve as a supporting factor in those tasks as it facilitates the automation of otherwise manual searches for source code artifacts. Challenges in this subject area include the aggregation and composition of a training corpus from historical codebase data for models as well as the integration and optimization of qualified topic modeling techniques. Building up on previous research, this thesis provides a comparison of two different techniques and introduces a toolkit that can be used to reproduce and extend on the results discussed. Specifically, in this thesis a changeset-based approach to feature location is pursued and applied to a large open-source Java project. The project is used to optimize and evaluate the performance of Latent Dirichlet Allocation models and Pachinko Allocation models, as well as to compare the accuracy of the two models with each other. As discussed at the end of the thesis, the results do not indicate a clear favorite between the models. Instead, the outcome of the comparison depends on the metric and viewpoint from which it is assessed.

Los estilos APA, Harvard, Vancouver, ISO, etc.

12

Hu, Xuequn. "Modeling Endogenous Treatment Eects with Heterogeneity: A Bayesian Nonparametric Approach". Scholar Commons, 2011. http://scholarcommons.usf.edu/etd/3159.

Texto completo

Resumen

This dissertation explores the estimation of endogenous treatment effects in the presence of heterogeneous responses. A Bayesian Nonparametric approach is taken to model the heterogeneity in treatment effects. Specifically, I adopt the Dirichlet Process Mixture (DPM) model to capture the heterogeneity and show that DPM often outperforms Finite Mixture Model (FMM) in providing more flexible function forms and thus better model fit. Rather than fixing the number of components in a mixture model, DPM allows the data and prior knowledge to determine the number of components in the data, thus providing an automatic mechanism for model selection. Two DPM models are presented in this dissertation. The first DPM model is based on a two-equation selection model. A Dirichlet Process (DP) prior is specified on some or all the parameters of the structural equation, and marginal likelihoods are calculated to select the best DPM model. This model is used to study the incentive and selection effects of having prescription drug coverage on total drug expenditures among Medicare beneficiaries. The second DPM model utilizes a three-equation Roy-type framework to model the observed heterogeneity that arises due to the treatment status, while the unobserved heterogeneity is handled by separate DPM models for the treated and untreated outcomes. This Roy-type DPM model is applied to a data set consisting of 33,081 independent individuals from the Medical Expenditure Panel Survey (MEPS), and the treatment effects of having private medical insurance on the outpatient expenditures are estimated. Key Words: Treatment Effects, Endogeneity, Heterogeneity, Finite Mixture Model, Dirichlet Process Prior, Dirichlet Process Mixture, Roy-type Modeling, Importance Sampling, Bridge Sampling

Los estilos APA, Harvard, Vancouver, ISO, etc.

13

Zerkoune, Abbas. "Modélisation de l'incertitude géologique par simulation stochastique de cubes de proportions de faciès : application aux réservoirs pétroliers de type carbonaté ou silico-clastique". Phd thesis, Grenoble 1, 2009. http://www.theses.fr/2009GRE10104.

Texto completo

Resumen

Après sa découverte, les choix relatifs au développement d'un gisement se prennent sur la base de représentations incertaines du champ. En effet, sa caractérisation utilise des modèles numériques spatiaux porteurs de l'incertitude liée à la complexité du milieu souterrain. D'ordinaire, les méthodes de simulations stochastiques, qui génèrent des modèles équiprobables du sous-sol, sont supposées les quantifier. Néanmoins, ces images alternatives du champ renvoient à des tirages au sein d'un modèle probabiliste unique. Elles oublient l'incertitude relative au choix du modèle probabiliste sous-jacent, et tendent à la sous-estimer. Ce travail de recherche vise à améliorer la quantification de cette incertitude. Elle retranscrit la part de doute relative à la compréhension des propriétés du milieu sur les modèles probabilistes, et propose de l'intégrer à ce niveau. Cette thèse précise d'abord la notion d'incertitude en modélisation pétrolière, en particulier sur les modèles géologiques 3D comprenant différents faciès. Leur construction demande au préalable de définir en tout point de l'espace leur probabilité d'existence : c'est le cube de proportions. Généralement, bien que ces probabilités soient peu connues, les méthodes actuelles d'évaluation de l'incertitude sédimentaire les gardent figées. De fait, elles oublient le caractère incertain du scénario géologique et son impact sur le cube de proportions. Deux méthodes stochastiques de simulation ont été développées afin de générer des modèles équiprobables en termes de cubes de proportions. Elles intègrent la variabilité liée aux proportions de faciès, et explorent dans son ensemble un tel domaine d'incertitude. La première reste relativement attachée à la géologie. Elle intègre directement l'incertitude liée aux paramètres qui composent le scénario géologique. Elle décrit sa mise en oeuvre sur les divers paramètres du scénario géologique, qu'ils prennent la forme de signaux aux puits, de cartes ou d'hypothèses plus globales à l'échelle du réservoir. Une démarche de type Monte-Carlo échantillonne les composantes du schéma sédimentaire. Chaque tirage permet de construire un cube de proportions par l'intermédiaire d'un géomodeleur qui intègre de façon plus ou moins explicite les paramètres du scénario géologique. La méthodologie est illustrée et appliquée à un processus inédit de modélisation des dépôts carbonatés en milieu marin. La seconde revêt un caractère plus géostatistique en se concentrant davantage sur le cube de proportions. Elle vise plutôt à réconcilier les différents modèles sédimentaires possibles. Dans le modèle maillé de réservoir, elle estime la loi de distribution des proportions de faciès cellule par cellule - supposées suivrent une loi de Dirichlet, à partir de quelques modèles, construits sur la base de scénarios géologiques distincts. Elle simule alors les proportions de façon séquentielle, maille après maille, en introduisant une corrélation spatiale (variogramme) qui peut être déterministe ou probabiliste. Divers cas pratiques, composés de réservoirs synthétiques ou de champs réels, illustrent et précisent les différentes étapes de la méthode proposée
After finding out a potential oil field, development decisions are based on uncertain representations of the reservoir. Indeed, its characterisation uses numerical, spatial models of the reservoir. However, if they are representative of subsoil heterogeneities, the uncertainty linked to subsoil complexity remain. Usually, uncertainty is supposed to be assessed using many equiprobable models, which represent the heterogeneities expected into the reservoir. Nevertheless, those alternative images of the underground correspond to multiple realizations of a given and a single stochastic model. Those methods ignore the uncertainty related to the choice of the underlying probabilistic model. This work aims at improving that kind of uncertainty assessment when modelling petroleum reservoir. It conveys the doubt linked with our subsoil properties understanding on probabilistic models, and proposes to integrate it on them. This thesis first defines uncertainty in the context of oil industry modelling, particularly on 3D geological models comprising several litho-types or facies. To build them, we need, before any simulations, to estimate for every point in the space the probability of occurring for each facies : this is the proportions cube. Even thought those probabilities are often poorly known, they are frozen while using current methods of uncertainty assessment. So, the impact of an uncertain geological scenario on the definition of a proportion cube is forgotten. Two methods based on stochastic simulations of alternative, equiprobable proportion cubes have been developed to sample the complete geological uncertainty space. The first one is closely linked to geology. It integrates directly uncertainty related to the parameters composing the geological scenario. Based on a multi-realisation approach, it describes its implementation on every parameters of geological scenario from information at wells to maps or global hypothesis at reservoir scale resolution. A Monte Carlo approach samples the components of the sedimentary scheme. Each drawing enables to build a proportion cube using modelling tools which integrates more or less explicitly parameters of geological scenario. That methodology is illustrated and applied to an modelling process which is used to model marine carbonate deposits. The second method appears to be more geostatistics focussing on proportion cubes. It rather aims at reconcile distinct eventual sedimentary models. In the meshed model symbolising the reservoir, it assesses the probabilistic law of facies proportion in each cells – they are supposed to follow Dirichlet's probabilistic law. That assessment is done from some models inferred from different geological scenarios. Facies proportions are sequentially simulated, cell after cell, introducing a spatial correlation model (variogram), which could be deterministic as probabilistic. Various practical cases, comprising synthetic reservoirs or real field, illustrates and specifies the different steps of the proposed method

Los estilos APA, Harvard, Vancouver, ISO, etc.

14

Zerkoune, Abbas. "Modélisation de l'incertitude géologique par simulation stochastique de cubes de proportions de faciès - Application aux réservoirs pétroliers de type carbonaté ou silico-clastique". Phd thesis, Université Joseph Fourier (Grenoble), 2009. http://tel.archives-ouvertes.fr/tel-00410136.

Texto completo

Resumen

Après sa découverte, les choix relatifs au développement d'un gisement se prennent sur la base de représentations incertaines du champ. En effet, sa caractérisation utilise des modèles numériques spatiaux porteurs de l'incertitude liée à la complexité du milieu souterrain. D'ordinaire, les méthodes de simulations stochastiques, qui génèrent des modèles équiprobables du sous-sol, sont supposées les quantifier. Néanmoins, ces images alternatives du champ renvoient à des tirages au sein d'un modèle probabiliste unique. Elles oublient l'incertitude relative au choix du modèle probabiliste sous-jacent, et tendent à la sous-estimer. Ce travail de recherche vise à améliorer la quantification de cette incertitude. Elle retranscrit la part de doute relative à la compréhension des propriétés du milieu sur les modèles probabilistes, et propose de l'intégrer à ce niveau. Cette thèse précise d'abord la notion d'incertitude en modélisation pétrolière, en particulier sur les modèles géologiques 3D comprenant différents faciès. Leur construction demande au préalable de définir en tout point de l'espace leur probabilité d'existence : c'est le cube de proportions. Généralement, bien que ces probabilités soient peu connues, les méthodes actuelles d'évaluation de l'incertitude sédimentaire les gardent figées. De fait, elles oublient le caractère incertain du scénario géologique et son impact sur le cube de proportions. Deux méthodes stochastiques de simulation ont été développées afin de générer des modèles équiprobables en termes de cubes de proportions. Elles intègrent la variabilité liée aux proportions de faciès, et explorent dans son ensemble un tel domaine d'incertitude. La première reste relativement attachée à la géologie. Elle intègre directement l'incertitude liée aux paramètres qui composent le scénario géologique. Elle décrit sa mise en oeuvre sur les divers paramètres du scénario géologique, qu'ils prennent la forme de signaux aux puits, de cartes ou d'hypothèses plus globales à l'échelle du réservoir. Une démarche de type Monte-Carlo échantillonne les composantes du schéma sédimentaire. Chaque tirage permet de construire un cube de proportions par l'intermédiaire d'un géomodeleur qui intègre de façon plus ou moins explicite les paramètres du scénario géologique. La méthodologie est illustrée et appliquée à un processus inédit de modélisation des dépôts carbonatés en milieu marin. La seconde revêt un caractère plus géostatistique en se concentrant davantage sur le cube de proportions. Elle vise plutôt à réconcilier les différents modèles sédimentaires possibles. Dans le modèle maillé de réservoir, elle estime la loi de distribution des proportions de faciès cellule par cellule - supposées suivrent une loi de Dirichlet, à partir de quelques modèles, construits sur la base de scénarios géologiques distincts. Elle simule alors les proportions de façon séquentielle, maille après maille, en introduisant une corrélation spatiale (variogramme) qui peut être déterministe ou probabiliste. Divers cas pratiques, composés de réservoirs synthétiques ou de champs réels, illustrent et précisent les différentes étapes de la méthode proposée.

Los estilos APA, Harvard, Vancouver, ISO, etc.

15

Harrysson, Mattias. "Neural probabilistic topic modeling of short and messy text". Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189532.

Texto completo

Resumen

Exploring massive amount of user generated data with topics posits a new way to find useful information. The topics are assumed to be “hidden” and must be “uncovered” by statistical methods such as topic modeling. However, the user generated data is typically short and messy e.g. informal chat conversations, heavy use of slang words and “noise” which could be URL’s or other forms of pseudo-text. This type of data is difficult to process for most natural language processing methods, including topic modeling. This thesis attempts to find the approach that objectively give the better topics from short and messy text in a comparative study. The compared approaches are latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words, and a new approach based on previous work named Neural Probabilistic Topic Modeling (NPTM). It could only be concluded that NPTM have a tendency to achieve better topics on short and messy text than LDA and RO-LDA. GMM on the other hand could not produce any meaningful results at all. The results are less conclusive since NPTM suffers from long running times which prevented enough samples to be obtained for a statistical test.
Att utforska enorma mängder användargenererad data med ämnen postulerar ett nytt sätt att hitta användbar information. Ämnena antas vara “gömda” och måste “avtäckas” med statistiska metoder såsom ämnesmodellering. Dock är användargenererad data generellt sätt kort och stökig t.ex. informella chattkonversationer, mycket slangord och “brus” som kan vara URL:er eller andra former av pseudo-text. Denna typ av data är svår att bearbeta för de flesta algoritmer i naturligt språk, inklusive ämnesmodellering. Det här arbetet har försökt hitta den metod som objektivt ger dem bättre ämnena ur kort och stökig text i en jämförande studie. De metoder som jämfördes var latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words samt en egen metod med namnet Neural Probabilistic Topic Modeling (NPTM) baserat på tidigare arbeten. Den slutsats som kan dras är att NPTM har en tendens att ge bättre ämnen på kort och stökig text jämfört med LDA och RO-LDA. GMM lyckades inte ge några meningsfulla resultat alls. Resultaten är mindre bevisande eftersom NPTM har problem med långa körtider vilket innebär att tillräckligt många stickprov inte kunde erhållas för ett statistiskt test.

Los estilos APA, Harvard, Vancouver, ISO, etc.

16

Simonnet, Titouan. "Apprentissage et réseaux de neurones en tomographie par diffraction de rayons X. Application à l'identification minéralogique". Electronic Thesis or Diss., Orléans, 2024. http://www.theses.fr/2024ORLE1033.

Texto completo

Resumen

La compréhension du comportement chimique et mécanique des matériaux compactés (par exemple sol, sous-sol, matériaux ouvragés) nécessite de se baser sur une description quantitative de structuration du matériau, et en particulier de la nature des différentes phases minéralogiques et de leur relation spatiale. Or, les matériaux naturels sont composés de nombreux minéraux de petite taille, fréquemment mixés à petite échelle. Les avancées récentes en tomographie de diffraction des rayons X sur source synchrotron (à différencier de la tomographie en contraste de phase) permettent maintenant d'obtenir des volumes tomographiques avec des voxels de taille nanométrique, avec un diffractogramme pour chacun de ces voxels (là où le contraste de phase ne donne qu'un niveau de gris). En contrepartie, le volume de données (typiquement de l'ordre de 100~000 diffractogrammes par tranche d'échantillon), associé au grand nombre de phases présentes, rend le traitement quantitatif virtuellement impossible sans codes numériques appropriés. Cette thèse vise à combler ce manque, en utilisant des approches de type réseaux de neurones pour identifier et quantifier des minéraux dans un matériau. L'entrainement de tels modèles nécessite la construction de bases d'apprentissage de grande taille, qui ne peuvent pas être constituées uniquement de données expérimentales. Des algorithmes capables de synthétiser des diffractogrammes pour générer ces bases ont donc été développés. L'originalité de ce travail a également porté sur l'inférence de proportions avec des réseaux de neurones.Pour répondre à cette tâche, nouvelle et complexe, des fonctions de perte adaptées ont été conçues. Le potentiel des réseaux de neurones a été testé sur des données de complexités croissantes : (i) à partir de diffractogrammes calculés à partir des informations cristallographiques, (ii) en utilisant des diffractogrammes expérimentaux de poudre mesurés au laboratoire, (iii) sur les données obtenues par tomographie de rayons X. Différentes architectures de réseaux de neurones ont aussi été testées. Si un réseau de neurones convolutifs semble apporter des résultats intéressants, la structure particulière du signal de diffraction (qui n'est pas invariant par translation) a conduit à l'utilisation de modèles comme les Transformers. L'approche adoptée dans cette thèse a démontré sa capacité à quantifier les phases minérales dans un solide. Pour les données les plus complexes, tomographie notamment, des pistes d'amélioration ont été proposées
Understanding the chemical and mechanical behavior of compacted materials (e.g. soil, subsoil, engineered materials) requires a quantitative description of the material's structure, and in particular the nature of the various mineralogical phases and their spatial relationships. Natural materials, however, are composed of numerous small-sized minerals, frequently mixed on a small scale. Recent advances in synchrotron-based X-ray diffraction tomography (to be distinguished from phase contrast tomography) now make it possible to obtain tomographic volumes with nanometer-sized voxels, with a XRD pattern for each of these voxels (where phase contrast only gives a gray level). On the other hand, the sheer volume of data (typically on the order of 100~000 XRD patterns per sample slice), combined with the large number of phases present, makes quantitative processing virtually impossible without appropriate numerical codes. This thesis aims to fill this gap, using neural network approaches to identify and quantify minerals in a material. Training such models requires the construction of large-scale learning bases, which cannot be made up of experimental data alone.Algorithms capable of synthesizing XRD patterns to generate these bases have therefore been developed.The originality of this work also concerned the inference of proportions using neural networks. To meet this new and complex task, adapted loss functions were designed.The potential of neural networks was tested on data of increasing complexity: (i) from XRD patterns calculated from crystallographic information, (ii) using experimental powder XRD patterns measured in the laboratory, (iii) on data obtained by X-ray tomography. Different neural network architectures were also tested. While a convolutional neural network seemed to provide interesting results, the particular structure of the diffraction signal (which is not translation invariant) led to the use of models such as Transformers. The approach adopted in this thesis has demonstrated its ability to quantify mineral phases in a solid. For more complex data, such as tomography, improvements have been proposed

Los estilos APA, Harvard, Vancouver, ISO, etc.

17

Johansson, Richard y Heino Otto Engström. "Topic propagation over time in internet security conferences : Topic modeling as a tool to investigate trends for future research". Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-177748.

Texto completo

Resumen

When conducting research, it is valuable to find high-ranked papers closely related to the specific research area, without spending too much time reading insignificant papers. To make this process more effective an automated process to extract topics from documents would be useful, and this is possible using topic modeling. Topic modeling can also be used to provide topic trends, where a topic is first mentioned, and who the original author was. In this paper, over 5000 articles are scraped from four different top-ranked internet security conferences, using a web scraper built in Python. From the articles, fourteen topics are extracted, using the topic modeling library Gensim and LDA Mallet, and the topics are visualized in graphs to find trends about which topics are emerging and fading away over twenty years. The result found in this research is that topic modeling is a powerful tool to extract topics, and when put into a time perspective, it is possible to identify topic trends, which can be explained when put into a bigger context.

Los estilos APA, Harvard, Vancouver, ISO, etc.

18

Malsiner-Walli, Gertraud, Sylvia Frühwirth-Schnatter y Bettina Grün. "Model-based clustering based on sparse finite Gaussian mixtures". Springer, 2016. http://dx.doi.org/10.1007/s11222-014-9500-2.

Texto completo

Resumen

In the framework of Bayesian model-based clustering based on a finite mixture of Gaussian distributions, we present a joint approach to estimate the number of mixture components and identify cluster-relevant variables simultaneously as well as to obtain an identified model. Our approach consists in specifying sparse hierarchical priors on the mixture weights and component means. In a deliberately overfitting mixture model the sparse prior on the weights empties superfluous components during MCMC. A straightforward estimator for the true number of components is given by the most frequent number of non-empty components visited during MCMC sampling. Specifying a shrinkage prior, namely the normal gamma prior, on the component means leads to improved parameter estimates as well as identification of cluster-relevant variables. After estimating the mixture model using MCMC methods based on data augmentation and Gibbs sampling, an identified model is obtained by relabeling the MCMC output in the point process representation of the draws. This is performed using K-centroids cluster analysis based on the Mahalanobis distance. We evaluate our proposed strategy in a simulation setup with artificial data and by applying it to benchmark data sets. (authors' abstract)

Los estilos APA, Harvard, Vancouver, ISO, etc.

19

Lindgren, Jennifer. "Evaluating Hierarchical LDA Topic Models for Article Categorization". Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-167080.

Texto completo

Resumen

With the vast amount of information available on the Internet today, helping users find relevant content has become a prioritized task in many software products that recommend news articles. One such product is Opera for Android, which has a news feed containing articles the user may be interested in. In order to easily determine what articles to recommend, they can be categorized by the topics they contain. One approach of categorizing articles is using Machine Learning and Natural Language Processing (NLP). A commonly used model is Latent Dirichlet Allocation (LDA), which finds latent topics within large datasets of for example text articles. An extension of LDA is hierarchical Latent Dirichlet Allocation (hLDA) which is an hierarchical variant of LDA. In hLDA, the latent topics found among a set of articles are structured hierarchically in a tree. Each node represents a topic, and the levels represent different levels of abstraction in the topics. A further extension of hLDA is constrained hLDA, where a set of predefined, constrained topics are added to the tree. The constrained topics are extracted from the dataset by grouping highly correlated words. The idea of constrained hLDA is to improve the topic structure derived by a hLDA model by making the process semi-supervised. The aim of this thesis is to create a hLDA and a constrained hLDA model from a dataset of articles provided by Opera. The models should then be evaluated using the novel metric word frequency similarity, which is a measure of the similarity between the words representing the parent and child topics in a hierarchical topic model. The results show that word frequency similarity can be used to evaluate whether the topics in a parent-child topic pair are too similar, so that the child does not specify a subtopic of the parent. It can also be used to evaluate if the topics are too dissimilar, so that the topics seem unrelated and perhaps should not be connected in the hierarchy. The results also show that the two topic models created had comparable word frequency similarity scores. None of the models seemed to significantly outperform the other with regard to the metric.

Los estilos APA, Harvard, Vancouver, ISO, etc.

20

Le, Hai-Son Phuoc. "Probabilistic Models for Collecting, Analyzing, and Modeling Expression Data". Research Showcase @ CMU, 2013. http://repository.cmu.edu/dissertations/245.

Texto completo

Resumen

Advances in genomics allow researchers to measure the complete set of transcripts in cells. These transcripts include messenger RNAs (which encode for proteins) and microRNAs, short RNAs that play an important regulatory role in cellular networks. While this data is a great resource for reconstructing the activity of networks in cells, it also presents several computational challenges. These challenges include the data collection stage which often results in incomplete and noisy measurement, developing methods to integrate several experiments within and across species, and designing methods that can use this data to map the interactions and networks that are activated in specific conditions. Novel and efficient algorithms are required to successfully address these challenges. In this thesis, we present probabilistic models to address the set of challenges associated with expression data. First, we present a novel probabilistic error correction method for RNA-Seq reads. RNA-Seq generates large and comprehensive datasets that have revolutionized our ability to accurately recover the set of transcripts in cells. However, sequencing reads inevitably contain errors, which affect all downstream analyses. To address these problems, we develop an efficient hidden Markov modelbased error correction method for RNA-Seq data . Second, for the analysis of expression data across species, we develop clustering and distance function learning methods for querying large expression databases. The methods use a Dirichlet Process Mixture Model with latent matchings and infer soft assignments between genes in two species to allow comparison and clustering across species. Third, we introduce new probabilistic models to integrate expression and interaction data in order to predict targets and networks regulated by microRNAs. Combined, the methods developed in this thesis provide a solution to the pipeline of expression analysis used by experimentalists when performing expression experiments.

Los estilos APA, Harvard, Vancouver, ISO, etc.

21

Apelthun, Catharina. "Topic modeling on a classical Swedish text corpus of prose fiction : Hyperparameters’ effect on theme composition and identification of writing style". Thesis, Uppsala universitet, Statistiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-441653.

Texto completo

Resumen

A topic modeling method, smoothed Latent Dirichlet Allocation (LDA) is applied on a text corpus data of classical Swedish prose fiction. The thesis consists of two parts. In the first part, a smoothed LDA model is applied to the corpus, investigating how changes in hyperparameter values affect the topics in terms of distribution of words within topics and topics within novels. In the second part, two smoothed LDA models are applied to a reduced corpus, only consisting of adjectives. The generated topics are examined to see if they are more likely to occur in a text of a particular author and if the model could be used for identification of writing style. With this new approach, the ability of the smoothed LDA model as a writing style identifier is explored. While the texts analyzed in this thesis is unusally long - as they are not seg- mented prose fiction - the effect of the hyperparameters on model performance was found to be similar to those found in previous research. For the adjectives corpus, the models did succeed in generating topics with a higher probability of occurring in novels by the same author. The smoothed LDA was shown to be a good model for identification of writing style. Keywords: Topic modeling, Smoothed Latent Dirichlet Allocation, Gibbs sam- pling, MCMC, Bayesian statistics, Swedish prose fiction.

Los estilos APA, Harvard, Vancouver, ISO, etc.

22

Khan, Mohammed Salman. "A Topic Modeling approach for Code Clone Detection". UNF Digital Commons, 2019. https://digitalcommons.unf.edu/etd/874.

Texto completo

Resumen

In this thesis work, the potential benefits of Latent Dirichlet Allocation (LDA) as a technique for code clone detection has been described. The objective is to propose a language-independent, effective, and scalable approach for identifying similar code fragments in relatively large software systems. The main assumption is that the latent topic structure of software artifacts gives an indication of the presence of code clones. It can be hypothesized that artifacts with similar topic distributions contain duplicated code fragments and to prove this hypothesis, an experimental investigation using multiple datasets from various application domains were conducted. In addition, CloneTM, an LDA-based working prototype for code clone detection was developed. Results showed that, if calibrated properly, topic modeling can deliver a satisfactory performance in capturing different types of code clones, showing particularity good performance in detecting Type III clones. CloneTM also achieved levels of performance comparable to already existing practical tools that adopt different clone detection strategies.

Los estilos APA, Harvard, Vancouver, ISO, etc.

23

Park, Kyoung Jin. "Generating Thematic Maps from Hyperspectral Imagery Using a Bag-of-Materials Model". The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1366296426.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

24

SUI, ZHENHUAN. "Hierarchical Text Topic Modeling with Applications in Social Media-Enabled Cyber Maintenance Decision Analysis and Quality Hypothesis Generation". The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1499446404436637.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

25

Cedervall, Andreas y Daniel Jansson. "Topic classification of Monetary Policy Minutes from the Swedish Central Bank". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-240403.

Texto completo

Resumen

Over the last couple of years, Machine Learning has seen a very high increase in usage. Many previously manual tasks are becoming automated and it stands to reason that this development will continue in an incredible pace. This paper builds on the work in Topic Classification and attempts to provide a baseline on how to analyse the Swedish Central Bank Minutes and gather information using both Latent Dirichlet Allocation and a simple Neural Networks. Topic Classification is done on Monetary Policy Minutes from 2004 to 2018 to find how the distributions of topics change over time. The results are compared to empirical evidence that would confirm trends. Finally a business perspective of the work is analysed to reveal what the benefits of implementing this type of technique could be. The results of these methods are compared and they differ. Specifically the Neural Network shows larger changes in topic distributions than the Latent Dirichlet Allocation. The neural network also proved to yield more trends that correlated with other observations such as the start of bond purchasing by the Swedish Central Bank. Thus, our results indicate that a Neural Network would perform better than the Latent Dirichlet Allocation when analyzing Swedish Monetary Policy Minutes.
Under de senaste åren har artificiell intelligens och maskininlärning fått mycket uppmärksamhet och växt otroligt. Tidigare manuella arbeten blir nu automatiserade och mycket tyder på att utvecklingen kommer att fortsätta i en hög takt. Detta arbete bygger vidare på arbeten inom topic modeling (ämnesklassifikation) och applicera detta i ett tidigare outforskat område, riksbanksprotokoll. Latent Dirichlet Allocation och Neural Network används för att undersöka huruvida fördelningen av diskussionspunkter (topics) förändras över tid. Slutligen presenteras en teoretisk diskussion av det potentiella affärsvärdet i att implementera en liknande metod. Resultaten för de olika modellerna uppvisar stora skillnader över tid. Medan Latent Dirichlet Allocation inte finner några större trender i diskussionspunkter visar Neural Network på större förändringar över tid. De senare stämmer dessutom väl överens med andra observationer såsom påbörjandet av obligationsköp. Därav indikerar resultaten att Neural Network är en mer lämplig metod för analys av riksbankens mötesprotokoll.

Los estilos APA, Harvard, Vancouver, ISO, etc.

26

Schneider, Bruno. "Visualização em multirresolução do fluxo de tópicos em coleções de texto". reponame:Repositório Institucional do FGV, 2014. http://hdl.handle.net/10438/11745.

Texto completo

Resumen

Submitted by Bruno Schneider (bruno.sch@gmail.com) on 2014-05-08T17:46:04Z No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Approved for entry into archive by Janete de Oliveira Feitosa (janete.feitosa@fgv.br) on 2014-05-13T12:56:21Z (GMT) No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Approved for entry into archive by Marcia Bacha (marcia.bacha@fgv.br) on 2014-05-14T19:44:51Z (GMT) No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Made available in DSpace on 2014-05-14T19:45:33Z (GMT). No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) Previous issue date: 2014-03-21
The combined use of algorithms for topic discovery in document collections with topic flow visualization techniques allows the exploration of thematic patterns in long corpus. In this task, those patterns could be revealed through compact visual representations. This research has investigated the requirements for viewing data about the thematic composition of documents obtained through topic modeling - where datasets are sparse and has multi-attributes - at different levels of detail through the development of an own technique and the use of an open source library for data visualization, comparatively. About the studied problem of topic flow visualization, we observed the presence of conflicting requirements for data display in different resolutions, which led to detailed investigation on ways of manipulating and displaying this data. In this study, the hypothesis put forward was that the integrated use of more than one visualization technique according to the resolution of data expands the possibilities for exploitation of the object under study in relation to what would be obtained using only one method. The exhibition of the limits on the use of these techniques according to the resolution of data exploration is the main contribution of this work, in order to provide subsidies for the development of new applications.
O uso combinado de algoritmos para a descoberta de tópicos em coleções de documentos com técnicas orientadas à visualização da evolução daqueles tópicos no tempo permite a exploração de padrões temáticos em corpora extensos a partir de representações visuais compactas. A pesquisa em apresentação investigou os requisitos de visualização do dado sobre composição temática de documentos obtido através da modelagem de tópicos – o qual é esparso e possui multiatributos – em diferentes níveis de detalhe, através do desenvolvimento de uma técnica de visualização própria e pelo uso de uma biblioteca de código aberto para visualização de dados, de forma comparativa. Sobre o problema estudado de visualização do fluxo de tópicos, observou-se a presença de requisitos de visualização conflitantes para diferentes resoluções dos dados, o que levou à investigação detalhada das formas de manipulação e exibição daqueles. Dessa investigação, a hipótese defendida foi a de que o uso integrado de mais de uma técnica de visualização de acordo com a resolução do dado amplia as possibilidades de exploração do objeto em estudo em relação ao que seria obtido através de apenas uma técnica. A exibição dos limites no uso dessas técnicas de acordo com a resolução de exploração do dado é a principal contribuição desse trabalho, no intuito de dar subsídios ao desenvolvimento de novas aplicações.

Los estilos APA, Harvard, Vancouver, ISO, etc.

27

Moon, Gordon Euhyun. "Parallel Algorithms for Machine Learning". The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1561980674706558.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

28

Chiron, Guillaume. "Système complet d’acquisition vidéo, de suivi de trajectoires et de modélisation comportementale pour des environnements 3D naturellement encombrés : application à la surveillance apicole". Thesis, La Rochelle, 2014. http://www.theses.fr/2014LAROS030/document.

Texto completo

Resumen

Ce manuscrit propose une approche méthodologique pour la constitution d’une chaîne complète de vidéosurveillance pour des environnements naturellement encombrés. Nous identifions et levons un certain nombre de verrous méthodologiques et technologiques inhérents : 1) à l’acquisition de séquences vidéo en milieu naturel, 2) au traitement d’images, 3) au suivi multi-cibles, 4) à la découverte et la modélisation de motifs comportementaux récurrents, et 5) à la fusion de données. Le contexte applicatif de nos travaux est la surveillance apicole, et en particulier, l’étude des trajectoires des abeilles en vol devant la ruche. De ce fait, cette thèse se présente également comme une étude de faisabilité et de prototypage dans le cadre des deux projets interdisciplinaires EPERAS et RISQAPI (projets menées en collaboration avec l’INRA Magneraud et le Muséum National d’Histoire Naturelle). Il s’agit pour nous informaticiens et pour les biologistes qui nous ont accompagnés, d’un domaine d’investigation totalement nouveau, pour lequel les connaissances métiers, généralement essentielles à ce genre d’applications, restent encore à définir. Contrairement aux approches existantes de suivi d’insectes, nous proposons de nous attaquer au problème dans l’espace à trois dimensions grâce à l’utilisation d’une caméra stéréovision haute fréquence. Dans ce contexte, nous détaillons notre nouvelle méthode de détection de cibles appelée segmentation HIDS. Concernant le calcul des trajectoires, nous explorons plusieurs approches de suivi de cibles, s’appuyant sur plus ou moins d’a priori, susceptibles de supporter les conditions extrêmes de l’application (e.g. cibles nombreuses, de petite taille, présentant un mouvement chaotique). Une fois les trajectoires collectées, nous les organisons selon une structure de données hiérarchique et mettons en œuvre une approche Bayésienne non-paramétrique pour la découverte de comportements émergents au sein de la colonie d’insectes. L’analyse exploratoire des trajectoires issues de la scène encombrée s’effectue par classification non supervisée, simultanément sur des niveaux sémantiques différents, et où le nombre de clusters pour chaque niveau n’est pas défini a priori mais est estimé à partir des données. Cette approche est dans un premier temps validée à l’aide d’une pseudo-vérité terrain générée par un Système Multi-Agents, puis dans un deuxième temps appliquée sur des données réelles
This manuscript provides the basis for a complete chain of videosurveillence for naturally cluttered environments. In the latter, we identify and solve the wide spectrum of methodological and technological barriers inherent to : 1) the acquisition of video sequences in natural conditions, 2) the image processing problems, 3) the multi-target tracking ambiguities, 4) the discovery and the modeling of recurring behavioral patterns, and 5) the data fusion. The application context of our work is the monitoring of honeybees, and in particular the study of the trajectories bees in flight in front of their hive. In fact, this thesis is part a feasibility and prototyping study carried by the two interdisciplinary projects EPERAS and RISQAPI (projects undertaken in collaboration with INRA institute and the French National Museum of Natural History). It is for us, computer scientists, and for biologists who accompanied us, a completely new area of investigation for which the scientific knowledge, usually essential for such applications, are still in their infancy. Unlike existing approaches for monitoring insects, we propose to tackle the problem in the three-dimensional space through the use of a high frequency stereo camera. In this context, we detail our new target detection method which we called HIDS segmentation. Concerning the computation of trajectories, we explored several tracking approaches, relying on more or less a priori, which are able to deal with the extreme conditions of the application (e.g. many targets, small in size, following chaotic movements). Once the trajectories are collected, we organize them according to a given hierarchical data structure and apply a Bayesian nonparametric approach for discovering emergent behaviors within the colony of insects. The exploratory analysis of the trajectories generated by the crowded scene is performed following an unsupervised classification method simultaneously over different levels of semantic, and where the number of clusters for each level is not defined a priori, but rather estimated from the data only. This approach is has been validated thanks to a ground truth generated by a Multi-Agent System. Then we tested it in the context of real data

Los estilos APA, Harvard, Vancouver, ISO, etc.

29

Ladouceur, Martin. "Modelling continuous digagnostic test data using Dirichlet process prios distributions". Thesis, McGill University, 2009. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=95623.

Texto completo

Resumen

Diagnostic tests are widely used in medicine and epidemiology. Most diagnostic tests areimperfect at distinguishing subjects with and without the condition of interest, and manythat provide results on a continuous scale have overlapping densities from diseased andnon-diseased subjects. For the se continuous tests, most statistical techniques developed todate assume a parametric (e.g. normal) family for the distribution of the continuousoutcomes within groups, an often unverifiable but convenient distributional assumption.In addition, evaluating the properties of these tests typically requires or assumes a perfectgold standard test is available. [...]
Les tests diagnostiques sont abondamment utilisés en médecine et en épidémiologie. Laplupart d'entre eux ne distinguent pas parfaitement les sujets qui ont ou non la conditiond'intérêt, et ceux qui fournissent des résultats sur une échelle continue ont souvent desdensités des résultats des sujets malades et non-malades qui se chevauchent. Pour cestests continus, la plupart des techniques statistiques développées jusqu'à présentprésument une famille de distributions paramétriques des résultats dans les 2 groupes, unehypothèse pratique mais souvent non vérifiable. De plus, l'évaluation de leurs propriétésrequiert typiquement qu'un test étalon d'or soit disponible. [...]

Los estilos APA, Harvard, Vancouver, ISO, etc.

30

Jaradat, Shatha. "OLLDA: Dynamic and Scalable Topic Modelling for Twitter : AN ONLINE SUPERVISED LATENT DIRICHLET ALLOCATION ALGORITHM". Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177535.

Texto completo

Resumen

Providing high quality of topics inference in today's large and dynamic corpora, such as Twitter, is a challenging task. This is especially challenging taking into account that the content in this environment contains short texts and many abbreviations. This project proposes an improvement of a popular online topics modelling algorithm for Latent Dirichlet Allocation (LDA), by incorporating supervision to make it suitable for Twitter context. This improvement is motivated by the need for a single algorithm that achieves both objectives: analyzing huge amounts of documents, including new documents arriving in a stream, and, at the same time, achieving high quality of topics’ detection in special case environments, such as Twitter. The proposed algorithm is a combination of an online algorithm for LDA and a supervised variant of LDA - labeled LDA. The performance and quality of the proposed algorithm is compared with these two algorithms. The results demonstrate that the proposed algorithm has shown better performance and quality when compared to the supervised variant of LDA, and it achieved better results in terms of quality in comparison to the online algorithm. These improvements make our algorithm an attractive option when applied to dynamic environments, like Twitter. An environment for analyzing and labelling data is designed to prepare the dataset before executing the experiments. Possible application areas for the proposed algorithm are tweets recommendation and trends detection.
Tillhandahålla högkvalitativa ämnen slutsats i dagens stora och dynamiska korpusar, såsom Twitter, är en utmanande uppgift. Detta är särskilt utmanande med tanke på att innehållet i den här miljön innehåller korta texter och många förkortningar. Projektet föreslår en förbättring med en populär online ämnen modellering algoritm för Latent Dirichlet Tilldelning (LDA), genom att införliva tillsyn för att göra den lämplig för Twitter sammanhang. Denna förbättring motiveras av behovet av en enda algoritm som uppnår båda målen: analysera stora mängder av dokument, inklusive nya dokument som anländer i en bäck, och samtidigt uppnå hög kvalitet på ämnen "upptäckt i speciella fall miljöer, till exempel som Twitter. Den föreslagna algoritmen är en kombination av en online-algoritm för LDA och en övervakad variant av LDA - Labeled LDA. Prestanda och kvalitet av den föreslagna algoritmen jämförs med dessa två algoritmer. Resultaten visar att den föreslagna algoritmen har visat bättre prestanda och kvalitet i jämförelse med den övervakade varianten av LDA, och det uppnådde bättre resultat i fråga om kvalitet i jämförelse med den online-algoritmen. Dessa förbättringar gör vår algoritm till ett attraktivt alternativ när de tillämpas på dynamiska miljöer, som Twitter. En miljö för att analysera och märkning uppgifter är utformad för att förbereda dataset innan du utför experimenten. Möjliga användningsområden för den föreslagna algoritmen är tweets rekommendation och trender upptäckt.

Los estilos APA, Harvard, Vancouver, ISO, etc.

31

Habli, Nada. "Nonparametric Bayesian Modelling in Machine Learning". Thesis, Université d'Ottawa / University of Ottawa, 2016. http://hdl.handle.net/10393/34267.

Texto completo

Resumen

Nonparametric Bayesian inference has widespread applications in statistics and machine learning. In this thesis, we examine the most popular priors used in Bayesian non-parametric inference. The Dirichlet process and its extensions are priors on an infinite-dimensional space. Originally introduced by Ferguson (1983), its conjugacy property allows a tractable posterior inference which has lately given rise to a significant developments in applications related to machine learning. Another yet widespread prior used in nonparametric Bayesian inference is the Beta process and its extensions. It has originally been introduced by Hjort (1990) for applications in survival analysis. It is a prior on the space of cumulative hazard functions and it has recently been widely used as a prior on an infinite dimensional space for latent feature models. Our contribution in this thesis is to collect many diverse groups of nonparametric Bayesian tools and explore algorithms to sample from them. We also explore machinery behind the theory to apply and expose some distinguished features of these procedures. These tools can be used by practitioners in many applications.

Los estilos APA, Harvard, Vancouver, ISO, etc.

32

Halmann, Marju. "Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification". Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14710.

Texto completo

Resumen

Filtering out and replying automatically to emails are of interest to many but is hard due to the complexity of the language and to dependencies of background information that is not present in the email itself. This paper investigates whether Latent Dirichlet Allocation (LDA) combined with Random Forest classifier can be used for the more general email classification task and how it compares to other existing email classifiers. The comparison is based on the literature study and on the empirical experimentation using two real-life datasets. Firstly, a literature study is performed to gain insight of the accuracy of other available email classifiers. Secondly, proposed model’s accuracy is explored with experimentation. The literature study shows that the accuracy of more general email classifiers differs greatly on different user sets. The proposed model accuracy is within the reported accuracy range, however in the lower part. It indicates that the proposed model performs poorly compared to other classifiers. On average, the classifier performance improves 15 percentage points with additional information. This indicates that Latent Dirichlet Allocation (LDA) combined with Random Forest classifier is promising, however future studies are needed to explore the model and ways to further increase the accuracy.

Los estilos APA, Harvard, Vancouver, ISO, etc.

33

Déhaye, Vincent. "Characterisation of a developer’s experience fields using topic modelling". Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-171946.

Texto completo

Resumen

Finding the most relevant candidate for a position represents an ubiquitous challenge for organisations. It can also be arduous for a candidate to explain on a concise resume what they have experience with. Due to the fact that the candidate usually has to select which experience to expose and filter out some of them, they might not be detected by the person carrying out the search, whereas they were indeed having the desired experience. In the field of software engineering, developing one's experience usually leaves traces behind: the code one produced. This project explores approaches to tackle the screening challenges with an automated way of extracting experience directly from code by defining common lexical patterns in code for different experience fields, using topic modeling. Two different techniques were compared. On one hand, Latent Dirichlet Allocation (LDA) is a generative statistical model which has proven to yield good results in topic modeling. On the other hand Non-Negative Matrix Factorization (NMF) is simply a singular value decomposition of a matrix representing the code corpus as word counts per piece of code.The code gathered consisted of 30 random repositories from all the collaborators of the open-source Ruby-on-Rails project on GitHub, which was then applied common natural language processing transformation steps. The results of both techniques were compared using respectively perplexity for LDA, reconstruction error for NMF and topic coherence for both. The two first represent how well the data could be represented by the topics produced while the later estimates the hanging and fitting together of the elements of a topic, and can depict human understandability and interpretability. Given that we did not have any similar work to benchmark with, the performance of the values obtained is hard to assess scientifically. However, the method seems promising as we would have been rather confident in assigning labels to 10 of the topics generated. The results imply that one could probably use natural language processing methods directly on code production in order to extend the detected fields of experience of a developer, with a finer granularity than traditional resumes and with fields definition evolving dynamically with the technology.

Los estilos APA, Harvard, Vancouver, ISO, etc.

34

Chazel, Florent. "Influence de la topographie sur les ondes de surface". Phd thesis, Université Sciences et Technologies - Bordeaux I, 2007. http://tel.archives-ouvertes.fr/tel-00200419.

Texto completo

Resumen

Dans cette thèse, nous considérons le problème d'Euler surface libre sur un domaine à fond non plat, dans le cadre du régime d'ondes longues de faible amplitude. L'objectif est de construire, justifier et comparer de nouveaux modèles asymptotiques pour ce problème, permettant de prendre en compte les effets liés aux variations bathymétriques. En premier lieu, nous construisons rigoureusement deux classes de modèles de Boussinesq symétriques dans le cadre de deux régimes topographiques distincts, celui de faible variations bathymétriques et celui de fortes variations. Dans un second temps, nous retrouvons et discutons dans le cas de faibles variations topographiques l'approximation classique de Korteweg-de Vries, et proposons une nouvelle approximation via l'ajout de termes bathymétriques. Dans une troisième partie, ces deux modèles, ainsi que les modèles de Boussinesq construits dans la première partie, sont simulés numériquement et comparés sur des cas tests de topographie. Enfin, il est présenté une étude numérique des équations de Green-Naghdi, dont le domaine de validité physique est plus étendu, ainsi qu'une comparaison numérique de ce modèle avec les modèles précédents sur des bathymétries spécifiques.

Los estilos APA, Harvard, Vancouver, ISO, etc.

35

Bakharia, Aneesha. "Interactive content analysis : evaluating interactive variants of non-negative Matrix Factorisation and Latent Dirichlet Allocation as qualitative content analysis aids". Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/76535/1/Aneesha_Bakharia_Thesis.pdf.

Texto completo

Resumen

This thesis addressed issues that have prevented qualitative researchers from using thematic discovery algorithms. The central hypothesis evaluated whether allowing qualitative researchers to interact with thematic discovery algorithms and incorporate domain knowledge improved their ability to address research questions and trust the derived themes. Non-negative Matrix Factorisation and Latent Dirichlet Allocation find latent themes within document collections but these algorithms are rarely used, because qualitative researchers do not trust and cannot interact with the themes that are automatically generated. The research determined the types of interactivity that qualitative researchers require and then evaluated interactive algorithms that matched these requirements. Theoretical contributions included the articulation of design guidelines for interactive thematic discovery algorithms, the development of an Evaluation Model and a Conceptual Framework for Interactive Content Analysis.

Los estilos APA, Harvard, Vancouver, ISO, etc.

36

Aspilaire, Roseman. "Économie informelle en Haïti, marché du travail et pauvreté : analyses quantitatives". Thesis, Paris Est, 2017. http://www.theses.fr/2017PESC0122/document.

Texto completo

Resumen

La prédominance de l’informel dans l’économie d’Haïti, où plus de 80% de la population vit en dessous du seuil de la pauvreté et plus de 35% au chômage, laisse entrevoir des liens étroits entre l’économie informelle, la pauvreté et le marché du travail. Faire ressortir ces interrelations, exige une évaluation de cette économie informelle qui fait l’objet des quatre chapitres de notre thèse traitant successivement l’évolution de la situation macroéconomique, le capital humain, les gains des travailleurs informels, et la segmentation du marché du travail.Le premier chapitre fait un diagnostic du phénomène selon l’état des lieux des théories élaborées et l’évolution du cadre macro-économique d’Haïti de 1980 à 2010 et propose une évaluation macroéconomique de l’informel à partir d’un modèle PLS (Partial Least Squares) en pourcentage du PIB.Le chapitre deux établit les relations entre l’évolution de l’économie informelle, dérégulation et politiques néolibérales grâce à un modèle LISREL (Linear Structural Relations). Nous examinons les incidences des politiques fiscales, budgétaires et monétaires des 30 dernières années sur l’économie informelle. Nous réévaluons aussi les causes de l’évolution de l’informel généralement évoquées par les études empiriques (taxes, sécurité sociale).Au chapitre trois, nous analysons la dimension micro-réelle de l’informel grâce à un modèle des gains à la Mincer estimé par les équations logit à partir des données d’une enquête nationale sur l’emploi et l’économie informelle (EEEI) de 2007. Nous analysons les déterminants des gains informels au regard de la position des travailleurs sur le marché (salariés, entrepreneurs et indépendants) ; et les revenus (formels et informels) et les caractéristiques socioéconomiques des travailleurs pauvres et non-pauvres par rapport au seuil de pauvreté.Au chapitre quatre, nous testons d’abord la compétitivité et la segmentation du marché de l’emploi en faisant usage de modèle de Roy et du modèle de Roy élargi à travers une estimation d’un modèle Tobit. Nous utilisons un modèle de Processus de Dirichlet : d’abord analyser la segmentation et la compétitivité éventuelle du marché du travail informel ainsi que ses déterminants, selon les données de l’EEEI-2007 ; ensuite, pour distinguer les caractéristiques fondamentales des informels involontaires (exclus du marché du travail formel) de celles des informels volontaires qui en retirent des avantages comparatifs
The predominance of the informal sector in the economy of Haiti, where more than 80% of the population lives below the threshold of poverty and more than 35% unemployed, suggests links between the informal economy, poverty and the labour market. Highlight these interrelationships, requires an assessment of the informal economy, which is the subject of the four chapters of this thesis, dealing successively with the evolution of the macroeconomic situation, human capital, the informal earnings of workers, and the segmentation of the labour market.The first chapter made a diagnosis of the phenomenon according to the State of affairs of the developed theories and the evolution of the macroeconomic framework of Haiti from 1980 to 2010. And then offers a macroeconomic assessment of the informal sector as a percentage of GDP from a PLS (Partial Least Squares).Chapter two sets out the relationship between the evolution of the informal economy, deregulation and neo-liberal policies through a LISREL (Linear Structural Relations) model. We look at the impact of the budgetary, fiscal and monetary policies of the past 30 years on the informal economy. We also reassess the causes of the evolution of the informal economy generally evoked by the empirical studies (taxes, social security).In the chapter three, we analyse the micro-real dimension of the informal economy through a model of the Mincer earnings estimated by the equations logit from data in a national survey on employment and the informal economy (EEEI) in 2007. We analyse the determinants of informal gains in terms of the position of the market workers (employees, entrepreneurs and self-employed); and revenues (formal and informal) and the socio-economic characteristics of the working poor and non-poor compared to the poverty line.In chapter four, we first test the competitiveness and the segmentation of the labour market by making use of model of Roy and the expanded Roy model through an estimate a model Tobit. We use a model of Dirichlet process: first analyse the segmentation and possible informal work and market competitiveness as its determinants, according to data from the EEEI 2007; then, to distinguish the fundamental characteristics of the involuntary informal (excluded from the formal labour market) than the voluntary informal who gain comparative advantages

Los estilos APA, Harvard, Vancouver, ISO, etc.

37

White, Nicole. "Bayesian mixtures for modelling complex medical data : a case study in Parkinson’s disease". Thesis, Queensland University of Technology, 2011. https://eprints.qut.edu.au/48202/1/Nicole_White_Thesis.pdf.

Texto completo

Resumen

Mixture models are a flexible tool for unsupervised clustering that have found popularity in a vast array of research areas. In studies of medicine, the use of mixtures holds the potential to greatly enhance our understanding of patient responses through the identification of clinically meaningful clusters that, given the complexity of many data sources, may otherwise by intangible. Furthermore, when developed in the Bayesian framework, mixture models provide a natural means for capturing and propagating uncertainty in different aspects of a clustering solution, arguably resulting in richer analyses of the population under study. This thesis aims to investigate the use of Bayesian mixture models in analysing varied and detailed sources of patient information collected in the study of complex disease. The first aim of this thesis is to showcase the flexibility of mixture models in modelling markedly different types of data. In particular, we examine three common variants on the mixture model, namely, finite mixtures, Dirichlet Process mixtures and hidden Markov models. Beyond the development and application of these models to different sources of data, this thesis also focuses on modelling different aspects relating to uncertainty in clustering. Examples of clustering uncertainty considered are uncertainty in a patient’s true cluster membership and accounting for uncertainty in the true number of clusters present. Finally, this thesis aims to address and propose solutions to the task of comparing clustering solutions, whether this be comparing patients or observations assigned to different subgroups or comparing clustering solutions over multiple datasets. To address these aims, we consider a case study in Parkinson’s disease (PD), a complex and commonly diagnosed neurodegenerative disorder. In particular, two commonly collected sources of patient information are considered. The first source of data are on symptoms associated with PD, recorded using the Unified Parkinson’s Disease Rating Scale (UPDRS) and constitutes the first half of this thesis. The second half of this thesis is dedicated to the analysis of microelectrode recordings collected during Deep Brain Stimulation (DBS), a popular palliative treatment for advanced PD. Analysis of this second source of data centers on the problems of unsupervised detection and sorting of action potentials or "spikes" in recordings of multiple cell activity, providing valuable information on real time neural activity in the brain.

Los estilos APA, Harvard, Vancouver, ISO, etc.

38

Ficapal, Vila Joan. "Anemone: a Visual Semantic Graph". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-252810.

Texto completo

Resumen

Semantic graphs have been used for optimizing various natural language processing tasks as well as augmenting search and information retrieval tasks. In most cases these semantic graphs have been constructed through supervised machine learning methodologies that depend on manually curated ontologies such as Wikipedia or similar. In this thesis, which consists of two parts, we explore in the first part the possibility to automatically populate a semantic graph from an ad hoc data set of 50 000 newspaper articles in a completely unsupervised manner. The utility of the visual representation of the resulting graph is tested on 14 human subjects performing basic information retrieval tasks on a subset of the articles. Our study shows that, for entity finding and document similarity our feature engineering is viable and the visual map produced by our artifact is visually useful. In the second part, we explore the possibility to identify entity relationships in an unsupervised fashion by employing abstractive deep learning methods for sentence reformulation. The reformulated sentence structures are qualitatively assessed with respect to grammatical correctness and meaningfulness as perceived by 14 test subjects. We negatively evaluate the outcomes of this second part as they have not been good enough to acquire any definitive conclusion but have instead opened new doors to explore.
Semantiska grafer har använts för att optimera olika processer för naturlig språkbehandling samt för att förbättra sökoch informationsinhämtningsuppgifter. I de flesta fall har sådana semantiska grafer konstruerats genom övervakade maskininlärningsmetoder som förutsätter manuellt kurerade ontologier såsom Wikipedia eller liknande. I denna uppsats, som består av två delar, undersöker vi i första delen möjligheten att automatiskt generera en semantisk graf från ett ad hoc dataset bestående av 50 000 tidningsartiklar på ett helt oövervakat sätt. Användbarheten hos den visuella representationen av den resulterande grafen testas på 14 försökspersoner som utför grundläggande informationshämtningsuppgifter på en delmängd av artiklarna. Vår studie visar att vår funktionalitet är lönsam för att hitta och dokumentera likhet med varandra, och den visuella kartan som produceras av vår artefakt är visuellt användbar. I den andra delen utforskar vi möjligheten att identifiera entitetsrelationer på ett oövervakat sätt genom att använda abstraktiva djupa inlärningsmetoder för meningsomformulering. De omformulerade meningarna utvärderas kvalitativt med avseende på grammatisk korrekthet och meningsfullhet såsom detta uppfattas av 14 testpersoner. Vi utvärderar negativt resultaten av denna andra del, eftersom de inte har varit tillräckligt bra för att få någon definitiv slutsats, men har istället öppnat nya dörrar för att utforska.

Los estilos APA, Harvard, Vancouver, ISO, etc.

39

GIOVANNINI, STEFANO. "Verso un indice "convergente" dell'impatto dei prodotti culturali italiani in Cina". Doctoral thesis, Università Cattolica del Sacro Cuore, 2022. http://hdl.handle.net/10280/122043.

Texto completo

Resumen

La tesi muove dall’ipotesi che sia possibile costruire un modello predittivo del successo di un prodotto mediale italiano sul mercato cinese, laddove per “successo” s’intenda un impatto culturale ed economico pari a quello di alcuni prodotti benchmark degli ultimi anni (il romanzo L’amica geniale, il film Perfetti sconosciuti, la serie TV My Brilliant Friend). L’impatto culturale è misurato come generazione di discorso online e quello economico con indicatori sia tradizionali, nel caso della distribuzione fisica, sia digitali nel caso di quella online. I tre casi usati come benchmark hanno avuto la funzione di fornire materiale discorsivo online da cui estrapolare variabili predittive tramite LDA topic modelling e sentiment analysis in Python 3. Nel capitolo 1 si esplora e definisce l’ambito delle Digital Humanities (DH). Il capitolo 2 si occupa di riassumere le principali nozioni teoriche inerenti alla produzione culturale. Il capitolo 3 dettaglia metodologia e sua applicazione empirica, recando una sezione di collaudo del modello, confermantene l’affidabilità. I risultati indicano che gli strumenti propri delle DH sono stati adatti, mentre le teorie novecentesche sulla produzione culturale potrebbero beneficiare di aggiornamenti. Tre insiemi di variabili predittive del successo di prodotti mediali italiani in Cina sono stati individuati.
The thesis’s hypothesis is the possibility of the creation of a model for the prediction of any Italian media product in the Chinese market, by “success” meaning a cultural-economic impact equal to that of some benchmark products of the last years (the novel L’amica geniale, the film Perfetti sconosciuti, the TV-series My Brilliant Friend). Cultural impact is measured as the generation of online discourse, while the economic impact by traditional and digital indicators, for physical and online distributions respectively. The three cases used as benchmarks served to provide online discourse material from which to extract predictive variables by Python 3-supported LDA topic modelling and sentiment analysis. In chapter 1 Digital Humanities (DH) are explored and defined as a field of study. Chapter 2 summarises the main theories about cultural production. Chapter 3 details the methodology and its empirical application, also containing a section devoted to testing, which in turn confirmed the model’s reliability. Results show that DH’s tools were a proper choice, while last century’s theories of cultural production may benefit from updates. Three sets of variables to predict Italian media products in China were identified.

Los estilos APA, Harvard, Vancouver, ISO, etc.

40

Mercado, Salazar Jorge Anibal y S. M. Masud Rana. "A Confirmatory Analysis for Automating the Evaluation of Motivation Letters to Emulate Human Judgment". Thesis, Högskolan Dalarna, Institutionen för information och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:du-37469.

Texto completo

Resumen

Manually reading, evaluating, and scoring motivation letters as part of the admissions process is a time-consuming and tedious task for Dalarna University's program managers. An automated scoring system would provide them with relief as well as the ability to make much faster decisions when selecting applicants for admission. The aim of this thesis was to analyse current human judgment and attempt to emulate it using machine learning techniques. We used various topic modelling methods, such as Latent Dirichlet Allocation and Non-Negative Matrix Factorization, to find the most interpretable topics, build a bridge between topics and human-defined factors, and finally evaluate model performance by predicting scoring values and finding accuracy using logistic regression, discriminant analysis, and other classification algorithms. Despite the fact that we were able to discover the meaning of almost all human factors on our own, the topic models' accuracy in predicting overall score was unexpectedly low. Setting a threshold on overall score to select applicants for admission yielded a good overall accuracy result, but did not yield a good consistent precision or recall score. During our investigation, we attempted to determine the possible causes of these unexpected results and discovered that not only is topic modelling limitation to blame, but human bias also plays a role.

Los estilos APA, Harvard, Vancouver, ISO, etc.

41

Patel, Virashree Hrushikesh. "Topic modeling using latent dirichlet allocation on disaster tweets". 2018. http://hdl.handle.net/2097/39337.

Texto completo

Resumen

Master of Science
Department of Computer Science
Cornelia Caragea
Doina Caragea
Social media has changed the way people communicate information. It has been noted that social media platforms like Twitter are increasingly being used by people and authorities in the wake of natural disasters. The year 2017 was a historic year for the USA in terms of natural calamities and associated costs. According to NOAA (National Oceanic and Atmospheric Administration), during 2017, USA experienced 16 separate billion-dollar disaster events, including three tropical cyclones, eight severe storms, two inland floods, a crop freeze, drought, and wild re. During natural disasters, due to the collapse of infrastructure and telecommunication, often it is hard to reach out to people in need or to determine what areas are affected. In such situations, Twitter can be a lifesaving tool for local government and search and rescue agencies. Using Twitter streaming API service, disaster-related tweets can be collected and analyzed in real-time. Although tweets received from Twitter can be sparse, noisy and ambiguous, some may contain useful information with respect to situational awareness. For example, some tweets express emotions, such as grief, anguish, or call for help, other tweets provide information specific to a region, place or person, while others simply help spread information from news or environmental agencies. To extract information useful for disaster response teams from tweets, disaster tweets need to be cleaned and classified into various categories. Topic modeling can help identify topics from the collection of such disaster tweets. Subsequently, a topic (or a set of topics) will be associated with a tweet. Thus, in this report, we will use Latent Dirichlet Allocation (LDA) to accomplish topic modeling for disaster tweets dataset.

Los estilos APA, Harvard, Vancouver, ISO, etc.

42

Lin, Chieh-Hung y 林桀宏. "Survey Topic Modeling and Expert Finding Based on Latent Dirichlet Allocation". Thesis, 2011. http://ndltd.ncl.edu.tw/handle/7f8879.

Texto completo

Resumen

碩士
國立臺灣科技大學
資訊工程系
99
To get into a new research topic for a researcher, it is a shortcut to study survey articles that provide some beneﬁts for readers. Survey articles introduce and summarize signiﬁcant approaches from those important articles of a certain research topic. Users get these to understand corresponding domains easily, and ﬁnd related papers quickly. However, it is not easy to ﬁnd survey articles in every research domain, or there are no recent survey articles in a speciﬁc domain. To deal with this in actual condition, traditional approaches use citing texts to generate surveys. Nevertheless, citation-based approaches might limit the performance. In this thesis, we propose an approach, namely Survey Topic Model (STM), which applies Latent Dirichlet Allocation model (LDA) to facilitate the processes of building Topic Modeling or Survey Structure. The proposed STM provides two functions for readers by using certain academic keywords: (1) Collecting important research articles from on-line digital libraries; (2) Categorizing the collected papers with a certain structure manner. In the proposed methodology, feature selection for important articles and LDA-based clustering for survey articles are proposed. We evaluate the proposed mechanism by survey as a dataset collected from CSUR articles. The experimental results show that LDA-based clustering leads to signiﬁcant improvement. We also solve Expert Finding problem based on LDA, and our system provides fair and relevant expert lists for each proposal.

Los estilos APA, Harvard, Vancouver, ISO, etc.

43

Crea, Catherine. "On the Robustness of Dirichlet-multinomial Regression in the Context of Modeling Pollination Networks". Thesis, 2011. http://hdl.handle.net/10214/3222.

Texto completo

Resumen

Recent studies have suggested that the structure of plant-pollinator networks is driven by two opposing theories: neutrality and linkage rules. However, relatively few studies have tried to exploit both of these theories in building pollination webs. This thesis proposes Dirichlet-Multinomial (DM) regression to model plant-pollinator interactions as a function of plant-pollinator characteristics (e.g. complementary phenotypic traits), for evaluating the contribution of each process to network structure. DM regression models first arose in econometrics for modeling consumers' choice behaviour. Further, this thesis (i) evaluates the robustness of DM regression to misspecification of dispersion structure, and (ii) compares the performance of DM regression to grouped conditional logit (GCL) regression through simulation studies. Results of these studies suggest that DM regression is a robust statistical method for modeling qualitative plant-pollinator interaction networks and outperforms the GCL regression when data are indeed over-dispersed. Finally, using DM regression seems to significantly improve model fit.

Los estilos APA, Harvard, Vancouver, ISO, etc.

44

Wei, Hongchuan. "Sensor Planning for Bayesian Nonparametric Target Modeling". Diss., 2016. http://hdl.handle.net/10161/12863.

Texto completo

Resumen

Bayesian nonparametric models, such as the Gaussian process and the Dirichlet process, have been extensively applied for target kinematics modeling in various applications including environmental monitoring, traffic planning, endangered species tracking, dynamic scene analysis, autonomous robot navigation, and human motion modeling. As shown by these successful applications, Bayesian nonparametric models are able to adjust their complexities adaptively from data as necessary, and are resistant to overfitting or underfitting. However, most existing works assume that the sensor measurements used to learn the Bayesian nonparametric target kinematics models are obtained a priori or that the target kinematics can be measured by the sensor at any given time throughout the task. Little work has been done for controlling the sensor with bounded field of view to obtain measurements of mobile targets that are most informative for reducing the uncertainty of the Bayesian nonparametric models. To present the systematic sensor planning approach to leaning Bayesian nonparametric models, the Gaussian process target kinematics model is introduced at first, which is capable of describing time-invariant spatial phenomena, such as ocean currents, temperature distributions and wind velocity fields. The Dirichlet process-Gaussian process target kinematics model is subsequently discussed for modeling mixture of mobile targets, such as pedestrian motion patterns.

Novel information theoretic functions are developed for these introduced Bayesian nonparametric target kinematics models to represent the expected utility of measurements as a function of sensor control inputs and random environmental variables. A Gaussian process expected Kullback Leibler divergence is developed as the expectation of the KL divergence between the current (prior) and posterior Gaussian process target kinematics models with respect to the future measurements. Then, this approach is extended to develop a new information value function that can be used to estimate target kinematics described by a Dirichlet process-Gaussian process mixture model. A theorem is proposed that shows the novel information theoretic functions are bounded. Based on this theorem, efficient estimators of the new information theoretic functions are designed, which are proved to be unbiased with the variance of the resultant approximation error decreasing linearly as the number of samples increases. Computational complexities for optimizing the novel information theoretic functions under sensor dynamics constraints are studied, and are proved to be NP-hard. A cumulative lower bound is then proposed to reduce the computational complexity to polynomial time.

Three sensor planning algorithms are developed according to the assumptions on the target kinematics and the sensor dynamics. For problems where the control space of the sensor is discrete, a greedy algorithm is proposed. The efficiency of the greedy algorithm is demonstrated by a numerical experiment with data of ocean currents obtained by moored buoys. A sweep line algorithm is developed for applications where the sensor control space is continuous and unconstrained. Synthetic simulations as well as physical experiments with ground robots and a surveillance camera are conducted to evaluate the performance of the sweep line algorithm. Moreover, a lexicographic algorithm is designed based on the cumulative lower bound of the novel information theoretic functions, for the scenario where the sensor dynamics are constrained. Numerical experiments with real data collected from indoor pedestrians by a commercial pan-tilt camera are performed to examine the lexicographic algorithm. Results from both the numerical simulations and the physical experiments show that the three sensor planning algorithms proposed in this dissertation based on the novel information theoretic functions are superior at learning the target kinematics with

little or no prior knowledge

Dissertation

Los estilos APA, Harvard, Vancouver, ISO, etc.

45

Hines, Keegan. "Bayesian approaches for modeling protein biophysics". Thesis, 2014. http://hdl.handle.net/2152/26016.

Texto completo

Resumen

Proteins are the fundamental unit of computation and signal processing in biological systems. A quantitative understanding of protein biophysics is of paramount importance, since even slight malfunction of proteins can lead to diverse and severe disease states. However, developing accurate and useful mechanistic models of protein function can be strikingly elusive. I demonstrate that the adoption of Bayesian statistical methods can greatly aid in modeling protein systems. I first discuss the pitfall of parameter non-identifiability and how a Bayesian approach to modeling can yield reliable and meaningful models of molecular systems. I then delve into a particular case of non-identifiability within the context of an emerging experimental technique called single molecule photobleaching. I show that the interpretation of this data is non-trivial and provide a rigorous inference model for the analysis of this pervasive experimental tool. Finally, I introduce the use of nonparametric Bayesian inference for the analysis of single molecule time series. These methods aim to circumvent problems of model selection and parameter identifiability and are demonstrated with diverse applications in single molecule biophysics. The adoption of sophisticated inference methods will lead to a more detailed understanding of biophysical systems.
text

Los estilos APA, Harvard, Vancouver, ISO, etc.

46

"Bayesian Nonparametric Modeling and Inference for Multiple Object Tracking". Doctoral diss., 2019. http://hdl.handle.net/2286/R.I.54996.

Texto completo

Resumen

abstract: The problem of multiple object tracking seeks to jointly estimate the time-varying cardinality and trajectory of each object. There are numerous challenges that are encountered in tracking multiple objects including a time-varying number of measurements, under varying constraints, and environmental conditions. In this thesis, the proposed statistical methods integrate the use of physical-based models with Bayesian nonparametric methods to address the main challenges in a tracking problem. In particular, Bayesian nonparametric methods are exploited to efficiently and robustly infer object identity and learn time-dependent cardinality; together with Bayesian inference methods, they are also used to associate measurements to objects and estimate the trajectory of objects. These methods differ from the current methods to the core as the existing methods are mainly based on random finite set theory. The first contribution proposes dependent nonparametric models such as the dependent Dirichlet process and the dependent Pitman-Yor process to capture the inherent time-dependency in the problem at hand. These processes are used as priors for object state distributions to learn dependent information between previous and current time steps. Markov chain Monte Carlo sampling methods exploit the learned information to sample from posterior distributions and update the estimated object parameters. The second contribution proposes a novel, robust, and fast nonparametric approach based on a diffusion process over infinite random trees to infer information on object cardinality and trajectory. This method follows the hierarchy induced by objects entering and leaving a scene and the time-dependency between unknown object parameters. Markov chain Monte Carlo sampling methods integrate the prior distributions over the infinite random trees with time-dependent diffusion processes to update object states. The third contribution develops the use of hierarchical models to form a prior for statistically dependent measurements in a single object tracking setup. Dependency among the sensor measurements provides extra information which is incorporated to achieve the optimal tracking performance. The hierarchical Dirichlet process as a prior provides the required flexibility to do inference. Bayesian tracker is integrated with the hierarchical Dirichlet process prior to accurately estimate the object trajectory. The fourth contribution proposes an approach to model both the multiple dependent objects and multiple dependent measurements. This approach integrates the dependent Dirichlet process modeling over the dependent object with the hierarchical Dirichlet process modeling of the measurements to fully capture the dependency among both object and measurements. Bayesian nonparametric models can successfully associate each measurement to the corresponding object and exploit dependency among them to more accurately infer the trajectory of objects. Markov chain Monte Carlo methods amalgamate the dependent Dirichlet process with the hierarchical Dirichlet process to infer the object identity and object cardinality. Simulations are exploited to demonstrate the improvement in multiple object tracking performance when compared to approaches that are developed based on random finite set theory.
Dissertation/Thesis
Doctoral Dissertation Electrical Engineering 2019

Los estilos APA, Harvard, Vancouver, ISO, etc.

47

Karlsson, Kalle. "News media attention in Climate Action: Latent topics and open access". Thesis, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-23413.

Texto completo

Resumen

The purpose of the thesis is i) to discover the latent topics of SDG13 and their coverage in news media ii) to investigate the share of OA and Non-OA articles and reviews in each topic iii) to compare the share of different OA types (Green, Gold, Hybrid and Bronze) in each topic. It imposes a heuristic perspective and explorative approach in reviewing the three concepts open access, altmetrics and climate action (SDG13). Data is collected from SciVal, Unpaywall, Altmetric.com and Scopus rendering a dataset of 70,206 articles and reviews published between 2014-2018. The documents retrieved are analyzed with descriptive statistics and topic modeling using Sklearn’s package for LDA(Latent Dirichlet Allocation) in Python. The findings show an altmetric advantage for OA in the case of news media and SDG13 which fluctuates over topics. News media is shown to focus on subjects with “visible” effects in concordance with previous research on media coverage. Examples of this were topics concerning emissions of greenhouse gases and melting glaciers. Gold OA is the most common type being mentioned in news outlets. It also generates the highest number of news mentions while the average sum of news mentions was highest for documents published as Bronze. Moreover, the thesis is largely driven by methods used and most notably the programming language Python. As such it outlines future paths for research into the three concepts reviewed as well as methods used for topic modeling and programming.

Los estilos APA, Harvard, Vancouver, ISO, etc.

48

Ely, Nicole. "Rekonstrukce identit ve fake news: Srovnání dvou webových stránek s obsahem fake news". Master's thesis, 2020. http://www.nusl.cz/ntk/nusl-415291.

Texto completo

Resumen

TOPICAL ANALYSIS OF FAKE NEWS 4 Abstract Since the 2016 US presidential campaign of Donald Trump, the term "fake news" has permeated mainstream discourse. The proliferation of disinformation and false narratives on social media platforms has caused concern in security circles in both the United States and European Union. Combining latent Dirichlet allocation, a machine learning method for text mining, with themes on topical analysis, ideology and social identity drawn from Critical Discourse theory, this thesis examines the elaborate fake news environments of two well-known English language websites: InfoWars and Sputnik News. Through the exploration of the ideologies and social representations at play in the larger thematic structure of these websites, a picture of two very different platforms emerges. One, a white dominant, somewhat isolationist counterculture mindset that promotes a racist and bigoted view of the world. Another, a more subtle world order-making perspective intent on reaching people in the realm of the mundane. Keywords: fake news, Sputnik, InfoWars, topical analysis, latent Dirichlet allocation Od americké prezidentské kampaně Donalda Trumpa z roku 2016, termín "fake news" (doslovně falešné zprávy) pronikl do mainstreamového diskurzu. Šíření dezinformací a falešných zpráv na platformách...

Los estilos APA, Harvard, Vancouver, ISO, etc.

49

Ji, Chunlin. "Advances in Bayesian Modelling and Computation: Spatio-Temporal Processes, Model Assessment and Adaptive MCMC". Diss., 2009. http://hdl.handle.net/10161/1609.

Texto completo

Resumen

The modelling and analysis of complex stochastic systems with increasingly large data sets, state-spaces and parameters provides major stimulus to research in Bayesian nonparametric methods and Bayesian computation. This dissertation presents advances in both nonparametric modelling and statistical computation stimulated by challenging problems of analysis in complex spatio-temporal systems and core computational issues in model fitting and model assessment. The first part of the thesis, represented by chapters 2 to 4, concerns novel, nonparametric Bayesian mixture models for spatial point processes, with advances in modelling, computation and applications in biological contexts. Chapter 2 describes and develops models for spatial point processes in which the point outcomes are latent, where indirect observations related to the point outcomes are available, and in which the underlying spatial intensity functions are typically highly heterogenous. Spatial intensities of inhomogeneous Poisson processes are represented via flexible nonparametric Bayesian mixture models. Computational approaches are presented for this new class of spatial point process mixtures and extended to the context of unobserved point process outcomes. Two examples drawn from a central, motivating context, that of immunofluorescence histology analysis in biological studies generating high-resolution imaging data, demonstrate the modelling approach and computational methodology. Chapters 3 and 4 extend this framework to define a class of flexible Bayesian nonparametric models for inhomogeneous spatio-temporal point processes, adding dynamic models for underlying intensity patterns. Dependent Dirichlet process mixture models are introduced as core components of this new time-varying spatial model. Utilizing such nonparametric mixture models for the spatial process intensity functions allows the introduction of time variation via dynamic, state-space models for parameters characterizing the intensities. Bayesian inference and model-fitting is addressed via novel particle filtering ideas and methods. Illustrative simulation examples include studies in problems of extended target tracking and substantive data analysis in cell fluorescent microscopic imaging tracking problems.

The second part of the thesis, consisting of chapters 5 and chapter 6, concerns advances in computational methods for some core and generic Bayesian inferential problems. Chapter 5 develops a novel approach to estimation of upper and lower bounds for marginal likelihoods in Bayesian modelling using refinements of existing variational methods. Traditional variational approaches only provide lower bound estimation; this new lower/upper bound analysis is able to provide accurate and tight bounds in many problems, so facilitates more reliable computation for Bayesian model comparison while also providing a way to assess adequacy of variational densities as approximations to exact, intractable posteriors. The advances also include demonstration of the significant improvements that may be achieved in marginal likelihood estimation by marginalizing some parameters in the model. A distinct contribution to Bayesian computation is covered in Chapter 6. This concerns a generic framework for designing adaptive MCMC algorithms, emphasizing the adaptive Metropolized independence sampler and an effective adaptation strategy using a family of mixture distribution proposals. This work is coupled with development of a novel adaptive approach to computation in nonparametric modelling with large data sets; here a sequential learning approach is defined that iteratively utilizes smaller data subsets. Under the general framework of importance sampling based marginal likelihood computation, the proposed adaptive Monte Carlo method and sequential learning approach can facilitate improved accuracy in marginal likelihood computation. The approaches are exemplified in studies of both synthetic data examples, and in a real data analysis arising in astro-statistics.

Finally, chapter 7 summarizes the dissertation and discusses possible extensions of the specific modelling and computational innovations, as well as potential future work.

Dissertation

Los estilos APA, Harvard, Vancouver, ISO, etc.

Tesis sobre el tema "Dirichlet modeling"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros