Dissertationen zum Thema „Dirichlet allocation“
Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an
Machen Sie sich mit Top-50 Dissertationen für die Forschung zum Thema "Dirichlet allocation" bekannt.
Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.
Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.
Sehen Sie die Dissertationen für verschiedene Spezialgebieten durch und erstellen Sie Ihre Bibliographie auf korrekte Weise.
Ponweiser, Martin. „Latent Dirichlet Allocation in R“. WU Vienna University of Economics and Business, 2012. http://epub.wu.ac.at/3558/1/main.pdf.
Der volle Inhalt der QuelleSeries: Theses / Institute for Statistics and Mathematics
Arnekvist, Isac, und Ludvig Ericson. „Finding competitors using Latent Dirichlet Allocation“. Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186386.
Der volle Inhalt der QuelleDet finns ett intresse av att kunna identifiera affärskonkurrenter, men detta blir allt svårare på en ständigt växande och alltmer global marknad. Syftet med denna rapport är att undersöka om Latent Dirichlet Allocation (LDA) kan användas för att identifiera och rangordna konkurrenter. Detta genom att jämföra avstånden mellan LDA-representationerna av dessas företagsbeskrivningar. Effektiviteten av LDA i detta syfte jämfördes med den för bag-of-words samt slumpmässig ordning, detta med hjälp av några vanliga informationsteoretiska mått. Flera olika avståndsmått utvärderades för att bestämma vilken av dessa som bäst åstadkommer att konkurrerande företag hamnar nära varandra. I detta fall fanns Cosine similarity överträffa andra avståndsmått. Medan både LDA och bag-of-words konstaterades vara signifikant bättre än slumpmässig ordning så fanns att LDA presterar kvalitativt sämre än bag-of-words. Uträkning av avståndsmått var dock betydligt snabbare med LDA-representationer. Att omvandla webbinnehåll till LDA-representationer fångar dock vissa ospecifika likheter som inte nödvändigt beskriver konkurrenter. Det kan möjligen vara fördelaktigt att använda LDA-representationer ihop med någon ytterligare datakälla och/eller heuristik.
Choubey, Rahul. „Tag recommendation using Latent Dirichlet Allocation“. Thesis, Kansas State University, 2011. http://hdl.handle.net/2097/9785.
Der volle Inhalt der QuelleDepartment of Computing and Information Sciences
Doina Caragea
The vast amount of data present on the internet calls for ways to label and organize this data according to specific categories, in order to facilitate search and browsing activities. This can be easily accomplished by making use of folksonomies and user provided tags. However, it can be difficult for users to provide meaningful tags. Tag recommendation systems can guide the users towards informative tags for online resources such as websites, pictures, etc. The aim of this thesis is to build a system for recommending tags to URLs available through a bookmark sharing service, called BibSonomy. We assume that the URLs for which we recommend tags do not have any prior tags assigned to them. Two approaches are proposed to address the tagging problem, both of them based on Latent Dirichlet Allocation (LDA) Blei et al. [2003]. LDA is a generative and probabilistic topic model which aims to infer the hidden topical structure in a collection of documents. According to LDA, documents can be seen as mixtures of topics, while topics can be seen as mixtures of words (in our case, tags). The first approach that we propose, called topic words based approach, recommends the top words in the top topics representing a resource as tags for that particular resource. The second approach, called topic distance based approach, uses the tags of the most similar training resources (identified using the KL-divergence Kullback and Liebler [1951]) to recommend tags for a test untagged resource. The dataset used in this work was made available through the ECML/PKDD Discovery Challenge 2009. We construct the documents that are provided as input to LDA in two ways, thus producing two different datasets. In the first dataset, we use only the description and the tags (when available) corresponding to a URL. In the second dataset, we crawl the URL content and use it to construct the document. Experimental results show that the LDA approach is not very effective at recommending tags for new untagged resources. However, using the resource content gives better results than using the description only. Furthermore, the topic distance based approach is better than the topic words based approach, when only the descriptions are used to construct documents, while the topic words based approach works better when the contents are used to construct documents.
Risch, Johan. „Detecting Twitter topics using Latent Dirichlet Allocation“. Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-277260.
Der volle Inhalt der QuelleLiu, Zelong. „High performance latent dirichlet allocation for text mining“. Thesis, Brunel University, 2013. http://bura.brunel.ac.uk/handle/2438/7726.
Der volle Inhalt der QuelleKulhanek, Raymond Daniel. „A Latent Dirichlet Allocation/N-gram Composite Language Model“. Wright State University / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=wright1379520876.
Der volle Inhalt der QuelleAnaya, Leticia H. „Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers“. Thesis, University of North Texas, 2011. https://digital.library.unt.edu/ark:/67531/metadc103284/.
Der volle Inhalt der QuelleJaradat, Shatha. „OLLDA: Dynamic and Scalable Topic Modelling for Twitter : AN ONLINE SUPERVISED LATENT DIRICHLET ALLOCATION ALGORITHM“. Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177535.
Der volle Inhalt der QuelleTillhandahålla högkvalitativa ämnen slutsats i dagens stora och dynamiska korpusar, såsom Twitter, är en utmanande uppgift. Detta är särskilt utmanande med tanke på att innehållet i den här miljön innehåller korta texter och många förkortningar. Projektet föreslår en förbättring med en populär online ämnen modellering algoritm för Latent Dirichlet Tilldelning (LDA), genom att införliva tillsyn för att göra den lämplig för Twitter sammanhang. Denna förbättring motiveras av behovet av en enda algoritm som uppnår båda målen: analysera stora mängder av dokument, inklusive nya dokument som anländer i en bäck, och samtidigt uppnå hög kvalitet på ämnen "upptäckt i speciella fall miljöer, till exempel som Twitter. Den föreslagna algoritmen är en kombination av en online-algoritm för LDA och en övervakad variant av LDA - Labeled LDA. Prestanda och kvalitet av den föreslagna algoritmen jämförs med dessa två algoritmer. Resultaten visar att den föreslagna algoritmen har visat bättre prestanda och kvalitet i jämförelse med den övervakade varianten av LDA, och det uppnådde bättre resultat i fråga om kvalitet i jämförelse med den online-algoritmen. Dessa förbättringar gör vår algoritm till ett attraktivt alternativ när de tillämpas på dynamiska miljöer, som Twitter. En miljö för att analysera och märkning uppgifter är utformad för att förbereda dataset innan du utför experimenten. Möjliga användningsområden för den föreslagna algoritmen är tweets rekommendation och trender upptäckt.
Yalamanchili, Hima Bindu. „A Novel Approach For Cancer Characterization Using Latent Dirichlet Allocation and Disease-Specific Genomic Analysis“. Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright1527600876174758.
Der volle Inhalt der QuelleSheikha, Hassan. „Text mining Twitter social media for Covid-19 : Comparing latent semantic analysis and latent Dirichlet allocation“. Thesis, Högskolan i Gävle, Avdelningen för datavetenskap och samhällsbyggnad, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-32567.
Der volle Inhalt der QuelleNelaturu, Keerthi. „Content Management and Hashtag Recommendation in a P2P Social Networking Application“. Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32501.
Der volle Inhalt der QuelleJärvstråt, Lotta. „Functionality Classification Filter for Websites“. Thesis, Linköpings universitet, Statistik, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-93702.
Der volle Inhalt der QuelleSchenk, Jason Robert. „Meta-uncertainty and resilience with applications in intelligence analysis“. The Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1199129269.
Der volle Inhalt der QuelleAskling, Kim. „Application of Topic Models for Test Case Selection : A comparison of similarity-based selection techniques“. Thesis, Linköpings universitet, Programvara och system, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-159803.
Der volle Inhalt der QuelleMalhomme, Nemo. „Statistical learning for climate models“. Electronic Thesis or Diss., université Paris-Saclay, 2024. http://www.theses.fr/2024UPAST165.
Der volle Inhalt der QuelleClimate models face challenges in accurately representing atmospheric circulation patterns related to extreme weather events, especially regarding regional variability.This thesis explores how Latent Dirichlet Allocation (LDA), a statistical learning method originating from natural language processing, can be adapted to evaluate the ability of climate models to represent data such as Sea-Level Pressure (SLP).LDA identifies a set of local synoptic-scale structures, physically interpretable as cyclones and anticyclones, referred to as motifs.A common basis of motifs can be used to describe reanalysis and model data so that any SLP map can be represented as a sparse combination of these motifs.The motif weights provide local information on the synoptic configuration of circulation.By analyzing the weights, we can characterize circulation patterns in both reanalysis data and models, allowing us to identify local biases, both in general data and during extreme events.A global dynamic error can be defined for each model run based on the differences between the average weights of the run and reanalysis data.This methodology was applied to four CMIP6 models.While large-scale circulation is well predicted by all models on average, higher errors are found for heatwaves and cold spells.In general, a major source of error is found to be associated with Mediterranean motifs, for all models.Additional evaluation criteria were considered: one was based on the frequency of motifs in the sparse map representation.Another one involved combining the global dynamic error with the temperature error, thus making it possible to discriminate between models.These results show the potential of LDA for model evaluation and preselection
Lindgren, Jennifer. „Evaluating Hierarchical LDA Topic Models for Article Categorization“. Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-167080.
Der volle Inhalt der QuelleMorchid, Mohamed. „Représentations robustes de documents bruités dans des espaces homogènes“. Thesis, Avignon, 2014. http://www.theses.fr/2014AVIG0202/document.
Der volle Inhalt der QuelleIn the Information Retrieval field, documents are usually considered as a "bagof-words". This model does not take into account the temporal structure of thedocument and is sensitive to noises which can alter its lexical form. These noisescan be produced by different sources : uncontrolled form of documents in microbloggingplatforms, automatic transcription of speech documents which are errorprone,lexical and grammatical variabilities in Web forums. . . The work presented inthis thesis addresses issues related to document representations from noisy sources.The thesis consists of three parts in which different representations of content areavailable. The first one compares a classical representation based on a term-frequencyrepresentation to a higher level representation based on a topic space. The abstractionof the document content allows us to limit the alteration of the noisy document byrepresenting its content with a set of high-level features. Our experiments confirm thatmapping a noisy document into a topic space allows us to improve the results obtainedduring different information retrieval tasks compared to a classical approach based onterm frequency. The major problem with such a high-level representation is that it isbased on a space theme whose parameters are chosen empirically.The second part presents a novel representation based on multiple topic spaces thatallow us to solve three main problems : the closeness of the subjects discussed in thedocument, the tricky choice of the "right" values of the topic space parameters and therobustness of the topic-based representation. Based on the idea that a single representationof the contents cannot capture all the relevant information, we propose to increasethe number of views on a single document. This multiplication of views generates "artificial"observations that contain fragments of useful information. The first experimentvalidated the multi-view approach to represent noisy texts. However, it has the disadvantageof being very large and redundant and of containing additional variability associatedwith the diversity of views. In the second step, we propose a method based onfactor analysis to compact the different views and to obtain a new robust representationof low dimension which contains only the informative part of the document whilethe noisy variabilities are compensated. During a dialogue classification task, the compressionprocess confirmed that this compact representation allows us to improve therobustness of noisy document representation.Nonetheless, during the learning process of topic spaces, the document is consideredas a "bag-of-words" while many studies have showed that the word position in a7document is useful. A representation which takes into account the temporal structureof the document based on hyper-complex numbers is proposed in the third part. Thisrepresentation is based on the hyper-complex numbers of dimension four named quaternions.Our experiments on a classification task have showed the effectiveness of theproposed approach compared to a conventional "bag-of-words" representation
Hachey, Benjamin. „Towards generic relation extraction“. Thesis, University of Edinburgh, 2009. http://hdl.handle.net/1842/3978.
Der volle Inhalt der QuellePaganin, Sally. „Prior-driven cluster allocation in bayesian mixture models“. Doctoral thesis, Università degli studi di Padova, 2018. http://hdl.handle.net/11577/3426831.
Der volle Inhalt der QuelleBakharia, Aneesha. „Interactive content analysis : evaluating interactive variants of non-negative Matrix Factorisation and Latent Dirichlet Allocation as qualitative content analysis aids“. Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/76535/1/Aneesha_Bakharia_Thesis.pdf.
Der volle Inhalt der QuelleBui, Quang Vu. „Pretopology and Topic Modeling for Complex Systems Analysis : Application on Document Classification and Complex Network Analysis“. Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEP034/document.
Der volle Inhalt der QuelleThe work of this thesis presents the development of algorithms for document classification on the one hand, or complex network analysis on the other hand, based on pretopology, a theory that models the concept of proximity. The first work develops a framework for document clustering by combining Topic Modeling and Pretopology. Our contribution proposes using topic distributions extracted from topic modeling treatment as input for classification methods. In this approach, we investigated two aspects: determine an appropriate distance between documents by studying the relevance of Probabilistic-Based and Vector-Based Measurements and effect groupings according to several criteria using a pseudo-distance defined from pretopology. The second work introduces a general framework for modeling Complex Networks by developing a reformulation of stochastic pretopology and proposes Pretopology Cascade Model as a general model for information diffusion. In addition, we proposed an agent-based model, Textual-ABM, to analyze complex dynamic networks associated with textual information using author-topic model and introduced Textual-Homo-IC, an independent cascade model of the resemblance, in which homophily is measured based on textual content obtained by utilizing Topic Modeling
Clavijo, García David Mauricio. „Metodología para el análisis de grandes volúmenes de información aplicada a la investigación médica en Chile“. Tesis, Universidad de Chile, 2017. http://repositorio.uchile.cl/handle/2250/146597.
Der volle Inhalt der QuelleEl conocimiento en la medicina se ha acumulado en artículos de investigación científica a través del tiempo, por consiguiente, se ha generado un interés creciente en desarrollar metodologías de minería de texto para extraer, estructurar y analizar el conocimiento obtenido de grandes volúmenes de información en el menor tiempo posible. En este trabajo se presenta un una metodología que permite lograr el objetivo anterior utilizando el modelo LDA (Latent Dirichlet Allocation). Esta metodología consiste en 3 pasos: Primero, reconocer tópicos relevantes en artículos de investigación científica médica de la Revista Médica de Chile (2012 2015); Segundo, identificar e interpretar la relación entre los tópicos resultantes mediante métodos de visualización (LDAvis); Tercero, evaluar características propias de las investigaciones científicas, en este caso, el financiamiento dirigido, utilizando los dos pasos anteriores. Los resultados muestran que esta metodología resulta efectiva, no sólo para el análisis de artículos de investigación científica médica, sino que también puede ser utilizado en otros campos de la ciencia. Adicionalmente, éste método permite analizar e interpretar el estado en el que se encuentra la investigación médica a nivel nacional utilizando como referente la Revista Médica de Chile. Dentro de este contexto es importante considerar los procesos de planificación, gestión y producción de la investigación científica al interior de los Hospitales que han sido estandartes de generación del conocimiento ya que funcionan como campus universitarios de tradición e innovación. Por la razón anterior, se realizará un análisis del entorno en el sector de la salud, su estructura y la posibilidad de aplicar la metodología propuesta en este trabajo a partir del planteamiento estratégico y el modelo de negocio del Hospital Exequiel González Cortés.
Halmann, Marju. „Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification“. Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14710.
Der volle Inhalt der QuelleMuñoz, Cancino Ricardo Luis. „Diseño, desarrollo y evaluación de un algoritmo para detectar sub-comunidades traslapadas usando análisis de redes sociales y minería de datos“. Tesis, Universidad de Chile, 2013. http://www.repositorio.uchile.cl/handle/2250/112582.
Der volle Inhalt der QuelleIngeniero Civil Industrial
Los sitios de redes sociales virtuales han tenido un enorme crecimiento en la última década. Su principal objetivo es facilitar la creación de vínculos entre personas que, por ejemplo, comparten intereses, actividades, conocimientos, o conexiones en la vida real. La interacción entre los usuarios genera una comunidad en la red social. Existen varios tipos de comunidades, se distinguen las comunidades de interés y práctica. Una comunidad de interés es un grupo de personas interesadas en compartir y discutir un tema de interés particular. En cambio, en una comunidad de práctica las personas comparten una preocupación o pasión por algo que ellos hacen y aprenden cómo hacerlo mejor. Si las interacciones se realizan por internet, se les llama comunidades virtuales (VCoP/VCoI por sus siglas en inglés). Es común que los miembros compartan solo con algunos usuarios formando así subcomunidades, pudiendo pertenecer a más de una. Identificar estas subestructuras es necesario, pues allí se generan las interacciones para la creación y desarrollo del conocimiento de la comunidad. Se han diseñado muchos algoritmos para detectar subcomunidades. Sin embargo, la mayoría de ellos detecta subcomunidades disjuntas y además, no consideran el contenido generado por los miembros de la comunidad. El objetivo principal de este trabajo es diseñar, desarrollar y evaluar un algoritmo para detectar subcomunidades traslapadas mediante el uso de análisis de redes sociales (SNA) y Text Mining. Para ello se utiliza la metodología SNA-KDD propuesta por Ríos et al. [79] que combina Knowledge Discovery in Databases (KDD) y SNA. Ésta fue aplicada sobre dos comunidades virtuales, Plexilandia (VCoP) y The Dark Web Portal (VCoI). En la etapa de KDD se efectuó el preprocesamiento de los posts de los usuarios, para luego aplicar Latent Dirichlet Allocation (LDA), que permite describir cada post en términos de tópicos. En la etapa SNA se construyeron redes filtradas con la información obtenida en la etapa anterior. A continuación se utilizaron dos algoritmos desarrollados en esta tesis, SLTA y TPA, para encontrar subcomunidades traslapadas. Los resultados muestran que SLTA logra un desempeño, en promedio, un 5% superior que el mejor algoritmo existente cuando es aplicado sobre una VCoP. Además, se encontró que la calidad de la estructura de sub-comunidades detectadas aumenta, en promedio, un 64% cuando el filtro semántico es aumentado. Con respecto a TPA, este algoritmo logra, en promedio, una medida de modularidad de 0.33 mientras que el mejor algoritmo existente 0.043 cuando es aplicado sobre una VCoI. Además la aplicación conjunta de nuestros algoritmos parece mostrar una forma de determinar el tipo de comunidad que se está analizando. Sin embargo, esto debe ser comprobado analizando más comunidades virtuales.
Wedenberg, Kim, und Alexander Sjöberg. „Online inference of topics : Implementation of the topic model Latent Dirichlet Allocation using an online variational bayes inference algorithm to sort news articles“. Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-222429.
Der volle Inhalt der QuelleLong, Hannah. „Geographic Relevance for Travel Search: The 2014-2015 Harvey Mudd College Clinic Project for Expedia, Inc“. Scholarship @ Claremont, 2015. http://scholarship.claremont.edu/scripps_theses/670.
Der volle Inhalt der QuelleDupuy, Christophe. „Inference and applications for topic models“. Thesis, Paris Sciences et Lettres (ComUE), 2017. http://www.theses.fr/2017PSLEE055/document.
Der volle Inhalt der QuelleMost of current recommendation systems are based on ratings (i.e. numbers between 0 and 5) and try to suggest a content (movie, restaurant...) to a user. These systems usually allow users to provide a text review for this content in addition to ratings. It is hard to extract useful information from raw text while a rating does not contain much information on the content and the user. In this thesis, we tackle the problem of suggesting personalized readable text to users to help them make a quick decision about a content. More specifically, we first build a topic model that predicts personalized movie description from text reviews. Our model extracts distinct qualitative (i.e., which convey opinion) and descriptive topics by combining text reviews and movie ratings in a joint probabilistic model. We evaluate our model on an IMDB dataset and illustrate its performance through comparison of topics. We then study parameter inference in large-scale latent variable models, that include most topic models. We propose a unified treatment of online inference for latent variable models from a non-canonical exponential family, and draw explicit links between several previously proposed frequentist or Bayesian methods. We also propose a novel inference method for the frequentist estimation of parameters, that adapts MCMC methods to online inference of latent variable models with the proper use of local Gibbs sampling.~For the specific latent Dirichlet allocation topic model, we provide an extensive set of experiments and comparisons with existing work, where our new approach outperforms all previously proposed methods. Finally, we propose a new class of determinantal point processes (DPPs) which can be manipulated for inference and parameter learning in potentially sublinear time in the number of items. This class, based on a specific low-rank factorization of the marginal kernel, is particularly suited to a subclass of continuous DPPs and DPPs defined on exponentially many items. We apply this new class to modelling text documents as sampling a DPP of sentences, and propose a conditional maximum likelihood formulation to model topic proportions, which is made possible with no approximation for our class of DPPs. We present an application to document summarization with a DPP on 2 to the power 500 items, where the summaries are composed of readable sentences
Sathi, Veer Reddy, und Jai Simha Ramanujapura. „A Quality Criteria Based Evaluation of Topic Models“. Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-13274.
Der volle Inhalt der QuelleChenghua, Lin. „Probabilistic topic models for sentiment analysis on the Web“. Thesis, University of Exeter, 2011. http://hdl.handle.net/10036/3307.
Der volle Inhalt der QuelleMungre, Surbhi. „LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification“. Thesis, Kansas State University, 2011. http://hdl.handle.net/2097/8846.
Der volle Inhalt der QuelleDepartment of Computing and Information Sciences
Doina Caragea
Several computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available. Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a ``document" and k-mers obtained from a sequence are ``document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems.
Johansson, Richard, und Heino Otto Engström. „Topic propagation over time in internet security conferences : Topic modeling as a tool to investigate trends for future research“. Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-177748.
Der volle Inhalt der QuelleFröjd, Sofia. „Measuring the information content of Riksbank meeting minutes“. Thesis, Umeå universitet, Institutionen för fysik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-158151.
Der volle Inhalt der QuelleGreau-Hamard, Pierre-Samuel. „Contribution à l’apprentissage non supervisé de protocoles pour la couche de Liaison de données dans les systèmes communicants, à l'aide des Réseaux Bayésiens“. Thesis, CentraleSupélec, 2021. http://www.theses.fr/2021CSUP0009.
Der volle Inhalt der QuelleThe world of telecommunications is rapidly developing, especially in the area of the Internet of Things; in such a context, it would be useful to be able to analyze any unknown protocol one might encounter. For this purpose, obtaining the state machine and frame formats of the target protocol is essential. These two elements can be extracted from network traces and/or execution traces using Protocol Reverse Engineering (PRE) techniques.By analyzing the performance of three algorithms used in PRE systems, we discovered the potential of models based on Bayesian networks. We then developed Bayesian Network Frame Format Finder (BaNet3F), our own frame format learning model based on Bayesian networks, and showed that its performance is significantly better than the state of the art. BaNet3F also includes an optimized version of the Viterbi algorithm, applicable to any Bayesian network, thanks to its ability to generate the necessary Markov boundaries itself
Rocha, João Pedro Magalhães da. „The evolution of international business research : a content analysis of EIBA’s conference papers (1999-2011)“. Master's thesis, Instituto Superior de Economia e Gestão, 2020. http://hdl.handle.net/10400.5/20932.
Der volle Inhalt der QuelleEste estudo busca analisar a evolução das Conferências Anuais da European International Business Academy entre os anos 1999 e 2011. Um conjunto de 2221 documentos apresentados durante o período foi processado com o uso de uma ferramenta computadorizada - Latent Dirichlet Allocation, potencializada por sistemas de inteligência artificial - para facilitar a análise de conteúdo. O estudo utilizou o sistema de software R como a plataforma de aplicação do método, com o apoio de bibliotecas compatíveis para auxiliar as etapas de pré-processamento, de modelagem de tópicos, e de plotagem de resultados deste estudo. O método foi capaz de identificar 30 tópicos de investigação subjacentes ao grupo de documentos, e um rótulo foi manualmente atribuído a cada tópico de acordo com seu tema geral. Resultados mostram um crescimento generalizado no número de artigos apresentados, bem como novos autores ao longo dos anos, indicando um aumento no grau de abertura das Conferências Anuais da Associação. Determinados tópicos de investigação também se mostraram mais discutidos nos documentos em relação a outros tópicos, e investigação nos tópicos de "Capacidades dinâmicas, visão baseada em recursos e internacionalização de firma" e "Abordagens institucionais à investigação e teoria em Negócios Internacionais" se mostraram tópicos de tendência em anos recentes das Conferências. Com pouca necessidade de intervenção humana, este estudo pôde aplicar com sucesso um método automatizado para identificar os temas de investigação da Associação e sua evolução ao longo dos anos, reduzindo o viés do investigador e permitindo uma análise eficiente de grandes volumes de conteúdo textual.
This study seeks to analyse the evolution of the European International Business Academy's Annual Conferences between the years 1999 and 2011. A collection of the 2221 documents presented across the defined period was processed with the use of a computer-aided tool - the Latent Dirichlet Allocation, powered by artificial intelligence systems - to facilitate the content analysis. The study utilized the R software environment as the platform to apply the method, with the support of a group of compatible libraries to support in the pre-processing, topic modelling and result plotting stages of this study. The method was able to identify 30 underlying research topics across all the documents, and a label was manually assigned to each topic according to its overall theme. Results show an overall growth in number of papers presented, as well as new authors throughout the years, indicating an increase in degree of openness in the Association's Annual Conferences. Specific research topics have also shown to be more discussed across the documents than others, and research on the topics of "Dynamic capabilities, resource-based view and firm internationalization" and "Institutional approaches to International Business research and theory" showed to be trending in recent years of the Conferences. With little need for human intervention, this study was able to successfully apply an automated method to identify the Association's research themes and their evolution throughout the years, reducing researcher's bias and allowing for the efficient analysis of large volumes of text content.
info:eu-repo/semantics/publishedVersion
Déhaye, Vincent. „Characterisation of a developer’s experience fields using topic modelling“. Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-171946.
Der volle Inhalt der QuelleFicapal, Vila Joan. „Anemone: a Visual Semantic Graph“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-252810.
Der volle Inhalt der QuelleSemantiska grafer har använts för att optimera olika processer för naturlig språkbehandling samt för att förbättra sökoch informationsinhämtningsuppgifter. I de flesta fall har sådana semantiska grafer konstruerats genom övervakade maskininlärningsmetoder som förutsätter manuellt kurerade ontologier såsom Wikipedia eller liknande. I denna uppsats, som består av två delar, undersöker vi i första delen möjligheten att automatiskt generera en semantisk graf från ett ad hoc dataset bestående av 50 000 tidningsartiklar på ett helt oövervakat sätt. Användbarheten hos den visuella representationen av den resulterande grafen testas på 14 försökspersoner som utför grundläggande informationshämtningsuppgifter på en delmängd av artiklarna. Vår studie visar att vår funktionalitet är lönsam för att hitta och dokumentera likhet med varandra, och den visuella kartan som produceras av vår artefakt är visuellt användbar. I den andra delen utforskar vi möjligheten att identifiera entitetsrelationer på ett oövervakat sätt genom att använda abstraktiva djupa inlärningsmetoder för meningsomformulering. De omformulerade meningarna utvärderas kvalitativt med avseende på grammatisk korrekthet och meningsfullhet såsom detta uppfattas av 14 testpersoner. Vi utvärderar negativt resultaten av denna andra del, eftersom de inte har varit tillräckligt bra för att få någon definitiv slutsats, men har istället öppnat nya dörrar för att utforska.
MAGATTI, DAVIDE. „Graphical models for text mining: knowledge extraction and performance estimation“. Doctoral thesis, Università degli Studi di Milano-Bicocca, 2011. http://hdl.handle.net/10281/19576.
Der volle Inhalt der QuellePark, Kyoung Jin. „Generating Thematic Maps from Hyperspectral Imagery Using a Bag-of-Materials Model“. The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1366296426.
Der volle Inhalt der QuelleRusso, Massimiliano. „Bayesian inference for tensor factorization models“. Doctoral thesis, Università degli studi di Padova, 2019. http://hdl.handle.net/11577/3426830.
Der volle Inhalt der QuelleAtrevi, Dieudonne Fabrice. „Détection et analyse des évènements rares par vision, dans un contexte urbain ou péri-urbain“. Thesis, Orléans, 2019. http://www.theses.fr/2019ORLE2008.
Der volle Inhalt der QuelleThe main objective of this thesis is the development of complete methods for rare events detection. The works can be summarized in two parts. The first part is devoted to the study of shapes descriptors of the state of the art. On the one hand, the robustness of some descriptors to varying light conditions was studied.On the other hand, the ability of geometric moments to describe the human shape was also studied through a3D human pose estimation application based on 2D images. From this study, we have shown that through a shape retrieval application, geometric moments can be used to estimate a human pose through an exhaustive search in a pose database. This kind of application can be used in human actions recognition system which may be a final step of an event analysis system. In the second part of this report, three main contributions to rare event detection are presented. The first contribution concerns the development of a global scene analysis method for crowd event detection. In this method, global scene modeling is done based on spatiotemporal interest points filtered from the saliency map of the scene. The characteristics used are the histogram of the optical flow orientations and a set of shapes descriptors studied in the first part. The Latent Dirichlet Allocation algorithm is used to create event models by using a visual document representation of image sequences(video clip). The second contribution is the development of a method for salient motions detection in video.This method is totally unsupervised and relies on the properties of the discrete cosine transform to explore the optical flow information of the scene. Local modeling for events detection and localization is at the core of the latest contribution of this thesis. The method is based on the saliency score of movements and one class SVM algorithm to create the events model. The methods have been tested on different public database and the results obtained are promising
Rusch, Thomas, Paul Hofmarcher, Reinhold Hatzinger und Kurt Hornik. „Model trees with topic model preprocessing: an approach for data journalism illustrated with the WikiLeaks Afghanistan war logs“. Institute of Mathematical Statistics (IMS), 2013. http://dx.doi.org/10.1214/12-AOAS618.
Der volle Inhalt der QuelleApelthun, Catharina. „Topic modeling on a classical Swedish text corpus of prose fiction : Hyperparameters’ effect on theme composition and identification of writing style“. Thesis, Uppsala universitet, Statistiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-441653.
Der volle Inhalt der QuelleCedervall, Andreas, und Daniel Jansson. „Topic classification of Monetary Policy Minutes from the Swedish Central Bank“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-240403.
Der volle Inhalt der QuelleUnder de senaste åren har artificiell intelligens och maskininlärning fått mycket uppmärksamhet och växt otroligt. Tidigare manuella arbeten blir nu automatiserade och mycket tyder på att utvecklingen kommer att fortsätta i en hög takt. Detta arbete bygger vidare på arbeten inom topic modeling (ämnesklassifikation) och applicera detta i ett tidigare outforskat område, riksbanksprotokoll. Latent Dirichlet Allocation och Neural Network används för att undersöka huruvida fördelningen av diskussionspunkter (topics) förändras över tid. Slutligen presenteras en teoretisk diskussion av det potentiella affärsvärdet i att implementera en liknande metod. Resultaten för de olika modellerna uppvisar stora skillnader över tid. Medan Latent Dirichlet Allocation inte finner några större trender i diskussionspunkter visar Neural Network på större förändringar över tid. De senare stämmer dessutom väl överens med andra observationer såsom påbörjandet av obligationsköp. Därav indikerar resultaten att Neural Network är en mer lämplig metod för analys av riksbankens mötesprotokoll.
Schneider, Bruno. „Visualização em multirresolução do fluxo de tópicos em coleções de texto“. reponame:Repositório Institucional do FGV, 2014. http://hdl.handle.net/10438/11745.
Der volle Inhalt der QuelleApproved for entry into archive by Janete de Oliveira Feitosa (janete.feitosa@fgv.br) on 2014-05-13T12:56:21Z (GMT) No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Approved for entry into archive by Marcia Bacha (marcia.bacha@fgv.br) on 2014-05-14T19:44:51Z (GMT) No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Made available in DSpace on 2014-05-14T19:45:33Z (GMT). No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) Previous issue date: 2014-03-21
The combined use of algorithms for topic discovery in document collections with topic flow visualization techniques allows the exploration of thematic patterns in long corpus. In this task, those patterns could be revealed through compact visual representations. This research has investigated the requirements for viewing data about the thematic composition of documents obtained through topic modeling - where datasets are sparse and has multi-attributes - at different levels of detail through the development of an own technique and the use of an open source library for data visualization, comparatively. About the studied problem of topic flow visualization, we observed the presence of conflicting requirements for data display in different resolutions, which led to detailed investigation on ways of manipulating and displaying this data. In this study, the hypothesis put forward was that the integrated use of more than one visualization technique according to the resolution of data expands the possibilities for exploitation of the object under study in relation to what would be obtained using only one method. The exhibition of the limits on the use of these techniques according to the resolution of data exploration is the main contribution of this work, in order to provide subsidies for the development of new applications.
O uso combinado de algoritmos para a descoberta de tópicos em coleções de documentos com técnicas orientadas à visualização da evolução daqueles tópicos no tempo permite a exploração de padrões temáticos em corpora extensos a partir de representações visuais compactas. A pesquisa em apresentação investigou os requisitos de visualização do dado sobre composição temática de documentos obtido através da modelagem de tópicos – o qual é esparso e possui multiatributos – em diferentes níveis de detalhe, através do desenvolvimento de uma técnica de visualização própria e pelo uso de uma biblioteca de código aberto para visualização de dados, de forma comparativa. Sobre o problema estudado de visualização do fluxo de tópicos, observou-se a presença de requisitos de visualização conflitantes para diferentes resoluções dos dados, o que levou à investigação detalhada das formas de manipulação e exibição daqueles. Dessa investigação, a hipótese defendida foi a de que o uso integrado de mais de uma técnica de visualização de acordo com a resolução do dado amplia as possibilidades de exploração do objeto em estudo em relação ao que seria obtido através de apenas uma técnica. A exibição dos limites no uso dessas técnicas de acordo com a resolução de exploração do dado é a principal contribuição desse trabalho, no intuito de dar subsídios ao desenvolvimento de novas aplicações.
Pratt, Landon James. „Cliff Walls: Threats to Validity in Empirical Studies of Open Source Forges“. BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/3511.
Der volle Inhalt der QuelleHarrysson, Mattias. „Neural probabilistic topic modeling of short and messy text“. Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189532.
Der volle Inhalt der QuelleAtt utforska enorma mängder användargenererad data med ämnen postulerar ett nytt sätt att hitta användbar information. Ämnena antas vara “gömda” och måste “avtäckas” med statistiska metoder såsom ämnesmodellering. Dock är användargenererad data generellt sätt kort och stökig t.ex. informella chattkonversationer, mycket slangord och “brus” som kan vara URL:er eller andra former av pseudo-text. Denna typ av data är svår att bearbeta för de flesta algoritmer i naturligt språk, inklusive ämnesmodellering. Det här arbetet har försökt hitta den metod som objektivt ger dem bättre ämnena ur kort och stökig text i en jämförande studie. De metoder som jämfördes var latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words samt en egen metod med namnet Neural Probabilistic Topic Modeling (NPTM) baserat på tidigare arbeten. Den slutsats som kan dras är att NPTM har en tendens att ge bättre ämnen på kort och stökig text jämfört med LDA och RO-LDA. GMM lyckades inte ge några meningsfulla resultat alls. Resultaten är mindre bevisande eftersom NPTM har problem med långa körtider vilket innebär att tillräckligt många stickprov inte kunde erhållas för ett statistiskt test.
Moon, Gordon Euhyun. „Parallel Algorithms for Machine Learning“. The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1561980674706558.
Der volle Inhalt der QuelleWebb, Jared Anthony. „A Topics Analysis Model for Health Insurance Claims“. BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/3805.
Der volle Inhalt der QuelleVictors, Mason Lemoyne. „A Classification Tool for Predictive Data Analysis in Healthcare“. BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/5639.
Der volle Inhalt der QuelleChen, Yuxin. „Apprentissage interactif de mots et d'objets pour un robot humanoïde“. Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLY003/document.
Der volle Inhalt der QuelleFuture applications of robotics, especially personal service robots, will require continuous adaptability to the environment, and particularly the ability to recognize new objects and learn new words through interaction with humans. Though having made tremendous progress by using machine learning, current computational models for object detection and representation still rely heavily on good training data and ideal learning supervision. In contrast, two year old children have an impressive ability to learn to recognize new objects and at the same time to learn the object names during interaction with adults and without precise supervision. Therefore, following the developmental robotics approach, we develop in the thesis learning approaches for objects, associating their names and corresponding features, inspired by the infants' capabilities, in particular, the ambiguous interaction with humans, inspired by the interaction that occurs between children and parents.The general idea is to use cross-situational learning (finding the common points between different presentations of an object or a feature) and to implement multi-modal concept discovery based on two latent topic discovery approaches : Non Negative Matrix Factorization (NMF) and Latent Dirichlet Association (LDA). Based on vision descriptors and sound/voice inputs, the proposed approaches will find the underlying regularities in the raw dataflow to produce sets of words and their associated visual meanings (eg. the name of an object and its shape, or a color adjective and its correspondence in images). We developed a complete approach based on these algorithms and compared their behavior in front of two sources of uncertainties: referential ambiguities, in situations where multiple words are given that describe multiple objects features; and linguistic ambiguities, in situations where keywords we intend to learn are merged in complete sentences. This thesis highlights the algorithmic solutions required to be able to perform efficient learning of these word-referent associations from data acquired in a simplified but realistic acquisition setup that made it possible to perform extensive simulations and preliminary experiments in real human-robot interactions. We also gave solutions for the automatic estimation of the number of topics for both NMF and LDA.We finally proposed two active learning strategies, Maximum Reconstruction Error Based Selection (MRES) and Confidence Based Exploration (CBE), to improve the quality and speed of incremental learning by letting the algorithms choose the next learning samples. We compared the behaviors produced by these algorithms and show their common points and differences with those of humans in similar learning situations