Um die anderen Arten von Veröffentlichungen zu diesem Thema anzuzeigen, folgen Sie diesem Link: Dirichlet allocation.

Dissertationen zum Thema „Dirichlet allocation“

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an

Wählen Sie eine Art der Quelle aus:

Machen Sie sich mit Top-50 Dissertationen für die Forschung zum Thema "Dirichlet allocation" bekannt.

Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.

Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.

Sehen Sie die Dissertationen für verschiedene Spezialgebieten durch und erstellen Sie Ihre Bibliographie auf korrekte Weise.

1

Ponweiser, Martin. „Latent Dirichlet Allocation in R“. WU Vienna University of Economics and Business, 2012. http://epub.wu.ac.at/3558/1/main.pdf.

Der volle Inhalt der Quelle
Annotation:
Topic models are a new research field within the computer sciences information retrieval and text mining. They are generative probabilistic models of text corpora inferred by machine learning and they can be used for retrieval and text mining tasks. The most prominent topic model is latent Dirichlet allocation (LDA), which was introduced in 2003 by Blei et al. and has since then sparked off the development of other topic models for domain-specific purposes. This thesis focuses on LDA's practical application. Its main goal is the replication of the data analyses from the 2004 LDA paper ``Finding scientific topics'' by Thomas Griffiths and Mark Steyvers within the framework of the R statistical programming language and the R~package topicmodels by Bettina Grün and Kurt Hornik. The complete process, including extraction of a text corpus from the PNAS journal's website, data preprocessing, transformation into a document-term matrix, model selection, model estimation, as well as presentation of the results, is fully documented and commented. The outcome closely matches the analyses of the original paper, therefore the research by Griffiths/Steyvers can be reproduced. Furthermore, this thesis proves the suitability of the R environment for text mining with LDA. (author's abstract)
Series: Theses / Institute for Statistics and Mathematics
APA, Harvard, Vancouver, ISO und andere Zitierweisen
2

Arnekvist, Isac, und Ludvig Ericson. „Finding competitors using Latent Dirichlet Allocation“. Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-186386.

Der volle Inhalt der Quelle
Annotation:
Identifying business competitors is of interest to many, but is becoming increasingly hard in an expanding global market. The aim of this report is to investigate whether Latent Dirichlet Allocation (LDA) can be used to identify and rank competitors based on distances between LDA representations of company descriptions. The performance of the LDA model was compared to that of bag-of-words and random ordering by evaluating then comparing them on a handful of common information retrieval metrics. Several different distance metrics were evaluated to determine which metric had best correspondence between representation distance and companies being competitors. Cosine similarity was found to outperform the other distance metrics. While both LDA and bag-of-words representations were found to be significantly better than random ordering, LDA was found to perform worse than bag-of-words. However, computation of distance metrics was considerably faster for LDA representations. The LDA representations capture features that are not helpful for identifying competitors, and it is suggested that LDA representations could be used together with some other data source or heuristic.
Det finns ett intresse av att kunna identifiera affärskonkurrenter, men detta blir allt svårare på en ständigt växande och alltmer global marknad. Syftet med denna rapport är att undersöka om Latent Dirichlet Allocation (LDA) kan användas för att identifiera och rangordna konkurrenter. Detta genom att jämföra avstånden mellan LDA-representationerna av dessas företagsbeskrivningar. Effektiviteten av LDA i detta syfte jämfördes med den för bag-of-words samt slumpmässig ordning, detta med hjälp av några vanliga informationsteoretiska mått. Flera olika avståndsmått utvärderades för att bestämma vilken av dessa som bäst åstadkommer att konkurrerande företag hamnar nära varandra. I detta fall fanns Cosine similarity överträffa andra avståndsmått. Medan både LDA och bag-of-words konstaterades vara signifikant bättre än slumpmässig ordning så fanns att LDA presterar kvalitativt sämre än bag-of-words. Uträkning av avståndsmått var dock betydligt snabbare med LDA-representationer. Att omvandla webbinnehåll till LDA-representationer fångar dock vissa ospecifika likheter som inte nödvändigt beskriver konkurrenter. Det kan möjligen vara fördelaktigt att använda LDA-representationer ihop med någon ytterligare datakälla och/eller heuristik.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
3

Choubey, Rahul. „Tag recommendation using Latent Dirichlet Allocation“. Thesis, Kansas State University, 2011. http://hdl.handle.net/2097/9785.

Der volle Inhalt der Quelle
Annotation:
Master of Science
Department of Computing and Information Sciences
Doina Caragea
The vast amount of data present on the internet calls for ways to label and organize this data according to specific categories, in order to facilitate search and browsing activities. This can be easily accomplished by making use of folksonomies and user provided tags. However, it can be difficult for users to provide meaningful tags. Tag recommendation systems can guide the users towards informative tags for online resources such as websites, pictures, etc. The aim of this thesis is to build a system for recommending tags to URLs available through a bookmark sharing service, called BibSonomy. We assume that the URLs for which we recommend tags do not have any prior tags assigned to them. Two approaches are proposed to address the tagging problem, both of them based on Latent Dirichlet Allocation (LDA) Blei et al. [2003]. LDA is a generative and probabilistic topic model which aims to infer the hidden topical structure in a collection of documents. According to LDA, documents can be seen as mixtures of topics, while topics can be seen as mixtures of words (in our case, tags). The first approach that we propose, called topic words based approach, recommends the top words in the top topics representing a resource as tags for that particular resource. The second approach, called topic distance based approach, uses the tags of the most similar training resources (identified using the KL-divergence Kullback and Liebler [1951]) to recommend tags for a test untagged resource. The dataset used in this work was made available through the ECML/PKDD Discovery Challenge 2009. We construct the documents that are provided as input to LDA in two ways, thus producing two different datasets. In the first dataset, we use only the description and the tags (when available) corresponding to a URL. In the second dataset, we crawl the URL content and use it to construct the document. Experimental results show that the LDA approach is not very effective at recommending tags for new untagged resources. However, using the resource content gives better results than using the description only. Furthermore, the topic distance based approach is better than the topic words based approach, when only the descriptions are used to construct documents, while the topic words based approach works better when the contents are used to construct documents.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
4

Risch, Johan. „Detecting Twitter topics using Latent Dirichlet Allocation“. Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-277260.

Der volle Inhalt der Quelle
Annotation:
Latent Dirichlet Allocations is evaluated for its suitability when detecting topics in a stream of short messages limited to 140 characters. This is done by assessing its ability to model the incoming messages and its ability to classify previously unseen messages with known topics. The evaluation shows that the model can be suitable for certain applications in topic detection when the stream size is small enough. Furthermoresuggestions on how to handle larger streams are outlined.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
5

Liu, Zelong. „High performance latent dirichlet allocation for text mining“. Thesis, Brunel University, 2013. http://bura.brunel.ac.uk/handle/2438/7726.

Der volle Inhalt der Quelle
Annotation:
Latent Dirichlet Allocation (LDA), a total probability generative model, is a three-tier Bayesian model. LDA computes the latent topic structure of the data and obtains the significant information of documents. However, traditional LDA has several limitations in practical applications. LDA cannot be directly used in classification because it is a non-supervised learning model. It needs to be embedded into appropriate classification algorithms. LDA is a generative model as it normally generates the latent topics in the categories where the target documents do not belong to, producing the deviation in computation and reducing the classification accuracy. The number of topics in LDA influences the learning process of model parameters greatly. Noise samples in the training data also affect the final text classification result. And, the quality of LDA based classifiers depends on the quality of the training samples to a great extent. Although parallel LDA algorithms are proposed to deal with huge amounts of data, balancing computing loads in a computer cluster poses another challenge. This thesis presents a text classification method which combines the LDA model and Support Vector Machine (SVM) classification algorithm for an improved accuracy in classification when reducing the dimension of datasets. Based on Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the algorithm automatically optimizes the number of topics to be selected which reduces the number of iterations in computation. Furthermore, this thesis presents a noise data reduction scheme to process noise data. When the noise ratio is large in the training data set, the noise reduction scheme can always produce a high level of accuracy in classification. Finally, the thesis parallelizes LDA using the MapReduce model which is the de facto computing standard in supporting data intensive applications. A genetic algorithm based load balancing algorithm is designed to balance the workloads among computers in a heterogeneous MapReduce cluster where the computers have a variety of computing resources in terms of CPU speed, memory space and hard disk space.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
6

Kulhanek, Raymond Daniel. „A Latent Dirichlet Allocation/N-gram Composite Language Model“. Wright State University / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=wright1379520876.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
7

Anaya, Leticia H. „Comparing Latent Dirichlet Allocation and Latent Semantic Analysis as Classifiers“. Thesis, University of North Texas, 2011. https://digital.library.unt.edu/ark:/67531/metadc103284/.

Der volle Inhalt der Quelle
Annotation:
In the Information Age, a proliferation of unstructured text electronic documents exists. Processing these documents by humans is a daunting task as humans have limited cognitive abilities for processing large volumes of documents that can often be extremely lengthy. To address this problem, text data computer algorithms are being developed. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are two text data computer algorithms that have received much attention individually in the text data literature for topic extraction studies but not for document classification nor for comparison studies. Since classification is considered an important human function and has been studied in the areas of cognitive science and information science, in this dissertation a research study was performed to compare LDA, LSA and humans as document classifiers. The research questions posed in this study are: R1: How accurate is LDA and LSA in classifying documents in a corpus of textual data over a known set of topics? R2: How accurate are humans in performing the same classification task? R3: How does LDA classification performance compare to LSA classification performance? To address these questions, a classification study involving human subjects was designed where humans were asked to generate and classify documents (customer comments) at two levels of abstraction for a quality assurance setting. Then two computer algorithms, LSA and LDA, were used to perform classification on these documents. The results indicate that humans outperformed all computer algorithms and had an accuracy rate of 94% at the higher level of abstraction and 76% at the lower level of abstraction. At the high level of abstraction, the accuracy rates were 84% for both LSA and LDA and at the lower level, the accuracy rate were 67% for LSA and 64% for LDA. The findings of this research have many strong implications for the improvement of information systems that process unstructured text. Document classifiers have many potential applications in many fields (e.g., fraud detection, information retrieval, national security, and customer management). Development and refinement of algorithms that classify text is a fruitful area of ongoing research and this dissertation contributes to this area.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
8

Jaradat, Shatha. „OLLDA: Dynamic and Scalable Topic Modelling for Twitter : AN ONLINE SUPERVISED LATENT DIRICHLET ALLOCATION ALGORITHM“. Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-177535.

Der volle Inhalt der Quelle
Annotation:
Providing high quality of topics inference in today's large and dynamic corpora, such as Twitter, is a challenging task. This is especially challenging taking into account that the content in this environment contains short texts and many abbreviations. This project proposes an improvement of a popular online topics modelling algorithm for Latent Dirichlet Allocation (LDA), by incorporating supervision to make it suitable for Twitter context. This improvement is motivated by the need for a single algorithm that achieves both objectives: analyzing huge amounts of documents, including new documents arriving in a stream, and, at the same time, achieving high quality of topics’ detection in special case environments, such as Twitter. The proposed algorithm is a combination of an online algorithm for LDA and a supervised variant of LDA - labeled LDA. The performance and quality of the proposed algorithm is compared with these two algorithms. The results demonstrate that the proposed algorithm has shown better performance and quality when compared to the supervised variant of LDA, and it achieved better results in terms of quality in comparison to the online algorithm. These improvements make our algorithm an attractive option when applied to dynamic environments, like Twitter. An environment for analyzing and labelling data is designed to prepare the dataset before executing the experiments. Possible application areas for the proposed algorithm are tweets recommendation and trends detection.
Tillhandahålla högkvalitativa ämnen slutsats i dagens stora och dynamiska korpusar, såsom Twitter, är en utmanande uppgift. Detta är särskilt utmanande med tanke på att innehållet i den här miljön innehåller korta texter och många förkortningar. Projektet föreslår en förbättring med en populär online ämnen modellering algoritm för Latent Dirichlet Tilldelning (LDA), genom att införliva tillsyn för att göra den lämplig för Twitter sammanhang. Denna förbättring motiveras av behovet av en enda algoritm som uppnår båda målen: analysera stora mängder av dokument, inklusive nya dokument som anländer i en bäck, och samtidigt uppnå hög kvalitet på ämnen "upptäckt i speciella fall miljöer, till exempel som Twitter. Den föreslagna algoritmen är en kombination av en online-algoritm för LDA och en övervakad variant av LDA - Labeled LDA. Prestanda och kvalitet av den föreslagna algoritmen jämförs med dessa två algoritmer. Resultaten visar att den föreslagna algoritmen har visat bättre prestanda och kvalitet i jämförelse med den övervakade varianten av LDA, och det uppnådde bättre resultat i fråga om kvalitet i jämförelse med den online-algoritmen. Dessa förbättringar gör vår algoritm till ett attraktivt alternativ när de tillämpas på dynamiska miljöer, som Twitter. En miljö för att analysera och märkning uppgifter är utformad för att förbereda dataset innan du utför experimenten. Möjliga användningsområden för den föreslagna algoritmen är tweets rekommendation och trender upptäckt.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
9

Yalamanchili, Hima Bindu. „A Novel Approach For Cancer Characterization Using Latent Dirichlet Allocation and Disease-Specific Genomic Analysis“. Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright1527600876174758.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
10

Sheikha, Hassan. „Text mining Twitter social media for Covid-19 : Comparing latent semantic analysis and latent Dirichlet allocation“. Thesis, Högskolan i Gävle, Avdelningen för datavetenskap och samhällsbyggnad, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-32567.

Der volle Inhalt der Quelle
Annotation:
In this thesis, the Twitter social media is data mined for information about the covid-19 outbreak during the month of March, starting from the 3’rd and ending on the 31’st. 100,000 tweets were collected from Harvard’s opensource data and recreated using Hydrate. This data is analyzed further using different Natural Language Processing (NLP) methodologies, such as termfrequency inverse document frequency (TF-IDF), lemmatizing, tokenizing, Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA). Furthermore, the results of the LSA and LDA algorithms is reduced dimensional data that will be clustered using clustering algorithms HDBSCAN and K-Means for later comparison. Different methodologies are used to determine the optimal parameters for the algorithms. This is all done in the python programing language, as there are libraries for supporting this research, the most important being scikit-learn. The frequent words of each cluster will then be displayed and compared with factual data regarding the outbreak to discover if there are any correlations. The factual data is collected by World Health Organization (WHO) and is then visualized in graphs in ourworldindata.org. Correlations with the results are also looked for in news articles to find any significant moments to see if that affected the top words in the clustered data. The news articles with good timelines used for correlating incidents are that of NBC News and New York Times. The results show no direct correlations with the data reported by WHO, however looking into the timelines reported by news sources some correlation can be seen with the clustered data. Also, the combination of LDA and HDBSCAN yielded the most desireable results in comparison to the other combinations of the dimnension reductions and clustering. This was much due to the use of GridSearchCV on LDA to determine the ideal parameters for the LDA models on each dataset as well as how well HDBSCAN clusters its data in comparison to K-Means.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
11

Nelaturu, Keerthi. „Content Management and Hashtag Recommendation in a P2P Social Networking Application“. Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32501.

Der volle Inhalt der Quelle
Annotation:
In this thesis focus is on developing an online social network application with a Peer-to-Peer infrastructure motivated by BestPeer++ architecture and BATON overlay structure. BestPeer++ is a data processing platform which enables data sharing between enterprise systems. BATON is an open-sourced project which implements a peer-to-peer with a topology of a balanced tree. We designed and developed the components for users to manage their accounts, maintain friend relationships, and publish their contents with privacy control and newsfeed, notification requests in this social network- ing application. We also developed a Hashtag Recommendation system for this social net- working application. A user may invoke a recommendation procedure while writing a content. After being invoked, the recommendation pro- cedure returns a list of candidate hashtags, and the user may select one hashtag from the list and embed it into the content. The proposed ap- proach uses Latent Dirichlet Allocation (LDA) topic model to derive the latent or hidden topics of different content. LDA topic model is a well developed data mining algorithm and generally effective in analyzing text documents with different lengths. The topic model is further used to identify the candidate hashtags that are associated with the texts in the published content through their association with the derived hidden top- ics. We considered different methods of recommendation approach for the pro- cedure to select candidate hashtags from different content. Some methods consider the hashtags contained in the contents of the whole social net- work or of the user self. These are content-based recommendation tech- niques which matching user’s own profile with the profiles of items.. Some methods consider the hashtags contained in contents of the friends or of the similar users. These are collaborative filtering based recommendation techniques which considers the profiles of other users in the system. At the end of the recommendation procedure, the candidate hashtags are or- dered by their probabilities of appearance in the content and returned to the user. We also conducted experiments to evaluate the effectiveness of the hashtag recommendation approach. These experiments were fed with the tweets published in Twitter. The hit-rate of recommendation is measured in these experiments. Hit-rate is the percentage of the selected or relevant hashtags contained in candidate hashtags. Our experiment results show that the hit-rate above 50% is observed when we use a method of recommendation approach independently. Also, for the case that both similar user and user preferences are considered at the same time, the hit-rate improved to 87% and 92% for top-5 and top-10 candidate recommendations respectively.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
12

Järvstråt, Lotta. „Functionality Classification Filter for Websites“. Thesis, Linköpings universitet, Statistik, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-93702.

Der volle Inhalt der Quelle
Annotation:
The objective of this thesis is to evaluate different models and methods for website classification. The websites are classified based on their functionality, in this case specifically whether they are forums, news sites or blogs. The analysis aims at solving a search engine problem, which means that it is interesting to know from which categories in a information search the results come. The data consists of two datasets, extracted from the web in January and April 2013. Together these data sets consist of approximately 40.000 observations, with each observation being the extracted text from the website. Approximately 7.000 new word variables were subsequently created from this text, as were variables based on Latent Dirichlet Allocation. One variable (the number of links) was created using the HTML-code for the web site. These data sets are used both in multinomial logistic regression with Lasso regularization, and to create a Naive Bayes classifier. The best classifier for the data material studied was achieved when using Lasso for all variables with multinomial logistic regression to reduce the number of variables. The  accuracy of this model is 99.70 %. When time dependency of the models is considered, using the first data to make the model and the second data for testing, the accuracy, however, is only 90.74 %. This indicates that the data is time dependent and that websites topics change over time.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
13

Schenk, Jason Robert. „Meta-uncertainty and resilience with applications in intelligence analysis“. The Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=osu1199129269.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
14

Askling, Kim. „Application of Topic Models for Test Case Selection : A comparison of similarity-based selection techniques“. Thesis, Linköpings universitet, Programvara och system, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-159803.

Der volle Inhalt der Quelle
Annotation:
Regression testing is just as important for the quality assurance of a system, as it is time consuming. Several techniques exist with the purpose of lowering the execution times of test suites and provide faster feedback to the developers, examples are ones based on transition-models or string-distances. These techniques are called test case selection (TCS) techniques, and focuses on selecting subsets of the test suite deemed relevant for the modifications made to the system under test. This thesis project focused on evaluating the use of a topic model, latent dirichlet allocation, as a means to create a diverse selection of test cases for coverage of certain test characteristics. The model was tested on authentic data sets from two different companies, where the results were compared against prior work where TCS was performed using similarity-based techniques. Also, the model was tuned and evaluated, using an algorithm based on differential evolution, to increase the model’s stability in terms of inferred topics and topic diversity. The results indicate that the use of the model for test case selection purposes was not as efficient as the other similarity-based selection techniques studied in work prior to thist hesis. In fact, the results show that the selection generated using the model performs similar, in terms of coverage, to a randomly selected subset of the test suite. Tuning of the model does not improve these results, in fact the tuned model performs worse than the other methods in most cases. However, the tuning process results in the model being more stable in terms of inferred latent topics and topic diversity. The performance of the model is believed to be strongly dependent on the characteristics of the underlying data used to train the model, putting emphasis on word frequencies and the overall sizes of the training documents, and implying that this would affect the words’ relevance scoring to the better.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
15

Malhomme, Nemo. „Statistical learning for climate models“. Electronic Thesis or Diss., université Paris-Saclay, 2024. http://www.theses.fr/2024UPAST165.

Der volle Inhalt der Quelle
Annotation:
Les modèles climatiques peinent à représenter précisément les structures de circulation atmosphérique liées aux événements extrêmes, et notamment leurs variations régionales.Cette thèse explore comment l'Allocation Latente de Dirichlet (LDA), une méthode d'apprentissage statistique issue du traitement du langage naturel, peut être utilisée pour évaluer la représentation par modèles climatiques de données telles que la pression au niveau de la mer (SLP).La LDA identifie un jeu de structures locales (ou motifs) à l'échelle synoptique, interprétables physiquement comme des cyclones et des anticyclones.La même base de motifs peut servir à décrire les données issues des modèles et des réanalyses, permettant de représenter toute carte SLP par une combinaison parcimonieuse de ces motifs.Les coefficients, ou poids, de ces combinaisons fournissent une information locale sur la configuration synoptique de la circulation.Les analyser permet de caractériser la structure de la circulation dans les réanalyses et les modèles, et ainsi d'identifier localement des biais globaux ou spécifiques aux événements extrêmes.Une erreur dynamique globale peut être définie à partir des différences de poids des données modèles avec les réanalyses.Cette méthodologie a été appliquée à quatre modèles de CMIP6.Bien que les modèles représentent correctement en général la circulation à grande échelle, leurs erreurs sont plus élevées pour les vagues de froid et de chaleur.Une source d'erreur dans tous les modèles est liée aux motifs méditerranéens.Des critères d'évaluation supplémentaires ont été proposés.L'un s'appuie sur la fréquence d'apparition des motifs dans la représentation des cartes de pression.L'autre consiste à combiner l'erreur dynamique globale avec l'erreur de température, ce qui permet de différentier entre les modèles.Ces résultats démontrent le potentiel de la LDA pour l'évaluation et la préselection des modèles
Climate models face challenges in accurately representing atmospheric circulation patterns related to extreme weather events, especially regarding regional variability.This thesis explores how Latent Dirichlet Allocation (LDA), a statistical learning method originating from natural language processing, can be adapted to evaluate the ability of climate models to represent data such as Sea-Level Pressure (SLP).LDA identifies a set of local synoptic-scale structures, physically interpretable as cyclones and anticyclones, referred to as motifs.A common basis of motifs can be used to describe reanalysis and model data so that any SLP map can be represented as a sparse combination of these motifs.The motif weights provide local information on the synoptic configuration of circulation.By analyzing the weights, we can characterize circulation patterns in both reanalysis data and models, allowing us to identify local biases, both in general data and during extreme events.A global dynamic error can be defined for each model run based on the differences between the average weights of the run and reanalysis data.This methodology was applied to four CMIP6 models.While large-scale circulation is well predicted by all models on average, higher errors are found for heatwaves and cold spells.In general, a major source of error is found to be associated with Mediterranean motifs, for all models.Additional evaluation criteria were considered: one was based on the frequency of motifs in the sparse map representation.Another one involved combining the global dynamic error with the temperature error, thus making it possible to discriminate between models.These results show the potential of LDA for model evaluation and preselection
APA, Harvard, Vancouver, ISO und andere Zitierweisen
16

Lindgren, Jennifer. „Evaluating Hierarchical LDA Topic Models for Article Categorization“. Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-167080.

Der volle Inhalt der Quelle
Annotation:
With the vast amount of information available on the Internet today, helping users find relevant content has become a prioritized task in many software products that recommend news articles. One such product is Opera for Android, which has a news feed containing articles the user may be interested in. In order to easily determine what articles to recommend, they can be categorized by the topics they contain. One approach of categorizing articles is using Machine Learning and Natural Language Processing (NLP). A commonly used model is Latent Dirichlet Allocation (LDA), which finds latent topics within large datasets of for example text articles. An extension of LDA is hierarchical Latent Dirichlet Allocation (hLDA) which is an hierarchical variant of LDA. In hLDA, the latent topics found among a set of articles are structured hierarchically in a tree. Each node represents a topic, and the levels represent different levels of abstraction in the topics. A further extension of hLDA is constrained hLDA, where a set of predefined, constrained topics are added to the tree. The constrained topics are extracted from the dataset by grouping highly correlated words. The idea of constrained hLDA is to improve the topic structure derived by a hLDA model by making the process semi-supervised. The aim of this thesis is to create a hLDA and a constrained hLDA model from a dataset of articles provided by Opera. The models should then be evaluated using the novel metric word frequency similarity, which is a measure of the similarity between the words representing the parent and child topics in a hierarchical topic model. The results show that word frequency similarity can be used to evaluate whether the topics in a parent-child topic pair are too similar, so that the child does not specify a subtopic of the parent. It can also be used to evaluate if the topics are too dissimilar, so that the topics seem unrelated and perhaps should not be connected in the hierarchy. The results also show that the two topic models created had comparable word frequency similarity scores. None of the models seemed to significantly outperform the other with regard to the metric.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
17

Morchid, Mohamed. „Représentations robustes de documents bruités dans des espaces homogènes“. Thesis, Avignon, 2014. http://www.theses.fr/2014AVIG0202/document.

Der volle Inhalt der Quelle
Annotation:
En recherche d’information, les documents sont le plus souvent considérés comme des "sacs-de-mots". Ce modèle ne tient pas compte de la structure temporelle du document et est sensible aux bruits qui peuvent altérer la forme lexicale. Ces bruits peuvent être produits par différentes sources : forme peu contrôlée des messages des sites de micro-blogging, messages vocaux dont la transcription automatique contient des erreurs, variabilités lexicales et grammaticales dans les forums du Web. . . Le travail présenté dans cette thèse s’intéresse au problème de la représentation de documents issus de sources bruitées.La thèse comporte trois parties dans lesquelles différentes représentations des contenus sont proposées. La première partie compare une représentation classique utilisant la fréquence des mots à une représentation de haut-niveau s’appuyant sur un espace de thèmes. Cette abstraction du contenu permet de limiter l’altération de la forme de surface du document bruité en le représentant par un ensemble de caractéristiques de haut-niveau. Nos expériences confirment que cette projection dans un espace de thèmes permet d’améliorer les résultats obtenus sur diverses tâches de recherche d’information en comparaison d’une représentation plus classique utilisant la fréquence des mots.Le problème majeur d’une telle représentation est qu’elle est fondée sur un espace de thèmes dont les paramètres sont choisis empiriquement.La deuxième partie décrit une nouvelle représentation s’appuyant sur des espaces multiples et permettant de résoudre trois problèmes majeurs : la proximité des sujets traités dans le document, le choix difficile des paramètres du modèle de thèmes ainsi que la robustesse de la représentation. Partant de l’idée qu’une seule représentation des contenus ne peut pas capturer l’ensemble des informations utiles, nous proposons d’augmenter le nombre de vues sur un même document. Cette multiplication des vues permet de générer des observations "artificielles" qui contiennent des fragments de l’information utile. Une première expérience a validé cette approche multi-vues de la représentation de textes bruités. Elle a cependant l’inconvénient d’être très volumineuse,redondante, et de contenir une variabilité additionnelle liée à la diversité des vues. Dans un deuxième temps, nous proposons une méthode s’appuyant sur l’analyse factorielle pour fusionner les vues multiples et obtenir une nouvelle représentation robuste,de dimension réduite, ne contenant que la partie "utile" du document tout en réduisant les variabilités "parasites". Lors d’une tâche de catégorisation de conversations,ce processus de compression a confirmé qu’il permettait d’augmenter la robustesse de la représentation du document bruité.Cependant, lors de l’élaboration des espaces de thèmes, le document reste considéré comme un "sac-de-mots" alors que plusieurs études montrent que la position d’un terme au sein du document est importante. Une représentation tenant compte de cette structure temporelle du document est proposée dans la troisième partie. Cette représentation s’appuie sur les nombres hyper-complexes de dimension appelés quaternions. Nos expériences menées sur une tâche de catégorisation ont montré l’efficacité de cette méthode comparativement aux représentations classiques en "sacs-de-mots"
In the Information Retrieval field, documents are usually considered as a "bagof-words". This model does not take into account the temporal structure of thedocument and is sensitive to noises which can alter its lexical form. These noisescan be produced by different sources : uncontrolled form of documents in microbloggingplatforms, automatic transcription of speech documents which are errorprone,lexical and grammatical variabilities in Web forums. . . The work presented inthis thesis addresses issues related to document representations from noisy sources.The thesis consists of three parts in which different representations of content areavailable. The first one compares a classical representation based on a term-frequencyrepresentation to a higher level representation based on a topic space. The abstractionof the document content allows us to limit the alteration of the noisy document byrepresenting its content with a set of high-level features. Our experiments confirm thatmapping a noisy document into a topic space allows us to improve the results obtainedduring different information retrieval tasks compared to a classical approach based onterm frequency. The major problem with such a high-level representation is that it isbased on a space theme whose parameters are chosen empirically.The second part presents a novel representation based on multiple topic spaces thatallow us to solve three main problems : the closeness of the subjects discussed in thedocument, the tricky choice of the "right" values of the topic space parameters and therobustness of the topic-based representation. Based on the idea that a single representationof the contents cannot capture all the relevant information, we propose to increasethe number of views on a single document. This multiplication of views generates "artificial"observations that contain fragments of useful information. The first experimentvalidated the multi-view approach to represent noisy texts. However, it has the disadvantageof being very large and redundant and of containing additional variability associatedwith the diversity of views. In the second step, we propose a method based onfactor analysis to compact the different views and to obtain a new robust representationof low dimension which contains only the informative part of the document whilethe noisy variabilities are compensated. During a dialogue classification task, the compressionprocess confirmed that this compact representation allows us to improve therobustness of noisy document representation.Nonetheless, during the learning process of topic spaces, the document is consideredas a "bag-of-words" while many studies have showed that the word position in a7document is useful. A representation which takes into account the temporal structureof the document based on hyper-complex numbers is proposed in the third part. Thisrepresentation is based on the hyper-complex numbers of dimension four named quaternions.Our experiments on a classification task have showed the effectiveness of theproposed approach compared to a conventional "bag-of-words" representation
APA, Harvard, Vancouver, ISO und andere Zitierweisen
18

Hachey, Benjamin. „Towards generic relation extraction“. Thesis, University of Edinburgh, 2009. http://hdl.handle.net/1842/3978.

Der volle Inhalt der Quelle
Annotation:
A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database that can be more effectively used for querying and automated reasoning. However, adapting conventional relation extraction systems to new domains or tasks requires significant effort from annotators and developers. Furthermore, previous adaptation approaches based on bootstrapping start from example instances of the target relations, thus requiring that the correct relation type schema be known in advance. Generic relation extraction (GRE) addresses the adaptation problem by applying generic techniques that achieve comparable accuracy when transferred, without modification of model parameters, across domains and tasks. Previous work on GRE has relied extensively on various lexical and shallow syntactic indicators. I present new state-of-the-art models for GRE that incorporate governordependency information. I also introduce a dimensionality reduction step into the GRE relation characterisation sub-task, which serves to capture latent semantic information and leads to significant improvements over an unreduced model. Comparison of dimensionality reduction techniques suggests that latent Dirichlet allocation (LDA) – a probabilistic generative approach – successfully incorporates a larger and more interdependent feature set than a model based on singular value decomposition (SVD) and performs as well as or better than SVD on all experimental settings. Finally, I will introduce multi-document summarisation as an extrinsic test bed for GRE and present results which demonstrate that the relative performance of GRE models is consistent across tasks and that the GRE-based representation leads to significant improvements over a standard baseline from the literature. Taken together, the experimental results 1) show that GRE can be improved using dependency parsing and dimensionality reduction, 2) demonstrate the utility of GRE for the content selection step of extractive summarisation and 3) validate the GRE claim of modification-free adaptation for the first time with respect to both domain and task. This thesis also introduces data sets derived from publicly available corpora for the purpose of rigorous intrinsic evaluation in the news and biomedical domains.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
19

Paganin, Sally. „Prior-driven cluster allocation in bayesian mixture models“. Doctoral thesis, Università degli studi di Padova, 2018. http://hdl.handle.net/11577/3426831.

Der volle Inhalt der Quelle
Annotation:
There is a very rich literature proposing Bayesian approaches for clustering starting with a prior probability distribution on partitions. Most approaches assume exchangeability, leading to simple representations of such prior in terms of an Exchangeable Partition Probability Function (EPPF). Gibbs-type priors encompass a broad class of such cases, including Dirichlet and Pitman-Yor processes. Even though there have been some proposals to relax the exchangeability assumption, allowing covariate-dependence and partial exchangeability, limited consideration has been given on how to include concrete prior knowledge on the partition. Our motivation is drawn from an epidemiological application, in which we wish to cluster birth defects into groups and we have a prior knowledge of an initial clustering provided by experts. The underlying assumption is that birth defects in the same group may have similar coefficients in logistic regression analysis relating different exposures to risk of developing the defect. As a general approach for including such prior knowledge, we propose a Centered Partition (CP) process that modifies a base EPPF to favor partitions in a convenient distance neighborhood of the initial clustering. This thesis focus on providing characterization of such new class, along with properties and general algorithms for posterior computation. We illustrate the methodology through simulation examples and an application to the motivating epidemiology study of birth defects.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
20

Bakharia, Aneesha. „Interactive content analysis : evaluating interactive variants of non-negative Matrix Factorisation and Latent Dirichlet Allocation as qualitative content analysis aids“. Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/76535/1/Aneesha_Bakharia_Thesis.pdf.

Der volle Inhalt der Quelle
Annotation:
This thesis addressed issues that have prevented qualitative researchers from using thematic discovery algorithms. The central hypothesis evaluated whether allowing qualitative researchers to interact with thematic discovery algorithms and incorporate domain knowledge improved their ability to address research questions and trust the derived themes. Non-negative Matrix Factorisation and Latent Dirichlet Allocation find latent themes within document collections but these algorithms are rarely used, because qualitative researchers do not trust and cannot interact with the themes that are automatically generated. The research determined the types of interactivity that qualitative researchers require and then evaluated interactive algorithms that matched these requirements. Theoretical contributions included the articulation of design guidelines for interactive thematic discovery algorithms, the development of an Evaluation Model and a Conceptual Framework for Interactive Content Analysis.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
21

Bui, Quang Vu. „Pretopology and Topic Modeling for Complex Systems Analysis : Application on Document Classification and Complex Network Analysis“. Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEP034/document.

Der volle Inhalt der Quelle
Annotation:
Les travaux de cette thèse présentent le développement d'algorithmes de classification de documents d'une part, ou d'analyse de réseaux complexes d'autre part, en s'appuyant sur la prétopologie, une théorie qui modélise le concept de proximité. Le premier travail développe un cadre pour la classification de documents en combinant une approche de topicmodeling et la prétopologie. Notre contribution propose d'utiliser des distributions de sujets extraites à partir d'un traitement topic-modeling comme entrées pour des méthodes de classification. Dans cette approche, nous avons étudié deux aspects : déterminer une distance adaptée entre documents en étudiant la pertinence des mesures probabilistes et des mesures vectorielles, et effet réaliser des regroupements selon plusieurs critères en utilisant une pseudo-distance définie à partir de la prétopologie. Le deuxième travail introduit un cadre général de modélisation des Réseaux Complexes en développant une reformulation de la prétopologie stochastique, il propose également un modèle prétopologique de cascade d'informations comme modèle général de diffusion. De plus, nous avons proposé un modèle agent, Textual-ABM, pour analyser des réseaux complexes dynamiques associés à des informations textuelles en utilisant un modèle auteur-sujet et nous avons introduit le Textual-Homo-IC, un modèle de cascade indépendant de la ressemblance, dans lequel l'homophilie est fondée sur du contenu textuel obtenu par un topic-model
The work of this thesis presents the development of algorithms for document classification on the one hand, or complex network analysis on the other hand, based on pretopology, a theory that models the concept of proximity. The first work develops a framework for document clustering by combining Topic Modeling and Pretopology. Our contribution proposes using topic distributions extracted from topic modeling treatment as input for classification methods. In this approach, we investigated two aspects: determine an appropriate distance between documents by studying the relevance of Probabilistic-Based and Vector-Based Measurements and effect groupings according to several criteria using a pseudo-distance defined from pretopology. The second work introduces a general framework for modeling Complex Networks by developing a reformulation of stochastic pretopology and proposes Pretopology Cascade Model as a general model for information diffusion. In addition, we proposed an agent-based model, Textual-ABM, to analyze complex dynamic networks associated with textual information using author-topic model and introduced Textual-Homo-IC, an independent cascade model of the resemblance, in which homophily is measured based on textual content obtained by utilizing Topic Modeling
APA, Harvard, Vancouver, ISO und andere Zitierweisen
22

Clavijo, García David Mauricio. „Metodología para el análisis de grandes volúmenes de información aplicada a la investigación médica en Chile“. Tesis, Universidad de Chile, 2017. http://repositorio.uchile.cl/handle/2250/146597.

Der volle Inhalt der Quelle
Annotation:
Magíster en Ingeniería de Negocios con Tecnología de Información
El conocimiento en la medicina se ha acumulado en artículos de investigación científica a través del tiempo, por consiguiente, se ha generado un interés creciente en desarrollar metodologías de minería de texto para extraer, estructurar y analizar el conocimiento obtenido de grandes volúmenes de información en el menor tiempo posible. En este trabajo se presenta un una metodología que permite lograr el objetivo anterior utilizando el modelo LDA (Latent Dirichlet Allocation). Esta metodología consiste en 3 pasos: Primero, reconocer tópicos relevantes en artículos de investigación científica médica de la Revista Médica de Chile (2012 2015); Segundo, identificar e interpretar la relación entre los tópicos resultantes mediante métodos de visualización (LDAvis); Tercero, evaluar características propias de las investigaciones científicas, en este caso, el financiamiento dirigido, utilizando los dos pasos anteriores. Los resultados muestran que esta metodología resulta efectiva, no sólo para el análisis de artículos de investigación científica médica, sino que también puede ser utilizado en otros campos de la ciencia. Adicionalmente, éste método permite analizar e interpretar el estado en el que se encuentra la investigación médica a nivel nacional utilizando como referente la Revista Médica de Chile. Dentro de este contexto es importante considerar los procesos de planificación, gestión y producción de la investigación científica al interior de los Hospitales que han sido estandartes de generación del conocimiento ya que funcionan como campus universitarios de tradición e innovación. Por la razón anterior, se realizará un análisis del entorno en el sector de la salud, su estructura y la posibilidad de aplicar la metodología propuesta en este trabajo a partir del planteamiento estratégico y el modelo de negocio del Hospital Exequiel González Cortés.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
23

Halmann, Marju. „Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification“. Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14710.

Der volle Inhalt der Quelle
Annotation:
Filtering out and replying automatically to emails are of interest to many but is hard due to the complexity of the language and to dependencies of background information that is not present in the email itself. This paper investigates whether Latent Dirichlet Allocation (LDA) combined with Random Forest classifier can be used for the more general email classification task and how it compares to other existing email classifiers. The comparison is based on the literature study and on the empirical experimentation using two real-life datasets. Firstly, a literature study is performed to gain insight of the accuracy of other available email classifiers. Secondly, proposed model’s accuracy is explored with experimentation. The literature study shows that the accuracy of more general email classifiers differs greatly on different user sets. The proposed model accuracy is within the reported accuracy range, however in the lower part. It indicates that the proposed model performs poorly compared to other classifiers. On average, the classifier performance improves 15 percentage points with additional information. This indicates that Latent Dirichlet Allocation (LDA) combined with Random Forest classifier is promising, however future studies are needed to explore the model and ways to further increase the accuracy.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
24

Muñoz, Cancino Ricardo Luis. „Diseño, desarrollo y evaluación de un algoritmo para detectar sub-comunidades traslapadas usando análisis de redes sociales y minería de datos“. Tesis, Universidad de Chile, 2013. http://www.repositorio.uchile.cl/handle/2250/112582.

Der volle Inhalt der Quelle
Annotation:
Magíster en Gestión de Operaciones
Ingeniero Civil Industrial
Los sitios de redes sociales virtuales han tenido un enorme crecimiento en la última década. Su principal objetivo es facilitar la creación de vínculos entre personas que, por ejemplo, comparten intereses, actividades, conocimientos, o conexiones en la vida real. La interacción entre los usuarios genera una comunidad en la red social. Existen varios tipos de comunidades, se distinguen las comunidades de interés y práctica. Una comunidad de interés es un grupo de personas interesadas en compartir y discutir un tema de interés particular. En cambio, en una comunidad de práctica las personas comparten una preocupación o pasión por algo que ellos hacen y aprenden cómo hacerlo mejor. Si las interacciones se realizan por internet, se les llama comunidades virtuales (VCoP/VCoI por sus siglas en inglés). Es común que los miembros compartan solo con algunos usuarios formando así subcomunidades, pudiendo pertenecer a más de una. Identificar estas subestructuras es necesario, pues allí se generan las interacciones para la creación y desarrollo del conocimiento de la comunidad. Se han diseñado muchos algoritmos para detectar subcomunidades. Sin embargo, la mayoría de ellos detecta subcomunidades disjuntas y además, no consideran el contenido generado por los miembros de la comunidad. El objetivo principal de este trabajo es diseñar, desarrollar y evaluar un algoritmo para detectar subcomunidades traslapadas mediante el uso de análisis de redes sociales (SNA) y Text Mining. Para ello se utiliza la metodología SNA-KDD propuesta por Ríos et al. [79] que combina Knowledge Discovery in Databases (KDD) y SNA. Ésta fue aplicada sobre dos comunidades virtuales, Plexilandia (VCoP) y The Dark Web Portal (VCoI). En la etapa de KDD se efectuó el preprocesamiento de los posts de los usuarios, para luego aplicar Latent Dirichlet Allocation (LDA), que permite describir cada post en términos de tópicos. En la etapa SNA se construyeron redes filtradas con la información obtenida en la etapa anterior. A continuación se utilizaron dos algoritmos desarrollados en esta tesis, SLTA y TPA, para encontrar subcomunidades traslapadas. Los resultados muestran que SLTA logra un desempeño, en promedio, un 5% superior que el mejor algoritmo existente cuando es aplicado sobre una VCoP. Además, se encontró que la calidad de la estructura de sub-comunidades detectadas aumenta, en promedio, un 64% cuando el filtro semántico es aumentado. Con respecto a TPA, este algoritmo logra, en promedio, una medida de modularidad de 0.33 mientras que el mejor algoritmo existente 0.043 cuando es aplicado sobre una VCoI. Además la aplicación conjunta de nuestros algoritmos parece mostrar una forma de determinar el tipo de comunidad que se está analizando. Sin embargo, esto debe ser comprobado analizando más comunidades virtuales.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
25

Wedenberg, Kim, und Alexander Sjöberg. „Online inference of topics : Implementation of the topic model Latent Dirichlet Allocation using an online variational bayes inference algorithm to sort news articles“. Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-222429.

Der volle Inhalt der Quelle
Annotation:
The client of the project has problems with complex queries and noisewhen querying their stream of five million news articles per day. Thisresults in much manual work when sorting and pruning the search result of their query. Instead of using direct text matching, the approachof the project was to use a topic model to describe articles in terms oftopics covered and to use this new information to sort the articles. An online version of the topic model Latent Dirichlet Allocationwas implemented using online variational Bayes inference to handlestreamed data. Using 100 dimensions, topics such as sports and politics emerged during training on a 1.7 million articles big simulatedstream. These topics were used to sort articles based on context. Theimplementation was found accurate enough to be useful for the client aswell as fast and stable enough to be a feasible solution to the problem.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
26

Long, Hannah. „Geographic Relevance for Travel Search: The 2014-2015 Harvey Mudd College Clinic Project for Expedia, Inc“. Scholarship @ Claremont, 2015. http://scholarship.claremont.edu/scripps_theses/670.

Der volle Inhalt der Quelle
Annotation:
The purpose of this Clinic project is to help Expedia, Inc. expand the search capabilities it offers to its users. In particular, the goal is to help the company respond to unconstrained search queries by generating a method to associate hotels and regions around the world with the higher-level attributes that describe them, such as “family- friendly” or “culturally-rich.” Our team utilized machine-learning algorithms to extract metadata from textual data about hotels and cities. We focused on two machine-learning models: decision trees and Latent Dirichlet Allocation (LDA). The first appeared to be a promising approach, but would require more resources to replicate on the scale Expedia needs. On the other hand, we were able to generate useful results using LDA. We created a website to visualize these results.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
27

Dupuy, Christophe. „Inference and applications for topic models“. Thesis, Paris Sciences et Lettres (ComUE), 2017. http://www.theses.fr/2017PSLEE055/document.

Der volle Inhalt der Quelle
Annotation:
La plupart des systèmes de recommandation actuels se base sur des évaluations sous forme de notes (i.e., chiffre entre 0 et 5) pour conseiller un contenu (film, restaurant...) à un utilisateur. Ce dernier a souvent la possibilité de commenter ce contenu sous forme de texte en plus de l'évaluer. Il est difficile d'extraire de l'information d'un texte brut tandis qu'une simple note contient peu d'information sur le contenu et l'utilisateur. Dans cette thèse, nous tentons de suggérer à l'utilisateur un texte lisible personnalisé pour l'aider à se faire rapidement une opinion à propos d'un contenu. Plus spécifiquement, nous construisons d'abord un modèle thématique prédisant une description de film personnalisée à partir de commentaires textuels. Notre modèle sépare les thèmes qualitatifs (i.e., véhiculant une opinion) des thèmes descriptifs en combinant des commentaires textuels et des notes sous forme de nombres dans un modèle probabiliste joint. Nous évaluons notre modèle sur une base de données IMDB et illustrons ses performances à travers la comparaison de thèmes. Nous étudions ensuite l'inférence de paramètres dans des modèles à variables latentes à grande échelle, incluant la plupart des modèles thématiques. Nous proposons un traitement unifié de l'inférence en ligne pour les modèles à variables latentes à partir de familles exponentielles non-canoniques et faisons explicitement apparaître les liens existants entre plusieurs méthodes fréquentistes et Bayesiennes proposées auparavant. Nous proposons aussi une nouvelle méthode d'inférence pour l'estimation fréquentiste des paramètres qui adapte les méthodes MCMC à l'inférence en ligne des modèles à variables latentes en utilisant proprement un échantillonnage de Gibbs local. Pour le modèle thématique d'allocation de Dirichlet latente, nous fournissons une vaste série d'expériences et de comparaisons avec des travaux existants dans laquelle notre nouvelle approche est plus performante que les méthodes proposées auparavant. Enfin, nous proposons une nouvelle classe de processus ponctuels déterminantaux (PPD) qui peut être manipulée pour l'inférence et l'apprentissage de paramètres en un temps potentiellement sous-linéaire en le nombre d'objets. Cette classe, basée sur une factorisation spécifique de faible rang du noyau marginal, est particulièrement adaptée à une sous-classe de PPD continus et de PPD définis sur un nombre exponentiel d'objets. Nous appliquons cette classe à la modélisation de documents textuels comme échantillons d'un PPD sur les phrases et proposons une formulation du maximum de vraisemblance conditionnel pour modéliser les proportions de thèmes, ce qui est rendu possible sans aucune approximation avec notre classe de PPD. Nous présentons une application à la synthèse de documents avec un PPD sur 2 à la puissance 500 objets, où les résumés sont composés de phrases lisibles
Most of current recommendation systems are based on ratings (i.e. numbers between 0 and 5) and try to suggest a content (movie, restaurant...) to a user. These systems usually allow users to provide a text review for this content in addition to ratings. It is hard to extract useful information from raw text while a rating does not contain much information on the content and the user. In this thesis, we tackle the problem of suggesting personalized readable text to users to help them make a quick decision about a content. More specifically, we first build a topic model that predicts personalized movie description from text reviews. Our model extracts distinct qualitative (i.e., which convey opinion) and descriptive topics by combining text reviews and movie ratings in a joint probabilistic model. We evaluate our model on an IMDB dataset and illustrate its performance through comparison of topics. We then study parameter inference in large-scale latent variable models, that include most topic models. We propose a unified treatment of online inference for latent variable models from a non-canonical exponential family, and draw explicit links between several previously proposed frequentist or Bayesian methods. We also propose a novel inference method for the frequentist estimation of parameters, that adapts MCMC methods to online inference of latent variable models with the proper use of local Gibbs sampling.~For the specific latent Dirichlet allocation topic model, we provide an extensive set of experiments and comparisons with existing work, where our new approach outperforms all previously proposed methods. Finally, we propose a new class of determinantal point processes (DPPs) which can be manipulated for inference and parameter learning in potentially sublinear time in the number of items. This class, based on a specific low-rank factorization of the marginal kernel, is particularly suited to a subclass of continuous DPPs and DPPs defined on exponentially many items. We apply this new class to modelling text documents as sampling a DPP of sentences, and propose a conditional maximum likelihood formulation to model topic proportions, which is made possible with no approximation for our class of DPPs. We present an application to document summarization with a DPP on 2 to the power 500 items, where the summaries are composed of readable sentences
APA, Harvard, Vancouver, ISO und andere Zitierweisen
28

Sathi, Veer Reddy, und Jai Simha Ramanujapura. „A Quality Criteria Based Evaluation of Topic Models“. Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-13274.

Der volle Inhalt der Quelle
Annotation:
Context. Software testing is the process, where a particular software product, or a system is executed, in order to find out the bugs, or issues which may otherwise degrade its performance. Software testing is usually done based on pre-defined test cases. A test case can be defined as a set of terms, or conditions that are used by the software testers to determine, if a particular system that is under test operates as it is supposed to or not. However, in numerous situations, test cases can be so many that executing each and every test case is practically impossible, as there may be many constraints. This causes the testers to prioritize the functions that are to be tested. This is where the ability of topic models can be exploited. Topic models are unsupervised machine learning algorithms that can explore large corpora of data, and classify them by identifying the hidden thematic structure in those corpora. Using topic models for test case prioritization can save a lot of time and resources. Objectives. In our study, we provide an overview of the amount of research that has been done in relation to topic models. We want to uncover various quality criteria, evaluation methods, and metrics that can be used to evaluate the topic models. Furthermore, we would also like to compare the performance of two topic models that are optimized for different quality criteria, on a particular interpretability task, and thereby determine the topic model that produces the best results for that task. Methods. A systematic mapping study was performed to gain an overview of the previous research that has been done on the evaluation of topic models. The mapping study focused on identifying quality criteria, evaluation methods, and metrics that have been used to evaluate topic models. The results of mapping study were then used to identify the most used quality criteria. The evaluation methods related to those criteria were then used to generate two optimized topic models. An experiment was conducted, where the topics generated from those two topic models were provided to a group of 20 subjects. The task was designed, so as to evaluate the interpretability of the generated topics. The performance of the two topic models was then compared by using the Precision, Recall, and F-measure. Results. Based on the results obtained from the mapping study, Latent Dirichlet Allocation (LDA) was found to be the most widely used topic model. Two LDA topic models were created, optimizing one for the quality criterion Generalizability (TG), and one for Interpretability (TI); using the Perplexity, and Point-wise Mutual Information (PMI) measures respectively. For the selected metrics, TI showed better performance, in Precision and F-measure, than TG. However, the performance of both TI and TG was comparable in case of Recall. The total run time of TI was also found to be significantly high than TG. The run time of TI was 46 hours, and 35 minutes, whereas for TG it was 3 hours, and 30 minutes.Conclusions. Looking at the F-measure, it can be concluded that the interpretability topic model (TI) performs better than the generalizability topic model (TG). However, while TI performed better in precision, Conclusions. Looking at the F-measure, it can be concluded that the interpretability topic model (TI) performs better than the generalizability topic model (TG). However, while TI performed better in precision, recall was comparable. Furthermore, the computational cost to create TI is significantly higher than for TG. Hence, we conclude that, the selection of the topic model optimization should be based on the aim of the task the model is used for. If the task requires high interpretability of the model, and precision is important, such as for the prioritization of test cases based on content, then TI would be the right choice, provided time is not a limiting factor. However, if the task aims at generating topics that provide a basic understanding of the concepts (i.e., interpretability is not a high priority), then TG is the most suitable choice; thus making it more suitable for time critical tasks.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
29

Chenghua, Lin. „Probabilistic topic models for sentiment analysis on the Web“. Thesis, University of Exeter, 2011. http://hdl.handle.net/10036/3307.

Der volle Inhalt der Quelle
Annotation:
Sentiment analysis aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text, and has received a rapid growth of interest in natural language processing in recent years. Probabilistic topic models, on the other hand, are capable of discovering hidden thematic structure in large archives of documents, and have been an active research area in the field of information retrieval. The work in this thesis focuses on developing topic models for automatic sentiment analysis of web data, by combining the ideas from both research domains. One noticeable issue of most previous work in sentiment analysis is that the trained classifier is domain dependent, and the labelled corpora required for training could be difficult to acquire in real world applications. Another issue is that the dependencies between sentiment/subjectivity and topics are not taken into consideration. The main contribution of this thesis is therefore the introduction of three probabilistic topic models, which address the above concerns by modelling sentiment/subjectivity and topic simultaneously. The first model is called the joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. Unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when applied to new domains, the weakly-supervised nature of JST makes it highly portable to other domains, where the only supervision information required is a domain-independent sentiment lexicon. Apart from document-level sentiment classification results, JST can also extract sentiment-bearing topics automatically, which is a distinct feature compared to the existing sentiment analysis approaches. The second model is a dynamic version of JST called the dynamic joint sentiment-topic (dJST) model. dJST respects the ordering of documents, and allows the analysis of topic and sentiment evolution of document archives that are collected over a long time span. By accounting for the historical dependencies of documents from the past epochs in the generative process, dJST gives a richer posterior topical structure than JST, and can better respond to the permutations of topic prominence. We also derive online inference procedures based on a stochastic EM algorithm for efficiently updating the model parameters. The third model is called the subjectivity detection LDA (subjLDA) model for sentence-level subjectivity detection. Two sets of latent variables were introduced in subjLDA. One is the subjectivity label for each sentence; another is the sentiment label for each word token. By viewing the subjectivity detection problem as weakly-supervised generative model learning, subjLDA significantly outperforms the baseline and is comparable to the supervised approach which relies on much larger amounts of data for training. These models have been evaluated on real world datasets, demonstrating that joint sentiment topic modelling is indeed an important and useful research area with much to offer in the way of good results.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
30

Mungre, Surbhi. „LDA-based dimensionality reduction and domain adaptation with application to DNA sequence classification“. Thesis, Kansas State University, 2011. http://hdl.handle.net/2097/8846.

Der volle Inhalt der Quelle
Annotation:
Master of Science
Department of Computing and Information Sciences
Doina Caragea
Several computational biology and bioinformatics problems involve DNA sequence classification using supervised machine learning algorithms. The performance of these algorithms is largely dependent on the availability of labeled data and the approach used to represent DNA sequences as {\it feature vectors}. For many organisms, the labeled DNA data is scarce, while the unlabeled data is easily available. However, for a small number of well-studied model organisms, large amounts of labeled data are available. This calls for {\it domain adaptation} approaches, which can transfer knowledge from a {\it source} domain, for which labeled data is available, to a {\it target} domain, for which large amounts of unlabeled data are available. Intuitively, one approach to domain adaptation can be obtained by extracting and representing the features that the source domain and the target domain sequences share. \emph{Latent Dirichlet Allocation} (LDA) is an unsupervised dimensionality reduction technique that has been successfully used to generate features for sequence data such as text. In this work, we explore the use of LDA for generating predictive DNA sequence features, that can be used in both supervised and domain adaptation frameworks. More precisely, we propose two dimensionality reduction approaches, LDA Words (LDAW) and LDA Distribution (LDAD) for DNA sequences. LDA is a probabilistic model, which is generative in nature, and is used to model collections of discrete data such as document collections. For our problem, a sequence is considered to be a ``document" and k-mers obtained from a sequence are ``document words". We use LDA to model our sequence collection. Given the LDA model, each document can be represented as a distribution over topics (where a topic can be seen as a distribution over k-mers). In the LDAW method, we use the top k-mers in each topic as our features (i.e., k-mers with the highest probability); while in the LDAD method, we use the topic distribution to represent a document as a feature vector. We study LDA-based dimensionality reduction approaches for both supervised DNA sequence classification, as well as domain adaptation approaches. We apply the proposed approaches on the splice site predication problem, which is an important DNA sequence classification problem in the context of genome annotation. In the supervised learning framework, we study the effectiveness of LDAW and LDAD methods by comparing them with a traditional dimensionality reduction technique based on the information gain criterion. In the domain adaptation framework, we study the effect of increasing the evolutionary distances between the source and target organisms, and the effect of using different weights when combining labeled data from the source domain and with labeled data from the target domain. Experimental results show that LDA-based features can be successfully used to perform dimensionality reduction and domain adaptation for DNA sequence classification problems.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
31

Johansson, Richard, und Heino Otto Engström. „Topic propagation over time in internet security conferences : Topic modeling as a tool to investigate trends for future research“. Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-177748.

Der volle Inhalt der Quelle
Annotation:
When conducting research, it is valuable to find high-ranked papers closely related to the specific research area, without spending too much time reading insignificant papers. To make this process more effective an automated process to extract topics from documents would be useful, and this is possible using topic modeling. Topic modeling can also be used to provide topic trends, where a topic is first mentioned, and who the original author was. In this paper, over 5000 articles are scraped from four different top-ranked internet security conferences, using a web scraper built in Python. From the articles, fourteen topics are extracted, using the topic modeling library Gensim and LDA Mallet, and the topics are visualized in graphs to find trends about which topics are emerging and fading away over twenty years. The result found in this research is that topic modeling is a powerful tool to extract topics, and when put into a time perspective, it is possible to identify topic trends, which can be explained when put into a bigger context.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
32

Fröjd, Sofia. „Measuring the information content of Riksbank meeting minutes“. Thesis, Umeå universitet, Institutionen för fysik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-158151.

Der volle Inhalt der Quelle
Annotation:
As the amount of information available on the internet has increased sharply in the last years, methods for measuring and comparing text-based information is gaining popularity on financial markets. Text mining and natural language processing has become an important tool for classifying large collections of texts or documents. One field of applications is topic modelling of the minutes from central banks' monetary policy meetings, which tend to be about topics such as"inflation", "economic growth" and "rates". The central bank of Sweden is the Riksbank, which hold 6 annual monetary policy meetings where the members of the Executive Board decide on the new repo rate. Two weeks later, the minutes of the meeting is published and information regarding the future monetary policy is given to the market in the form of text. This information has before release been unknown to the market, thus having the potential to be market-sensitive. Using Latent Dirichlet Allocation (LDA), an algorithm used for uncovering latent topics in documents, the topics in the meeting minutes should be possible to identify and quantify. In this project, 8 topics were found regarding, among other, inflation, rates, household debt and economic development. An important factor in analysis of central bank communication is the underlying tone in the discussions. It is common to classify central bankers as hawkish or dovish. Hawkish members of the board tend to favour tightening monetary policy and rate hikes, while more dovish members advocate a more expansive monetary policy and rate cuts. Thus, analysing the tone of the minutes can give an indication of future moves of the monetary policy rate. The purpose of this project is to provide a fast method for analysing the minutes from the Riksbank monetary policy meetings. The project is divided into two parts. First, a LDA model was trained to identify the topics in the minutes, which was then used to compare the content of two consecutive meeting minutes. Next, the sentiment was measured as a degree of hawkishness or dovishness. This was done by categorising each sentence in terms of their content, and then counting words with hawkish or dovish sentiment. The resulting net score gives larger values to more hawkish minutes and was shown to follow the repo rate path well. At the time of the release of the minutes, the new repo rate is already known, but the net score does gives an indication of the stance of the board.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
33

Greau-Hamard, Pierre-Samuel. „Contribution à l’apprentissage non supervisé de protocoles pour la couche de Liaison de données dans les systèmes communicants, à l'aide des Réseaux Bayésiens“. Thesis, CentraleSupélec, 2021. http://www.theses.fr/2021CSUP0009.

Der volle Inhalt der Quelle
Annotation:
Le monde des télécommunications est en rapide développement, surtout dans le domaine de l'internet des objets; dans un tel contexte, il serait utile de pouvoir analyser n'importe quel protocole inconnu auquel on pourrait se trouver confronté. Dans ce but, l'obtention de la machine d'états et des formats de trames du protocole cible est indispensable. Ces deux éléments peuvent être extraits de traces réseaux et/ou traces d'exécution à l'aide de techniques de Protocol Reverse Engineering (PRE).A l'aide de l'analyse des performances de trois algorithmes utilisés dans des systèmes de PRE, nous avons découvert le potentiel des modèles basés sur les réseaux Bayésiens. Nous avons ensuite développé Bayesian Network Frame Format Finder (BaNet3F), notre propre modèle d'apprentissage de format de trames basé sur les réseaux Bayésiens, et nous avons montré que ses performances sont nettement supérieures à celles de l'état de l'art. BaNet3F inclut également une version optimisée de l'algorithme de Viterbi, applicable à un réseau Bayésien quelconque, grâce à sa capacité à générer lui-même les frontières de Markov nécessaires
The world of telecommunications is rapidly developing, especially in the area of the Internet of Things; in such a context, it would be useful to be able to analyze any unknown protocol one might encounter. For this purpose, obtaining the state machine and frame formats of the target protocol is essential. These two elements can be extracted from network traces and/or execution traces using Protocol Reverse Engineering (PRE) techniques.By analyzing the performance of three algorithms used in PRE systems, we discovered the potential of models based on Bayesian networks. We then developed Bayesian Network Frame Format Finder (BaNet3F), our own frame format learning model based on Bayesian networks, and showed that its performance is significantly better than the state of the art. BaNet3F also includes an optimized version of the Viterbi algorithm, applicable to any Bayesian network, thanks to its ability to generate the necessary Markov boundaries itself
APA, Harvard, Vancouver, ISO und andere Zitierweisen
34

Rocha, João Pedro Magalhães da. „The evolution of international business research : a content analysis of EIBA’s conference papers (1999-2011)“. Master's thesis, Instituto Superior de Economia e Gestão, 2020. http://hdl.handle.net/10400.5/20932.

Der volle Inhalt der Quelle
Annotation:
Mestrado em Economia e Gestão de Ciência, Tecnologia e Inovação
Este estudo busca analisar a evolução das Conferências Anuais da European International Business Academy entre os anos 1999 e 2011. Um conjunto de 2221 documentos apresentados durante o período foi processado com o uso de uma ferramenta computadorizada - Latent Dirichlet Allocation, potencializada por sistemas de inteligência artificial - para facilitar a análise de conteúdo. O estudo utilizou o sistema de software R como a plataforma de aplicação do método, com o apoio de bibliotecas compatíveis para auxiliar as etapas de pré-processamento, de modelagem de tópicos, e de plotagem de resultados deste estudo. O método foi capaz de identificar 30 tópicos de investigação subjacentes ao grupo de documentos, e um rótulo foi manualmente atribuído a cada tópico de acordo com seu tema geral. Resultados mostram um crescimento generalizado no número de artigos apresentados, bem como novos autores ao longo dos anos, indicando um aumento no grau de abertura das Conferências Anuais da Associação. Determinados tópicos de investigação também se mostraram mais discutidos nos documentos em relação a outros tópicos, e investigação nos tópicos de "Capacidades dinâmicas, visão baseada em recursos e internacionalização de firma" e "Abordagens institucionais à investigação e teoria em Negócios Internacionais" se mostraram tópicos de tendência em anos recentes das Conferências. Com pouca necessidade de intervenção humana, este estudo pôde aplicar com sucesso um método automatizado para identificar os temas de investigação da Associação e sua evolução ao longo dos anos, reduzindo o viés do investigador e permitindo uma análise eficiente de grandes volumes de conteúdo textual.
This study seeks to analyse the evolution of the European International Business Academy's Annual Conferences between the years 1999 and 2011. A collection of the 2221 documents presented across the defined period was processed with the use of a computer-aided tool - the Latent Dirichlet Allocation, powered by artificial intelligence systems - to facilitate the content analysis. The study utilized the R software environment as the platform to apply the method, with the support of a group of compatible libraries to support in the pre-processing, topic modelling and result plotting stages of this study. The method was able to identify 30 underlying research topics across all the documents, and a label was manually assigned to each topic according to its overall theme. Results show an overall growth in number of papers presented, as well as new authors throughout the years, indicating an increase in degree of openness in the Association's Annual Conferences. Specific research topics have also shown to be more discussed across the documents than others, and research on the topics of "Dynamic capabilities, resource-based view and firm internationalization" and "Institutional approaches to International Business research and theory" showed to be trending in recent years of the Conferences. With little need for human intervention, this study was able to successfully apply an automated method to identify the Association's research themes and their evolution throughout the years, reducing researcher's bias and allowing for the efficient analysis of large volumes of text content.
info:eu-repo/semantics/publishedVersion
APA, Harvard, Vancouver, ISO und andere Zitierweisen
35

Déhaye, Vincent. „Characterisation of a developer’s experience fields using topic modelling“. Thesis, Linköpings universitet, Institutionen för datavetenskap, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-171946.

Der volle Inhalt der Quelle
Annotation:
Finding the most relevant candidate for a position represents an ubiquitous challenge for organisations. It can also be arduous for a candidate to explain on a concise resume what they have experience with. Due to the fact that the candidate usually has to select which experience to expose and filter out some of them, they might not be detected by the person carrying out the search, whereas they were indeed having the desired experience. In the field of software engineering, developing one's experience usually leaves traces behind: the code one produced. This project explores approaches to tackle the screening challenges with an automated way of extracting experience directly from code by defining common lexical patterns in code for different experience fields, using topic modeling. Two different techniques were compared. On one hand, Latent Dirichlet Allocation (LDA) is a generative statistical model which has proven to yield good results in topic modeling. On the other hand Non-Negative Matrix Factorization (NMF) is simply a singular value decomposition of a matrix representing the code corpus as word counts per piece of code.The code gathered consisted of 30 random repositories from all the collaborators of the open-source Ruby-on-Rails project on GitHub, which was then applied common natural language processing transformation steps. The results of both techniques were compared using respectively perplexity for LDA, reconstruction error for NMF and topic coherence for both. The two first represent how well the data could be represented by the topics produced while the later estimates the hanging and fitting together of the elements of a topic, and can depict human understandability and interpretability. Given that we did not have any similar work to benchmark with, the performance of the values obtained is hard to assess scientifically. However, the method seems promising as we would have been rather confident in assigning labels to 10 of the topics generated. The results imply that one could probably use natural language processing methods directly on code production in order to extend the detected fields of experience of a developer, with a finer granularity than traditional resumes and with fields definition evolving dynamically with the technology.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
36

Ficapal, Vila Joan. „Anemone: a Visual Semantic Graph“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-252810.

Der volle Inhalt der Quelle
Annotation:
Semantic graphs have been used for optimizing various natural language processing tasks as well as augmenting search and information retrieval tasks. In most cases these semantic graphs have been constructed through supervised machine learning methodologies that depend on manually curated ontologies such as Wikipedia or similar. In this thesis, which consists of two parts, we explore in the first part the possibility to automatically populate a semantic graph from an ad hoc data set of 50 000 newspaper articles in a completely unsupervised manner. The utility of the visual representation of the resulting graph is tested on 14 human subjects performing basic information retrieval tasks on a subset of the articles. Our study shows that, for entity finding and document similarity our feature engineering is viable and the visual map produced by our artifact is visually useful. In the second part, we explore the possibility to identify entity relationships in an unsupervised fashion by employing abstractive deep learning methods for sentence reformulation. The reformulated sentence structures are qualitatively assessed with respect to grammatical correctness and meaningfulness as perceived by 14 test subjects. We negatively evaluate the outcomes of this second part as they have not been good enough to acquire any definitive conclusion but have instead opened new doors to explore.
Semantiska grafer har använts för att optimera olika processer för naturlig språkbehandling samt för att förbättra sökoch informationsinhämtningsuppgifter. I de flesta fall har sådana semantiska grafer konstruerats genom övervakade maskininlärningsmetoder som förutsätter manuellt kurerade ontologier såsom Wikipedia eller liknande. I denna uppsats, som består av två delar, undersöker vi i första delen möjligheten att automatiskt generera en semantisk graf från ett ad hoc dataset bestående av 50 000 tidningsartiklar på ett helt oövervakat sätt. Användbarheten hos den visuella representationen av den resulterande grafen testas på 14 försökspersoner som utför grundläggande informationshämtningsuppgifter på en delmängd av artiklarna. Vår studie visar att vår funktionalitet är lönsam för att hitta och dokumentera likhet med varandra, och den visuella kartan som produceras av vår artefakt är visuellt användbar. I den andra delen utforskar vi möjligheten att identifiera entitetsrelationer på ett oövervakat sätt genom att använda abstraktiva djupa inlärningsmetoder för meningsomformulering. De omformulerade meningarna utvärderas kvalitativt med avseende på grammatisk korrekthet och meningsfullhet såsom detta uppfattas av 14 testpersoner. Vi utvärderar negativt resultaten av denna andra del, eftersom de inte har varit tillräckligt bra för att få någon definitiv slutsats, men har istället öppnat nya dörrar för att utforska.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
37

MAGATTI, DAVIDE. „Graphical models for text mining: knowledge extraction and performance estimation“. Doctoral thesis, Università degli Studi di Milano-Bicocca, 2011. http://hdl.handle.net/10281/19576.

Der volle Inhalt der Quelle
Annotation:
The amount of information produced every day is steadily increasing. The extraction of knowledge from such information is becoming a key aspect for many companies and institutions which spend a great deal of efforts in document management and organization with slightly sufficient results. In this dissertation, we are mainly concerned with probabilistic graphical model for knowledge extraction and performance estimation. In particular, we present a set of models that could improve information mining by relieving the user from boring duties and offering efficient ways to manage, classify, tag and retrieve documents. We adopted a Bayesian hierarchical model, Latent Dirichlet Allocation to extract meaningful topic from a given collection, and naïve Bayes classifier to tag new documents in a multi-label scenario. Moreover we present a novel and sound technique for the evaluation of Topic Models based on a probabilistic measure derived from state of the art performance index, namely Folkes-Mallows index, deriving a representation based on hyper-geometric distribution. Finally we identified a hybrid approach for searching in semantic repositories enriched by textual sources easing the process of information gathering exploiting state of the art information retrieval techniques.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
38

Park, Kyoung Jin. „Generating Thematic Maps from Hyperspectral Imagery Using a Bag-of-Materials Model“. The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1366296426.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
39

Russo, Massimiliano. „Bayesian inference for tensor factorization models“. Doctoral thesis, Università degli studi di Padova, 2019. http://hdl.handle.net/11577/3426830.

Der volle Inhalt der Quelle
Annotation:
Multivariate categorical data are routinely collected in several applications, including epidemiology, biology, and sociology, among many others. Popular models dealing with these variables include log-linear and tensor factorization models, with these lasts having the advantage of flexibly characterizing the dependence structure underlying the data. Under such framework, this Thesis aims to provide novel approaches to define compact representations of the dependence structures and to introduce new inference possibilities in tensor factorization approaches. We introduce a new class of GROuped Tensor (GROT) factorizations, which have superior performance in terms of data compression if compared to standard Parafac approach, using relatively few components to represent the joint probability mass function of the data. While popular Parafac factorizations rely on mixing together independent components, GROT mixes together grouped factorizations, equivalent to replacing vector arms in Parafac with low-dimensional tensor arms. We consider a Bayesian approach to inference with Dirichlet priors on the mixing weights and arm components, to obtain a combined low-rank and sparse structure, while facilitating efficient posterior computation via Markov chain Monte Carlo. Motivated by an application on malaria risk assessment, we also introduce a novel multivariate generalization of mixed membership models, which allows identification of correlated profiles related to different domains corresponding to separate groups of variables. We consider as a case study the Machadinho settlement project in Brazil, with the aim of defining survey based environmental and behavioral risk profiles and studying their interaction and evolution. To achieve this goal, we show that the use of correlated multiple membership vectors leads to interpretable inference requiring a lower number of profiles compared to standard formulations while inducing a more compact representation of the population level model. We propose a novel multivariate logistic normal distribution for the membership vectors, which allows easy introduction of auxiliary information in the membership profiles leveraging a multivariate latent logistic regression. A Bayesian approach to inference, relying on Pólya gamma data augmentation, facilitates efficient posterior computation via Markov chain Monte Carlo. The proposed approach is shown to outperform the classical mixed membership model in simulations, and the malaria diffusion application.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
40

Atrevi, Dieudonne Fabrice. „Détection et analyse des évènements rares par vision, dans un contexte urbain ou péri-urbain“. Thesis, Orléans, 2019. http://www.theses.fr/2019ORLE2008.

Der volle Inhalt der Quelle
Annotation:
L’objectif principal de cette thèse est le développement de méthodes complètes de détection d’événements rares. Les travaux de cette thèse se résument en deux parties. La première partie est consacrée à l’étude de descripteurs de formes de l’état de l’art. D’une part, la robustesse de certains descripteurs face à différentes conditions de luminosité a été étudiée. D’autre part, les moments géométriques ont été comparés à travers une application d’estimation de pose humaine 3D à partir d’image 2D. De cette étude, nous avons montré qu’à travers une application de recherche de formes, les moments géométriques permettent d’estimer la pose d’une personne à travers une recherche exhaustive dans une base d’apprentissage de poses connues.Cette application peut être utilisée dans un système de reconnaissance d’actions pour une analyse plus fine des événements détectés. Dans la deuxième partie, trois contributions à la détection d’événements rares sont présentées. La première contribution concerne l’élaboration d’une méthode d’analyse globale de scène pour la détection des événements liés aux mouvements de foule. Dans cette approche, la modélisation globale de la scène est faite en nous basant sur des points d’intérêt filtrés à partir de la carte de saillance de la scène. Les caractéristiques exploitées sont l’histogramme des orientations du flot optique et un ensemble de descripteur de formes étudié dans la première partie. L’algorithme LDA (Latent Dirichlet Allocation) est utilisé pour la création des modèles d’événements à partir d’une représentation en document visuel à partir de séquences d’images (clip vidéo). La deuxième contribution consiste en l’élaboration d’une méthode de détection de mouvements saillants ou dominants dans une vidéo. La méthode, totalement non supervisée,s’appuie sur les propriétés de la transformée en cosinus discrète pour analyser les informations du flot optique de la scène afin de mettre en évidence les mouvements saillants. La modélisation locale pour la détection et la localisation des événements est au coeur de la dernière contribution de cette thèse. La méthode se base sur les scores de saillance des mouvements et de l’algorithme SVM dans sa version "one class" pour créer le modèle d’événements. Les méthodes ont été évaluées sur différentes bases publiques et les résultats obtenus sont prometteurs
The main objective of this thesis is the development of complete methods for rare events detection. The works can be summarized in two parts. The first part is devoted to the study of shapes descriptors of the state of the art. On the one hand, the robustness of some descriptors to varying light conditions was studied.On the other hand, the ability of geometric moments to describe the human shape was also studied through a3D human pose estimation application based on 2D images. From this study, we have shown that through a shape retrieval application, geometric moments can be used to estimate a human pose through an exhaustive search in a pose database. This kind of application can be used in human actions recognition system which may be a final step of an event analysis system. In the second part of this report, three main contributions to rare event detection are presented. The first contribution concerns the development of a global scene analysis method for crowd event detection. In this method, global scene modeling is done based on spatiotemporal interest points filtered from the saliency map of the scene. The characteristics used are the histogram of the optical flow orientations and a set of shapes descriptors studied in the first part. The Latent Dirichlet Allocation algorithm is used to create event models by using a visual document representation of image sequences(video clip). The second contribution is the development of a method for salient motions detection in video.This method is totally unsupervised and relies on the properties of the discrete cosine transform to explore the optical flow information of the scene. Local modeling for events detection and localization is at the core of the latest contribution of this thesis. The method is based on the saliency score of movements and one class SVM algorithm to create the events model. The methods have been tested on different public database and the results obtained are promising
APA, Harvard, Vancouver, ISO und andere Zitierweisen
41

Rusch, Thomas, Paul Hofmarcher, Reinhold Hatzinger und Kurt Hornik. „Model trees with topic model preprocessing: an approach for data journalism illustrated with the WikiLeaks Afghanistan war logs“. Institute of Mathematical Statistics (IMS), 2013. http://dx.doi.org/10.1214/12-AOAS618.

Der volle Inhalt der Quelle
Annotation:
The WikiLeaks Afghanistan war logs contain nearly 77,000 reports of incidents in the US-led Afghanistan war, covering the period from January 2004 to December 2009. The recent growth of data on complex social systems and the potential to derive stories from them has shifted the focus of journalistic and scientific attention increasingly toward data-driven journalism and computational social science. In this paper we advocate the usage of modern statistical methods for problems of data journalism and beyond, which may help journalistic and scientific work and lead to additional insight. Using the WikiLeaks Afghanistan war logs for illustration, we present an approach that builds intelligible statistical models for interpretable segments in the data, in this case to explore the fatality rates associated with different circumstances in the Afghanistan war. Our approach combines preprocessing by Latent Dirichlet Allocation (LDA) with model trees. LDA is used to process the natural language information contained in each report summary by estimating latent topics and assigning each report to one of them. Together with other variables these topic assignments serve as splitting variables for finding segments in the data to which local statistical models for the reported number of fatalities are fitted. Segmentation and fitting is carried out with recursive partitioning of negative binomial distributions. We identify segments with different fatality rates that correspond to a small number of topics and other variables as well as their interactions. Furthermore, we carve out the similarities between segments and connect them to stories that have been covered in the media. This gives an unprecedented description of the war in Afghanistan and serves as an example of how data journalism, computational social science and other areas with interest in database data can benefit from modern statistical techniques. (authors' abstract)
APA, Harvard, Vancouver, ISO und andere Zitierweisen
42

Apelthun, Catharina. „Topic modeling on a classical Swedish text corpus of prose fiction : Hyperparameters’ effect on theme composition and identification of writing style“. Thesis, Uppsala universitet, Statistiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-441653.

Der volle Inhalt der Quelle
Annotation:
A topic modeling method, smoothed Latent Dirichlet Allocation (LDA) is applied on a text corpus data of classical Swedish prose fiction. The thesis consists of two parts. In the first part, a smoothed LDA model is applied to the corpus, investigating how changes in hyperparameter values affect the topics in terms of distribution of words within topics and topics within novels. In the second part, two smoothed LDA models are applied to a reduced corpus, only consisting of adjectives. The generated topics are examined to see if they are more likely to occur in a text of a particular author and if the model could be used for identification of writing style. With this new approach, the ability of the smoothed LDA model as a writing style identifier is explored. While the texts analyzed in this thesis is unusally long - as they are not seg- mented prose fiction - the effect of the hyperparameters on model performance was found to be similar to those found in previous research. For the adjectives corpus, the models did succeed in generating topics with a higher probability of occurring in novels by the same author. The smoothed LDA was shown to be a good model for identification of writing style. Keywords: Topic modeling, Smoothed Latent Dirichlet Allocation, Gibbs sam- pling, MCMC, Bayesian statistics, Swedish prose fiction.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
43

Cedervall, Andreas, und Daniel Jansson. „Topic classification of Monetary Policy Minutes from the Swedish Central Bank“. Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-240403.

Der volle Inhalt der Quelle
Annotation:
Over the last couple of years, Machine Learning has seen a very high increase in usage. Many previously manual tasks are becoming automated and it stands to reason that this development will continue in an incredible pace. This paper builds on the work in Topic Classification and attempts to provide a baseline on how to analyse the Swedish Central Bank Minutes and gather information using both Latent Dirichlet Allocation and a simple Neural Networks. Topic Classification is done on Monetary Policy Minutes from 2004 to 2018 to find how the distributions of topics change over time. The results are compared to empirical evidence that would confirm trends. Finally a business perspective of the work is analysed to reveal what the benefits of implementing this type of technique could be. The results of these methods are compared and they differ. Specifically the Neural Network shows larger changes in topic distributions than the Latent Dirichlet Allocation. The neural network also proved to yield more trends that correlated with other observations such as the start of bond purchasing by the Swedish Central Bank. Thus, our results indicate that a Neural Network would perform better than the Latent Dirichlet Allocation when analyzing Swedish Monetary Policy Minutes.
Under de senaste åren har artificiell intelligens och maskininlärning fått mycket uppmärksamhet och växt otroligt. Tidigare manuella arbeten blir nu automatiserade och mycket tyder på att utvecklingen kommer att fortsätta i en hög takt. Detta arbete bygger vidare på arbeten inom topic modeling (ämnesklassifikation) och applicera detta i ett tidigare outforskat område, riksbanksprotokoll. Latent Dirichlet Allocation och Neural Network används för att undersöka huruvida fördelningen av diskussionspunkter (topics) förändras över tid. Slutligen presenteras en teoretisk diskussion av det potentiella affärsvärdet i att implementera en liknande metod. Resultaten för de olika modellerna uppvisar stora skillnader över tid. Medan Latent Dirichlet Allocation inte finner några större trender i diskussionspunkter visar Neural Network på större förändringar över tid. De senare stämmer dessutom väl överens med andra observationer såsom påbörjandet av obligationsköp. Därav indikerar resultaten att Neural Network är en mer lämplig metod för analys av riksbankens mötesprotokoll.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
44

Schneider, Bruno. „Visualização em multirresolução do fluxo de tópicos em coleções de texto“. reponame:Repositório Institucional do FGV, 2014. http://hdl.handle.net/10438/11745.

Der volle Inhalt der Quelle
Annotation:
Submitted by Bruno Schneider (bruno.sch@gmail.com) on 2014-05-08T17:46:04Z No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Approved for entry into archive by Janete de Oliveira Feitosa (janete.feitosa@fgv.br) on 2014-05-13T12:56:21Z (GMT) No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Approved for entry into archive by Marcia Bacha (marcia.bacha@fgv.br) on 2014-05-14T19:44:51Z (GMT) No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5)
Made available in DSpace on 2014-05-14T19:45:33Z (GMT). No. of bitstreams: 1 dissertacao_bruno_schneider.pdf.pdf: 8019497 bytes, checksum: 70ff1fddb844b630666397e95c188672 (MD5) Previous issue date: 2014-03-21
The combined use of algorithms for topic discovery in document collections with topic flow visualization techniques allows the exploration of thematic patterns in long corpus. In this task, those patterns could be revealed through compact visual representations. This research has investigated the requirements for viewing data about the thematic composition of documents obtained through topic modeling - where datasets are sparse and has multi-attributes - at different levels of detail through the development of an own technique and the use of an open source library for data visualization, comparatively. About the studied problem of topic flow visualization, we observed the presence of conflicting requirements for data display in different resolutions, which led to detailed investigation on ways of manipulating and displaying this data. In this study, the hypothesis put forward was that the integrated use of more than one visualization technique according to the resolution of data expands the possibilities for exploitation of the object under study in relation to what would be obtained using only one method. The exhibition of the limits on the use of these techniques according to the resolution of data exploration is the main contribution of this work, in order to provide subsidies for the development of new applications.
O uso combinado de algoritmos para a descoberta de tópicos em coleções de documentos com técnicas orientadas à visualização da evolução daqueles tópicos no tempo permite a exploração de padrões temáticos em corpora extensos a partir de representações visuais compactas. A pesquisa em apresentação investigou os requisitos de visualização do dado sobre composição temática de documentos obtido através da modelagem de tópicos – o qual é esparso e possui multiatributos – em diferentes níveis de detalhe, através do desenvolvimento de uma técnica de visualização própria e pelo uso de uma biblioteca de código aberto para visualização de dados, de forma comparativa. Sobre o problema estudado de visualização do fluxo de tópicos, observou-se a presença de requisitos de visualização conflitantes para diferentes resoluções dos dados, o que levou à investigação detalhada das formas de manipulação e exibição daqueles. Dessa investigação, a hipótese defendida foi a de que o uso integrado de mais de uma técnica de visualização de acordo com a resolução do dado amplia as possibilidades de exploração do objeto em estudo em relação ao que seria obtido através de apenas uma técnica. A exibição dos limites no uso dessas técnicas de acordo com a resolução de exploração do dado é a principal contribuição desse trabalho, no intuito de dar subsídios ao desenvolvimento de novas aplicações.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
45

Pratt, Landon James. „Cliff Walls: Threats to Validity in Empirical Studies of Open Source Forges“. BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/3511.

Der volle Inhalt der Quelle
Annotation:
Artifact-based research provides a mechanism whereby researchers may study the creation of software yet avoid many of the difficulties of direct observation and experimentation. Open source software forges are of great value to the software researcher, because they expose many of the artifacts of software development. However, many challenges affect the quality of artifact-based studies, especially those studies examining software evolution. This thesis addresses one of these threats: the presence of very large commits, which we refer to as "Cliff Walls." Cliff walls are a threat to studies of software evolution because they do not appear to represent incremental development. In this thesis we demonstrate the existence of cliff walls in open source software projects and discuss the threats they present. We also seek to identify key causes of these monolithic commits, and begin to explore ways that researchers can mitigate the threats of cliff walls.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
46

Harrysson, Mattias. „Neural probabilistic topic modeling of short and messy text“. Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189532.

Der volle Inhalt der Quelle
Annotation:
Exploring massive amount of user generated data with topics posits a new way to find useful information. The topics are assumed to be “hidden” and must be “uncovered” by statistical methods such as topic modeling. However, the user generated data is typically short and messy e.g. informal chat conversations, heavy use of slang words and “noise” which could be URL’s or other forms of pseudo-text. This type of data is difficult to process for most natural language processing methods, including topic modeling. This thesis attempts to find the approach that objectively give the better topics from short and messy text in a comparative study. The compared approaches are latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words, and a new approach based on previous work named Neural Probabilistic Topic Modeling (NPTM). It could only be concluded that NPTM have a tendency to achieve better topics on short and messy text than LDA and RO-LDA. GMM on the other hand could not produce any meaningful results at all. The results are less conclusive since NPTM suffers from long running times which prevented enough samples to be obtained for a statistical test.
Att utforska enorma mängder användargenererad data med ämnen postulerar ett nytt sätt att hitta användbar information. Ämnena antas vara “gömda” och måste “avtäckas” med statistiska metoder såsom ämnesmodellering. Dock är användargenererad data generellt sätt kort och stökig t.ex. informella chattkonversationer, mycket slangord och “brus” som kan vara URL:er eller andra former av pseudo-text. Denna typ av data är svår att bearbeta för de flesta algoritmer i naturligt språk, inklusive ämnesmodellering. Det här arbetet har försökt hitta den metod som objektivt ger dem bättre ämnena ur kort och stökig text i en jämförande studie. De metoder som jämfördes var latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words samt en egen metod med namnet Neural Probabilistic Topic Modeling (NPTM) baserat på tidigare arbeten. Den slutsats som kan dras är att NPTM har en tendens att ge bättre ämnen på kort och stökig text jämfört med LDA och RO-LDA. GMM lyckades inte ge några meningsfulla resultat alls. Resultaten är mindre bevisande eftersom NPTM har problem med långa körtider vilket innebär att tillräckligt många stickprov inte kunde erhållas för ett statistiskt test.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
47

Moon, Gordon Euhyun. „Parallel Algorithms for Machine Learning“. The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1561980674706558.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
48

Webb, Jared Anthony. „A Topics Analysis Model for Health Insurance Claims“. BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/3805.

Der volle Inhalt der Quelle
Annotation:
Mathematical probability has a rich theory and powerful applications. Of particular note is the Markov chain Monte Carlo (MCMC) method for sampling from high dimensional distributions that may not admit a naive analysis. We develop the theory of the MCMC method from first principles and prove its relevance. We also define a Bayesian hierarchical model for generating data. By understanding how data are generated we may infer hidden structure about these models. We use a specific MCMC method called a Gibbs' sampler to discover topic distributions in a hierarchical Bayesian model called Topics Over Time. We propose an innovative use of this model to discover disease and treatment topics in a corpus of health insurance claims data. By representing individuals as mixtures of topics, we are able to consider their future costs on an individual level rather than as part of a large collective.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
49

Victors, Mason Lemoyne. „A Classification Tool for Predictive Data Analysis in Healthcare“. BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/5639.

Der volle Inhalt der Quelle
Annotation:
Hidden Markov Models (HMMs) have seen widespread use in a variety of applications ranging from speech recognition to gene prediction. While developed over forty years ago, they remain a standard tool for sequential data analysis. More recently, Latent Dirichlet Allocation (LDA) was developed and soon gained widespread popularity as a powerful topic analysis tool for text corpora. We thoroughly develop LDA and a generalization of HMMs and demonstrate the conjunctive use of both methods in predictive data analysis for health care problems. While these two tools (LDA and HMM) have been used in conjunction previously, we use LDA in a new way to reduce the dimensionality involved in the training of HMMs. With both LDA and our extension of HMM, we train classifiers to predict development of Chronic Kidney Disease (CKD) in the near future.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
50

Chen, Yuxin. „Apprentissage interactif de mots et d'objets pour un robot humanoïde“. Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLY003/document.

Der volle Inhalt der Quelle
Annotation:
Les applications futures de la robotique, en particulier pour des robots de service à la personne, exigeront des capacités d’adaptation continue à l'environnement, et notamment la capacité à reconnaître des nouveaux objets et apprendre des nouveaux mots via l'interaction avec les humains. Bien qu'ayant fait d'énormes progrès en utilisant l'apprentissage automatique, les méthodes actuelles de vision par ordinateur pour la détection et la représentation des objets reposent fortement sur de très bonnes bases de données d’entrainement et des supervisions d'apprentissage idéales. En revanche, les enfants de deux ans ont une capacité impressionnante à apprendre à reconnaître des nouveaux objets et en même temps d'apprendre les noms des objets lors de l'interaction avec les adultes et sans supervision précise. Par conséquent, suivant l'approche de le robotique développementale, nous développons dans la thèse des approches d'apprentissage pour les objets, en associant leurs noms et leurs caractéristiques correspondantes, inspirées par les capacités des enfants, en particulier l'interaction ambiguë avec l’homme en s’inspirant de l'interaction qui a lieu entre les enfants et les parents.L'idée générale est d’utiliser l'apprentissage cross-situationnel (cherchant les points communs entre différentes présentations d’un objet ou d’une caractéristique) et la découverte de concepts multi-modaux basée sur deux approches de découverte de thèmes latents: la Factorisation en Natrices Non-Négatives (NMF) et l'Allocation de Dirichlet latente (LDA). Sur la base de descripteurs de vision et des entrées audio / vocale, les approches proposées vont découvrir les régularités sous-jacentes dans le flux de données brutes afin de parvenir à produire des ensembles de mots et leur signification visuelle associée (p.ex le nom d’un objet et sa forme, ou un adjectif de couleur et sa correspondance dans les images). Nous avons développé une approche complète basée sur ces algorithmes et comparé leur comportements face à deux sources d'incertitudes: ambiguïtés de références, dans des situations où plusieurs mots sont donnés qui décrivent des caractéristiques d'objets multiples; et les ambiguïtés linguistiques, dans des situations où les mots-clés que nous avons l'intention d'apprendre sont intégrés dans des phrases complètes. Cette thèse souligne les solutions algorithmiques requises pour pouvoir effectuer un apprentissage efficace de ces associations de mot-référent à partir de données acquises dans une configuration d'acquisition simplifiée mais réaliste qui a permis d'effectuer des simulations étendues et des expériences préliminaires dans des vraies interactions homme-robot. Nous avons également apporté des solutions pour l'estimation automatique du nombre de thèmes pour les NMF et LDA.Nous avons finalement proposé deux stratégies d'apprentissage actives: la Sélection par l'Erreur de Reconstruction Maximale (MRES) et l'Exploration Basée sur la Confiance (CBE), afin d'améliorer la qualité et la vitesse de l'apprentissage incrémental en laissant les algorithmes choisir les échantillons d'apprentissage suivants. Nous avons comparé les comportements produits par ces algorithmes et montré leurs points communs et leurs différences avec ceux des humains dans des situations d'apprentissage similaires
Future applications of robotics, especially personal service robots, will require continuous adaptability to the environment, and particularly the ability to recognize new objects and learn new words through interaction with humans. Though having made tremendous progress by using machine learning, current computational models for object detection and representation still rely heavily on good training data and ideal learning supervision. In contrast, two year old children have an impressive ability to learn to recognize new objects and at the same time to learn the object names during interaction with adults and without precise supervision. Therefore, following the developmental robotics approach, we develop in the thesis learning approaches for objects, associating their names and corresponding features, inspired by the infants' capabilities, in particular, the ambiguous interaction with humans, inspired by the interaction that occurs between children and parents.The general idea is to use cross-situational learning (finding the common points between different presentations of an object or a feature) and to implement multi-modal concept discovery based on two latent topic discovery approaches : Non Negative Matrix Factorization (NMF) and Latent Dirichlet Association (LDA). Based on vision descriptors and sound/voice inputs, the proposed approaches will find the underlying regularities in the raw dataflow to produce sets of words and their associated visual meanings (eg. the name of an object and its shape, or a color adjective and its correspondence in images). We developed a complete approach based on these algorithms and compared their behavior in front of two sources of uncertainties: referential ambiguities, in situations where multiple words are given that describe multiple objects features; and linguistic ambiguities, in situations where keywords we intend to learn are merged in complete sentences. This thesis highlights the algorithmic solutions required to be able to perform efficient learning of these word-referent associations from data acquired in a simplified but realistic acquisition setup that made it possible to perform extensive simulations and preliminary experiments in real human-robot interactions. We also gave solutions for the automatic estimation of the number of topics for both NMF and LDA.We finally proposed two active learning strategies, Maximum Reconstruction Error Based Selection (MRES) and Confidence Based Exploration (CBE), to improve the quality and speed of incremental learning by letting the algorithms choose the next learning samples. We compared the behaviors produced by these algorithms and show their common points and differences with those of humans in similar learning situations
APA, Harvard, Vancouver, ISO und andere Zitierweisen
Wir bieten Rabatte auf alle Premium-Pläne für Autoren, deren Werke in thematische Literatursammlungen aufgenommen wurden. Kontaktieren Sie uns, um einen einzigartigen Promo-Code zu erhalten!

Zur Bibliographie