Dissertations / Theses: 'Text modeling'

1

Sauper, Christina (Christina Joan). "Content modeling for social media text." Thesis, Massachusetts Institute of Technology, 2012. http://hdl.handle.net/1721.1/75648.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 129-136).
This thesis focuses on machine learning methods for extracting information from user-generated content. Instances of this data such as product and restaurant reviews have become increasingly valuable and influential in daily decision making. In this work, I consider a range of extraction tasks such as sentiment analysis and aspect-based review aggregation. These tasks have been well studied in the context of newswire documents, but the informal and colloquial nature of social media poses significant new challenges. The key idea behind our approach is to automatically induce the content structure of individual documents given a large, noisy collection of user-generated content. This structure enables us to model the connection between individual documents and effectively aggregate their content. The models I propose demonstrate that content structure can be utilized at both document and phrase level to aid in standard text analysis tasks. At the document level, I capture this idea by joining the original task features with global contextual information. The coupling of the content model and the task-specific model allows the two components to mutually influence each other during learning. At the phrase level, I utilize a generative Bayesian topic model where a set of properties and corresponding attribute tendencies are represented as hidden variables. The model explains how the observed text arises from the latent variables, thereby connecting text fragments with corresponding properties and attributes.
by Christina Sauper.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

2

Harrysson, Mattias. "Neural probabilistic topic modeling of short and messy text." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-189532.

Full text

Abstract:

Exploring massive amount of user generated data with topics posits a new way to find useful information. The topics are assumed to be “hidden” and must be “uncovered” by statistical methods such as topic modeling. However, the user generated data is typically short and messy e.g. informal chat conversations, heavy use of slang words and “noise” which could be URL’s or other forms of pseudo-text. This type of data is difficult to process for most natural language processing methods, including topic modeling. This thesis attempts to find the approach that objectively give the better topics from short and messy text in a comparative study. The compared approaches are latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words, and a new approach based on previous work named Neural Probabilistic Topic Modeling (NPTM). It could only be concluded that NPTM have a tendency to achieve better topics on short and messy text than LDA and RO-LDA. GMM on the other hand could not produce any meaningful results at all. The results are less conclusive since NPTM suffers from long running times which prevented enough samples to be obtained for a statistical test.
Att utforska enorma mängder användargenererad data med ämnen postulerar ett nytt sätt att hitta användbar information. Ämnena antas vara “gömda” och måste “avtäckas” med statistiska metoder såsom ämnesmodellering. Dock är användargenererad data generellt sätt kort och stökig t.ex. informella chattkonversationer, mycket slangord och “brus” som kan vara URL:er eller andra former av pseudo-text. Denna typ av data är svår att bearbeta för de flesta algoritmer i naturligt språk, inklusive ämnesmodellering. Det här arbetet har försökt hitta den metod som objektivt ger dem bättre ämnena ur kort och stökig text i en jämförande studie. De metoder som jämfördes var latent Dirichlet allocation (LDA), Re-organized LDA (RO-LDA), Gaussian Mixture Model (GMM) with distributed representation of words samt en egen metod med namnet Neural Probabilistic Topic Modeling (NPTM) baserat på tidigare arbeten. Den slutsats som kan dras är att NPTM har en tendens att ge bättre ämnen på kort och stökig text jämfört med LDA och RO-LDA. GMM lyckades inte ge några meningsfulla resultat alls. Resultaten är mindre bevisande eftersom NPTM har problem med långa körtider vilket innebär att tillräckligt många stickprov inte kunde erhållas för ett statistiskt test.

APA, Harvard, Vancouver, ISO, and other styles

3

Reynolds, Douglas A. "A Gaussian mixture modeling approach to text-independent speaker identification." Diss., Georgia Institute of Technology, 1992. http://hdl.handle.net/1853/16903.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Sad, Hamed. "Text entry interfaces on mobile devices : modeling, design and evaluation." Lorient, 2009. http://www.theses.fr/2009LORIS153.

Full text

Abstract:

This thesis focuses on one of the currently active areas of Human-Computer Interaction (HCI): text entry on mobile devices. Specifically, it deals with the evaluation of text entry method interfaces on mobile devices. We address the two main approach of the evaluation: empirical user testing and model based evaluation. A platform is presented for empirical user tests. The platform aims to shorten the evaluation time and make the evaluation more reproducible and generalizable. It’s implemented on mobile device and equipped with many existing entry methods. It also facilitates the implementation of a new idea for text entry. New ideas can be tested by comparison to the already implemented methods. The platform also includes valuable tools for the evaluation. One tool is provided to help in the creation of a set of test phrases that represent a given language. Another tool is for automating the most common analysis in text entry tests. We presented a framework to identify, classify and model the different operation in the text entry process on mobile devices. It divides the process to two stages: action planning stage and execution stage. The action planning stage studies how the word or the text to be typed is mapped to the physical actions allowed by the entry system used. The execution stage concerns the primitive tasks the user should done to produce text. Some measures are proposed for the evaluation of the two stages. The framework also identifies the different entities that should be taken into account during the evaluation. Model based evaluation requires human performance models for the primitive tasks used in the entry process. We studied three tasks frequently used in the execution stage of the entry process: word disambiguation, word selection from a list, and tilt based interaction. We present an algorithm and design guidelines for the design of efficient ambiguous keyboard layout. A model for the selection of a word from a word-list is constructed from an experimental study. Another model for pointing and scrolling using tilt sensor on a mobile device is also presented. Finally, we present new design directions for the action planning stage that we think more interesting when compared to current designs. They are based on knowledge we already have about our life and about the text we write. They use this knowledge to allow typing a word in a more direct way compared to typing it letter by letter as it’s the case in the current designs. We provide two implementations to help the concretization of the concepts presented in that part of Page 6 Abstract Text entry interfaces on mobile devices: Modeling, design and evaluation, PhD thesis 2009 the thesis. The first shows how our knowledge can be used to type frequent words through using pictographs. The second exploits language syntax to enable the user to type fast and to protect her/him from committing typing errors. The two implementations employ a specially constructed text prediction engine. It implements language redundancy in storage efficient data structures. It introduces the concept of meaning (concept, object, action, relation, etc. ) and separates it from its lexical representation in a given language
Cette thèse concerne la saisie de texte sur les dispositifs mobiles qui est un domaine très actif de l'interaction home-machine (IHM) depuis quelques années. Cette recherche traite plus particulièrement de l'évaluation des méthodes de saisie de texte. Nous abordons les deux approches principales de l'évaluation : l'évaluation expérimentale et l’évaluation par modélisation. Une plateforme pour l'évaluation expérimentale est présentée. Elle vise à faciliter, rendre plus rapide et plus reproductive l'évaluation. Cette plateforme qui intègre de nombreuses méthodes de saisie rend possible leur comparaison et simplifie grandement la conception et le développement d'une nouvelle idée pour la saisie de texte. Enfin, la plateforme inclut des outils pour l'évaluation, comme un outil d'aide à la création d'un corpus de test représentatif de la langue cible ou un outil pour automatiser l'analyse des performances sur la base des métriques standard du domaine. Nous proposons également un cadre (framework) pour décrire, classifier et modéliser les opérations impliquées dans la saisie de texte sur dispositifs mobiles. À la base, nous distinguons deux étapes : la planification et l’exécution. La première étape correspond au processus mental de planification des actions physiques requises pour saisir un mot avec une méthode donnée ; la deuxième étape concerne le processus moteur de production du texte à partir des actions disponibles pour l’utilisateur. On propose, dans cette thèse, des mesures pour une évaluation théorique de ces deux phases de la saisie. L’évaluation théorique s’appuie sur des modèles de la performance humaine pour l’exécution des différentes tâches impliquées dans la saisie. Nous avons étudié en particulier deux tâches fréquemment utilisées dans la phase d'exécution : la sélection d’un mot dans une liste de mots et le pointage et défilement par une interaction basée sur l’inclinaison (tilt). Nous présentons un algorithme et des recommandations pour la conception de claviers ambigus efficaces. Un modèle de performance pour la sélection de mot dans une liste est proposé qui fait suite à une étude expérimentale. Un autre modèle prédit le temps d’exécution du ciblage et du défilement par inclinaison sur un dispositif mobile. Enfin, nous proposons de nouvelles directions originales pour la saisie de texte qui concernent la phase de planification. L’approche exploite notre « connaissance du monde » ainsi que la nature syntaxique des mots du message. Nous nous affranchissons le plus possible d’une saisie de texte « lettre par lettre », pour suivre une approche pictographique où les mots les plus fréquents sont Page 8 Résumé Text entry interfaces on mobile devices: Modeling, design and evaluation, PhD thesis 2009 directement accessibles à partir d’une représentation graphique. L’approche proposée exploite également la syntaxe de la langue pour permettre à l'utilisateur de filtrer gestuellement le mot désiré selon sa catégorie grammaticale. Cette approche pictographique et syntaxique utilise un moteur de prédiction et un codage du lexique spécifiques qui assurent une structure de données efficace et adaptée aux performances limitées des dispositifs mobiles

APA, Harvard, Vancouver, ISO, and other styles

5

Cheng, Chi Wa. "Probabilistic topic modeling and classification probabilistic PCA for text corpora." HKBU Institutional Repository, 2011. http://repository.hkbu.edu.hk/etd_ra/1263.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Ren, Zhaowei. "Analysis and Modeling of the Structure of Semantic Dynamics in Texts." University of Cincinnati / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1512045439740177.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Preece, Daniel Joseph. "Text Identification by Example." Diss., CLICK HERE for online access, 2007. http://contentdm.lib.byu.edu/ETD/image/etd2060.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Bischof, Jonathan Michael. "Interpretable and Scalable Bayesian Models for Advertising and Text." Thesis, Harvard University, 2014. http://dissertations.umi.com/gsas.harvard:11400.

Full text

Abstract:

In the era of "big data", scalable statistical inference is necessary to learn from new and growing sources of quantitative information. However, many commercial and scientific applications also require models to be interpretable to end users in order to generate actionable insights about quantities of interest. We present three case studies of Bayesian hierarchical models that improve the interpretability of existing models while also maintaining or improving the efficiency of inference. The first paper is an application to online advertising that presents an augmented regression model interpretable in terms of the amount of revenue a customer is expected to generate over his or her entire relationship with the company---even if complete histories are never observed. The resulting Poisson Process Regression employs a marginal inference strategy that avoids specifying customer-level latent variables used in previous work that complicate inference and interpretability. The second and third papers are applications to the analysis of text data that propose improved summaries of topic components discovered by these mixture models. While the current practice is to summarize topics in terms of their most frequent words, we show significantly greater interpretability in online experiments with human evaluators by using words that are also relatively exclusive to the topic of interest. In the process we develop a new class of topic models that directly regularize the differential usage of words across topics in order to produce stable estimates of the combined frequency-exclusivity metric as well as proposing efficient and parallelizable MCMC inference strategies.
Statistics

APA, Harvard, Vancouver, ISO, and other styles

9

Foulds, James Richard. "Latent Variable Modeling for Networks and Text| Algorithms, Models and Evaluation Techniques." Thesis, University of California, Irvine, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3631094.

Full text

Abstract:

In the era of the internet, we are connected to an overwhelming abundance of information. As more facets of our lives become digitized, there is a growing need for automatic tools to help us find the content we care about. To tackle the problem of information overload, a standard machine learning approach is to perform dimensionality reduction, transforming complicated high-dimensional data into a manageable, low-dimensional form. Probabilistic latent variable models provide a powerful and elegant framework for performing this transformation in a principled way. This thesis makes several advances for modeling two of the most ubiquitous types of online information: networks and text data.

Our first contribution is to develop a model for social networks as they vary over time. The model recovers latent feature representations of each individual, and tracks these representations as they change dynamically. We also show how to use text information to interpret these latent features.

Continuing the theme of modeling networks and text data, we next build a model of citation networks. The model finds influential scientific articles and the influence relationships between the articles, potentially opening the door for automated exploratory tools for scientists. The increasing prevalence of web-scale data sets provides both an opportunity and a challenge. With more data we can fit more accurate models, as long as our learning algorithms are up to the task. To meet this challenge, we present an algorithm for learning latent Dirichlet allocation topic models quickly, accurately and at scale. The algorithm leverages stochastic techniques, as well as the collapsed representation of the model. We use it to build a topic model on 4.6 million articles from the open encyclopedia Wikipedia in a matter of hours, and on a corpus of 1740 machine learning articles from the NIPS conference in seconds.

Finally, evaluating the predictive performance of topic models is an important yet computationally difficult task. We develop one algorithm for comparing topic models, and another for measuring the progress of learning algorithms for these models. The latter method achieves better estimates than previous algorithms, in many cases with an order of magnitude less computational effort.

APA, Harvard, Vancouver, ISO, and other styles

10

Alsadhan, Majed. "An application of topic modeling algorithms to text analytics in business intelligence." Thesis, Kansas State University, 2014. http://hdl.handle.net/2097/17580.

Full text

Abstract:

Master of Science
Department of Computing and Information Sciences
Doina Caragea
William H. Hsu
In this work, we focus on the task of clustering businesses in the state of Kansas based on the content of their websites and their business listing information. Our goal is to cluster the businesses and overcome the challenges facing current approaches such as: data noise, low number of clustered businesses, and lack of evaluation approach. We propose an LSA-based approach to analyze the businesses’ data and cluster those businesses by using Bisecting K-Means algorithm. In this approach, we analyze the businesses’ data by using LSA and produce businesses’ representations in a reduced space. We then use the businesses’ representations to cluster the businesses by applying the Bisecting K-Means algorithm. We also apply an existing LDA-based approach to cluster the businesses and compare the results with our proposed LSA-based approach at the end. In this work, we evaluate the results by using a human-expert-based evaluation procedure. At the end, we visualize the clusters produced in this work by using Google Earth and Tableau. According to our evaluation procedure, the LDA-based approach performed slightly bet- ter then the LSA-based approach. However, with the LDA-based approach, there were some limitations which are: low number of clustered businesses, and not being able to produce a hierarchical tree for the clusters. With the LSA-based approach, we were able to cluster all the businesses and produce a hierarchical tree for the clusters.

APA, Harvard, Vancouver, ISO, and other styles

11

Sun, Yingcheng. "Topic Modeling and Spam Detection for Short Text Segments in Web Forums." Case Western Reserve University School of Graduate Studies / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=case1575281495398615.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Fries, Jason Alan. "Modeling words for online sexual behavior surveillance and clinical text information extraction." Diss., University of Iowa, 2015. https://ir.uiowa.edu/etd/2076.

Full text

Abstract:

How do we model the meaning of words? In domains like information retrieval, words have classically been modeled as discrete entities using 1-of-n encoding, a representation that elides most of a word's syntactic and semantic structure. Recent research, however, has begun exploring more robust representations called word embeddings. Embeddings model words as a parameterized function mapping into an n-dimensional continuous space and implicitly encode a number of interesting semantic and syntactic properties. This dissertation examines two application areas where existing, state-of-the-art terminology modeling improves the task of information extraction (IE) -- the process of transforming unstructured data into structured form. We show that a large amount of word meaning can be learned directly from very large document collections. First, we explore the feasibility of mining sexual health behavior data directly from the unstructured text of online “hookup" requests. The Internet has fundamentally changed how individuals locate sexual partners. The rise of dating websites, location-aware smartphone apps like Grindr and Tinder that facilitate casual sexual encounters (“hookups"), as well as changing trends in sexual health practices all speak to the shifting cultural dynamics surrounding sex in the digital age. These shifts also coincide with an increase in the incidence rate of sexually transmitted infections (STIs) in subpopulations such as young adults, racial and ethnic minorities, and men who have sex with men (MSM). The reasons for these increases and their possible connections to Internet cultural dynamics are not completely understood. What is apparent, however, is that sexual encounters negotiated online complicate many traditional public health intervention strategies such as contact tracing and partner notification. These circumstances underline the need to examine online sexual communities using computational tools and techniques -- as is done with other social networks -- to provide new insight and direction for public health surveillance and intervention programs. One of the central challenges in this task is constructing lexical resources that reflect how people actually discuss and negotiate sex online. Using a 2.5-year collection of over 130 million Craigslist ads (a large venue for MSM casual sexual encounters), we discuss computational methods for automatically learning terminology characterizing risk behaviors in the MSM community. These approaches range from keyword-based dictionaries and topic modeling to semi-supervised methods using word embeddings for query expansion and sequence labeling. These methods allow us to gather information similar (in part) to the types of questions asked in public health risk assessment surveys, but automatically aggregated directly from communities of interest, in near real-time, and at geographically high-resolution. We then address the methodological limitations of this work, as well as the fundamental validation challenges posed by the lack of large-scale sexual sexual behavior survey data and limited availability of STI surveillance data. Finally, leveraging work on terminology modeling in Craigslist, we present new research exploring representation learning using 7 years of University of Iowa Hospitals and Clinics (UIHC) clinical notes. Using medication names as an example, we show that modeling a low-dimensional representation of a medication's neighboring words, i.e., a word embedding, encodes a large amount of non-obvious semantic information. Embeddings, for example, implicitly capture a large degree of the hierarchical structure of drug families as well as encode relational attributes of words, such as generic and brand names of medications. These representations -- learned in a completely unsupervised fashion -- can then be used as features in other machine learning tasks. We show that incorporating clinical word embeddings in a benchmark classification task of medication labeling leads to a 5.4% increase in F1-score over a baseline of random initialization and a 1.9% over just using non-UIHC training data. This research suggests clinical word embeddings could be shared for use in other institutions and other IE tasks.

APA, Harvard, Vancouver, ISO, and other styles

13

Lund, Jeffrey A. "Fast Inference for Interactive Models of Text." BYU ScholarsArchive, 2015. https://scholarsarchive.byu.edu/etd/5780.

Full text

Abstract:

Probabilistic models of text are a useful tool for enabling the analysis of large collections of digital text. For example, Latent Dirichlet Allocation can quickly produce topical summaries of large collections of text documents. Many important uses cases of such models include human interaction during the inference process for these models of text. For example, the Interactive Topic Model extends Latent Dirichlet Allocation to incorporate human expertiese during inference in order to produce topics which are better suited to individual user needs. However, interactive use cases of probabalistic models of text introduce new constraints on inference - the inference procedure must not only be accurate, but also fast enough to facilitate human interaction. If the inference is too slow, then the human interaction will be harmed, and the interactive aspect of the probalistic model will be less useful. Unfortunately, the most popular inference algorithms in use today either require strong approximations which can degrade the quality of some models, or require time-consuming sampling. We explore the use of Iterated Conditional Modes, an algorithm which is able to obtain locally optimal maximum a posteriori estimates, as an alternative to popular inference algorithms such as Gibbs sampling or mean field variational inference. Iterated Conditional Modes algorithm is not only fast enough to facilitate human interaction, but can produce better maximum a posteriori estimates than sampling. We demonstrate the superior performance of Iterated Conditional Modes on a wide variety of models. First we use a DP Mixture of Multinomials model applied to the problem of web search result cluster, and show that not only can we outperform previous methods in clustering quality, but we can achieve interactive runtimes when performing inference with Iterated Conditional Modes. We then apply Iterated Conditional Modes to the Interactive Topic Model. Not only is Iterated Conditional Modes much faster than the previous published Gibbs sampler, but we are better able to incorporate human feedback during inference, as measured by accuracy on a classification task using the resultant topic model. Finally, we utilize Iterated Conditional Modes with MomResp, a model used to aggregate multiple noisy crowdsourced data. Compared with Gibbs sampling, Iterated Conditional Modes is better able to recover ground truth labels from simulated noisy annotations, and runs orders of magnitude faster.

APA, Harvard, Vancouver, ISO, and other styles

14

Wang, Xuerui. "Structured Topic Models: Jointly Modeling Words and Their Accompanying Modalities." Amherst, Mass. : University of Massachusetts Amherst, 2009. http://scholarworks.umass.edu/open_access_dissertations/58/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Jernite, Yacine. "Learning Representations of Text through Language and Discourse Modeling| From Characters to Sentences." Thesis, New York University, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10680744.

Full text

Abstract:

In this thesis, we consider the problem of obtaining a representation of the meaning expressed in a text. How to do so correctly remains a largely open problem, combining a number of inter-related questions (e.g. what is the role of context in interpreting text? how should language understanding models handle compositionality? etc...) In this work, after reflecting on the notion of meaning and describing the most common sequence modeling paradigms in use in recent work, we focus on two of these questions: what level of granularity text should be read at, and what training objectives can lead models to learn useful representations of a text's meaning.

In a first part, we argue for the use of sub-word information for that purpose, and present new neural network architectures which can either process words in a way that takes advantage of morphological information, or do away with word separations altogether while still being able to identify relevant units of meaning.

The second part starts by arguing for the use of language modeling as a learning objective, and provides algorithms which can help with its scalability issues and propose a globally rather than locally normalized probability distribution. It then explores the question of what makes a good language learning objective, and introduces discriminative objectives inspired by the notion of discourse coherence which help learn a representation of the meaning of sentences.

APA, Harvard, Vancouver, ISO, and other styles

16

Duong-Trung, Nghia [Verfasser]. "Social Media Learning : Novel Text Analytics for Geolocation and Topic Modeling / Nghia Duong-Trung." Göttingen : Cuvillier Verlag, 2017. http://d-nb.info/1136676988/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Lipka, Nedim [Verfasser], Benno [Akademischer Betreuer] Stein, and James [Gutachter] Shanahan. "Modeling Non-Standard Text Classification Tasks / Nedim Lipka ; Gutachter: James Shanahan ; Betreuer: Benno Stein." Weimar : Professur Content Management / Web-Technologien, 2013. http://d-nb.info/1116094495/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Redyuk, Sergey. "Finding early signals of emerging trends in text through topic modeling and anomaly detection." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-15507.

Full text

Abstract:

Trend prediction has become an extremely popular practice in many industrial sectors and academia. It is beneficial for strategic planning and decision making, and facilitates exploring new research directions that are not yet matured. To anticipate future trends in academic environment, a researcher needs to analyze an extensive amount of literature and scientific publications, and gain expertise in the particular research domain. This approach is time-consuming and extremely complicated due to abundance of data and its diversity. Modern machine learning tools, on the other hand, are capable of processing tremendous volumes of data, reaching the real-time human-level performance for various applications. Achieving high performance in unsupervised prediction of emerging trends in text can indicate promising directions for future research and potentially lead to breakthrough discoveries in any field of science. This thesis addresses the problem of emerging trend prediction in text in two main steps: it utilizes HDP topic model to represent latent topic space of a given temporal collection of documents, DBSCAN clustering algorithm to detect groups with high-density regions in the document space potentially leading to emerging trends, and applies KLdivergence in order to capture deviating text which might indicate birth of a new not-yet-seen phenomenon. In order to empirically evaluate the effectiveness of the proposed framework and estimate its predictive capability, both synthetically generated corpora and real-world text collections from arXiv.org, an open-access electronic archive of scientific publications (category: Computer Science), and NIPS publications are used. For synthetic data, a text generator is designed which provides ground truth to evaluate the performance of anomaly detection algorithms. This work contributes to the body of knowledge in the area of emerging trend prediction in several ways. First of all, the method of incorporating topic modeling and anomaly detection algorithms for emerging trend prediction is a novel approach and highlights new perspectives in the subject area. Secondly, the three-level word-document-topic topology of anomalies is formalized in order to detect anomalies in temporal text collections which might lead to emerging trends. Finally, a framework for unsupervised detection of early signals of emerging trends in text is designed. The framework captures new vocabulary, documents with deviating word/topic distribution, and drifts in latent topic space as three main indicators of a novel phenomenon to occur, in accordance with the three-level topology of anomalies. The framework is not limited by particular sources of data and can be applied to any temporal text collections in combination with any online methods for soft clustering.

APA, Harvard, Vancouver, ISO, and other styles

19

Nguyen, Thi Thu Trang. "HMM-based Vietnamese Text-To-Speech : Prosodic Phrasing Modeling, Corpus Design System Design, and Evaluation." Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112201/document.

Full text

Abstract:

L’objectif de cette thèse est de concevoir et de construire, un système Text-To-Speech (TTS) haute qualité à base de HMM (Hidden Markov Model) pour le vietnamien, une langue tonale. Le système est appelé VTED (Vietnamese TExt-to-speech Development system). Au vu de la grande importance de tons lexicaux, un tonophone” – un allophones dans un contexte tonal – a été proposé comme nouvelle unité de la parole dans notre système de TTS. Un nouveau corpus d’entraînement, VDTS (Vietnamese Di-Tonophone Speech corpus), a été conçu à partir d’un grand texte brut pour une couverture de 100% de di-phones tonalisés (di-tonophones) en utilisant l’algorithme glouton. Un total d’environ 4000 phrases ont été enregistrées et pré-traitées comme corpus d’apprentissage de VTED.Dans la synthèse de la parole sur la base de HMM, bien que la durée de pause puisse être modélisée comme un phonème, l’apparition de pauses ne peut pas être prédite par HMM. Les niveaux de phrasé ne peuvent pas être complètement modélisés avec des caractéristiques de base. Cette recherche vise à obtenir un découpage automatique en groupes intonatifs au moyen des seuls indices de durée. Des blocs syntaxiques constitués de phrases syntaxiques avec un nombre borné de syllabes (n), ont été proposés pour prévoir allongement final (n = 6) et pause apparente (n = 10). Des améliorations pour allongement final ont été effectuées par des stratégies de regroupement des blocs syntaxiques simples. La qualité du modèle prédictive J48-arbre-décision pour l’apparence de pause à l’aide de blocs syntaxiques, combinée avec lien syntaxique et POS (Part-Of-Speech) dispose atteint un F-score de 81,4 % (Précision = 87,6 %, Recall = 75,9 %), beaucoup mieux que le modèle avec seulement POS (F-score=43,6%) ou un lien syntaxique (F-score=52,6%).L’architecture du système a été proposée sur la base de l’architecture HTS avec une extension d’une partie traitement du langage naturel pour le Vietnamien. L’apparence de pause a été prédit par le modèle proposé. Les caractéristiques contextuelles incluent les caractéristiques d’identité de “tonophones”, les caractéristiques de localisation, les caractéristiques liées à la tonalité, et les caractéristiques prosodiques (POS, allongement final, niveaux de rupture). Mary TTS a été choisi comme plateforme pour la mise en oeuvre de VTED. Dans le test MOS (Mean Opinion Score), le premier VTED, appris avec les anciens corpus et des fonctions de base, était plutôt bonne, 0,81 (sur une échelle MOS 5 points) plus élevé que le précédent système – HoaSung (lequel utilise la sélection de l’unité non-uniforme avec le même corpus) ; mais toujours 1,2-1,5 point de moins que le discours naturel. La qualité finale de VTED, avec le nouveau corpus et le modèle de phrasé prosodique, progresse d’environ 1,04 par rapport au premier VTED, et son écart avec le langage naturel a été nettement réduit. Dans le test d’intelligibilité, le VTED final a reçu un bon taux élevé de 95,4%, seulement 2,6% de moins que le discours naturel, et 18% plus élevé que le premier. Le taux d’erreur du premier VTED dans le test d’intelligibilité générale avec le carré latin test d’environ 6-12% plus élevé que le langage naturel selon des niveaux de syllabe, de ton ou par phonème. Le résultat final ne s’écarte de la parole naturelle que de 0,4-1,4%
The thesis objective is to design and build a high quality Hidden Markov Model (HMM-)based Text-To-Speech (TTS) system for Vietnamese – a tonal language. The system is called VTED (Vietnamese TExt-tospeech Development system). In view of the great importance of lexical tones, a “tonophone” – an allophone in tonal context – was proposed as a new speech unit in our TTS system. A new training corpus, VDTS (Vietnamese Di-Tonophone Speech corpus), was designed for 100% coverage of di-phones in tonal contexts (i.e. di-tonophones) using the greedy algorithm from a huge raw text. A total of about 4,000 sentences of VDTS were recorded and pre-processed as a training corpus of VTED.In the HMM-based speech synthesis, although pause duration can be modeled as a phoneme, the appearanceof pauses cannot be predicted by HMMs. Lower phrasing levels above words may not be completely modeled with basic features. This research aimed at automatic prosodic phrasing for Vietnamese TTS using durational clues alone as it appeared too difficult to disentangle intonation from lexical tones. Syntactic blocks, i.e. syntactic phrases with a bounded number of syllables (n), were proposed for predicting final lengthening (n = 6) and pause appearance (n = 10). Improvements for final lengthening were done by some strategies of grouping single syntactic blocks. The quality of the predictive J48-decision-tree model for pause appearance using syntactic blocks combining with syntactic link and POS (Part-Of-Speech) features reached F-score of 81.4% Precision=87.6%, Recall=75.9%), much better than that of the model with only POS (F-score=43.6%)or syntactic link (F-score=52.6%) alone.The architecture of the system was proposed on the basis of the core architecture of HTS with an extension of a Natural Language Processing part for Vietnamese. Pause appearance was predicted by the proposed model. Contextual feature set included phone identity features, locational features, tone-related features, and prosodic features (i.e. POS, final lengthening, break levels). Mary TTS was chosen as a platform for implementing VTED. In the MOS (Mean Opinion Score) test, the first VTED, trained with the old corpus and basic features, was rather good, 0.81 (on a 5 point MOS scale) higher than the previous system – HoaSung (using the non-uniform unit selection with the same training corpus); but still 1.2-1.5 point lower than the natural speech. The quality of the final VTED, trained with the new corpus and prosodic phrasing model, progressed by about 1.04 compared to the first VTED, and its gap with the natural speech was much lessened. In the tone intelligibility test, the final VTED received a high correct rate of 95.4%, only 2.6% lower than the natural speech, and 18% higher than the initial one. The error rate of the first VTED in the intelligibility test with the Latin square design was about 6-12% higher than the natural speech depending on syllable, tone or phone levels. The final one diverged about only 0.4-1.4% from the natural speech

APA, Harvard, Vancouver, ISO, and other styles

20

Akinepally, Pratima Rao. "Investigating Performance of Different Models at Short Text Topic Modelling." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-288531.

Full text

Abstract:

The key objective of this project was to quantitatively and qualitatively assess the performance of a sentence embedding model, Universal Sentence Encoder (USE), and a word embedding model, word2vec, at the task of topic modelling. The first step in the process was data collection. The data used for the project was podcast descriptions available at Spotify, and the topics associated with them. Following this, the data was used to generate description vectors and topic vectors using the embedding models, which were then used to assign topics to descriptions. The results from this study led to the conclusion that embedding models are well suited to this task, and that overall the USE outperforms the word2vec models.
Det huvudsakliga syftet med det i denna uppsats rapporterade projektet är att kvantitativt och kvalitativt utvärdera och jämföra hur väl Universal Sentence Encoder USE, ett semantiskt vektorrum för meningar, och word2vec, ett semantiskt vektorrum för ord, fungerar för att modellera ämnesinnehåll i text. Projektet har som träningsdata använt skriftliga sammanfattningar och ämnesetiketter för podd-episoder som gjorts tillgängliga av Spotify. De skriftliga sammanfattningarna har använts för att generera både vektorer för de enskilda podd-episoderna och för de ämnen de behandlar. De båda ansatsernas vektorer har sedan utvärderats genom att de använts för att tilldela ämnen till beskrivningar ur en testmängd. Resultaten har sedan jämförts och leder både till den allmänna slutsatsen att semantiska vektorrum är väl lämpade för den här sortens uppgifter, och att USE totalt sett överträffar word2vec-modellerna.

APA, Harvard, Vancouver, ISO, and other styles

21

Kim, Hyowon. "Improving Inferences about Preferences in Choice Modeling." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1587524882296023.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Barford, Paul R. "Modeling, measurement and performance of World Wide Web transactions." Thesis, Boston University, 2001. https://hdl.handle.net/2144/36753.

Full text

Abstract:

Thesis (Ph.D.)--Boston University
PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at open-help@bu.edu. Thank you.
The size, diversity and continued growth of the World Wide Web combine to make its understanding difficult even at the most basic levels. The focus of our work is in developing novel methods for measuring and analyzing the Web which lead to a deeper understanding of its performance. We describe a methodology and a distributed infrastructure for taking measurements in both the network and end-hosts. The first unique characteristic of the infrastructure is our ability to generate requests at our Web server which closely imitate actual users. This ability is based on detailed analysis of Web client behavior and the creation of the Scalable URL Request Generator (SURGE) tool. SURGE provides us with the flexibility to test different aspects of Web performance. We demonstrate this flexibility in an evaluation of the 1.0 and 1.1 versions of the Hyper Text Transfer Protocol. The second unique aspect of our approach is that we analyze the details of Web transactions by applying critical path analysis (CPA). CPA enables us to precisely decompose latency in Web transactions into propagation delay, network variation, server delay, client delay and packet loss delays. We present analysis of pe1formance data collected in our infrastructure. Our results show that our methods can expose surprising behavior in Web servers, and can yield considerable insight into the causes of delay variability in Web transactions.
2031-01-01

APA, Harvard, Vancouver, ISO, and other styles

23

Johnson, Barbara Denise. "Modeling Cognitive Authority Relationships." Thesis, University of North Texas, 2016. https://digital.library.unt.edu/ark:/67531/metadc955042/.

Full text

Abstract:

Information-seeking behavior is a mixture of activities and attitudes, oftentimes motivated by an individual's need to make a decision. One underlying element of this mixture is cognitive authority - which sources (e.g., individuals, institutions, texts, etc.) can be trusted to fulfil the information needs? In order to gain insight into the dynamics of cognitive authority selection behavior which is an information seeking behavior, this study explored primary source text data (316 text records) that reflected selection in the mundaneness of life (advice column submissions and responses). Linguistic analysis was performed on the data using the Linguistic Inquiry Word Count (LIWC2015) software package. Pearson correlation and 1-sample T tests revealed the same 45 statistically significant relationships (SSRs) in the word usage behavior of all subgroups. As a result of the study, the gap in research formed from the lack of quantitative models of cognitive authority relationships was addressed via the development of the Wordprint Classification System which was used to generate a cognitive authority relationship model in the form of a cognitive authority intra-segment wordprint. The findings and implications of this study may provide a contribution to the body of work in the area of information literacy and information seeker behavior by revealing factors that information scientists can address to help meet information seekers' needs. Additionally, the Wordprint Classification System may be used in such disciplines as psychology, marketing, and forensic linguistics to create to create models of various relationships or individuals through the use of written or spoken word usage patterns.

APA, Harvard, Vancouver, ISO, and other styles

24

Das, Manirupa. "Neural Methods Towards Concept Discovery from Text via Knowledge Transfer." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1572387318988274.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Al, Madi Naser S. "A STUDY OF LEARNING PERFORMANCE AND COGNITIVE ACTIVITY DURING MULTIMODAL COMPREHENSION USING SEGMENTATION-INTEGRATION MODEL AND EEG." Kent State University / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=kent1416868268.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Keneshloo, Yaser. "Addressing Challenges of Modern News Agencies via Predictive Modeling, Deep Learning, and Transfer Learning." Diss., Virginia Tech, 2019. http://hdl.handle.net/10919/91910.

Full text

Abstract:

Today's news agencies are moving from traditional journalism, where publishing just a few news articles per day was sufficient, to modern content generation mechanisms, which create more than thousands of news pieces every day. With the growth of these modern news agencies comes the arduous task of properly handling this massive amount of data that is generated for each news article. Therefore, news agencies are constantly seeking solutions to facilitate and automate some of the tasks that have been previously done by humans. In this dissertation, we focus on some of these problems and provide solutions for two broad problems which help a news agency to not only have a wider view of the behaviour of readers around the article but also to provide an automated tools to ease the job of editors in summarizing news articles. These two disjoint problems are aiming at improving the users' reading experience by helping the content generator to monitor and focus on poorly performing content while allow them to promote the good-performing ones. We first focus on the task of popularity prediction of news articles via a combination of regression, classification, and clustering models. We next focus on the problem of generating automated text summaries for a long news article using deep learning models. The first problem aims at helping the content developer in understanding of how a news article is performing over the long run while the second problem provides automated tools for the content developers to generate summaries for each news article.
Doctor of Philosophy
Nowadays, each person is exposed to an immense amount of information from social media, blog posts, and online news portals. Among these sources, news agencies are one of the main content providers for each person around the world. Contemporary news agencies are moving from traditional journalism to modern techniques from different angles. This is achieved either by building smart tools to track the behaviour of readers’ reaction around a specific news article or providing automated tools to facilitate the editor’s job in providing higher quality content to readers. These systems should not only be able to scale well with the growth of readers but also they have to be able to process ad-hoc requests, precisely since most of the policies and decisions in these agencies are taken around the result of these analytical tools. As part of this new movement towards adapting new technologies for smart journalism, we have worked on various problems with The Washington Post news agency on building tools for predicting the popularity of a news article and automated text summarization model. We develop a model that monitors each news article after its publication and provide prediction over the number of views that this article will receive within the next 24 hours. This model will help the content creator to not only promote potential viral article in the main page of the web portal or social media, but also provide intuition for editors on potential poorly performing articles so that they can edit the content of those articles for better exposure. On the other hand, current news agencies are generating more than a thousands news articles per day and generating three to four summary sentences for each of these news pieces not only become infeasible in the near future but also very expensive and time-consuming. Therefore, we also develop a separate model for automated text summarization which generates summary sentences for a news article. Our model will generate summaries by selecting the most salient sentence in the news article and paraphrase them to shorter sentences that could represent as a summary sentence for the entire document.

APA, Harvard, Vancouver, ISO, and other styles

27

Xiong, Hui. "Combining Subject Expert Experimental Data with Standard Data in Bayesian Mixture Modeling." The Ohio State University, 2011. http://rave.ohiolink.edu/etdc/view?acc_num=osu1312214048.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Chen, Le. "Identifying Job Categories and Required Competencies for Instructional Technologist: A Text Mining and Content Analysis." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/99279.

Full text

Abstract:

This study applied both human-based and computer-based techniques to conduct a job analysis in the field of instructional technology. The primary research focus of the job analysis was to examine the efficacy of text mining by comparing text mining results with content analysis results. This agenda was fulfilled by using job announcement data as an example to determine essential job categories and required competencies. In phase one, a job title analysis was conducted. Different categorizing strategies were explored, and primary job categories were reported. In phase two, the human-based content analysis was conducted, which identified 20 competencies in the knowledge domain, 22 in the ability domain, 23 in the skill domain, and 13 other competencies. In phase three, text mining (topic modeling) was applied to the entire data set, resulting in 50 themes. From these 50 themes, the researcher selected 20 themes that were most relevant to instructional technology competencies. The findings of the two research techniques differ in terms of granularity, comprehensibility, and objectivity. Based on evidence revealed in the current study, the author recommends that future studies explore ways to combine the two techniques to complement one another.
Doctor of Philosophy
According to Kimmons and Veletsianos (2018), text mining has not been widely applied in the field of instructional technology. This study provides an example of using text mining techniques to discover a set of required job competencies. It can be helpful to researchers unfamiliar with text mining methodology, allowing them to understand its potentials and limitations better. The primary research focus was to examine the efficacy of text mining by comparing text mining results with content analysis results. Both content analysis and text mining procedures were applied to the same data set to extract job competencies. Similarities and differences between the results were compared, and the pros and cons of each methodology were discussed.

APA, Harvard, Vancouver, ISO, and other styles

29

SUI, ZHENHUAN. "Hierarchical Text Topic Modeling with Applications in Social Media-Enabled Cyber Maintenance Decision Analysis and Quality Hypothesis Generation." The Ohio State University, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=osu1499446404436637.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Fonseca, Felipe Penhorate Carvalho da. "Inferência das áreas de atuação de pesquisadores." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/100/100131/tde-02032018-102111/.

Full text

Abstract:

Atualmente, existe uma grande gama de dados acadêmicos disponíveis na web. Com estas informações é possível realizar tarefas como descoberta de especialistas em uma dada área, identificação de potenciais bolsistas de produtividade, sugestão de colaboradores, entre outras diversas. Contudo, o sucesso destas tarefas depende da qualidade dos dados utilizados, pois dados incorretos ou incompletos tendem a prejudicar o desempenho dos algoritmos aplicados. Diversos repositórios de dados acadêmicos não contêm ou não exigem a informação explícita das áreas de atuação dos pesquisadores. Nos dados dos currículos Lattes essa informação existe, porém é inserida manualmente pelo pesquisador sem que haja nenhum tipo de validação (e potencialmente possui informações desatualizadas, faltantes ou mesmo incorretas). O presente trabalho utilizou técnicas de aprendizado de máquina na inferência das áreas de atuação de pesquisadores com base nos dados cadastrados na plataforma Lattes. Os títulos da produção científica foram utilizados como fonte de dados, sendo estes enriquecidos com informações semanticamente relacionadas presentes em outras bases, além de adotar representações diversas para o texto dos títulos e outras informações acadêmicas como orientações e projetos de pesquisa. Objetivou-se avaliar se o enriquecimento dos dados melhora o desempenho dos algoritmos de classificação testados, além de analisar a contribuição de fatores como métricas de redes sociais, idioma dos títulos e a própria estrutura hierárquica das áreas de atuação no desempenho dos algoritmos. A técnica proposta pode ser aplicada a diferentes dados acadêmicos (não sendo restrita a dados presentes na plataforma Lattes), mas os dados oriundos dessa plataforma foram utilizados para os testes e validações da solução proposta. Como resultado, identificou-se que a técnica utilizada para realizar o enriquecimento do texto não auxiliou na melhoria da precisão da inferência. Todavia, as métricas de redes sociais e representações numéricas melhoram a inferência quando comparadas com técnicas do estado da arte, assim como o uso da própria estrutura hierárquica de classes, que retornou os melhores resultados dentre os obtidos
Nowadays, there is a wide range of academic data available on the web. With this information, it is possible to solve tasks such as the discovery of specialists in a given area, identification of potential scholarship holders, suggestion of collaborators, among others. However, the success of these tasks depends on the quality of the data used, since incorrect or incomplete data tend to impair the performance of the applied algorithms. Several academic data repositories do not contain or do not require the explicit information of the researchers\' areas. In the data of the Lattes curricula, this information exists, but it is inserted manually by the researcher without any kind of validation (and potentially it is outdated, missing or even there is incorrect information). The present work utilized machine learning techniques in the inference of the researcher\'s areas based on the data registered in the Lattes platform. The titles of the scientific production were used as data source and they were enriched with semantically related information present in other bases, besides adopting other representations for the text of the titles and other academic information as orientations and research projects. The objective of this dissertation was to evaluate if the data enrichment improves the performance of the classification algorithms tested, as well as to analyze the contribution of factors such as social network metrics, the language of the titles and the hierarchical structure of the areas in the performance of the algorithms. The proposed technique can be applied to different academic data (not restricted to data present in the Lattes platform), but the data from this platform was used for the tests and validations of the proposed solution. As a result, it was identified that the technique used to perform the enrichment of the text did not improve the accuracy of the inference. However, social network metrics and numerical representations improved inference accuracy when compared to state-of-the-art techniques, as well as the use of the hierarchical structure of the classes, which returned the best results among the obtained

APA, Harvard, Vancouver, ISO, and other styles

31

Ahmad, Irfan [Verfasser], Gernot A. [Akademischer Betreuer] Fink, and Laurence [Gutachter] Likforman-Sulem. "Modeling and training options for handwritten Arabic text recognition / Irfan Ahmad ; Gutachter: Laurence Likforman-Sulem ; Betreuer: Gernot A. Fink." Dortmund : Universitätsbibliothek Dortmund, 2016. http://d-nb.info/1128903393/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Wei, Wei. "Probabilistic Models of Topics and Social Events." Research Showcase @ CMU, 2016. http://repository.cmu.edu/dissertations/941.

Full text

Abstract:

Structured probabilistic inference has shown to be useful in modeling complex latent structures of data. One successful way in which this technique has been applied is in the discovery of latent topical structures of text data, which is usually referred to as topic modeling. With the recent popularity of mobile devices and social networking, we can now easily acquire text data attached to meta information, such as geo-spatial coordinates and time stamps. This metadata can provide rich and accurate information that is helpful in answering many research questions related to spatial and temporal reasoning. However, such data must be treated differently from text data. For example, spatial data is usually organized in terms of a two dimensional region while temporal information can exhibit periodicities. While some work existing in the topic modeling community that utilizes some of the meta information, these models largely focused on incorporating metadata into text analysis, rather than providing models that make full use of the joint distribution of metainformation and text. In this thesis, I propose the event detection problem, which is a multidimensional latent clustering problem on spatial, temporal and topical data. I start with a simple parametric model to discover independent events using geo-tagged Twitter data. The model is then improved toward two directions. First, I augmented the model using Recurrent Chinese Restaurant Process (RCRP) to discover events that are dynamic in nature. Second, I studied a model that can detect events using data from multiple media sources. I studied the characteristics of different media in terms of reported event times and linguistic patterns. The approaches studied in this thesis are largely based on Bayesian nonparametric methods to deal with steaming data and unpredictable number of clusters. The research will not only serve the event detection problem itself but also shed light into a more general structured clustering problem in spatial, temporal and textual data.

APA, Harvard, Vancouver, ISO, and other styles

33

Romero, Margaurete. "Comparing Game Simulation to Concept Models for Student-Centered Learning in Biology." Scholar Commons, 2016. http://scholarcommons.usf.edu/etd/6577.

Full text

Abstract:

Science education research continues to demonstrate improved learning with active-learning techniques compared to lectures. However, the question of which active-learning methods are the most effective for learning complex scientific principles in various context still remains. Models are commonly used in activities that allow students to simplify complex systems and understand how components interact. I investigated the outcomes for student learning and engagement of two model-based activities - concept models and game simulations. The activities were conducted in an introductory biology course in sixteen discussion sections. Eight sections were assigned to the concept model activity and eight to the simulation activity. To assess engagement, students filled out a Likert-scale questionnaire on enjoyment and usefulness of activity (concept model: 130 students for food web activity and 131 for carbon cycle activity; game simulation: 131 students for food web activity and 126 game simulation students during the carbon cycle activity). To assess student learning, 152 students completed pre-post homework assignment based on conservation and transformation of matter. Over 80% of students enjoyed both the concept-mapping and simulation activities. Students reported that the hands-on nature of the concept activity was helpful for understanding the connections in food webs. For the homework assessment, all students significantly increased in their scores from pre to post on the MC (paired t-test, meanpre = 4.86±1.6; meanpost = 5.23±1.6;p<.05) and TF assessments (paired t-test; meanpre = 2.06±1.0 meanpost = 2.32± 1.0; p<0.05). For the TF assessments, we observed the trend that students in the simulation group showed a greater improvement in their scores than students in the concept-mapping group (t-test; meanΔconcept = 0.11±1.4; meanΔsimulation =0 .43±1.0 p=.059). There was no difference between student improvement for the two groups on the MC assessment ( t-test meanΔconcept = 0.27±2.1; meanΔsimulation = 0.51±1.8 p=.474). Students’ responses to short answer questions showed those students’ ideas about the concept of matter conservation varied from naive to scientific. For example, students failed to conserve matter during nutrient cycling. More scientific responses demonstrated principled reasoning such as references to conservation of matter. The students within the two activities did not demonstrate large differences between their text responses for the short answer. Overall, students in both activity type demonstrated learning gains, though there was no significant difference between the activity types.

APA, Harvard, Vancouver, ISO, and other styles

34

Shokat, Imran. "Computational Analyses of Scientific Publications Using Raw and Manually Curated Data with Applications to Text Visualization." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-78995.

Full text

Abstract:

Text visualization is a field dedicated to the visual representation of textual data by using computer technology. A large number of visualization techniques are available, and now it is becoming harder for researchers and practitioners to choose an optimal technique for a particular task among the existing techniques. To overcome this problem, the ISOVIS Group developed an interactive survey browser for text visualization techniques. ISOVIS researchers gathered papers which describe text visualization techniques or tools and categorized them according to a taxonomy. Several categories were manually assigned to each visualization technique. In this thesis, we aim to analyze the dataset of this browser. We carried out several analyses to find temporal trends and correlations of the categories present in the browser dataset. In addition, a comparison of these categories with a computational approach has been made. Our results show that some categories became more popular than before whereas others have declined in popularity. The cases of positive and negative correlation between various categories have been found and analyzed. Comparison between manually labeled datasets and results of computational text analyses were presented to the experts with an opportunity to refine the dataset. Data which is analyzed in this thesis project is specific to text visualization field, however, methods that are used in the analyses can be generalized for applications to other datasets of scientific literature surveys or, more generally, other manually curated collections of textual documents.

APA, Harvard, Vancouver, ISO, and other styles

35

Apelthun, Catharina. "Topic modeling on a classical Swedish text corpus of prose fiction : Hyperparameters’ effect on theme composition and identification of writing style." Thesis, Uppsala universitet, Statistiska institutionen, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-441653.

Full text

Abstract:

A topic modeling method, smoothed Latent Dirichlet Allocation (LDA) is applied on a text corpus data of classical Swedish prose fiction. The thesis consists of two parts. In the first part, a smoothed LDA model is applied to the corpus, investigating how changes in hyperparameter values affect the topics in terms of distribution of words within topics and topics within novels. In the second part, two smoothed LDA models are applied to a reduced corpus, only consisting of adjectives. The generated topics are examined to see if they are more likely to occur in a text of a particular author and if the model could be used for identification of writing style. With this new approach, the ability of the smoothed LDA model as a writing style identifier is explored. While the texts analyzed in this thesis is unusally long - as they are not seg- mented prose fiction - the effect of the hyperparameters on model performance was found to be similar to those found in previous research. For the adjectives corpus, the models did succeed in generating topics with a higher probability of occurring in novels by the same author. The smoothed LDA was shown to be a good model for identification of writing style. Keywords: Topic modeling, Smoothed Latent Dirichlet Allocation, Gibbs sam- pling, MCMC, Bayesian statistics, Swedish prose fiction.

APA, Harvard, Vancouver, ISO, and other styles

36

Alverio, Gustavo. "DISCUSSION ON EFFECTIVE RESTORATION OF ORAL SPEECH USING VOICE CONVERSION TECHNIQUES BASED ON GAUSSIAN MIXTURE MODELING." Master's thesis, University of Central Florida, 2007. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2909.

Full text

Abstract:

Today's world consists of many ways to communicate information. One of the most effective ways to communicate is through the use of speech. Unfortunately many lose the ability to converse. This in turn leads to a large negative psychological impact. In addition, skills such as lecturing and singing must now be restored via other methods. The usage of text-to-speech synthesis has been a popular resolution of restoring the capability to use oral speech. Text to speech synthesizers convert text into speech. Although text to speech systems are useful, they only allow for few default voice selections that do not represent that of the user. In order to achieve total restoration, voice conversion must be introduced. Voice conversion is a method that adjusts a source voice to sound like a target voice. Voice conversion consists of a training and converting process. The training process is conducted by composing a speech corpus to be spoken by both source and target voice. The speech corpus should encompass a variety of speech sounds. Once training is finished, the conversion function is employed to transform the source voice into the target voice. Effectively, voice conversion allows for a speaker to sound like any other person. Therefore, voice conversion can be applied to alter the voice output of a text to speech system to produce the target voice. The thesis investigates how one approach, specifically the usage of voice conversion using Gaussian mixture modeling, can be applied to alter the voice output of a text to speech synthesis system. Researchers found that acceptable results can be obtained from using these methods. Although voice conversion and text to speech synthesis are effective in restoring voice, a sample of the speaker before voice loss must be used during the training process. Therefore it is vital that voice samples are made to combat voice loss.
M.S.E.E.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Electrical Engineering MSEE

APA, Harvard, Vancouver, ISO, and other styles

37

Al, Madi Naser S. "Modeling Eye Movement for the Assessment of Programming Proficiency." Kent State University / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=kent1595429905152276.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Kramer, Stephanie. "Holy day effects on language: How religious geography, individual affiliation and day of the week relate to sentiment and topics on Twitter." Thesis, University of Oregon, 2018. http://hdl.handle.net/1794/23106.

Full text

Abstract:

Religious belief and attendance predict improved well-being at the individual level. Paradoxically, geographic locations with high rates of religious belief and attendance are often those with the differentially high rates of societal instability and suffering. Many of the consequences of religiosity are context-based and vary across time, and holy days are naturally-occurring religious cues that have been shown to influence religiously-relevant attitudes and behaviors. I investigated the degree to which personal religiosity and religious geography (i.e. religious demographics with other location variables) individually and interactively predict well-being across days of the week. In the first study, American Christians demonstrated greater well-being by expressing more positive sentiment in Twitter posts, while American Muslims displayed less well-being. Sundays were generally the most positive day, but American Muslims communicated more happiness on Fridays (the Muslim holy day). In the second study, Christianity did not predict increased well-being in the posts of college students. In the third study, global survey data with measures of religiosity and well-being indicated that the well-being consequences of religious affiliation depend on the religious group and location, and that people tend to be especially positive on their group’s holy day. Study four explored the latent topical content of Twitter posts. Across studies, religious minority status appeared to have a deleterious effect on well-being.

APA, Harvard, Vancouver, ISO, and other styles

39

Barbieri, Francesco. "Machine learning methods for understanding social media communication: modeling irony and emojis." Doctoral thesis, Universitat Pompeu Fabra, 2018. http://hdl.handle.net/10803/461793.

Full text

Abstract:

In this dissertation we propose algorithms for the analysis of social media texts, focusing on two particular aspects: irony and emojis. We propose novel automatic systems, based on machine learning methods, able to recognize and interpret these two phenomena. We also explore the problem of topic bias in sentiment analysis and irony detection, showing that traditional word based systems are not robust when they have to recognize irony on a new domain. We argue that our proposal is better suited for topic changes. We then use our approach to recognize another phenomena related to irony: satirical news in Twitter. By relying on distributional semantic models, we also introduce a novel method for the study of the meaning and use of emojis in social media texts. Moreover, we also propose an emoji prediction task that consists in predicting the emoji present in a text message using only the text. We have shown that this emoji prediction task can be performed by deep-learning systems with good accuracy, and that this accuracy can be improved by using images included in the post.
En esta tesis proponemos algoritmos para el análisis de textos de redes sociales, enfocándonos en dos aspectos particulares: el reconocimiento automático de la ironía y el análisis y predicción de emojis. Proponemos sistemas automáticos, basados en métodos de aprendizaje automático, capaces de reconocer e interpretar estos dos fenómenos. También exploramos el problema del sesgo en análisis del sentimiento y en la detección de la ironía, mostrando que los sistemas tradicionales, basados en palabras, no son robustos cuando los datos de entrenamiento y test pertenecen a dominios diferentes. El modelo que se propone en esta tesis para el reconocimiento de la ironía es más estable a los cambios de dominio que los sistemas basados en palabras. En una serie de experimentos demostramos que nuestro modelo es también capaz de distinguir entre noticias satíricas y no satíricas. Asimismo, exploramos con modelos semánticos distribucional, si y cómo el significado y el uso de emojis varía entre los idiomas, así como a través de las épocas del año. También nos preguntamos si es posible predecir el emoji que un mensaje contiene solo utilizando el texto del mensaje. Hemos demostrado que nuestro sistema basado en deep-learning es capaz de realizar esta tarea con buena precisión y que se pueden mejorar los resultados si además del texto se utiliza información sobre las imágenes que acompañan al texto.

APA, Harvard, Vancouver, ISO, and other styles

40

Walker, Daniel David. "Bayesian Test Analytics for Document Collections." BYU ScholarsArchive, 2012. https://scholarsarchive.byu.edu/etd/3530.

Full text

Abstract:

Modern document collections are too large to annotate and curate manually. As increasingly large amounts of data become available, historians, librarians and other scholars increasingly need to rely on automated systems to efficiently and accurately analyze the contents of their collections and to find new and interesting patterns therein. Modern techniques in Bayesian text analytics are becoming wide spread and have the potential to revolutionize the way that research is conducted. Much work has been done in the document modeling community towards this end,though most of it is focused on modern, relatively clean text data. We present research for improved modeling of document collections that may contain textual noise or that may include real-valued metadata associated with the documents. This class of documents includes many historical document collections. Indeed, our specific motivation for this work is to help improve the modeling of historical documents, which are often noisy and/or have historical context represented by metadata. Many historical documents are digitized by means of Optical Character Recognition(OCR) from document images of old and degraded original documents. Historical documents also often include associated metadata, such as timestamps,which can be incorporated in an analysis of their topical content. Many techniques, such as topic models, have been developed to automatically discover patterns of meaning in large collections of text. While these methods are useful, they can break down in the presence of OCR errors. We show the extent to which this performance breakdown occurs. The specific types of analyses covered in this dissertation are document clustering, feature selection, unsupervised and supervised topic modeling for documents with and without OCR errors and a new supervised topic model that uses Bayesian nonparametrics to improve the modeling of document metadata. We present results in each of these areas, with an emphasis on studying the effects of noise on the performance of the algorithms and on modeling the metadata associated with the documents. In this research we effectively: improve the state of the art in both document clustering and topic modeling; introduce a useful synthetic dataset for historical document researchers; and present analyses that empirically show how existing algorithms break down in the presence of OCR errors.

APA, Harvard, Vancouver, ISO, and other styles

41

Uslu, Tolga [Verfasser], Alexander [Akademischer Betreuer] Mehler, Alexander [Gutachter] Mehler, and Visvanathan [Gutachter] Ramesh. "Multi-document analysis : semantic analysis of large text corpora beyond topic modeling / Tolga Uslu ; Gutachter: Alexander Mehler, Visvanathan Ramesh ; Betreuer: Alexander Mehler." Frankfurt am Main : Universitätsbibliothek Johann Christian Senckenberg, 2020. http://d-nb.info/1221669125/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Di, Fiore Silvia. "La dimensione discorsiva della Politica di Coesione. Confronto fra Content Analysis e Topic Modeling." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amslaurea.unibo.it/17284/.

Full text

Abstract:

Questo elaborato studia il confronto fra due tipologie di analisi applicate ad una raccolta di articoli de Il Sole 24 Ore, riguardanti la “Politica di Coesione”: Content Analysis e Analisi Topic Modeling. Vengono descritte ed elencate le tecniche utilizzate per entrambi i metodi di analisi e, nel caso di Topic Modeling vengono indicati gli algoritmi più utilizzati e i corrispettivi software. Nella sezione finale dell’elaborato viene presentato il caso di studio dove sono riportati i risultati ottenuti dall’analisi qualitativa di articoli cartacei che ci sono stati forniti dalla “Regione Puglia” ed i risultati ottenuti dall’analisi Topic Modeling degli stessi articoli in formato digitale tramite il Software Mallet. In entrambe le analisi sono emersi una serie di Topic a cui sono state assegnate delle etichette per identificare i sotto-argomenti presenti nei documenti della raccolta. Alla fine dell’elaborato vengono analizzati e confrontati i vari Topic emersi dalle due tipologie di analisi applicate.

APA, Harvard, Vancouver, ISO, and other styles

43

Dzhambazov, Georgi. "Knowledge-based probabilistic modeling for tracking lyrics in music audio signals." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/404681.

Full text

Abstract:

This thesis proposes specific signal processing and machine learning methodologies for automatically aligning the lyrics of a song to its corresponding audio recording. The research carried out falls in the broader field of music information retrieval (MIR) and in this respect, we aim at improving some existing state-of-the-art methodologies, by introducing domain-specific knowledge. The goal of this work is to devise models capable of tracking in the music audio signal the sequential aspect of one particular element of lyrics - the phonemes. Music can be understood as comprising different facets, one of which is lyrics. The models we build take into account the complementary context that exists around lyrics, which is any musical facet complementary to lyrics. The facets used in this thesis include the structure of the music composition, structure of a melodic phrase, the structure of a metrical cycle. From this perspective, we analyse not only the low-level acoustic characteristics, representing the timbre of the phonemes, but also higher-level characteristics, in which the complementary context manifests. We propose specific probabilistic models to represent how the transitions between consecutive sung phonemes are conditioned by different facets of complementary context. The complementary context, which we address, unfolds in time according to principles that are particular of a music tradition. To capture these, we created corpora and datasets for two music traditions, which have a rich set of such principles: Ottoman Turkish makam and Beijing opera. The datasets and the corpora comprise different data types: audio recordings, music scores, and metadata. From this perspective, the proposed models can take advantage both of the data and the music-domain knowledge of particular musical styles to improve existing baseline approaches. As a baseline, we choose a phonetic recognizer based on hidden Markov models (HMM): a widely-used methodology for tracking phonemes both in singing and speech processing problems. We present refinements in the typical steps of existing phonetic recognizer approaches, tailored towards the characteristics of the studied music traditions. On top of the refined baseline, we device probabilistic models, based on dynamic Bayesian networks (DBN) that represent the relation of phoneme transitions to its complementary context. Two separate models are built for two granularities of complementary context: the structure of a melodic phrase (higher-level) and the structure of the metrical cycle (finer-level). In one model we exploit the fact the syllable durations depend on their position within a melodic phrase. Information about the melodic phrases is obtained from the score, as well as from music-specific knowledge.Then in another model, we analyse how vocal note onsets, estimated from audio recordings, influence the transitions between consecutive vowels and consonants. We also propose how to detect the time positions of vocal note onsets in melodic phrases by tracking simultaneously the positions in a metrical cycle (i.e. metrical accents). In order to evaluate the potential of the proposed models, we use the lyrics-to-audio alignment as a concrete task. Each model improves the alignment accuracy, compared to the baseline, which is based solely on the acoustics of the phonetic timbre. This validates our hypothesis that knowledge of complementary context is an important stepping stone for computationally tracking lyrics, especially in the challenging case of singing with instrumental accompaniment. The outcomes of this study are not only theoretic methodologies and data, but also specific software tools that have been integrated into Dunya - a suite of tools, built in the context of CompMusic, a project for advancing the computational analysis of the world's music. With this application, we have also shown that the developed methodologies are useful not only for tracking lyrics, but also for other use cases, such as enriched music listening and appreciation, or for educational purposes.
La tesi aquí presentada proposa metodologies d’aprenentatge automàtic i processament de senyal per alinear automàticament el text d’una cançó amb el seu corresponent enregistrament d’àudio. La recerca duta a terme s’engloba en l’ampli camp de l’extracció d’informació musical (Music Information Retrieval o MIR). Dins aquest context la tesi pretén millorar algunes de les metodologies d’última generació del camp introduint coneixement específic de l’àmbit. L’objectiu d’aquest treball és dissenyar models que siguin capaços de detectar en la senyal d’àudio l’aspecte seqüencial d’un element particular dels textos musicals; els fonemes. Podem entendre la música com la composició de diversos elements entre els quals podem trobar el text. Els models que construïm tenen en compte el context complementari del text. El context són tots aquells aspectes musicals que complementen el text, dels quals hem utilitzat en aquest tesi: la estructura de la composició musical, la estructura de les frases melòdiques i els accents rítmics. Des d’aquesta prespectiva analitzem no només les característiques acústiques de baix nivell, que representen el timbre musical dels fonemes, sinó també les característiques d’alt nivell en les quals es fa patent el context complementari. En aquest treball proposem models probabilístics específics que representen com les transicions entre fonemes consecutius de veu cantanda es veuen afectats per diversos aspectes del context complementari. El context complementari que tractem aquí es desenvolupa en el temps en funció de les característiques particulars de cada tradició musical. Per tal de modelar aquestes característiques hem creat corpus i conjunts de dades de dues tradicions musicals que presenten una gran riquesa en aquest aspectes; la música de l’opera de Beijing i la música makam turc-otomana. Les dades són de diversos tipus; enregistraments d’àudio, partitures musicals i metadades. Des d’aquesta prespectiva els models proposats poden aprofitar-se tant de les dades en si mateixes com del coneixement específic de la tradició musical per a millorar els resultats de referència actuals. Com a resultat de referència prenem un reconeixedor de fonemes basat en models ocults de Markov (Hidden Markov Models o HMM), una metodologia abastament emprada per a detectar fonemes tant en la veu cantada com en la parlada. Presentem millores en els processos comuns dels reconeixedors de fonemes actuals, ajustant-los a les característiques de les tradicions musicals estudiades. A més de millorar els resultats de referència també dissenyem models probabilistics basats en xarxes dinàmiques de Bayes (Dynamic Bayesian Networks o DBN) que respresenten la relació entre la transició dels fonemes i el context complementari. Hem creat dos models diferents per dos aspectes del context complementari; la estructura de la frase melòdica (alt nivell) i la estructura mètrica (nivell subtil). En un dels models explotem el fet que la duració de les síl·labes depén de la seva posició en la frase melòdica. Obtenim aquesta informació sobre les frases musical de la partitura i del coneixement específic de la tradició musical. En l’altre model analitzem com els atacs de les notes vocals, estimats directament dels enregistraments d’àudio, influencien les transicions entre vocals i consonants consecutives. A més també proposem com detectar les posicions temporals dels atacs de les notes en les frases melòdiques a base de localitzar simultàniament els accents en un cicle mètric musical. Per tal d’evaluar el potencial dels mètodes proposats utlitzem la tasca específica d’alineament de text amb àudio. Cada model proposat millora la precisió de l’alineament en comparació als resultats de referència, que es basen exclusivament en les característiques acústiques tímbriques dels fonemes. D’aquesta manera validem la nostra hipòtesi de que el coneixement del context complementari ajuda a la detecció automàtica de text musical, especialment en el cas de veu cantada amb acompanyament instrumental. Els resultats d’aquest treball no consisteixen només en metodologies teòriques i dades, sinó també en eines programàtiques específiques que han sigut integrades a Dunya, un paquet d’eines creat en el context del projecte de recerca CompMusic, l’objectiu del qual és promoure l’anàlisi computacional de les músiques del món. Gràcies a aquestes eines demostrem també que les metodologies desenvolupades es poden fer servir per a altres aplicacions en el context de la educació musical o la escolta musical enriquida.

APA, Harvard, Vancouver, ISO, and other styles

44

Efer, Thomas. "Graphdatenbanken für die textorientierten e-Humanities." Doctoral thesis, Universitätsbibliothek Leipzig, 2017. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-219122.

Full text

Abstract:

Vor dem Hintergrund zahlreicher Digitalisierungsinitiativen befinden sich weite Teile der Geistes- und Sozialwissenschaften derzeit in einer Transition hin zur großflächigen Anwendung digitaler Methoden. Zwischen den Fachdisziplinen und der Informatik zeigen sich große Differenzen in der Methodik und bei der gemeinsamen Kommunikation. Diese durch interdisziplinäre Projektarbeit zu überbrücken, ist das zentrale Anliegen der sogenannten e-Humanities. Da Text der häufigste Untersuchungsgegenstand in diesem Feld ist, wurden bereits viele Verfahren des Text Mining auf Problemstellungen der Fächer angepasst und angewendet. Während sich langsam generelle Arbeitsabläufe und Best Practices etablieren, zeigt sich, dass generische Lösungen für spezifische Teilprobleme oftmals nicht geeignet sind. Um für diese Anwendungsfälle maßgeschneiderte digitale Werkzeuge erstellen zu können, ist eines der Kernprobleme die adäquate digitale Repräsentation von Text sowie seinen vielen Kontexten und Bezügen. In dieser Arbeit wird eine neue Form der Textrepräsentation vorgestellt, die auf Property-Graph-Datenbanken beruht – einer aktuellen Technologie für die Speicherung und Abfrage hochverknüpfter Daten. Darauf aufbauend wird das Textrecherchesystem „Kadmos“ vorgestellt, mit welchem nutzerdefinierte asynchrone Webservices erstellt werden können. Es bietet flexible Möglichkeiten zur Erweiterung des Datenmodells und der Programmfunktionalität und kann Textsammlungen mit mehreren hundert Millionen Wörtern auf einzelnen Rechnern und weitaus größere in Rechnerclustern speichern. Es wird gezeigt, wie verschiedene Text-Mining-Verfahren über diese Graphrepräsentation realisiert und an sie angepasst werden können. Die feine Granularität der Zugriffsebene erlaubt die Erstellung passender Werkzeuge für spezifische fachwissenschaftliche Anwendungen. Zusätzlich wird demonstriert, wie die graphbasierte Modellierung auch über die rein textorientierte Forschung hinaus gewinnbringend eingesetzt werden kann
In light of the recent massive digitization efforts, most of the humanities disciplines are currently undergoing a fundamental transition towards the widespread application of digital methods. In between those traditional scholarly fields and computer science exists a methodological and communicational gap, that the so-called \\\"e-Humanities\\\" aim to bridge systematically, via interdisciplinary project work. With text being the most common object of study in this field, many approaches from the area of Text Mining have been adapted to problems of the disciplines. While common workflows and best practices slowly emerge, it is evident that generic solutions are no ultimate fit for many specific application scenarios. To be able to create custom-tailored digital tools, one of the central issues is to digitally represent the text, as well as its many contexts and related objects of interest in an adequate manner. This thesis introduces a novel form of text representation that is based on Property Graph databases – an emerging technology that is used to store and query highly interconnected data sets. Based on this modeling paradigm, a new text research system called \\\"Kadmos\\\" is introduced. It provides user-definable asynchronous web services and is built to allow for a flexible extension of the data model and system functionality within a prototype-driven development process. With Kadmos it is possible to easily scale up to text collections containing hundreds of millions of words on a single device and even further when using a machine cluster. It is shown how various methods of Text Mining can be implemented with and adapted for the graph representation at a very fine granularity level, allowing the creation of fitting digital tools for different aspects of scholarly work. In extended usage scenarios it is demonstrated how the graph-based modeling of domain data can be beneficial even in research scenarios that go beyond a purely text-based study

APA, Harvard, Vancouver, ISO, and other styles

45

Albishre, Khaled Mohammed H. "Informative feature discovery for social media mining." Thesis, Queensland University of Technology, 2020. https://eprints.qut.edu.au/199464/1/Khaled%20Mohammed%20H_Albishre_Thesis.pdf.

Full text

Abstract:

Finding relevant information in social media data to satisfy the user need presents unique challenges due to its nature (e.g. high volume, short length, sparseness). This thesis aims to discover informative feature representations that can help to capture user information needs when no annotated data is available. Using state-of-the-art techniques in text mining and information retrieval research, this research proposes novel methods to boost the user information need with representative information in a social media context. The experimental results show that the proposed models outperform baseline models on standard TREC 2011-2014 microblog datasets.

APA, Harvard, Vancouver, ISO, and other styles

46

Norkevičius, Giedrius. "Method for creating phone duration models using very large, multi-speaker, automatically annotated speech corpus." Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2011. http://vddb.laba.lt/obj/LT-eLABa-0001:E.02~2011~D_20110201_144440-12017.

Full text

Abstract:

Two heretofore unanalyzed aspects are addressed in this dissertation: 1. Building a model capable of predicting phone duration of Lithuanian. All existing investigations of phone durations of Lithuanian were performed by linguists. Usually these investigations are the kind of exploratory statistics and are limited to a single factor, affecting phone duration, analysis. Phone duration dependencies on contextual factors were estimated and written in explicit form (decision tree) in this work by means of machine learning method. 2. Construction of language independent method for creating phone duration models using very large, multi-speaker, automatically annotated speech corpus. Most of the researchers worldwide use speech corpus that are: relatively small scale, single speaker, manually annotated or at least validated by experts. Usually the referred reasons are: using multi-speaker speech corpora is inappropriate because different speakers have different pronunciation manners and speak in different speech rate; automatically annotated corpuses lack accuracy. The created method for phone duration modeling enables the use of such corpus. The main components of the created method are: the reduction of noisy data in speech corpus; normalization of speaker specific phone durations by using phone type clustering. The performed listening tests of synthesized speech, showed that: the perceived naturalness is affected by the underlying phones durations; The use of contextual... [to full text]
Disertacijoje nagrinėjamos dvi iki šiol netyrinėtos problemos: 1. Lietuvių kalbos garsų trukmių prognozavimo modelių kūrimas Iki šiol visi darbai, kuriuose yra nagrinėjamos lietuvių kalbos garsų trukmės, yra atlikti kalbininkų, tačiau šie tyrimai yra daugiau aprašomosios statistikos pobūdžio ir apsiriboja pavienių požymių įtakos garso trukmei analize. Šiame darbe, mašininio mokymo algoritmo pagalba, požymių įtaka garsų trukmei yra išmokstama iš duomenų ir užrašoma sprendimo medžio pavidalu. 2. Nuo kalbos nepriklausomų garsų trukmių prognozavimo modelių kūrimo metodas, naudojant didelės apimties daugelio, kalbėtojų automatiškai, anotuotą garsyną. Dėl skirtingų kalbėtojų tarties specifikos ir dėl automatinio anotavimo netikslumų, kuriant garsų trukmės modelius visame pasaulyje yra apsiribojama vieno kalbėtojo ekspertų anotuotais nedidelės apimties garsynais. Darbe pasiūlyti skirtingų kalbėtojų tarties ypatybių normalizavimo ir garsyno duomenų triukšmo atmetimo algoritmai leidžia garsų trukmių modelių kūrimui naudoti didelės apimties, daugelio kalbėtojų automatiškai anotuotus garsynus. Darbo metu atliktas audicinis tyrimas, kurio pagalba parodoma, kad šnekos signalą sudarančių garsų trukmės turi įtakos klausytojų/respondentų suvokiamam šnekos signalo natūralumui; kontekstinės informacijos panaudojimas garsų trukmių prognozavimo uždavinio sprendime yra svarbus faktorius įtakojantis sintezuotos šnekos natūralumą; natūralaus šnekos signalo atžvilgiu, geriausiai vertinamas yra... [toliau žr. visą tekstą]

APA, Harvard, Vancouver, ISO, and other styles

47

Rivaldo, Ricardo de Moura. "GraphSchema : uma linguagem visual para a criação de modelos de contratos com SML." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2008. http://hdl.handle.net/10183/14908.

Full text

Abstract:

É usual falar da onipresença dos documentos de texto e na quantidade de informação não estruturada, armazenada sob a forma de arquivos com documentos de texto em linguagem natural. Este fato torna-se mais dramático no domínio jurídico, onde o texto é a ferramenta básica de trabalho dos profissionais da área, tanto na forma das fontes de consulta, i.e., a legislação, como no principal produto da atividade jurídica, especificamente a criação de documentos escritos. Desde a invenção do editor de texto existem iniciativas de utilização de tecnologias da informação para auxiliar a geração, armazenamento e consulta de documentos jurídicos. Dentre os diversos ramos da atividade jurídica, a criação de contratos é especialmente importante, devido a sua onipresença nas interações entre os agentes sociais, sejam elas pessoas físicas, jurídicas ou agentes de governo. Com foco na criação de modelos de contratos, este trabalho introduz a linguagem gráfica GraphSchema. Projetada para usuários finais, GraphSchema utiliza uma representação visual para criação de modelos de contratos jurídicos, permitindo a modelagem dos conceitos, relacionamentos e restrições entre estes. A representação visual é diretamente mapeada na linguagem SML, uma extensão do XML Schema. Ao possibilitar a criação de modelos conceituais de contratos diretamente por parte dos usuários finais sem forçar um vocabulário ou ontologia específicos, GraphSchema e, conseqüentemente, a utilização de SML, apresenta vantagens quando comparado com a utilização de XML Schema, RDF e OWL. Mas principalmente apresenta vantagens quando comparada com outras abordagens baseadas em definição de vocabulários e utilização de ontologias formais. Estas vantagens decorrem de sua simplicidade e flexibilidade que permite a utilização de padrões existentes para a definição de modelos de contratos, tais como, o padrão eContracts definido pelo consórcio LegalXML. Deste modo, GraphSchema apresenta-se como uma opção para a implementação e aplicação prática deste padrão. A disponibilidade de uma linguagem para usuários não técnicos permitirá a criação de contratos com marcação a priori, quando utilizado em conjunto com editores de texto guiados por XML. Isto irá abrir caminho para o aumento da produtividade na criação de contratos e documentos jurídicos.
It is common place to talk about the widespread presence of text documents and unstructured information stored in natural language text documents file format. This fact is still more dramatic to law professionals where text is the basic tool for their work. Those texts came from multiple sources like research documents and legislation and also are the main product from law activities, i.e., text documents which are created by law professionals. Since the first text editor there are several initiatives to use information technologies to help the generation, storage and search of law documents. From all documents, legal contracts generation is especially important due to its ubiquity and use by all social actors like common people, companies and government agencies. This work main focus is legal contract model generation. GraphSchema graphical language is introduced as a proposed solution to enable users to create contract models without help from a computer professional. It uses a visual representation to create legal contracts models, where concepts, relationships between those and constraints can be represented in a visual paradigm which can be understood by users. The graphical representation is translated to SML, a XML Schema extension. On enabling final user conceptual contract modeling without forcing a restrict vocabulary or ontology, GraphSchema and. by consequence, the use of SML, has several advantages in comparison with the use of simple XML Schema, RDF and OWL. But mainly show advantages when compared with other approaches based on vocabulary definition and formal ontology usage. Those advantages are mainly due to its simplicity and flexibility which enable the use of existing standards to define contract models like the eContracts standard defined by LegalXML consortium. This way, GraphSchema appears as an option to implement and use this standard in real world cases. The availability of a language directed towards non technical user will enable the contracts creation with tag markup from the beginning when used with XML guided text editors. This opens a door to productivity grow on contracts and legal documents creation.

APA, Harvard, Vancouver, ISO, and other styles

48

Packer, Thomas L. "Scalable Detection and Extraction of Data in Lists in OCRed Text for Ontology Population Using Semi-Supervised and Unsupervised Active Wrapper Induction." BYU ScholarsArchive, 2014. https://scholarsarchive.byu.edu/etd/4258.

Full text

Abstract:

Lists of records in machine-printed documents contain much useful information. As one example, the thousands of family history books scanned, OCRed, and placed on-line by FamilySearch.org probably contain hundreds of millions of fact assertions about people, places, family relationships, and life events. Data like this cannot be fully utilized until a person or process locates the data in the document text, extracts it, and structures it with respect to an ontology or database schema. Yet, in the family history industry and other industries, data in lists goes largely unused because no known approach adequately addresses all of the costs, challenges, and requirements of a complete end-to-end solution to this task. The diverse information is costly to extract because many kinds of lists appear even within a single document, differing from each other in both structure and content. The lists' records and component data fields are usually not set apart explicitly from the rest of the text, especially in a corpus of OCRed historical documents. OCR errors and the lack of document structure (e.g. HMTL tags) make list content hard to recognize by a software tool developed without a substantial amount of highly specialized, hand-coded knowledge or machine learning supervision. Making an approach that is not only accurate but also sufficiently scalable in terms of time and space complexity to process a large corpus efficiently is especially challenging. In this dissertation, we introduce a novel family of scalable approaches to list discovery and ontology population. Its contributions include the following. We introduce the first general-purpose methods of which we are aware for both list detection and wrapper induction for lists in OCRed or other plain text. We formally outline a mapping between in-line labeled text and populated ontologies, effectively reducing the ontology population problem to a sequence labeling problem, opening the door to applying sequence labelers and other common text tools to the goal of populating a richly structured ontology from text. We provide a novel admissible heuristic for inducing regular expression wrappers using an A* search. We introduce two ways of modeling list-structured text with a hidden Markov model. We present two query strategies for active learning in a list-wrapper induction setting. Our primary contributions are two complete and scalable wrapper-induction-based solutions to the end-to-end challenge of finding lists, extracting data, and populating an ontology. The first has linear time and space complexity and extracts highly accurate information at a low cost in terms of user involvement. The second has time and space complexity that are linear in the size of the input text and quadratic in the length of an output record and achieves higher F1-measures for extracted information as a function of supervision cost. We measure the performance of each of these approaches and show that they perform better than strong baselines, including variations of our own approaches and a conditional random field-based approach.

APA, Harvard, Vancouver, ISO, and other styles

49

Gupta, Smita. "Modelling Deception Detection in Text." Thesis, Kingston, Ont. : [s.n.], 2007. http://hdl.handle.net/1974/922.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Svensson, Karin, and Johan Blad. "Exploring NMF and LDA Topic Models of Swedish News Articles." Thesis, Uppsala universitet, Avdelningen för systemteknik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-429250.

Full text

Abstract:

The ability to automatically analyze and segment news articles by their content is a growing research field. This thesis explores the unsupervised machine learning method topic modeling applied on Swedish news articles for generating topics to describe and segment articles. Specifically, the algorithms non-negative matrix factorization (NMF) and the latent Dirichlet allocation (LDA) are implemented and evaluated. Their usefulness in the news media industry is assessed by its ability to serve as a uniform categorization framework for news articles. This thesis fills a research gap by studying the application of topic modeling on Swedish news articles and contributes by showing that this can yield meaningful results. It is shown that Swedish text data requires extensive data preparation for successful topic models and that nouns exclusively and especially common nouns are the most suitable words to use. Furthermore, the results show that both NMF and LDA are valuable as content analysis tools and categorization frameworks, but they have different characteristics, hence optimal for different use cases. Lastly, the conclusion is that topic models have issues since they can generate unreliable topics that could be misleading for news consumers, but that they nonetheless can be powerful methods for analyzing and segmenting articles efficiently on a grand scale by organizations internally. The thesis project is a collaboration with one of Sweden’s largest media groups and its results have led to a topic modeling implementation for large-scale content analysis to gain insight into readers’ interests.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Text modeling'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles