Dissertations / Theses on the topic 'Text analysis'

To see the other types of publications on this topic, follow the link: Text analysis.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Text analysis.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Haggren, Hugo. "Text Similarity Analysis for Test Suite Minimization." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-290239.

Full text
Abstract:
Software testing is the most expensive phase in the software development life cycle. It is thus understandable why test optimization is a crucial area in the software development domain. In software testing, the gradual increase of test cases demands large portions of testing resources (budget and time). Test Suite Minimization is considered a potential approach to deal with the test suite size problem. Several test suite minimization techniques have been proposed to efficiently address the test suite size problem. Proposing a good solution for test suite minimization is a challenging task, where several parameters such as code coverage, requirement coverage, and testing cost need to be considered before removing a test case from the testing cycle. This thesis proposes and evaluates two different NLP-based approaches for similarity analysis between manual integration test cases, which can be employed for test suite minimization. One approach is based on syntactic text similarity analysis and the other is a machine learning based semantic approach. The feasibility of the proposed solutions is studied through analysis of industrial use cases at Ericsson AB in Sweden. The results show that the semantic approach barely manages to outperform the syntactic approach. While both approaches show promise, subsequent studies will have to be done to further evaluate the semantic similarity based method.
Mjukvarutestning är den mest kostsamma fasen inom mjukvaruutveckling. Därför är det förståeligt varför testoptimering är ett kritiskt område inom mjukvarubranschen. Inom mjukvarutestning ställer den gradvisa ökningen av testfall stora krav på testresurser (budget och tid). Test Suite Minimization anses vara ett potentiellt tillvägagångssätt för att hantera problemet med växande testsamlingar. Flera minimiseringsmetoder har föreslagits för att effektivt hantera testsamlingars storleksproblem. Att föreslå en bra lösning för minimering av antal testfall är en utmanande uppgift, där flera parametrar som kodtäckning, kravtäckning och testkostnad måste övervägas innan man tar bort ett testfall från testcykeln. Denna uppsats föreslår och utvärderar två olika NLP-baserade metoder för likhetsanalys mellan testfall för manuell integration, som kan användas för minimering av testsamlingar. Den ena metoden baseras på syntaktisk textlikhetsanalys, medan den andra är en maskininlärningsbaserad semantisk strategi. Genomförbarheten av de föreslagna lösningarna studeras genom analys av industriella användningsfall hos Ericsson AB i Sverige. Resultaten visar att den semantiska metoden knappt lyckas överträffa den syntaktiska metoden. Medan båda tillvägagångssätten visar lovande resultat, måste efterföljande studier göras för att ytterligare utvärdera den semantiska likhetsbaserade metoden.
APA, Harvard, Vancouver, ISO, and other styles
2

Romsdorfer, Harald. "Polyglot text to speech synthesis text analysis & prosody control." Aachen Shaker, 2009. http://d-nb.info/993448836/04.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Kay, Roderick Neil. "Text analysis, summarising and retrieval." Thesis, University of Salford, 1997. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.360435.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Haselton, Curt B. Deierlein Gregory G. "Assessing seismic collapse safety of modern reinforced concrete moment-frame buildings." Berkeley, Calif. : Pacific Earthquake Engineering Research Center, 2008. http://nisee.berkeley.edu/elibrary/Text/200803261.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Ozsoy, Makbule Gulcin. "Text Summarization Using Latent Semantic Analysis." Master's thesis, METU, 2011. http://etd.lib.metu.edu.tr/upload/12612988/index.pdf.

Full text
Abstract:
Text summarization solves the problem of presenting the information needed by a user in a compact form. There are different approaches to create well formed summaries in literature. One of the newest methods in text summarization is the Latent Semantic Analysis (LSA) method. In this thesis, different LSA based summarization algorithms are explained and two new LSA based summarization algorithms are proposed. The algorithms are evaluated on Turkish and English documents, and their performances are compared using their ROUGE scores.
APA, Harvard, Vancouver, ISO, and other styles
6

O'Connor, Brendan T. "Statistical Text Analysis for Social Science." Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/541.

Full text
Abstract:
What can text corpora tell us about society? How can automatic text analysis algorithms efficiently and reliably analyze the social processes revealed in language production? This work develops statistical text analyses of dynamic social and news media datasets to extract indicators of underlying social phenomena, and to reveal how social factors guide linguistic production. This is illustrated through three case studies: first, examining whether sentiment expressed in social media can track opinion polls on economic and political topics (Chapter 3); second, analyzing how novel online slang terms can be very specific to geographic and demographic communities, and how these social factors affect their transmission over time (Chapters 4 and 5); and third, automatically extracting political events from news articles, to assist analyses of the interactions of international actors over time (Chapter 6). We demonstrate a variety of computational, linguistic, and statistical tools that are employed for these analyses, and also contribute MiTextExplorer, an interactive system for exploratory analysis of text data against document covariates, whose design was informed by the experience of researching these and other similar works (Chapter 2). These case studies illustrate recurring themes toward developing text analysis as a social science methodology: computational and statistical complexity, and domain knowledge and linguistic assumptions.
APA, Harvard, Vancouver, ISO, and other styles
7

Lin, Yuhao. "Text Analysis in Fashion : Keyphrase Extraction." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-290158.

Full text
Abstract:
The ability to extract useful information from texts and present them in the form of structured attributes is an important step towards making product comparison algorithm in fashion smarter and better. Some previous work exploits statistical features like the word frequency and graph models to predict keyphrases. In recent years, deep neural networks have proved to be the state-of-the-art methods to handle language modeling. Successful examples include Long Short Term Memory (LSTM), Gated Recurrent Units (GRU), Bidirectional Encoder Representations from Transformers(BERT) and their variations. In addition, some word embedding techniques like word2vec[1] are also helpful to improve the performance. Besides these techniques, a high-quality dataset is also important to the effectiveness of models. In this project, we aim to develop reliable and efficient machine learning models for keyphrase extraction. At Norna AB, we have a collection of product descriptions from different vendors without keyphrase annotations, which motivates the use of unsupervised methods. They should be capable of extracting useful keyphrases that capture the features of a product. To further explore the power of deep neural networks, we also implement several deep learning models. The dataset has two parts, the first part is called the fashion dataset where keyphrases are extracted by our unsupervised method. The second part is a public dataset in the domain of news. We find that deep learning models are also capable of extracting meaningful keyphrases and outperform the unsupervised model. Precision, recall and F1 score are used as evaluation metrics. The result shows that the model that uses LSTM and CRF achieves the optimal performance. We also compare the performance of different models with respect to keyphrase lengths and keyphrase numbers. The result indicates that all models perform better on predicting short keyphrases. We also show that our refined model has the advantage of predicting long keyphrases, which is challenging in this field.
Förmågan att extrahera användbar information från texter och presentera den i form av strukturerade attribut är ett viktigt steg mot att göra produktjämförelsesalgoritmen på ett smartare och bättre sätt. Vissa tidigare arbeten utnyttjar statistiska funktioner som ordfrekvens och grafmodeller för att förutsäga nyckelfraser. Under de senaste åren har djupa neurala nätverk visat sig vara de senaste metoderna för att hantera språkmodellering. Framgångsrika exempel inkluderar Long Short Term Memory (LSTM), Gated Recurrent Units (GRU), Bidirectional Encoder Representations from Transformers (BERT) och deras variationer. Dessutom kan vissa ordinbäddningstekniker som word2vec[1] också vara till hjälp för att förbättra prestandan. Förutom dessa tekniker är en datauppsättning av hög kvalitet också viktig för modellernas effektivitet. I detta projekt strävar vi efter att utveckla pålitliga och effektiva maskininlärningsmodeller för utvinning av nyckelfraser. På Norna AB har vi en samling produktbeskrivningar från olika leverantörer utan nyckelfrasnoteringar, vilket motiverar användningen av metoder utan tillsyn. De bör kunna extrahera användbara nyckelfraser som fångar funktionerna i en produkt. För att ytterligare utforska kraften i djupa neurala nätverk implementerar vi också flera modeller för djupinlärning. Datasetet har två delar, den första delen kallas modedataset där nyckelfraser extraheras med vår metod utan tillsyn. Den andra delen är en offentlig dataset i nyhetsdomänen. Vi finner att deep learning-modeller också kan extrahera meningsfulla nyckelfraser och överträffa den oövervakade modellen. Precision, återkallning och F1-poäng används som utvärderingsmått. Resultatet visar att modellen som använder LSTM och CRF uppnår optimal prestanda. Vi jämför också prestanda för olika modeller med avseende på keyphrase längder och nyckelfras nummer. Resultatet indikerar att alla modeller presterar bättre på att förutsäga korta tangentfraser. Vi visar också att vår raffinerade modell har fördelen att förutsäga långa tangentfraser, vilket är utmanande inom detta område.
APA, Harvard, Vancouver, ISO, and other styles
8

Maisto, Alessandro. "A Hybrid Framework for Text Analysis." Doctoral thesis, Universita degli studi di Salerno, 2017. http://hdl.handle.net/10556/2481.

Full text
Abstract:
2015 - 2016
In Computational Linguistics there is an essential dichotomy between Linguists and Computer Scientists. The rst ones, with a strong knowledge of language structures, have not engineering skills. The second ones, contrariwise, expert in computer and mathematics skills, do not assign values to basic mechanisms and structures of language. Moreover, this discrepancy, especially in the last decades, has increased due to the growth of computational resources and to the gradual computerization of the world; the use of Machine Learning technologies in Arti cial Intelligence problems solving, which allows for example the machines to learn , starting from manually generated examples, has been more and more often used in Computational Linguistics in order to overcome the obstacle represented by language structures and its formal representation. The dichotomy has resulted in the birth of two main approaches to Computational Linguistics that respectively prefers: rule-based methods, that try to imitate the way in which man uses and understands the language, reproducing syntactic structures on which the understanding process is based on, building lexical resources as electronic dictionaries, taxonomies or ontologies; statistic-based methods that, conversely, treat language as a group of elements, quantifying words in a mathematical way and trying to extract information without identifying syntactic structures or, in some algorithms, trying to confer to the machine the ability to learn these structures. One of the main problems is the lack of communication between these two di erent approaches, due to substantial di erences characterizing them: on the one hand there is a strong focus on how language works and on language characteristics, there is a tendency to analytical and manual work. From other hand, engineering perspective nds in language an obstacle, and recognizes in the algorithms the fastest way to overcome this problem. However, the lack of communication is not only an incompatibility: following Harris, the best way to approach natural language, could result by taking the best of both. At the moment, there is a large number of open-source tools that perform text analysis and Natural Language Processing. A great part of these tools are based on statistical models and consist on separated modules which could be combined in order to create a pipeline for the processing of the text. Many of these resources consist in code packages which have not a GUI (Graphical User Interface) and they result impossible to use for users without programming skills. Furthermore, the vast majority of these open-source tools support only English language and, when Italian language is included, the performances of the tools decrease signi cantly. On the other hand, open source tools for Italian language are very few. In this work we want to ll this gap by present a new hybrid framework for the analysis of Italian texts. It must not be intended as a commercial tool, but the purpose for which it was built is to help linguists and other scholars to perform rapid text analysis and to produce linguistic data. The framework, that performs both statistical and rule-based analysis, is called LG-Starship. The idea is to built a modular software that includes, in the beginning, the basic algorithms to perform di erent kind of analysis. Modules will perform the following tasks: Preprocessing Module: a module with which it is possible to charge a text, normalize it or delete stop-words. As output, the module presents the list of tokens and letters which compose the texts with respective occurrences count and the processed text. Mr. Ling Module: a module with which POS tagging and Lemmatization are performed. The module also returns the table of lemmas with the count of occurrences and the table with the quanti cation of grammatical tags. Statistic Module: with which it is possible to calculate Term Frequency and TF-IDF of tokens or lemmas, extract bi-grams and tri-grams units and export results as tables. Semantic Module: which use The Hyperspace Analogue to Language algorithm to calculate semantic similarity between words. The module returns similarity matrices of words per word which can be exported and analyzed. SyntacticModule: which analyze syntax structures of a selected sentence and tag the verbs and its arguments with semantic labels. The objective of the Framework is to build an all-in-one platform for NLP which allows any kind of users to perform basic and advanced text analysis. With the purpose of make the Framework accessible to users who have not speci c computer science and programming language skills, the modules have been provided with an intuitive GUI. The framework can be considered hybrid in a double sense: as explained in the previous lines, it uses both statistical and rule/based methods, by relying on standard statistical algorithms or techniques, and, at the same time, on Lexicon-Grammar syntactic theory. In addition, it has been written in both Java and Python programming languages. LG-Starship Framework has a simple Graphic User Interface but will be also released as separated modules which may be included in any NLP pipelines independently. There are many resources of this kind, but the large majority works for English. There are very few free resources for Italian language and this work tries to cover this need by proposing a tool which can be used both by linguists or other scientist interested in language and text analysis who have no idea about programming languages, as by computer scientists, who can use free modules in their own code or in combination with di erent NLP algorithms. The Framework takes the start from a text or corpus written directly by the user or charged from an external resource. The LG-Starship Framework work ow is described in the owchart shown in g. 1. The pipeline shows that the Pre-Processing Module is applied on original imported or generated text in order to produce a clean and normalized preprocessed text. This module includes a function for text splitting, a stop-word list and a tokenization method. On the text preprocessed the Statistic Module or the Mr. Ling Module can be applied. The rst one, which includes basic statistics algorithm as Term Frequency, tf-idf and n-grams extraction, produces as output databases of lexical and numerical data which can be used to produce charts or perform more external analysis; the second one, is divided in two main task: a Pos tagger, based on the Averaged Perceptron Tagger [?] and trained on the Paisà Corpus [Lyding et al., 2014], perform the Part-Of- Speech Tagging and produce an annotated text. A lemmatization method, which relies on a set of electronic dictionaries developed at the University of Salerno [Elia, 1995, Elia et al., 2010], take as input the Postagged text and produces a new lemmatized version of original text with information about syntactic and semantic properties. This lemmatized text, which can also be processed with the Statistic Module, serves as input for two deeper level of text analysis carried out by both the Syntactic Module and the Semantic Module. The rst one lays on the Lexicon Grammar Theory [Gross, 1971, 1975] and use a database of Predicate structures in development at the Department of Political, Social and Communication Science. Its objective is to produce a Dependency Graph of the sentences that compose the text. The Semantic Module uses the Hyperspace Analogue to Language distributional semantics algorithm [Lund and Burgess, 1996] trained on the Paisà Corpus to produce a semantic network of the words of the text. These work ow has been included in two di erent experiments in which two User Generated Corpora have been involved. The rst experiment represent a statistical study of the language of Rap Music in Italy through the analysis of a great corpus of Rap Song lyrics downloaded from on line databases of user generated lyrics. The second experiment is a Feature-Based Sentiment Analysis project performed on user product reviews. For this project we integrated a large domain database of linguistic resources for Sentiment Analysis, developed in the past years by the Department of Political, Social and Communication Science of the University of Salerno, which consists of polarized dictionaries of Verbs, Adjectives, Adverbs and Nouns. These two experiment underline how the linguistic framework can be applied to di erent level of analysis and to produce both Qualitative data and Quantitative data. For what concern the obtained results, the Framework, which is only at a Beta Version, obtain discrete results both in terms of processing time that in terms of precision. Nevertheless, the work is far from being considered complete. More algorithms will be added to the Statistic Module and the Syntactic Module will be completed. The GUI will be improved and made more attractive and modern and, in addiction, an open-source on-line version of the modules will be published. [edited by author]
XV n.s.
APA, Harvard, Vancouver, ISO, and other styles
9

Algarni, Abdulmohsen. "Relevance feature discovery for text analysis." Thesis, Queensland University of Technology, 2011. https://eprints.qut.edu.au/48230/1/Abdulmohsen_Algarni_Thesis.pdf.

Full text
Abstract:
It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of the large number of terms, patterns, and noise. Most existing popular text mining and classification methods have adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern-based methods should perform better than term- based ones in describing user preferences, but many experiments do not support this hypothesis. This research presents a promising method, Relevance Feature Discovery (RFD), for solving this challenging issue. It discovers both positive and negative patterns in text documents as high-level features in order to accurately weight low-level features (terms) based on their specificity and their distributions in the high-level features. The thesis also introduces an adaptive model (called ARFD) to enhance the exibility of using RFD in adaptive environment. ARFD automatically updates the system's knowledge based on a sliding window over new incoming feedback documents. It can efficiently decide which incoming documents can bring in new knowledge into the system. Substantial experiments using the proposed models on Reuters Corpus Volume 1 and TREC topics show that the proposed models significantly outperform both the state-of-the-art term-based methods underpinned by Okapi BM25, Rocchio or Support Vector Machine and other pattern-based methods.
APA, Harvard, Vancouver, ISO, and other styles
10

Romsdorfer, Harald [Verfasser]. "Polyglot Text-to-Speech Synthesis : Text Analysis & Prosody Control / Harald Romsdorfer." Aachen : Shaker, 2009. http://d-nb.info/1156517354/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Nikolaou, Angelos. "Texture of analysis for Robust Reading Systems." Doctoral thesis, Universitat Autònoma de Barcelona, 2020. http://hdl.handle.net/10803/671279.

Full text
Abstract:
Aquesta tesi es centra en l’ús de l’anàlisi de textures per a sistemes de lectura robustos. En aquesta tesi es explora l’ús de l’anàlisi de textures per a imatges de text. Es presenta una anàlisi en profunditat de descriptor de “Local Binary Pattern” (LBP). Els descriptors d’LBP s’utilitzen en la detecció de paraules i assoleixen el màxim rendiment entre els mètodes sense aprenentatge. Es desenvolupa una variant anomenada Sparse Radial Sampling LBP per explotar les propietats úniques del text i s’utilitza per aconseguir un rendiment d’estat d’art en la identificació d’escriptors. Els mateixos descriptors de característiques s’utilitzen juntament amb models de xarxes neuronals profundes per abordar amb èxit el problema de la identificació de l’escriptura i el llenguatge en múltiples modalitats.
Esta tesis se centra en el uso del análisis de texturas para sistemas de lectura robustos. En esta tesis se explora el uso del análisis de texturas para imágenes de texto. Se presenta un análisis en profundidad del descriptor de "Local Binary Pattern" (LBP). Los descriptores de LBP se utilizan en la detección de palabras y logran el máximo rendimiento entre los métodos sin aprendizaje. Se desarrolla una variante llamada Sparse Radial Sampling LBP para explotar las propiedades únicas del texto y se utiliza para lograr un rendimiento de estado de arte en la identificación de escritores. Los mismos descriptores de características se utilizan junto con modelos de redes neuronales profundas para abordar con éxito el problema de la identificación de la escritura y el lenguaje en múltiples modalidades.
This thesis focuses on the use of texture analysis for Robust Reading Systems. In this thesis the use of texture analysis for text-images is explored. An in depth analysis of the established Local Binary Pattern (LBP) descriptor is presented. The LBP descriptors are used in word-spotting and achieves top performance among learning-free methods. A custom variant called Sparse Radial Sampling LBP is developed to exploit the unique properties of text and is used to achieve state-of-the-art performance in writer identification. The same feature descriptors are used in conjunction with deep Neural Networks in order to address successfully the problem of script and language identification in multiple modalities.
APA, Harvard, Vancouver, ISO, and other styles
12

Oostendorp, Marcelyn Camereldia Antonette. "Investigating changing notions of "text": comparing news text in printed and electronic media." Thesis, University of the Western Cape, 2005. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_9984_1183428106.

Full text
Abstract:

This research aimed to give an account of the development of concepts of text and discourse and the various approaches to analysis of texts and discourses, as this is reflected in core linguistic literature since the late 1960s. The idea was to focus specifically on literature that notes the development stimulated by a proliferation of electronic media. Secondly, this research aimed to describe the nature of electronic news texts found on the internet in comparison to an equivalent printed version, namely texts printed in newspapers and simultaneously on the newspaper website.

APA, Harvard, Vancouver, ISO, and other styles
13

Nyns, Roland. "Text grammar and text processing: a cognitivist approach." Doctoral thesis, Universite Libre de Bruxelles, 1989. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/213285.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Garrad, Mark, and n/a. "Computer Aided Text Analysis in Personnel Selection." Griffith University. School of Applied Psychology, 2004. http://www4.gu.edu.au:8080/adt-root/public/adt-QGU20040408.093133.

Full text
Abstract:
This program of research was aimed at investigating a novel application of computer aided text analysis (CATA). To date, CATA has been used in a wide variety of disciplines, including Psychology, but never in the area of personnel selection. Traditional personnel selection techniques have met with limited success in the prediction of costly training failures for some occupational groups such as pilot and air traffic controller. Accordingly, the overall purpose of this thesis was to assess the validity of linguistic style to select personnel. Several studies were used to examine the structure of language in a personnel selection setting; the relationship between linguistic style and the individual differences dimensions of ability, personality and vocational interests; the validity of linguistic style as a personnel selection tool and the differences in linguistic style across occupational groups. The participants for the studies contained in this thesis consisted of a group of 810 Royal Australian Air Force Pilot, Air Traffic Control and Air Defence Officer trainees. The results partially supported two of the eight hypotheses; the other six hypotheses were supported. The structure of the linguistic style measure was found to be different in this study compared with the structure found in previous research. Linguistic style was found to be unrelated to ability or vocational interests, although some overlap was found between linguistic style and the measure of personality. In terms of personnel selection validity, linguistic style was found to relate to the outcome of training for the occupations of Pilot, Air Traffic Control and Air Defence Officer. Linguistic style also demonstrated incremental validity beyond traditional ability and selection interview measures. The findings are discussed in light of the Five Factor Theory of Personality, and motivational theory and a modified spreading activation network model of semantic memory and knowledge. A general conclusion is drawn that the analysis of linguistic style is a promising new tool in the area of personnel selection.
APA, Harvard, Vancouver, ISO, and other styles
15

Keenan, Francis Gerard. "Large vocabulary syntactic analysis for text recognition." Thesis, Nottingham Trent University, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.334311.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Rose, Tony Gerard. "Large vocabulary semantic analysis for text recognition." Thesis, Nottingham Trent University, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.333961.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

Benbrahim, Mohamed. "Automatic text summarisation through lexical cohesion analysis." Thesis, University of Surrey, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.309200.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Ibáñez, Jiménez Jorge, Cid Daniela Jiménez, and Merino Naiomi Vera. "Error Analysis in Chilean Tourist Text Translations." Tesis, Universidad de Chile, 2014. http://www.repositorio.uchile.cl/handle/2250/129945.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Palerius, Viktor. "Affect analysis for text dialogue in movies." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-353232.

Full text
Abstract:
With the surge of services offering video-on-demand through streaming and the increased competition in the field, the need for the provider of the service to be able to fit its content to its users is important. Machine learning can be utilized to in an automatic fashion find user or movie patterns by looking at features from its data.In this project I create a model with weighting schemes to find affective content from small texts and then explore the potential to also do this for movies by extracting affective features from the movies subtitles. The affective content is determined by using a dictionary with affective labeled words to in a Bag-of-Word fashion score sentences by a dimensional approach with three dimensions called Valence, Arousal and Dominance (V,A,D). The project also consist of a data gathering where two separate datasets are gathered with already V,A,D labeled data, one dataset is found online and the other is self-gathered. These datasets are then used for validation of the affective model and to find the best weighting schema. The best weighting schema is then used to determine affective content during the duration of a movie and utilized to find interesting segment in a movie but also to compare movies and find similarities.I find that the performance of my model is somewhat decent with best scores on the dimensions Valence and Arousal and that there are small difference based on which weighting schema is used for the model. I find that the model shows potential in the movie domain, by finding interesting segments in a movie but also finding scene-similarities between movies. It does however have its limitation by not being able to distinguish genres and missing affective content expressed through visual or audio cues. Finally I argue my model could be incorporated into a larger machine learning model to determine similarities in movies or find user patterns but it also requires similar models to determine affective content from the audio and visual in the movie
APA, Harvard, Vancouver, ISO, and other styles
20

Wu, Yingyu. "Using Text based Visualization in Data Analysis." Kent State University / OhioLINK, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=kent1398079502.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Garrad, Mark. "Computer Aided Text Analysis in Personnel Selection." Thesis, Griffith University, 2004. http://hdl.handle.net/10072/367424.

Full text
Abstract:
This program of research was aimed at investigating a novel application of computer aided text analysis (CATA). To date, CATA has been used in a wide variety of disciplines, including Psychology, but never in the area of personnel selection. Traditional personnel selection techniques have met with limited success in the prediction of costly training failures for some occupational groups such as pilot and air traffic controller. Accordingly, the overall purpose of this thesis was to assess the validity of linguistic style to select personnel. Several studies were used to examine the structure of language in a personnel selection setting; the relationship between linguistic style and the individual differences dimensions of ability, personality and vocational interests; the validity of linguistic style as a personnel selection tool and the differences in linguistic style across occupational groups. The participants for the studies contained in this thesis consisted of a group of 810 Royal Australian Air Force Pilot, Air Traffic Control and Air Defence Officer trainees. The results partially supported two of the eight hypotheses; the other six hypotheses were supported. The structure of the linguistic style measure was found to be different in this study compared with the structure found in previous research. Linguistic style was found to be unrelated to ability or vocational interests, although some overlap was found between linguistic style and the measure of personality. In terms of personnel selection validity, linguistic style was found to relate to the outcome of training for the occupations of Pilot, Air Traffic Control and Air Defence Officer. Linguistic style also demonstrated incremental validity beyond traditional ability and selection interview measures. The findings are discussed in light of the Five Factor Theory of Personality, and motivational theory and a modified spreading activation network model of semantic memory and knowledge. A general conclusion is drawn that the analysis of linguistic style is a promising new tool in the area of personnel selection.
Thesis (PhD Doctorate)
Doctor of Philosophy (PhD)
School of Applied Psychology (Health)
Full Text
APA, Harvard, Vancouver, ISO, and other styles
22

Coccetta, Francesca. "Multimodal Text Analysis and English Language Teaching." Doctoral thesis, Università degli studi di Padova, 2009. http://hdl.handle.net/11577/3426506.

Full text
Abstract:
Corpora of spoken texts are commonly investigated by applying approaches borrowed from the investigation of corpora of written texts, partly due to the lack of adequate concordancing software tools. This common practice has somewhat limited the potential spoken texts bring to the study of oral discourse. Based on the theoretical and technical innovations which have taken place in the field of multimodal corpus linguistics (Baldry and Thibault, 2001; 2006a; 2006b; forthcoming), especially within the MCA project (Baldry, 2007b; 2008a; Baldry and Thibault, 2008), this thesis presents an alternative method for analysing spoken corpora for language functions and notions (van Ek and Trim, 1998a; 1998b; 2001). In particular, it applies the scalar-level approach developed within multimodal corpus linguistics to a corpus of 52 texts, carefully selected from the Padova Multimedia English Corpus (Ackerley and Coccetta, 2007a; 2007b), and demon-strates how this approach to text analysis facilitates the study of language functions and notions vis-à-vis their multimodal co-text (Baldry, 2008a). To illustrate this, the online multimodal concordancer MCA (Multimodal Corpus Authoring System) (Baldry, 2005; Baldry and Beltrami, 2005) was used to create, annotate and concordance the corpus in terms of functions and notions, as well as non-verbal features including gestures, dynamics and gaze. The findings of this research have been applied to English language teaching and learning by creating interactive activities illustrating the way in which corpora of spoken texts and multimodal concordancing techniques can be used by language learners and teaching material developers alike. The activities have been included in the online English course Le@rning Links (Ackerley, 2004; Ackerley and Cloke, 2005; Ackerley, Cloke and Mazurelle, 2006; Ackerley and Cloke, 2006; Ackerley and Coccetta, in press).
È pratica comune indagare corpora di testi orali utilizzando approcci presi in prestito dallo studio di corpora di testi scritti. Ciò è in parte dovuto alla mancanza di software adeguati per la loro interrogazione. Questa pratica ha alquanto limitato le potenzialità che corpora di tali testi offrono per lo studio della lingua orale. Questa tesi riprende i modelli teorici e gli strumenti informatici sviluppati dalla linguistica dei corpora multimodali (Baldry e Thibault, 2001; 2006a; 2006b; in fase di pubblicazione), e offre un metodo alternativo per lo studio di corpora orali per funzioni linguistiche e nozioni (van Ek e Trim, 1998a; 1998b; 2001). In modo particolare, la tesi applica il modello scalare, sviluppato dalla linguistica dei corpora multimodali, ad un corpus di 52 testi, accuratamente selezionati dal Padova Multimedia English Corpus (Ackerley and Coccetta, 2007a; 2007b), e dimostra come tale approccio faciliti lo studio delle funzioni linguistiche e delle nozioni vis-à-vis ciò che Baldry (2008a) definisce il co-testo multimodale. Per illustrate ciò, è stato usato il software MCA (Multimodal Corpus Authoring System) (Baldry, 2005; Baldry e Beltrami, 2005), grazie al quale si è potuto annotare ed interrogare il corpus dal punto di vista delle funzioni linguistiche e delle nozioni, ed anche dei gesti, dello sguardo e delle azioni, per mettere in evidenza l’interazione tra il linguaggio e gli altri sistemi semiotici. I risultati della ricerca sono stati applicati nell’ambito dell’apprendimento della lingua inglese nel contesto del corso online Le@rning Links (Ackerley, 2004; Ackerley e Cloke, 2005; Ackerley, Cloke e Mazurelle, 2006; Ackerley e Cloke, 2006; Ackerley e Coccetta, in fase di pubblicazione).
APA, Harvard, Vancouver, ISO, and other styles
23

Tirkkonen-Condit, Sonja. "Argumentative text structure and translation." Jyväskylä : University of Jyväskylä, 1985. http://catalog.hathitrust.org/api/volumes/oclc/13332106.html.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Bafuka, Freddy Nole. "Beyond text analysis : image-based evaluation of health-related text readability using style features." Thesis, Massachusetts Institute of Technology, 2009. http://hdl.handle.net/1721.1/53121.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.
Includes bibliographical references (p. 70-71).
Many studies have shown that the readability of health documents presented to consumers does not match their reading levels. An accurate assessment of the readability of health-related texts is an important step in providing material that match readers' literacy. Current readability measurements depend heavily on text analysis (NLP), but neglect style (text layout). In this study, we show that style properties are important predictors of documents' readability. In particular, we build an automated computer program that uses documents' style to predict their readability score. The style features are extracted by analyzing only one page of the document as an image. The scores produced by our system were tested against scores given by human experts. Our tool shows stronger correlation to experts' scores than the Flesch-Kincaid readability grading method. We provide an end-user program, VisualGrader, which provides a Graphical User Interface to the scoring model.
by Freddy Nole Bafuka.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
25

Valeš, Miroslav. "Seeking the Pattern: Using Quantitative Text Analysis to Assess Text Influence on Grant Program Results." Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-193924.

Full text
Abstract:
Since software and hardware is well available for automated text analysis and since a large data that describes real projects submitted to grant program is opened up, there is a possibility to follow phenomena of behavioral economics and psycholinguistics which evidence particularities in textual descriptions may be statistically associated with a reader's behavior or with a reader's decision-taking, which, in this case, involves an influence on final allocation of grant funds. The thesis uses forenamed areas as a starting-point and also employs quantitative indicators from the field of forensic linguistics in order to perform a computer-aided quantitative text analysis. The main goal is to evaluate from correlation perspective, if there in real operational programmes were present any associable relationships between the quantitative features of a proposed project's textual description and the amount of grant allocated to a project. The thesis is divided into four chapters, where it introduces basis, describes analyzed data and used methods, comments on made analyses and found relations, and all the performed research is summarized and evaluated in the last chapter.
APA, Harvard, Vancouver, ISO, and other styles
26

Sudhahar, Saatviga. "Automated analysis of narrative text using network analysis in large corpora." Thesis, University of Bristol, 2015. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.685924.

Full text
Abstract:
In recent years there has been an increased interest in computational social sciences, digital humanities and political sciences to perform automated quantitative narrative analysis (QNA) of text in large scale, by studying actors, actions and relations in a given narration. Social scientists have always relied on news media content to study opinion biases and extraction of socio-historical relations and events. Yet in order to perform analysis they had to face labour-intensive coding where basic narrative information was manually extracted from text and annotated by hand. This PhD thesis addresses this problem using a big-data approach based on automated information extraction using state of the art Natural Language Processing, Text mining and Artificial Intelligence tools. A text corpus is transformed into a semantic network formed of subject-verb-object (SVO) triplets, and the resulting network is analysed drawing from various theories and techniques such as graph partitioning, network centrality, assortativity, hierarchy and structural balance. Furthermore we study the position of actors in the network of actors and actions; generate scatter plots describing the subject/object bias, positive/ negative bias of each actor; and investigate the types of actions each actor is most associated with. Apart from QNA, SVO triplets extracted from text can also be used to summarize documents. Our findings are demonstrated on two different corpora containing English news articles about US elections and Crime and a third corpus containing ancieilt folklore stories from the Gutenberg Project. Amongst potentially interesting findings we found the 2012 US elections campaign was very much focused on 'Economy' and 'Rights'; and overall, the media reported more frequently positive statements for the Democrats than the Republicans. In the Crime study we found that the network identified men as frequent perpetrators, and women and children as victims, of violent crime. A network approach to text based on semantic graphs is a promising approach to analyse large corpora of texts and, by retaining relational information pertaining to actors and objects, this approach can reveal latent and hidden patterns, and therefore has relevance in the social sciences and humanities.
APA, Harvard, Vancouver, ISO, and other styles
27

Abu, Sheikha Fadi. "Analysis and Generation of Formal and Informal Text." Thesis, University of Ottawa (Canada), 2010. http://hdl.handle.net/10393/28845.

Full text
Abstract:
In this thesis, we discuss an important issue in computational linguistics: distinguishing between formal and informal style of texts, in document classification and in text generation. There is a need to identify formal texts and informal texts automatically. In addition, there is a need of having a computer system that could generate correct English texts in formal or informal style. Therefore, we propose to use two main techniques in order to solve the two tasks. The first technique is to build a model that can be used to classify any text or sentence as having formal or informal style. The second technique is based on natural language generation (NLG) and it generates correct English sentences with formal or informal style. In order to achieve our goals, we start by studying the main differences between formal and informal style and summarize their characteristics. In addition, we manually collect parallel lists of formal versus informal words, phrases, and expressions from different sources that will be used for our proposed work. Then, we build our model for the classification task by using machine learning technique in order to classify texts and sentences into formal and informal style. The evaluation results show that our model is able to predict a class of formal/informal for any text or sentence with high accuracy. After that, we build our system that can generate formal and informal sentences by using NLG techniques. The evaluation results on a sample of generated sentences show that our NLG system can produce high-quality sentences in formal or informal style. The main contribution of this work consists in designing a set of features that led to good results for both tasks: text classification and text generation with different formality levels.
APA, Harvard, Vancouver, ISO, and other styles
28

Johansson, Christian. "Computer Forensic Text Analysis with Open Source Software." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik och datavetenskap, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-4994.

Full text
Abstract:
Detta papper koncentrerar sig på kriminaltekniska undersökningar av text, med fokus på användande av mjukvara med öppen källkod. Pappret diskuterar och undersöker olika tekniker för framtida automatisering av dessa undersökningar.
APA, Harvard, Vancouver, ISO, and other styles
29

Pulliam, John Mark. "An analysis of the Septuagint text of Habakkuk." Theological Research Exchange Network (TREN), 2006. http://www.tren.com/search.cfm?p001-1086.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Kowalczyk, Thomas L. "Performance analysis of text-oriented printing using PostScript /." Online version of thesis, 1988. http://hdl.handle.net/1850/10451.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Ramachandran, Venkateshwaran. "A temporal analysis of natural language narrative text." Thesis, This resource online, 1990. http://scholar.lib.vt.edu/theses/available/etd-03122009-040648/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Shepherd, David. "TEFL methods articles : text analysis and reader interaction." Thesis, Durham University, 1992. http://etheses.dur.ac.uk/5710/.

Full text
Abstract:
EFL teachers from the Brazilian public sector have often experienced difficulties in efficiently accessing the relevant information from articles published in 'English Teaching Forum'. This study attempts to investigate these difficulties from both 'text-analytical' and 'reader-based' perspectives and begins with a brief profile of the teachers concerned. An analytical framework incorporating elements from several approaches, specifically those of Hoey (1973) and Swales (1990) is used to highlight the organisational features from a selection of 'Forum' articles. It is then hypothesised that certain clause-relational macropatterns will facilitate access and be focused upon by 'successful' readers; in contrast, writer 'justification' moves are seen as potential barriers to efficient comprehension. A sample of FL methods articles written by Brazilians and published in Portuguese is then analysed and the same set of analytical parameters are found to be valid for describing their organisational features. A review of processing models of text comprehension and related FL reading research is made following the second 'reader-based' perspective. A set of criteria regarding the processing strategies of 'successful' and 'less-skilled' FL readers is established. Verbal report methodologies are argued as a suitable means of testing both the text-analytical hypotheses and the reader processing criteria. Various types of field work carried out in the collection of verbal report data from Brazilian EFL teachers reading 'Forum' articles are then described. Groups of 'successful' and 'problematic' readers are defined according to the processing strategies revealed in the verbal reports. Although there are substantial variations in the individual strategies of individual readers, and evidence of the influence of text informativity, the 'successful' processing consistently included focusing on the clause-relational macro signals; in contrast, there was little evidence of activation of the same text features by the 'problematic' readers. Finally suggestions are made for including FL methods articles, text-analytical elements, and verbal reporting on INSED-TEFL courses in Brazil.
APA, Harvard, Vancouver, ISO, and other styles
33

Wharton, Chris. "Text and context : an analysis of advertising reception." Thesis, Northumbria University, 2005. http://nrl.northumbria.ac.uk/2831/.

Full text
Abstract:
The aim of this study is to explore advertising and in particular advertising reception as a significant part of contemporary social practice. Although advertising in some form has been a feature of a wide range of societies, historically and culturally, its economic and social importance has perhaps never been greater. Advertising, across the industrial period and in particular since the Second World War, has through the entrenchment of market economies and the development of different media technologies increased its reach and density through a variety of means. It has become a significant media form, received by audiences differentiated by social, economic, spatial and other factors. This study enquires into the nature of audience reception of advertising through an exploration and application of the encoding/decoding media model. The study argues that attention to the textual and formal elements of the model need to be given greater emphasis and the decoding aspect of the model broadened to deal with a complexity of contextual factors contributing to the process. Advertising media by their nature are comprised of different formal and presentational means. The study focuses on newsprint, television and billboard and other outdoor advertising. The public and private environments in which these forms appear can be characterised through the social and symbolic difference between the domestic environment in which much television is viewed and the outdoor urban environment in which much billboard advertising appears. These are recognised as contributory elements in the reception of advertising and any significance the advert may have for its audience. Audience decoding of advertisements is then a combination of producer intent and a complexity of contributory factors brought to or found in the decoding process. This includes a recognition of various ways of seeing associated with different media forms and social and spatial circumstances and the presentation and reception of adverts as part of a flow of advertising and of a wider social experience. The relation between adverts and other texts also has important intertextual consequences for reception. In the process of decoding, it will be argued that social groups can be understood to act as interpretive communities and a process of advertising diffusion can be observed. Three empirical case studies form a survey of mainly car or car related advertising, featuring television, billboard and newsprint advertising, and highlight a range of possible decodings. The significance of historical and social factors is confirmed as important in securing particular readings of advertisements, and spatial, environmental and contextual features are emphasised in this survey. The survey acknowledges the significance of advertising form and medium and highlights the circumstances in which negotiated and oppositional readings may occur. This study re-emphasises that advertising texts form their signification within a complex arrangement of synchronic and diachronic circumstances in which immediate social and environmental factors should be accorded further significance in the study of advertising. The study concludes with a reflection on its methods and procedures and a consideration of further work that might be carried out in the area of empirical advertising studies. In the interest of a richer understanding of advertising, further research would acknowledge the complexities of audience reception and might include an enquiry into further advertising contexts and environments.
APA, Harvard, Vancouver, ISO, and other styles
34

Cohen, F. "TASS - Text Analysis System for Understanding News Stories." Thesis, University of Reading, 1988. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.383567.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Boulton, David. "Fine art image classification based on text analysis." Thesis, University of Surrey, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.252478.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Le, Thien-Hoa. "Neural Methods for Sentiment Analysis and Text Summarization." Electronic Thesis or Diss., Université de Lorraine, 2020. http://www.theses.fr/2020LORR0037.

Full text
Abstract:
Cette thèse aborde deux questions majeures du traitement automatique du langage naturel liées à l'analyse sémantique des textes : la détection des sentiments, et le résumé automatique. Dans ces deux applications, la nécessité d'analyser le sens du texte de manière précise est primordiale, d'une part pour identifier le sentiment exprimé au travers des mots, et d'autre part pour extraire les informations saillantes d’une phrase complexe et les réécrire de la manière la plus naturelle possible tout en respectant la sémantique du texte d'origine. Nous abordons ces deux questions par des approches d'apprentissage profond, qui permettent d'exploiter au mieux les données, en particulier lorsqu'elles sont disponibles en grande quantité. Analyse des sentiments neuronale. De nombreux réseaux de neurones convolutionnels profonds ont été adaptés du domaine de la vision aux tâches d’analyse des sentiments et de classification des textes. Cependant, ces études ne permettent pas de conclure de manière satisfaisante quant à l'importance de la profondeur du réseau pour obtenir les meilleures performances en classification de textes. Dans cette thèse, nous apportons de nouveaux éléments pour répondre à cette question. Nous proposons une adaptation du réseau convolutionnel profond DenseNet pour la classification de texte et étudions l’importance de la profondeur avec différents niveaux de granularité en entrée (mots ou caractères). Nous montrons que si les modèles profonds offrent de meilleures performances que les réseaux peu profonds lorsque le texte est représenté par une séquence de caractères, ce n'est pas le cas avec des mots. En outre, nous proposons de modéliser conjointement sentiments et actes de dialogue, qui constituent un facteur explicatif influent pour l’analyse du sentiment. Nous avons annoté manuellement les dialogues et les sentiments sur un corpus de micro-blogs, et entraîné un réseau multi-tâches sur ce corpus. Nous montrons que l'apprentissage par transfert peut être efficacement réalisé entre les deux tâches et analysons de plus certaines corrélations spécifiques entre ces deux aspects. Résumé de texte neuronal. L'analyse de sentiments n'apporte qu'une partie de l'information sémantique contenue dans les textes et est insuffisante pour bien comprendre le texte d'origine et prendre des décisions fondées. L'utilisateur d'un tel système a également besoin des raisons sous-jacentes pour vraiment comprendre les documents. Dans cette partie, notre objectif est d'étudier une autre forme d'information sémantique fournie par les modèles de résumé automatique. Nous proposons ainsi un modèle de résumé qui présente de meilleures propriétés d’explicabilité et qui est suffisamment souple pour prendre en charge divers modules d’analyse syntaxique. Plus spécifiquement, nous linéarisons l’arbre syntaxique sous la forme de segments de texte superposés, qui sont ensuite sélectionnés par un apprentissage par renforcement (RL) et re-générés sous une forme compressée. Par conséquent, le modèle proposé est capable de gérer à la fois le résumé par extraction et par abstraction. En outre, les modèles de résumé automatique faisant de plus en plus appel à des approches d'apprentissage par renforcement, nous proposons une étude basée sur l'analyse syntaxique des phrases pour tenter de mieux comprendre quels types d'information sont pris en compte dans ces approches. Nous comparons ainsi de manière détaillée les modèles avec apprentissage par renforcement et les modèles exploitant une connaissance syntaxique supplémentaire des phrases ainsi que leur combinaison, selon plusieurs dimensions liées à la qualité perçue des résumés générés. Nous montrons lorsqu'il existe une contrainte de ressources (calcul et mémoire) qu'il est préférable de n'utiliser que l'apprentissage par renforcement, qui donne des résultats presque aussi satisfaisants que des modèles syntaxiques, avec moins de paramètres et une convergence plus rapide
This thesis focuses on two Natural Language Processing tasks that require to extract semantic information from raw texts: Sentiment Analysis and Text Summarization. This dissertation discusses issues and seeks to improve neural models on both tasks, which have become the dominant paradigm in the past several years. Accordingly, this dissertation is composed of two parts: the first part (Neural Sentiment Analysis) deals with the computational study of people's opinions, sentiments, and the second part (Neural Text Summarization) tries to extract salient information from a complex sentence and rewrites it in a human-readable form. Neural Sentiment Analysis. Similar to computer vision, numerous deep convolutional neural networks have been adapted to sentiment analysis and text classification tasks. However, unlike the image domain, these studies are carried on different input data types and on different datasets, which makes it hard to know if a deep network is truly needed. In this thesis, we seek to find elements to address this question, i.e. whether neural networks must compute deep hierarchies of features for textual data in the same way as they do in vision. We thus propose a new adaptation of the deepest convolutional architecture (DenseNet) for text classification and study the importance of depth in convolutional models with different atom-levels (word or character) of input. We show that deep models indeed give better performances than shallow networks when the text input is represented as a sequence of characters. However, a simple shallow-and-wide network outperforms the deep DenseNet models with word inputs. Besides, to further improve sentiment classifiers and contextualize them, we propose to model them jointly with dialog acts, which are a factor of explanation and correlate with sentiments but are nevertheless often ignored. We have manually annotated both dialogues and sentiments on a Twitter-like social medium, and train a multi-task hierarchical recurrent network on joint sentiment and dialog act recognition. We show that transfer learning may be efficiently achieved between both tasks, and further analyze some specific correlations between sentiments and dialogues on social media. Neural Text Summarization. Detecting sentiments and opinions from large digital documents does not always enable users of such systems to take informed decisions, as other important semantic information is missing. People also need the main arguments and supporting reasons from the source documents to truly understand and interpret the document. To capture such information, we aim at making the neural text summarization models more explainable. We propose a model that has better explainability properties and is flexible enough to support various shallow syntactic parsing modules. More specifically, we linearize the syntactic tree into the form of overlapping text segments, which are then selected with reinforcement learning (RL) and regenerated into a compressed form. Hence, the proposed model is able to handle both extractive and abstractive summarization. Further, we observe that RL-based models are becoming increasingly ubiquitous for many text summarization tasks. We are interested in better understanding what types of information is taken into account by such models, and we propose to study this question from the syntactic perspective. We thus provide a detailed comparison of both RL-based and syntax-aware approaches and of their combination along several dimensions that relate to the perceived quality of the generated summaries such as number of repetitions, sentence length, distribution of part-of-speech tags, relevance and grammaticality. We show that when there is a resource constraint (computation and memory), it is wise to only train models with RL and without any syntactic information, as they provide nearly as good results as syntax-aware models with less parameters and faster training convergence
APA, Harvard, Vancouver, ISO, and other styles
37

Widdowson, Henry George. "Text, context, pretext : critical issues in discourse analysis /." Oxford : Blackwell, 2004. http://catalogue.bnf.fr/ark:/12148/cb41322428h.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Gränsbo, Gustav. "Word Clustering in an Interactive Text Analysis Tool." Thesis, Linköpings universitet, Interaktiva och kognitiva system, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-157497.

Full text
Abstract:
A central operation of users of the text analysis tool Gavagai Explorer is to look through a list of words and arrange them in groups. This thesis explores the use of word clustering to automatically arrange the words in groups intended to help users. A new word clustering algorithm is introduced, which attempts to produce word clusters tailored to be small enough for a user to quickly grasp the common theme of the words. The proposed algorithm computes similarities among words using word embeddings, and clusters them using hierarchical graph clustering. Multiple variants of the algorithm are evaluated in an unsupervised manner by analysing the clusters they produce when applied to 110 data sets previously analysed by users of Gavagai Explorer. A supervised evaluation is performed to compare clusters to the groups of words previously created by users of Gavagai Explorer. Results show that it was possible to choose a set of hyperparameters deemed to perform well across most data sets in the unsupervised evaluation. These hyperparameters also performed among the best on the supervised evaluation. It was concluded that the choice of word embedding and graph clustering algorithm had little impact on the behaviour of the algorithm. Rather, limiting the maximum size of clusters and filtering out similarities between words had a much larger impact on behaviour.
APA, Harvard, Vancouver, ISO, and other styles
39

CANO, ERION. "Text-based Sentiment Analysis and Music Emotion Recognition." Doctoral thesis, Politecnico di Torino, 2018. http://hdl.handle.net/11583/2709436.

Full text
Abstract:
Nowadays, with the expansion of social media, large amounts of user-generated texts like tweets, blog posts or product reviews are shared online. Sentiment polarity analysis of such texts has become highly attractive and is utilized in recommender systems, market predictions, business intelligence and more. We also witness deep learning techniques becoming top performers on those types of tasks. There are however several problems that need to be solved for efficient use of deep neural networks on text mining and text polarity analysis. First of all, deep neural networks are data hungry. They need to be fed with datasets that are big in size, cleaned and preprocessed as well as properly labeled. Second, the modern natural language processing concept of word embeddings as a dense and distributed text feature representation solves sparsity and dimensionality problems of the traditional bag-of-words model. Still, there are various uncertainties regarding the use of word vectors: should they be generated from the same dataset that is used to train the model or it is better to source them from big and popular collections that work as generic text feature representations? Third, it is not easy for practitioners to find a simple and highly effective deep learning setup for various document lengths and types. Recurrent neural networks are weak with longer texts and optimal convolution-pooling combinations are not easily conceived. It is thus convenient to have generic neural network architectures that are effective and can adapt to various texts, encapsulating much of design complexity. This thesis addresses the above problems to provide methodological and practical insights for utilizing neural networks on sentiment analysis of texts and achieving state of the art results. Regarding the first problem, the effectiveness of various crowdsourcing alternatives is explored and two medium-sized and emotion-labeled song datasets are created utilizing social tags. One of the research interests of Telecom Italia was the exploration of relations between music emotional stimulation and driving style. Consequently, a context-aware music recommender system that aims to enhance driving comfort and safety was also designed. To address the second problem, a series of experiments with large text collections of various contents and domains were conducted. Word embeddings of different parameters were exercised and results revealed that their quality is influenced (mostly but not only) by the size of texts they were created from. When working with small text datasets, it is thus important to source word features from popular and generic word embedding collections. Regarding the third problem, a series of experiments involving convolutional and max-pooling neural layers were conducted. Various patterns relating text properties and network parameters with optimal classification accuracy were observed. Combining convolutions of words, bigrams, and trigrams with regional max-pooling layers in a couple of stacks produced the best results. The derived architecture achieves competitive performance on sentiment polarity analysis of movie, business and product reviews. Given that labeled data are becoming the bottleneck of the current deep learning systems, a future research direction could be the exploration of various data programming possibilities for constructing even bigger labeled datasets. Investigation of feature-level or decision-level ensemble techniques in the context of deep neural networks could also be fruitful. Different feature types do usually represent complementary characteristics of data. Combining word embedding and traditional text features or utilizing recurrent networks on document splits and then aggregating the predictions could further increase prediction accuracy of such models.
APA, Harvard, Vancouver, ISO, and other styles
40

Cowie, James Reid. "Automatic analysis of descriptive texts." Thesis, University of Strathclyde, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.387066.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Ashton, Triss A. "Accuracy and Interpretability Testing of Text Mining Methods." Thesis, University of North Texas, 2013. https://digital.library.unt.edu/ark:/67531/metadc283791/.

Full text
Abstract:
Extracting meaningful information from large collections of text data is problematic because of the sheer size of the database. However, automated analytic methods capable of processing such data have emerged. These methods, collectively called text mining first began to appear in 1988. A number of additional text mining methods quickly developed in independent research silos with each based on unique mathematical algorithms. How good each of these methods are at analyzing text is unclear. Method development typically evolves from some research silo centric requirement with the success of the method measured by a custom requirement-based metric. Results of the new method are then compared to another method that was similarly developed. The proposed research introduces an experimentally designed testing method to text mining that eliminates research silo bias and simultaneously evaluates methods from all of the major context-region text mining method families. The proposed research method follows a random block factorial design with two treatments consisting of three and five levels (RBF-35) with repeated measures. Contribution of the research is threefold. First, the users perceived a difference in the effectiveness of the various methods. Second, while still not clear, there are characteristics with in the text collection that affect the algorithms ability to extract meaningful results. Third, this research develops an experimental design process for testing the algorithms that is adaptable into other areas of software development and algorithm testing. This design eliminates the bias based practices historically employed by algorithm developers.
APA, Harvard, Vancouver, ISO, and other styles
42

Dumont-Le, Brazidc Joffrey. "An Object-Oriented Data Analysis approach for text population." Thesis, KTH, Matematisk statistik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-223244.

Full text
Abstract:
With more and more digital text-valued data available, the need to be able to cluster, classify and study them arises. We develop in this thesis statistical tools to perform null hypothesis testing and clustering or classification on text-valued data in the framework of Object-Oriented Data Analysis. The project includes research on semantic methods to represent texts, comparisons between representations, distances for such representations and performance of permutation tests. Main methods compared are Vector Space Model and topic model. More precisely, this thesis will provide an algorithm to compute permutation tests at document or sentence level to study the equality in terms of distribution of two texts for different representations and distances. Lastly, we describe the study of texts regarding a syntactic point of view and its structure with a tree representation.
Med ständigt ökande tillgänglighet av textvärd data ökar behovet att kunna klustra och klassificera denna data. I detta arbete utvecklar vi statistiska verktyg för hypotestestning, klustring och klassificering av textvärd data inom ramen för objektorienterad dataanalys. Projektet inkluderar forskning på semantiska metoder för att representera texter, jämförelser mellan representationer, avstånd för sådana representationer och prestanda hos permutationstest. De viktigaste metoderna som jämförs är vektorrumsmodeller och ämnesmodeller. Mer specifikt tillhandahåller detta arbete en algoritm för permutationstest, på dokument- eller meningsnivå, i syfte att pröva hypotesen att två texter har samma fördelning med avseende på olika representationer och avstånd. Till sist används en trädrepresentation för att beskriva studiet av texter ur en syntaktisk synvinkel.
APA, Harvard, Vancouver, ISO, and other styles
43

au, rmatycorp@iinet net, and Ross J. Maloney. "Assisting Reading and Analysis of Text Documents by Visualization." Murdoch University, 2005. http://wwwlib.murdoch.edu.au/adt/browse/view/adt-MU20060502.150150.

Full text
Abstract:
The research reported here examined the use of computer generated graphics as a means to assist humans to analyse text documents which have not been subject to markup. The approach taken was to survey available visualization techniques in a broad selection of disciplines including applications to text documents, group those techniques using a taxonomy proposed in this research, then develop a selection of techniques that assist the text analysis objective. Development of the selected techniques from their fundamental basis, through their visualization, to their demonstration in application, comprises most of the body of this research. A scientific orientation employing measurements, combined with visual depiction and explanation of the technique with limited mathematics, is used as opposed to fully utilising any one of those resulting techniques for performing complete text document analysis. Visualization techniques which apply directly to the text and those which exploit measurements produced by associated techniques are considered. Both approaches employ visualization to assist the human viewer to discover patterns which are then used in the analysis of the document. In the measurement case, this requires consideration of data with dimensions greater than three, which imposes a visualization difficulty. Several techniques for overcoming this problem are proposed. Word frequencies, Zipf considerations, parallel coordinates, colour maps, Cusum plots, and fractal dimensions are some of the techniques considered. One direct application of visualization to text documents is to assist reading of that document by de-emphasising selected words by fading them on the display from which they are read. Three word selection techniques are proposed for the automatic selection of which words to use. An experiment is reported which used such word fading techniques. It indicated that some readers do have improved reading speed under such conditions, but others do not. The experimental design enabled the separation of that group which did decrease reading times from the remaining readers who did not. Measurement of comprehension errors made under different types of word fading were shown not to increase beyond that obtained under normal reading conditions. A visualization based on categorising the words in a text document is proposed which contrasts to visualization of measurements based on counts. The result is a visual impression of the word composition, and the evolution of that composition within that document. The text documents used to demonstrates these techniques include English novels and short stories, emails, and a series of eighteenth century newspaper articles known as the Federalist Papers. This range of documents was needed because all analysis techniques are not applicable to all types of documents. This research proposes that an interactive use of the techniques on hand in a non-prescribed order can yield useful results in a document analysis. An example of this is in author attribution, i.e. assigning authorship of documents via patterns characteristic of an individual’s writing style. Different visual techniques can be used to explore the patterns of writing in given text documents. Asoftware toolkit as a platform for implementing the proposed interactive analysis of text documents is described. How the techniques could be integrated into such a toolkit is outlined. A prototype of software to implement such a toolkit is included in this research. Issues relating to implementation of each technique used are also outlined. ii
APA, Harvard, Vancouver, ISO, and other styles
44

Li, Yanjun. "High Performance Text Document Clustering." Wright State University / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=wright1181005422.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Bonora, Filippo. "Dynamic networks, text analysis and Gephi: the art math." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2013. http://amslaurea.unibo.it/6327/.

Full text
Abstract:
In numerosi campi scientici l'analisi di network complessi ha portato molte recenti scoperte: in questa tesi abbiamo sperimentato questo approccio sul linguaggio umano, in particolare quello scritto, dove le parole non interagiscono in modo casuale. Abbiamo quindi inizialmente presentato misure capaci di estrapolare importanti strutture topologiche dai newtork linguistici(Degree, Strength, Entropia, . . .) ed esaminato il software usato per rappresentare e visualizzare i grafi (Gephi). In seguito abbiamo analizzato le differenti proprietà statistiche di uno stesso testo in varie sue forme (shuffolato, senza stopwords e senza parole con bassa frequenza): il nostro database contiene cinque libri di cinque autori vissuti nel XIX secolo. Abbiamo infine mostrato come certe misure siano importanti per distinguere un testo reale dalle sue versioni modificate e perché la distribuzione del Degree di un testo normale e di uno shuffolato abbiano lo stesso andamento. Questi risultati potranno essere utili nella sempre più attiva analisi di fenomeni linguistici come l'autorship attribution e il riconoscimento di testi shuffolati.
APA, Harvard, Vancouver, ISO, and other styles
46

Boynukalin, Zeynep. "Emotion Analysis Of Turkish Texts By Using Machine Learning Methods." Master's thesis, METU, 2012. http://etd.lib.metu.edu.tr/upload/12614521/index.pdf.

Full text
Abstract:
Automatically analysing the emotion in texts is in increasing interest in today&rsquo
s research fields. The aim is to develop a machine that can detect type of user&rsquo
s emotion from his/her text. Emotion classification of English texts is studied by several researchers and promising results are achieved. In this thesis, an emotion classification study on Turkish texts is introduced. To the best of our knowledge, this is the first study on emotion analysis of Turkish texts. In English there exists some well-defined datasets for the purpose of emotion classification, but we could not find datasets in Turkish suitable for this study. Therefore, another important contribution is the generating a new data set in Turkish for emotion analysis. The dataset is generated by combining two types of sources. Several classification algorithms are applied on the dataset and results are compared. Due to the nature of Turkish language, new features are added to the existing methods to improve the success of the proposed method.
APA, Harvard, Vancouver, ISO, and other styles
47

Uchimoto, Kiyotaka. "Maximum Entropy Models for Japanese Text Analysis and Generation." 京都大学 (Kyoto University), 2004. http://hdl.handle.net/2433/147595.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Kof, Leonid. "Text analysis for requirements engineering : application of computational linguistics /." Saarbrücken : VDM Verl. Dr. Müller, 2007. http://deposit.d-nb.de/cgi-bin/dokserv?id=3021639&prov=M&dok_var=1&dok_ext=htm.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Green, Pamela Dilys. "Extracting group relationships within changing software using text analysis." Thesis, University of Hertfordshire, 2013. http://hdl.handle.net/2299/11896.

Full text
Abstract:
This research looks at identifying and classifying changes in evolving software by making simple textual comparisons between groups of source code files. The two areas investigated are software origin analysis and collusion detection. Textual comparison is attractive because it can be used in the same way for many different programming languages. The research includes the first major study using machine learning techniques in the domain of software origin analysis, which looks at the movement of code in an evolving system. The training set for this study, which focuses on restructured files, is created by analysing 89 software systems. Novel features, which capture abstract patterns in the comparisons between source code files, are used to build models which classify restructured files fromunseen systems with a mean accuracy of over 90%. The unseen code is not only in C, the language of the training set, but also in Java and Python, which helps to demonstrate the language independence of the approach. As well as generating features for the machine learning system, textual comparisons between groups of files are used in other ways throughout the system: in filtering to find potentially restructured files, in ranking the possible destinations of the code moved from the restructured files, and as the basis for a new file comparison tool. This tool helps in the demanding task of manually labelling the training data, is valuable to the end user of the system, and is applicable to other file comparison tasks. These same techniques are used to create a new text-based visualisation for use in collusion detection, and to generate a measure which focuses on the unusual similarity between submissions. This measure helps to overcome problems in detecting collusion in data where files are of uneven size, where there is high incidental similarity or where more than one programming language is used. The visualisation highlights interesting similarities between files, making the task of inspecting the texts easier for the user.
APA, Harvard, Vancouver, ISO, and other styles
50

Stein, Roger Alan. "An analysis of hierarchical text classification using word embeddings." Universidade do Vale do Rio dos Sinos, 2018. http://www.repositorio.jesuita.org.br/handle/UNISINOS/7624.

Full text
Abstract:
Submitted by JOSIANE SANTOS DE OLIVEIRA (josianeso) on 2019-03-07T14:41:05Z No. of bitstreams: 1 Roger Alan Stein_.pdf: 476239 bytes, checksum: a87a32ffe84d0e5d7a882e0db7b03847 (MD5)
Made available in DSpace on 2019-03-07T14:41:05Z (GMT). No. of bitstreams: 1 Roger Alan Stein_.pdf: 476239 bytes, checksum: a87a32ffe84d0e5d7a882e0db7b03847 (MD5) Previous issue date: 2018-03-28
CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates application of those models and algorithms on this specific problem by means of experimentation and analysis. Classification models were trained with prominent machine learning algorithm implementations—fastText, XGBoost, and Keras’ CNN—and noticeable word embeddings generation methods—GloVe, word2vec, and fastText—with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an LCAF1 of 0.871 on a single-labeled version of the RCV1 dataset. The results analysis indicates that using word embeddings is a very promising approach for HTC.
Modelos eficientes de representação numérica textual (word embeddings) combinados com algoritmos modernos de aprendizado de máquina têm recentemente produzido uma melhoria considerável em tarefas de classificação automática de documentos. Contudo, a efetividade de tais técnicas ainda não foi avaliada com relação à classificação hierárquica de texto. Este estudo investiga a aplicação daqueles modelos e algoritmos neste problema em específico através de experimentação e análise. Modelos de classificação foram treinados usando implementações proeminentes de algoritmos de aprendizado de máquina—fastText, XGBoost e CNN (Keras)— e notórios métodos de geração de word embeddings—GloVe, word2vec e fastText—com dados disponíveis publicamente e avaliados usando métricas especificamente adequadas ao contexto hierárquico. Nesses experimentos, fastText alcançou um LCAF1 de 0,871 usando uma versão da base de dados RCV1 com apenas uma categoria por tupla. A análise dos resultados indica que a utilização de word embeddings é uma abordagem muito promissora para classificação hierárquica de texto.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography