Dissertations / Theses: 'Text linguistics'

1

Atwell, Eric Steven. "Corpus linguistics and language learning : bootstrapping linguistic knowledge and resources from text." Thesis, University of Leeds, 2008. http://etheses.whiterose.ac.uk/7504/.

Full text

Abstract:

This submission for the award of the degree of PhD by published work must: “make a contribution to knowledge in a coherent and related subject area; demonstrate originality and independent critical ability; satisfy the examiners that it is of sufficient merit to qualify for the award of the degree of PhD.” It includes a selection of my work as a Lecturer (and later, Senior Lecturer) at Leeds University, from 1984 to the present. The overall theme of my research has been bootstrapping linguistic knowledge and resources from text. A persistent strand of interest has been unsupervised and semi-supervised machine learning of linguistic knowledge from textual sources; the attraction of this approach is that I could start with English, but go on to apply analogous techniques to other languages, in particular Arabic. This theme covers a broad range of research over more than 20 years at Leeds University which I have divided into 8 sub-topics: A: Constituent-Likelihood statistical modelling of English grammar; B: Machine Learning of grammatical patterns from a corpus; C: Detecting grammatical errors in English text; D: Evaluation of English grammatical annotation models; E: Machine Learning of semantic language models; F: Applications in English language teaching; G: Arabic corpus linguistics; H: Applications in Computing teaching and research. The first section builds on my early years as a lecturer at Leeds University, when my research was essentially a progression from my previous work at Lancaster University on the LOB Corpus Part-of-Speech Tagging project (which resulted in the Tagged LOB Corpus, a resource for Corpus Linguistics research still in use today); I investigated a range of ideas for extending and/or applying techniques related to Part-of-Speech tagging in Corpus Linguistics. The second section covers a range of co-authored papers representing grant-funded research projects in Corpus Linguistics; in this mode of research, I had to come up with the original ideas and guide the project, but much of the detailed implementation was down to research assistant staff. Another highly productive mode of research has been supervision of research students, leading to further jointly-authored research papers. I helped formulate the research plans, and guided and advised the students; as with research-grant projects, the detailed implementation of the research has been down to the research students. The third section includes a few of the most significant of these jointly-authored Corpus Linguistics research papers. A “standard” PhD generally includes a survey of the field to put the work in context; so as a fourth section, I include some survey papers aimed at introducing new developments in corpus linguistics to a wider audience.

APA, Harvard, Vancouver, ISO, and other styles

2

Clough, Paul D. "Measuring text reuse." Thesis, University of Sheffield, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.275023.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Boer, Maria Ângela de Sousa. "Systemic linguistics and the grammar of the text." reponame:Repositório Institucional da UFPR, 2010. http://hdl.handle.net/1884/24322.

Full text

Abstract:

Resumo: Esta dissertação trata da inter-relação entre o estudo da gramática e o estudo do texto como entidade comunicativa. Nossa experiência com o ensino de análise de texto no terceiro grau tem demonstrado que o estudo do texto baseado na gramática tradicional não se integra com as abordagens funcionais de compreensão de texto, ou seja, o estudo da gramática nos moldes tradicionais não engendra a habilidade de compreensão e produção de texto. Acredita-se que este tipo de incongruência leva o aluno a pensar que o estudo da gramática é um aspecto distinto do da língua, não integrada ao texto como unidade de comunicação. Considerando este contexto, decidimos revisar a Lingüística Sistêmica. A preferência por esta teoria deve-se ao fato de que ela é uma teoria de língua em uso, como tal aborda o texto como unidade básica de comunicação e descreve o componente lingüístico à luz do que acontece no texto. A revisão da Lingüística Sistêmica, visa à realização de três objetivos: i) verificar como a Lingüística Sistêmica descreve o componente lingüístico como um todo, isto é, quais são os níveis do componente lingüístico; ii) verificar de forma mais detalhada como a Lingüística Sistêmica descreve o componente gramatical; iii) à luz da Lingüística Sistêmica, ressaltar os 'porquês' da incompatibilidade entre o estudo da gramática e as abordagens funcionais de compreensão de texto, e explicar as vantagens de se usar uma gramática funcional como base para compreensão de texto. Esses três objetivos determinam a organização do trabalho que será sub-dividido em três partes: i) revisão da descrição do componente lingüístico como um todo; ii) revisão do componente gramatical e aplicação da gramática à análise de três textos; iii) com base na descrição sistêmica da língua, abordagem sobre os aspectos lingüísticos que explicam os 'porquês' da incompatibilidade entre o estudo da gramática e as abordagens funcionais de texto, e explicação das características lingüísticas que fazem a gramática funcional tornar-se base eficiente para a análise de texto. Através da compreensão da Lingüística Sistêmica foi possível perceber e explicitar porque o estudo da gramática pode muitas vezes ser incompatível com as abordagens funcionais de texto. O fator principal é que a teoria de língua que subjaz na gramática tradicional é, muitas vezes, incompatível com as teorias de língua que subjazem nas abordagens funcionais de texto. Em termos gerais, a conclusão a que se chega é que a incompatibilidade entre a gramática tradicional e as abordagens funcionais de texto é conseqüência da incompatibilidade entre uma perspectiva especificamente sintagmática da gramática tradicional e a característica paradigmática do texto. Como se explica na última parte do trabalho, a perspectiva paradigmática orienta a configuração do sistema lingüístico como um todo, assim sendo, as abordagens que trabalham nos diferentes níveis lingüísticos e com diferentes unidades hierárquicas devem também seguir esta perspectiva, é a dimensão paradigmática que leva a descrição de um texto ao nível de seu conteúdo.

APA, Harvard, Vancouver, ISO, and other styles

4

Tagg, Caroline. "A corpus linguistics study of SMS text messaging." Thesis, University of Birmingham, 2009. http://etheses.bham.ac.uk//id/eprint/253/.

Full text

Abstract:

This thesis reports a study using a corpus of text messages in English (CorTxt) to explore linguistic features which define texting as a language variety. It focuses on how the language of texting, Txt, is shaped by texters actively fulfilling interpersonal goals. The thesis starts with an overview of the literature on texting, which indicates the need for thorough linguistic investigation of Txt based on a large dataset. It then places texting within the tradition of research into the speech-writing continuum, which highlights limitations of focusing on mode at the expense of other user-variables. The thesis also argues the need for inductive investigation alongside the quantitative corpus-based frameworks that dominate the field. A number of studies are then reported which explore the unconventional nature of Txt. Firstly, drawing on the argument that respelling constitutes a meaning-making resource, spelling variants are retrieved using word-frequency lists and categorised according to form and function. Secondly, identification of everyday creativity in CorTxt challenges studies focusing solely on spelling as a creative resource, and suggests that creativity plays an important role in texting because of, rather than despite, physical constraints. Thirdly, word frequency analysis suggests that the distinct order of the most frequent words in CorTxt can be explained with reference to the frequent phrases in which they occur. Finally, application of a spoken grammar model reveals similarities and differences between spoken and texted interaction. The distinct strands of investigation highlight, on the one hand, the extent to which texting differs from speech and, on the other, the role of user agency, awareness and choice in shaping Txt. The argument is made that this can be explained through performativity and, in particular, the observation that texters perform brevity, speech-like informality and group deviance in construing identities through Txt.

APA, Harvard, Vancouver, ISO, and other styles

5

Roloff, Vera Lucia Posnik. "Foreign language reading comprehension: Text representation and the effects of text explicitness and reading ability." Thesis, University of Ottawa (Canada), 1999. http://hdl.handle.net/10393/8791.

Full text

Abstract:

The present study investigated text reconstruction performance of EFL university-level students reading a fairly long naturally-occurring popular magazine article taking two factors into consideration: degree of text content explicitness and EFL reading ability level. More specifically, it attempted to examine a deeper level of text representation, or what Kintsch and his associates label the situation model (Kintsch & van Dijk, 1978; van Dijk & Kintsch, 1983; Kintsch, 1974, 1988, 1992, 1994, 1998), by subjects of two reading ability levels in EFL [high and low]. Subjects performed an immediate written reconstructive recall after reading one of the two versions [fully explicit and less explicit ] of a popular science article. This recall or text representation reflected (1) the comprehension or the reconstruction of the text as a whole, (2) the distribution of information in the text, i.e., in terms of its macro and microstructure, and (3) any correct inferences that may have been generated. In addition, the study considered the influence of text difficulty, topic interest, and topic familiarity in the reconstructive representation. Ninety-two Brazilian university-level subjects participated in this study. Comprehension was measured quantitatively in terms of: (1) the number of propositions recalled from reading one of the two versions, (2) textbase recall, and (3) inferential recall. In both text versions, six hierarchical levels of information were considered. There were four main findings of the present study: (1) Text version had an impact on the reconstructive process. Readers benefited from reading a less explicit version of the text regardless of their reading ability level, although high reading ability level readers outperformed low reading ability ones. (2) The fully explicit version had an advantage over the less explicit version only with respect to the construction of the textbase representation. (3) Results respected the Hierarchy Principle, that is, higher-level propositions were better and more frequently recalled than lower-level ones. (4) Text difficulty and topic familiarity were not determining factors in the reconstructive representation. Topic interest, however, was shown to be a significant factor in the construction of the textbase as well as in the reconstructive process as a whole of low reading ability subjects. The findings of the present study are broadly consistent with those reported in earlier cognitive research in the area of text representation, particularly with those which examined text comprehension in the context of the Kintsch & van Dijk (Kintsch & van Dijk, 1978; van Dijk & Kintsch, 1983, Kintsch, 1974, 1988, 1992, 1994, 1998) model of reading comprehension.

APA, Harvard, Vancouver, ISO, and other styles

6

Maisto, Alessandro. "A Hybrid Framework for Text Analysis." Doctoral thesis, Universita degli studi di Salerno, 2017. http://hdl.handle.net/10556/2481.

Full text

Abstract:

2015 - 2016
In Computational Linguistics there is an essential dichotomy between Linguists and Computer Scientists. The rst ones, with a strong knowledge of language structures, have not engineering skills. The second ones, contrariwise, expert in computer and mathematics skills, do not assign values to basic mechanisms and structures of language. Moreover, this discrepancy, especially in the last decades, has increased due to the growth of computational resources and to the gradual computerization of the world; the use of Machine Learning technologies in Arti cial Intelligence problems solving, which allows for example the machines to learn , starting from manually generated examples, has been more and more often used in Computational Linguistics in order to overcome the obstacle represented by language structures and its formal representation. The dichotomy has resulted in the birth of two main approaches to Computational Linguistics that respectively prefers: rule-based methods, that try to imitate the way in which man uses and understands the language, reproducing syntactic structures on which the understanding process is based on, building lexical resources as electronic dictionaries, taxonomies or ontologies; statistic-based methods that, conversely, treat language as a group of elements, quantifying words in a mathematical way and trying to extract information without identifying syntactic structures or, in some algorithms, trying to confer to the machine the ability to learn these structures. One of the main problems is the lack of communication between these two di erent approaches, due to substantial di erences characterizing them: on the one hand there is a strong focus on how language works and on language characteristics, there is a tendency to analytical and manual work. From other hand, engineering perspective nds in language an obstacle, and recognizes in the algorithms the fastest way to overcome this problem. However, the lack of communication is not only an incompatibility: following Harris, the best way to approach natural language, could result by taking the best of both. At the moment, there is a large number of open-source tools that perform text analysis and Natural Language Processing. A great part of these tools are based on statistical models and consist on separated modules which could be combined in order to create a pipeline for the processing of the text. Many of these resources consist in code packages which have not a GUI (Graphical User Interface) and they result impossible to use for users without programming skills. Furthermore, the vast majority of these open-source tools support only English language and, when Italian language is included, the performances of the tools decrease signi cantly. On the other hand, open source tools for Italian language are very few. In this work we want to ll this gap by present a new hybrid framework for the analysis of Italian texts. It must not be intended as a commercial tool, but the purpose for which it was built is to help linguists and other scholars to perform rapid text analysis and to produce linguistic data. The framework, that performs both statistical and rule-based analysis, is called LG-Starship. The idea is to built a modular software that includes, in the beginning, the basic algorithms to perform di erent kind of analysis. Modules will perform the following tasks: Preprocessing Module: a module with which it is possible to charge a text, normalize it or delete stop-words. As output, the module presents the list of tokens and letters which compose the texts with respective occurrences count and the processed text. Mr. Ling Module: a module with which POS tagging and Lemmatization are performed. The module also returns the table of lemmas with the count of occurrences and the table with the quanti cation of grammatical tags. Statistic Module: with which it is possible to calculate Term Frequency and TF-IDF of tokens or lemmas, extract bi-grams and tri-grams units and export results as tables. Semantic Module: which use The Hyperspace Analogue to Language algorithm to calculate semantic similarity between words. The module returns similarity matrices of words per word which can be exported and analyzed. SyntacticModule: which analyze syntax structures of a selected sentence and tag the verbs and its arguments with semantic labels. The objective of the Framework is to build an all-in-one platform for NLP which allows any kind of users to perform basic and advanced text analysis. With the purpose of make the Framework accessible to users who have not speci c computer science and programming language skills, the modules have been provided with an intuitive GUI. The framework can be considered hybrid in a double sense: as explained in the previous lines, it uses both statistical and rule/based methods, by relying on standard statistical algorithms or techniques, and, at the same time, on Lexicon-Grammar syntactic theory. In addition, it has been written in both Java and Python programming languages. LG-Starship Framework has a simple Graphic User Interface but will be also released as separated modules which may be included in any NLP pipelines independently. There are many resources of this kind, but the large majority works for English. There are very few free resources for Italian language and this work tries to cover this need by proposing a tool which can be used both by linguists or other scientist interested in language and text analysis who have no idea about programming languages, as by computer scientists, who can use free modules in their own code or in combination with di erent NLP algorithms. The Framework takes the start from a text or corpus written directly by the user or charged from an external resource. The LG-Starship Framework work ow is described in the owchart shown in g. 1. The pipeline shows that the Pre-Processing Module is applied on original imported or generated text in order to produce a clean and normalized preprocessed text. This module includes a function for text splitting, a stop-word list and a tokenization method. On the text preprocessed the Statistic Module or the Mr. Ling Module can be applied. The rst one, which includes basic statistics algorithm as Term Frequency, tf-idf and n-grams extraction, produces as output databases of lexical and numerical data which can be used to produce charts or perform more external analysis; the second one, is divided in two main task: a Pos tagger, based on the Averaged Perceptron Tagger [?] and trained on the Paisà Corpus [Lyding et al., 2014], perform the Part-Of- Speech Tagging and produce an annotated text. A lemmatization method, which relies on a set of electronic dictionaries developed at the University of Salerno [Elia, 1995, Elia et al., 2010], take as input the Postagged text and produces a new lemmatized version of original text with information about syntactic and semantic properties. This lemmatized text, which can also be processed with the Statistic Module, serves as input for two deeper level of text analysis carried out by both the Syntactic Module and the Semantic Module. The rst one lays on the Lexicon Grammar Theory [Gross, 1971, 1975] and use a database of Predicate structures in development at the Department of Political, Social and Communication Science. Its objective is to produce a Dependency Graph of the sentences that compose the text. The Semantic Module uses the Hyperspace Analogue to Language distributional semantics algorithm [Lund and Burgess, 1996] trained on the Paisà Corpus to produce a semantic network of the words of the text. These work ow has been included in two di erent experiments in which two User Generated Corpora have been involved. The rst experiment represent a statistical study of the language of Rap Music in Italy through the analysis of a great corpus of Rap Song lyrics downloaded from on line databases of user generated lyrics. The second experiment is a Feature-Based Sentiment Analysis project performed on user product reviews. For this project we integrated a large domain database of linguistic resources for Sentiment Analysis, developed in the past years by the Department of Political, Social and Communication Science of the University of Salerno, which consists of polarized dictionaries of Verbs, Adjectives, Adverbs and Nouns. These two experiment underline how the linguistic framework can be applied to di erent level of analysis and to produce both Qualitative data and Quantitative data. For what concern the obtained results, the Framework, which is only at a Beta Version, obtain discrete results both in terms of processing time that in terms of precision. Nevertheless, the work is far from being considered complete. More algorithms will be added to the Statistic Module and the Syntactic Module will be completed. The GUI will be improved and made more attractive and modern and, in addiction, an open-source on-line version of the modules will be published. [edited by author]
XV n.s.

APA, Harvard, Vancouver, ISO, and other styles

7

Kof, Leonid. "Text analysis for requirements engineering : application of computational linguistics /." Saarbrücken : VDM Verl. Dr. Müller, 2007. http://deposit.d-nb.de/cgi-bin/dokserv?id=3021639&prov=M&dok_var=1&dok_ext=htm.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Dawson, David Allan. "Text-linguistics and Biblical Hebrew : an examination of methodologies." Thesis, University of Edinburgh, 1994. http://hdl.handle.net/1842/19674.

Full text

Abstract:

This dissertation focusses on the theoretical base, and accompanying methodologies, required for text-linguistic analysis of Biblical Hebrew texts, and the degree of clarity required for communication of the results. After a brief theoretical introduction, and explanation of a few common terms, two chapters are devoted to interacting with five works which concern themselves to some degree with this issue (including works by Niccacci, Eskhult, Andersen, Khan, and Longacre). Longacre's book was used as a springboard to launch into an introduction to the tagmemic school of text-linguistics (or "discourse analysis"); my intention has been to contribute explanations in plain English of some of the fundamental concepts of this model, in order that hebraists may make more use of its considerable benefits. In particular, Longacre's identification of several possible text-types (which free us from trying to describe Reported Speech as a single text-type with extremely flexible rules), and of the correlation of a scale of foregrounded to backgrounded clause-types for each significant text-type, promises to streamline description of Hebrew considerably. The next two chapters apply these concepts to biblical texts taken from Judges, Leviticus, Exodus, and Ruth. In these chapters, several text-types are confirmed, and their verb ranking identified. Reported Speech is found to have a slight modifying influence on these text-types, but it is suggested that this is due to internal cohesion with the speech formula into which it is embedded (contra Niccacci).

APA, Harvard, Vancouver, ISO, and other styles

9

Fulford, Heather. "Term acquisition : a text-probing approach." Thesis, University of Surrey, 1997. http://epubs.surrey.ac.uk/843700/.

Full text

Abstract:

In order to assist terminologists in the compilation of terminology collections in specialist domains, a "text probing" approach to the acquisition of English terms from special language texts is specified, designed, implemented, and evaluated. This approach draws on aspects of general language corpus linguistics and computational lexicography, and follows current trends towards corpus-based terminology compilation work. Our text-probing approach is founded specifically on observations about the linguistic features of English terms and their collocational behaviour in special language texts, and represents an effort to extend the scope of existing collocation studies from general language to special language. It aims to be both domain- and text-type independent. By operating on the premise that a term is likely to reside in a special language text between boundary markers comprising closed class words/punctuation, it permits the acquisition of single- and multi-word terms spanning a range of word classes. Our approach has been implemented in a prototype computer program ("Termspotter") which has been written in Quintus Prolog. This program processes untagged special language texts, either individually or in batches. It functions by "probing" texts for closed class words and punctuation, extracting as term candidates those items which reside between them. A systematic evaluation of the text-probing approach is presented in which, using an innovative experimental design, the term acquisition efficiency of Termspotter is measured against the manual scanning output of domain experts, as well as compared with the scanning output of terminologists. Results in the special language texts studied so far indicate that, on average, Termspotter can accurately retrieve 80% of the terms identified by a domain expert, and can typically partially retrieve the remaining 20%. The program performed very favourably in comparison with human terminologists. Extensions of our text- probing approach to other languages are anticipated. Moreover, wider applications of the notion of text probing are envisaged, both within and beyond the terminology community, for abstracting other structures from special language texts.

APA, Harvard, Vancouver, ISO, and other styles

10

Laffling, John D. "Machine disambiguation and translation of polysemous nouns : a lexicon-driven model for text-semantic analysis and parallel text-dependent transfer in German-English translation of party political texts." Thesis, University of Wolverhampton, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.254466.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Wilson, Christin M. L. "Variation and Text Type in Old Occitan Texts." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1331136026.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Law, Yee Wah Mary. "The study of register differentiation of two types of press text : opinion article & feature news." HKBU Institutional Repository, 2003. http://repository.hkbu.edu.hk/etd_ra/488.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Cheng, Chi Wa. "Probabilistic topic modeling and classification probabilistic PCA for text corpora." HKBU Institutional Repository, 2011. http://repository.hkbu.edu.hk/etd_ra/1263.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Whitelaw, Casey. "Systemic features for text classification." Thesis, The University of Sydney, 2005. https://hdl.handle.net/2123/28097.

Full text

Abstract:

This thesis applies Systemic Functional Linguistics (SFL) to the automatic analysis of text. SFL is a theory that describes language use primarily in terms of meaning. While widely used for text generation, the difficulty of complete automatic SFL analysis has kept it out of the text analysis mainstream. This thesis presents a new partial analytical model for SFL, designed to allow domain—specific systemic models to be used in shallow processing for text classification. In this model, language use in a document is identified through the use of a systemic extractor, for which algorithms are presented and shown to be fast7 eflicient and scalable. Documents are then represented as a set of systemic features, which leverage SFL theory to provide more meaningful representations. These systemic features are used to perform supervised text classification using statistical machine learning algorithms. The properties of systemic features are explored in a series of case studies upon different types of text classification tasks, using different parts of SFL. Systemic features prove useful in identifying interpersonally close and distant documents; in improving the classification of financial scams; and in the identification of positive and negative opinion. As presented in this thesis, language use described by SFL can be modelled and extracted efficiently and used effectively in real—world text classification tasks.

APA, Harvard, Vancouver, ISO, and other styles

15

J'Fellers, J., and Theresa McGarry. "Language and Linguistics." Digital Commons @ East Tennessee State University, 2009. https://dc.etsu.edu/etsu-works/6151.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Heister, Julian, and Reinhold Kliegl. "Comparing word frequencies from different German text corpora." Universität Potsdam, 2012. http://opus.kobv.de/ubp/volltexte/2012/6234/.

Full text

Abstract:

Inhalt: Introduction Developments in creating corpora dlexDB, subtitles, and tabloid newspapers Rating corpus emotionality Current study Method Materials Corpora Results Type-token ratio Validity: Effects of task difficulty Emotionality of a corpus Validity: Effects of emotionality Discussion Outlook References

APA, Harvard, Vancouver, ISO, and other styles

17

Keenan, Francis Gerard. "Large vocabulary syntactic analysis for text recognition." Thesis, Nottingham Trent University, 1992. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.334311.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Unaldi, Aylin. "Investigating reading for academic purposes : sentence, text and multiple texts." Thesis, University of Bedfordshire, 2010. http://hdl.handle.net/10547/279255.

Full text

Abstract:

This study examines the nature of reading in academic environments and suggests ways for a more appropriate assessment of it. Research studies show that reading in academic settings is a complex knowledge management process in which information is selected, combined and organised from not a single, isolated text but from multiple information sources. This study initially gathered evidence from students studying at a British university on their perceived and observed reading purposes and processes in three studies; a large scale questionnaire, longitudinal reading diary study and finally individual interviews in order both to establish whether the prominent reading skills used by them were as put forth in the studies on academic reading, and to examine in detail the actual cognitive processes (reading operations) used in reading for academic purposes. The study draws on the reading theories that explain reading comprehension and focuses specifically on different levels of careful reading such as sentence, text and multiple texts in order to explicate that increasingly more complex cognitive processes explain higher levels of reading comprehension. Building on the findings from the three initial studies, it is suggested that reading tests of English for Academic Purposes (EAP) should involve not only local level comprehension questions but also reading tasks at text and multiple texts levels. For this aim, taking the Khalifa and Weir (2009) framework as the basis, cognitive processes extracted from the theories defining each level of reading, and contextual features extracted through the analysis of university course books were combined to form the test specifications for each level of careful reading and sample tests assessing careful reading at sentence, text and intertextuallevels were designed. Statistical findings confirmed the differential nature of the three levels of careful reading; however, the expected difficulty continuum could not be observed among the tests. Possible reasons underlying this are discussed, suggestions on reading tasks that might operationalise text level reading more efficiently and intertextual level reading more extensively are made and additional components of intertextual reading are offered for the Khalifa and Weir (2009) reading framework. The implications of the findings for the teaching and assessment of English for Academic Purposes are also discussed.

APA, Harvard, Vancouver, ISO, and other styles

19

Fournier, Christopher. "Evaluating Text Segmentation." Thèse, Université d'Ottawa / University of Ottawa, 2013. http://hdl.handle.net/10393/24064.

Full text

Abstract:

This thesis investigates the evaluation of automatic and manual text segmentation. Text segmentation is the process of placing boundaries within text to create segments according to some task-dependent criterion. An example of text segmentation is topical segmentation, which aims to segment a text according to the subjective definition of what constitutes a topic. A number of automatic segmenters have been created to perform this task, and the question that this thesis answers is how to select the best automatic segmenter for such a task. This requires choosing an appropriate segmentation evaluation metric, confirming the reliability of a manual solution, and then finally employing an evaluation methodology that can select the automatic segmenter that best approximates human performance. A variety of comparison methods and metrics exist for comparing segmentations (e.g., WindowDiff, Pk), and all save a few are able to award partial credit for nearly missing a boundary. Those comparison methods that can award partial credit unfortunately lack consistency, symmetricity, intuition, and a host of other desirable qualities. This work proposes a new comparison method named boundary similarity (B) which is based upon a new minimal boundary edit distance to compare two segmentations. Near misses are frequent, even among manual segmenters (as is exemplified by the low inter-coder agreement reported by many segmentation studies). This work adapts some inter-coder agreement coefficients to award partial credit for near misses using the new metric proposed herein, B. The methodologies employed by many works introducing automatic segmenters evaluate them simply in terms of a comparison of their output to one manual segmentation of a text, and often only by presenting nothing other than a series of mean performance values (along with no standard deviation, standard error, or little if any statistical hypothesis testing). This work asserts that one segmentation of a text cannot constitute a “true” segmentation; specifically, one manual segmentation is simply one sample of the population of all possible segmentations of a text and of that subset of desirable segmentations. This work further asserts that an adapted inter-coder agreement statistics proposed herein should be used to determine the reproducibility and reliability of a coding scheme and set of manual codings, and then statistical hypothesis testing using the specific comparison methods and methodologies demonstrated herein should be used to select the best automatic segmenter. This work proposes new segmentation evaluation metrics, adapted inter-coder agreement coefficients, and methodologies. Most important, this work experimentally compares the state-or-the-art comparison methods to those proposed herein upon artificial data that simulates a variety of scenarios and chooses the best one (B). The ability of adapted inter-coder agreement coefficients, based upon B, to discern between various levels of agreement in artificial and natural data sets is then demonstrated. Finally, a contextual evaluation of three automatic segmenters is performed using the state-of-the art comparison methods and B using the methodology proposed herein to demonstrate the benefits and versatility of B as opposed to its counterparts.

APA, Harvard, Vancouver, ISO, and other styles

20

Scott, Sam. "Feature engineering for a symbolic approach to text classification." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1998. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp01/MQ36741.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Mohamed, Muhidin Abdullahi. "Automatic text summarisation using linguistic knowledge-based semantics." Thesis, University of Birmingham, 2016. http://etheses.bham.ac.uk//id/eprint/6659/.

Full text

Abstract:

Text summarisation is reducing a text document to a short substitute summary. Since the commencement of the field, almost all summarisation research works implemented to this date involve identification and extraction of the most important document/cluster segments, called extraction. This typically involves scoring each document sentence according to a composite scoring function consisting of surface level and semantic features. Enabling machines to analyse text features and understand their meaning potentially requires both text semantic analysis and equipping computers with an external semantic knowledge. This thesis addresses extractive text summarisation by proposing a number of semantic and knowledge-based approaches. The work combines the high-quality semantic information in WordNet, the crowdsourced encyclopaedic knowledge in Wikipedia, and the manually crafted categorial variation in CatVar, to improve the summary quality. Such improvements are accomplished through sentence level morphological analysis and the incorporation of Wikipedia-based named-entity semantic relatedness while using heuristic algorithms. The study also investigates how sentence-level semantic analysis based on semantic role labelling (SRL), leveraged with a background world knowledge, influences sentence textual similarity and text summarisation. The proposed sentence similarity and summarisation methods were evaluated on standard publicly available datasets such as the Microsoft Research Paraphrase Corpus (MSRPC), TREC-9 Question Variants, and the Document Understanding Conference 2002, 2005, 2006 (DUC 2002, DUC 2005, DUC 2006) Corpora. The project also uses Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for the quantitative assessment of the proposed summarisers’ performances. Results of our systems showed their effectiveness as compared to related state-of-the-art summarisation methods and baselines. Of the proposed summarisers, the SRL Wikipedia-based system demonstrated the best performance.

APA, Harvard, Vancouver, ISO, and other styles

22

Teich, Elke, and Peter Fankhauser. "Exploring lexical patterns in text : lexical cohesion analysis with WordNet." Universität Potsdam, 2005. http://opus.kobv.de/ubp/volltexte/2006/868/.

Full text

Abstract:

We present a system for the linguistic exploration and analysis of lexical cohesion in English texts.
Using an electronic thesaurus-like resource, Princeton WordNet, and the Brown Corpus of English, we have implemented a process of annotating text with lexical chains and a graphical user interface for inspection of the annotated text.
We describe the system and report on some sample linguistic analyses carried out using the combined thesaurus-corpus resource.

APA, Harvard, Vancouver, ISO, and other styles

23

Calderon, de Bolivar Adriana. "Interaction through written text : a discourse analysis of newspaper editorials." Thesis, University of Birmingham, 1986. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.312040.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

af, Geijerstam Åsa. "Att skriva i naturorienterande ämnen i skolan." Doctoral thesis, Uppsala University, Department of Linguistics and Philology, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-7352.

Full text

Abstract:

When children encounter new subjects in school, they are also faced with new ways of using language. Learning science thus means learning the language of science, and writing is one of the ways this is accomplished. The present study investigates writing in natural sciences in grades 5 and 8 in Swedish schools. Major theoretical influences for these investigations are found within the socio-cultural, dialogical and social semiotic perspectives on language use.

The study is based on texts written by 97 students, interviews around these texts and observations from 16 different classroom practices. Writing is seen as a situated practice; therefore analysis is carried out of the activities surrounding the texts. The student texts are analysed in terms of genre and in relation to their abstraction, density and use of expansions. This analysis shows among other things that the texts show increasing abstraction and density with increasing age, whereas the text structure and the use of expansions do not increase.

It is also argued that a central point in school writing must be the students’ way of talking about their texts. Analysis of interviews with the students is thus carried out in terms of text movability. The results from this analysis indicate that students find it difficult to talk about their texts. They find it hard to express the main content of the text, as well as to discuss it’s function and potential readers.

Previous studies argue that writing constitutes a potential for learning. In the material studied in this thesis, this potential learning tool is not used to any large extent. To be able to participate in natural sciences in higher levels, students need to take part in practices where the specialized language of natural science is used in writing as well as in speech.

APA, Harvard, Vancouver, ISO, and other styles

25

Vajjala, Balakrishna Sowmya [Verfasser], and Detmar [Akademischer Betreuer] Meurers. "Analyzing Text Complexity and Text Simplification : Connecting Linguistics, Processing and Educational Applications / Sowmya Vajjala Balakrishna ; Betreuer: Detmar Meurers." Tübingen : Universitätsbibliothek Tübingen, 2015. http://d-nb.info/1163397652/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

McGarry, Theresa, and J. Mwinyelle. "Adverbial Clauses and Gender in English and Spanish." Digital Commons @ East Tennessee State University, 2014. https://dc.etsu.edu/etsu-works/6155.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Pindi, Makaya ma Kimvwela. "Schematic structure and the modulation of propositions in economics forecasting text." Thesis, Online version, 1988. http://ethos.bl.uk/OrderDetails.do?did=1&uin=uk.bl.ethos.3821053.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Mason, Oliver Jan. "The automatic extraction of linguistic information from text corpora." Thesis, University of Birmingham, 2006. http://etheses.bham.ac.uk//id/eprint/116/.

Full text

Abstract:

This is a study exploring the feasibility of a fully automated analysis of linguistic data. It identifies a requirement for large-scale investigations, which cannot be done manually by a human researcher. Instead, methods from natural language processing are suggested as a way to analyse large amounts of corpus data without any human intervention. Human involvement hinders scalability and introduces a bias which prevents studies from being completely replicable. The fundamental assumption underlying this work is that linguistic analysis must be empirical, and that reliance on existing theories or even descriptive categories should be avoided as far as possible. In this thesis we report the results of a number of case studies investigating various areas of language description, lexis, grammar, and meaning. The aim of these case studies is to see how far we can automate the analysis of different aspects of language, both with data gathering and subsequent processing of the data. The outcomes of the feasibility studies demonstrate the practicability of such automated analyses.

APA, Harvard, Vancouver, ISO, and other styles

29

Danielsson, Benjamin. "A Study on Text Classification Methods and Text Features." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-159992.

Full text

Abstract:

When it comes to the task of classification the data used for training is the most crucial part. It follows that how this data is processed and presented for the classifier plays an equally important role. This thesis attempts to investigate the performance of multiple classifiers depending on the features that are used, the type of classes to classify and the optimization of said classifiers. The classifiers of interest are support-vector machines (SMO) and multilayer perceptron (MLP), the features tested are word vector spaces and text complexity measures, along with principal component analysis on the complexity measures. The features are created based on the Stockholm-Umeå-Corpus (SUC) and DigInclude, a dataset containing standard and easy-to-read sentences. For the SUC dataset the classifiers attempted to classify texts into nine different text categories, while for the DigInclude dataset the sentences were classified into either standard or simplified classes. The classification tasks on the DigInclude dataset showed poor performance in all trials. The SUC dataset showed best performance when using SMO in combination with word vector spaces. Comparing the SMO classifier on the text complexity measures when using or not using PCA showed that the performance was largely unchanged between the two, although not using PCA had slightly better performance

APA, Harvard, Vancouver, ISO, and other styles

30

Doyle, Paul G. "Replicating corpus linguistics : a corpus-driven investigation of lexical networks in text." Thesis, Lancaster University, 2003. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.418685.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Baka, Farida. "The discourse of biology lectures : aspects of its mode and text structure." Thesis, Aston University, 1989. http://publications.aston.ac.uk/14815/.

Full text

Abstract:

The present thesis investigates mode related aspects in biology lecture discourse and attempts to identify the position of this variety along the spontaneous spoken versus planned written language continuum. Nine lectures (of 43,000 words) consisting of three sets of three lectures each, given by the three lecturers at Aston University, make up the corpus. The indeterminacy of the results obtained from the investigation of grammatical complexity as measured in subordination motivates the need to take the analysis beyond sentence level to the study of mode related aspects in the use of sentence-initial connectives, sub-topic shifting and paraphrase. It is found that biology lecture discourse combines features typical of speech and writing at sentence as well as discourse level: thus, subordination is more used than co-ordination, but one degree complexity sentence is favoured; some sentence initial connectives are only found in uses typical of spoken language but sub-topic shift signalling (generally introduced by a connective) typical of planned written language is a major feature of the lectures; syntactic and lexical revision and repetition, interrupted structures are found in the sub-topic shift signalling utterance and paraphrase, but the text is also amenable to analysis into sentence like units. On the other hand, it is also found that: (1) while there are some differences in the use of a given feature, inter-speaker variation is on the whole not significant; (2) mode related aspects are often motivated by the didactic function of the variety; and (3) the structuring of the text follows a sequencing whose boundaries are marked by sub-topic shifting and the summary paraphrase. This study enables us to draw four theoretical conclusions: (1) mode related aspects cannot be approached as a simple dichotomy since a combination of aspects of both speech and writing are found in a given feature. It is necessary to go to the level of textual features to identify mode related aspects; (2) homogeneity is dominant in this sample of lectures which suggests that there is a high level of standardization in this variety; (3) the didactic function of the variety is manifested in some mode related aspects; (4) the features studied play a role in the structuring of the text.

APA, Harvard, Vancouver, ISO, and other styles

32

Ewert, Doreen Elizabeth. "The expression of temporality in the written discourse of L2 learners of English : distinguishing text-types and text passages /." [Bloomington, Ind.] : Indiana University, 2006. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3220175.

Full text

Abstract:

Thesis (Ph.D.)--Indiana University, Dept. of Linguistics, 2006.
Source: Dissertation Abstracts International, Volume: 67-05, Section: A, page: 1710. Adviser: Kathleen Bardovi-Harlig. "Title from dissertation home page (viewed June 20, 2007)."

APA, Harvard, Vancouver, ISO, and other styles

33

Lindén, Johannes. "Extracting Text into Meta-Data : Improving machine text-understanding of news-media articles." Licentiate thesis, Mittuniversitetet, Institutionen för informationssystem och –teknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-41775.

Full text

Abstract:

Society is constantly in need of information. It is important to consume event-based information of what is happening around us as well as facts and knowledge. As society grows, the amount of information to consume grows with it. This thesis demonstrates one way to extract and represent knowledge from text in a machine-readable way for news media articles. Three objectives are considered when developing a machine learning system to retrieve categories, entities, relations and other meta-data from text paragraphs. The first is to sort the terminology by topic; this makes it easier for machine learning algorithms to understand the text and the unique words used. The second objective is to construct a service for use in production, where scalability and performance are evaluated. Features are implemented to iteratively improve the model predictions, and several versions are run at the same time to, for example, compare them in an A/B test. The third objective is to further extract the gist of what is expressed in the text. The gist is extracted in the form of triples by connecting two related entities using a combination of natural language processing algorithms. The research presents a comparison between five different auto categorization algorithms, and an evaluation of their hyperparameters and how they would perform under the pressure of thousands of big, concurrent predictions. The aim is to build an auto-categorization system that can be used in the news media industry to help writers and journalists focus more on the story rather than filling in meta-data for each article. The best-performing algorithm is a Bidirectional Long-Short-Term-Memory neural network. Three different information extraction algorithms for extracting the gist of paragraphs are also compared. The proposed information extraction algorithm supports extracting information from texts in multiple languages with competitive accuracy compared with the state-of-the-art OpenIE and MinIE algorithms that can extract information in a single language. The use of the multi-linguistic models helps local-news media to write articles in different languages as a help to integrate immigrants into the society.

Vid tidpunkten för presentationen var följande delarbeten opublicerade: delarbete 4 inskickat.

At the time of the public defence the following papers were unpublished: paper 4 submitted.

APA, Harvard, Vancouver, ISO, and other styles

34

Folkeryd, Jenny W. "Writing with an Attitude : Appraisal and student texts in the school subject of Swedish." Doctoral thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-7410.

Full text

Abstract:

Learning in school is in many respects done through language. However, it has been shown that the language of school assignments is seldom explicitly discussed in school. Writing tasks are furthermore assigned without clear guidelines for how certain lexical choices make one text more powerful than another. The present study is a contribution to a linguistic and pedagogical discussion of student writing. More specifically the focus is on the use of evaluative language in texts written by students in the school subject of Swedish in grades 5, 8 and 11. The major investigations of the study have been accommodated within the theoretical framework of Appraisal. An overview is given of the language resources in the student texts for constructing emotion, judging behavior in ethical terms and valuing objects aesthetically. Another question addressed is that of how attitudinal meaning is intensified, thus creating greater or lesser degrees of positivity or negativity associated with the feelings. The results show that manifestations of attitude are found in practically all texts in the study. However, variations are noted in relation to different genres, age, proficiency level, language background and gender. A contribution of the study in relation to the theoretical framework upon which it draws is an extension of the system of Attitude as well as an identification of different patterns in the use of attitudinal resources. These patterns are furthermore discussed in relation to how students talk about their own written production in terms of text movability. Results indicate that students with a high degree of text movability also use attitudinal resources to a large extent. It is argued that applying the linguistic tool of Appraisal can facilitate a discussion of how to make one aspect of the hidden curriculum more visible, namely, how to write with an Attitude.

APA, Harvard, Vancouver, ISO, and other styles

35

Zhang, Yaxi. "Named Entity Recognition for Social Media Text." Thesis, Uppsala universitet, Institutionen för lingvistik och filologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-395978.

Full text

Abstract:

This thesis aims to perform named entity recognition for English social media texts. Named Entity Recognition (NER) is applied in many NLP tasks as an important preprocessing procedure. Social media texts contain lots of real-time data and therefore serve as a valuable source for information extraction. Nevertheless, NER for social media texts is a rather challenging task due to the noisy context. Traditional approaches to deal with this task use hand-crafted features but prove to be both time-consuming and very task-specific. As a result, they fail to deliver satisfactory performance. The goal of this thesis is to tackle this task by automatically identifying and annotating the named entities with multiple types with the help of neural network methods. In this thesis, we experiment with three different word embeddings and character embedding neural network architectures that combine long short- term memory (LSTM), bidirectional LSTM (BI-LSTM) and conditional random field (CRF) to get the best result. The data and evaluation tool comes from the previous shared tasks on Noisy User-generated Text (W- NUT) in 2017. We achieve the best F1 score 42.44 using BI-LSTM-CRF with character-level representation extracted by a BI-LSTM, and pre-trained word embeddings trained by GloVe. We also find out that the results could be improved with larger training data sets.

APA, Harvard, Vancouver, ISO, and other styles

36

Brewer, C. D. "Some implications of the Z-text for the textual tradition of Piers Plowman." Thesis, University of Oxford, 1985. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.371610.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Edling, Agnes. "Abstraction and authority in textbooks : The textual paths towards specialized language." Doctoral thesis, Uppsala University, Department of Linguistics and Philology, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-6989.

Full text

Abstract:

During a few hours of a school day, a student might read textbook texts which are highly diversified in terms of abstraction. Abstraction is a central feature of specialized language and the transition from everyday language to specialized language is one of the most important things formal education can offer students. That transition is the focus of this thesis.

This study introduces a new three-graded classification of abstraction including the levels of specificity, generalization and abstraction, based on a discussion of the concept of abstraction. The investigations performed, based on this classification, show that texts from different subject areas display distinct patterns of abstraction. The Swedish literary texts had the lowest degree of abstraction, the social science texts had an intermediate degree and the natural science texts were the most generalized and abstract. The results also show that the degree of abstraction in the textbook texts increases in later grade levels.

The thesis presents a new way of analyzing shifts between levels of abstraction and their functions. Interestingly, the texts with a medium degree of abstraction, the social science texts, are the ones with the greatest variety in shifts. The functions of the shifts differ with respect to cultural domains. The shifts in the Swedish literary texts in general belong to the everyday domain while the shifts in the natural science texts belong to a specialized domain. The shifts in the social science texts had features of both domains.

A secondary aim of the thesis is to develop the understanding of the relationship between author and reader in the texts. The results from my investigation of modality in the Swedish textbook texts confirm the earlier findings from English and Spanish textbooks. In comparison to other text types, textbook texts present knowledge in a more authoritative and less modalized way.

From time to time, abstraction is described as a feature that hinders students accessing texts. Some researchers even suggest a removal of features of specialized language in textbook texts, in order to increase students’ understanding. However, in a society where specialized knowledge is necessary, the access to specialized texts is important. A democratic view of education and school mandates that children and adolescents have the opportunity to encounter and learn to encounter specialized language in school. In analyzing the texts special attention is paid to the relationship between the texts, the contexts of use and the student readers.

APA, Harvard, Vancouver, ISO, and other styles

38

Paun, Silviu. "Topic models for short text data." Thesis, University of Essex, 2017. http://repository.essex.ac.uk/19715/.

Full text

Abstract:

Topic models are known to suffer from sparsity when applied to short text data. The problem is caused by a reduced number of observations available for a reliable inference (i.e.: the words in a document). A popular heuristic utilized to overcome this problem is to perform before training some form of document aggregation by context (e.g.: author, hashtag). We dedicated one part of this dissertation to modeling explicitly the implicit assumptions of the document aggregation heuristic and applying it to two well known model architectures: a mixture and an admixture. Our findings indicate that an admixture model benefits more from aggregation compared to a mixture model which rarely improved over its baseline (the standard mixture). We also find that the state of the art in short text data can be surpassed as long as every context is shared by a small number of documents. In the second part of the dissertation we develop a more general purpose topic model which can also be used when contextual information is not available. The proposed model is formulated around the observation that in normal text data, a classic topic model like an admixture works well because patterns of word co-occurrences arise across the documents. However, the possibility of such patterns to arise in a short text dataset is reduced. The model assumes every document is a bag of word co-occurrences, where each co-occurrence belongs to a latent topic. The documents are enhanced a priori with related co-occurrences from the other documents, such that the collection will have a greater chance of exhibiting word patterns. The proposed model performs well managing to surpass the state of the art and popular topic model baselines.

APA, Harvard, Vancouver, ISO, and other styles

39

Williams, Ken. "A framework for text categorization." Thesis, The University of Sydney, 2003. https://hdl.handle.net/2123/27951.

Full text

Abstract:

The field of automatic Text Categorization (TC) concerns the creation of categorizer functions, usually involving Machine Learning techniques, to assign labels from a pre-defined set of categories to documents based on the documents' content. Because of the many variations on how this can be achieved and the diversity of applications in which it can be employed, creating specific TC applications is often a difficult task. This thesis concerns the design, implementation, and testing of an ObjectOriented Application Framework for Text Categorization. By encoding expertise in the architecture of the framework, many of the barriers to creating TC applications are eliminated. Developers can focus on the domain-specific aspects of their applications, leaving the generic aspects of categorization to the framework. This allows significant code and design reuse when building new applications. Chapter 1 provides an introduction to automatic Text Categorization, Object-Oriented Application Frameworks, and Design Patterns. Some common application areas and benefits of using automatic TC are discussed. Frameworks are defined and their advantages compared to other software engineering strategies are presented. Design patterns are defined and placed in the context of framework development. An overview of three related products in the TC space, Weka, Autonomy, and Teragram, follows. Chapter 2 contains a detailed presentation of Text Categorization. TC is formally defined, followed by a detailed account of the main functional areas in Text Categorization that a modern TC framework must provide. These include document tokenizing, feature selection and reduction, Machine Learning techniques, and categorization runtime behavior. Four Machine Learning techniques (Na"ive Bayes categorizers, k-Nearest-Neighbor categorizers, Support Vector Machines, and Decision Trees) are presented, with discussions of their core algorithms and the computational complexity involved. Several measures for evaluating the quality of a categorizer are then defined, including precision, recall, and the Ff3 measure. The design of a framework that addresses the functional areas from Chapter 2 is presented in Chapter 3. This design is motivated by consideration of the framework's audience and some expected usage scenarios. The core architectural classes in the framework are then presented, and Design Patterns are employed in a detailed discussion of the cooperative relationships among framework classes. This is the first known use of Design Patterns in an academic work on Text Categorization software. Following the presentation of the framework design, some possible design limitations are discussed. The design in Chapter 3 has been implemented as the AI: : Categorizer Perl package. Chapter 4 is a short discussion of implementation issues, including considerations in choosing the programming language. Special consideration is given to the implementation of constructor methods in the framework, since they are responsible for enforcing the structural relationships among framework classes. Three data structure issues within the framework are then discussed: feature vectors, sets of document or category objects, and the serialized representation of a framework object. Chapter 5 evaluates the framework from several different perspectives on two corpora. The first corpus is the standard Reuters-21578 benchmark corpus, and the second is assembled from messages sent to an educational ask-an-expert service. Using these corpora, the framework is evaluated on the measures introduced in Chapter 2. The performance on the first corpus is compared to the well-known results in [50]. The Nai·ve Bayes categorizer is found to be competitive with standard implementations in the literature, and the Support Vector Machine and k-Nearest-Neighbor implementations are outperformed by comparable systems by other researchers. The framework is then evaluated in terms of its resource usage, and several applications using AI: : Categorizer are presented in order to show the framework's ability to function in the usage scenarios discussed in Chapter 3.

APA, Harvard, Vancouver, ISO, and other styles

40

Mills, Jon. "Computer assisted lemmatisation of a Cornish text corpus for lexicographical purposes." Thesis, University of Kent, 2002. http://kar.kent.ac.uk/8301/.

Full text

Abstract:

This project sets out to discover and develop techniques for the lemmatisation of a historical corpus of the Cornish language in order that a lemmatised dictionary macrostructure can be generated from the corpus. The system should be capable of uniquely identifying every lexical item that is attested in the corpus. A survey of published and unpublished Cornish dictionaries, glossaries and lexicographical notes was carried out. A corpus was compiled incorporating specially prepared new critical editions. An investigation into the history of Cornish lemmatisation was undertaken. A systemic description of Cornish inflection was written. Three methods of corpus lemmatisation were trialed. Findings were as follows. Lexicographical history shapes current Cornish lexicographical practice. Lexicon based tokenisation has advantages over character based tokenisation. System networks provide the means to generate base forms from attested word types. Grammatical difference is the most reliable way of disambiguating homographs. A lemma that contains three fields, the canonical form, the part-of-speech and a semantic field label, provides of a unique code for every lexeme attested in the corpus. Programs which involve human interaction during the lemmatisation process allow bootstrapping of the lemmatisation database. Computerised morphological processing may be used at least to partially create the lemmatisation database. Disambiguation of at least some of the most common homographs may be automated by the use of computer programs.

APA, Harvard, Vancouver, ISO, and other styles

41

Micallef, Paul. "A text to speech synthesis system for Maltese." Thesis, University of Surrey, 1997. http://epubs.surrey.ac.uk/842702/.

Full text

Abstract:

The subject of this thesis covers a considerably varied multidisciplinary area which needs to be addressed to be able to achieve a text-to-speech synthesis system of high quality, in any language. This is the first time that such a system has been built for Maltese, and therefore, there was the additional problem of no computerised sources or corpora. However many problems and much of the system designs are common to all languages. This thesis focuses on two general problems. The first is that of automatic labelling of phonemic data, since this is crucial for the setting up of Maltese speech corpora, which in turn can be used to improve the system. A novel way of achieving such automatic segmentation was investigated. This uses a mixed parameter model with maximum likelihood training of the first derivative of the features across a set of phonetic class boundaries. It was found that this gives good results even for continuous speech provided that a phonemic labelling of the text is available. A second general problem is that of segment concatenation, since the end and beginning of subsequent diphones can have mismatches in amplitude, frequency, phase and spectral envelope. The use of-intermediate frames, build up from the last and first frames of two concatenated diphones, to achieve a smoother continuity was analysed. The analysis was done both in time and in frequency. The use of wavelet theory for the separation of the spectral envelope from the excitation was also investigated. The linguistic system modules have been built for this thesis. In particular a rule based grapheme to phoneme conversion system that is serial and not hierarchical was developed. The morphological analysis required the design of a system which allowed two dissimilar lexical structures, (semitic and romance) to be integrated into one overall morphological analyser. Appendices at the back are included with detailed rules of the linguistic modules developed. The present system, while giving satisfactory intelligibility, with capability of modifying duration, does not include as yet a prosodic module.

APA, Harvard, Vancouver, ISO, and other styles

42

Forsyth, Richard. "Stylistic structures : a computational approach to text classification." Thesis, University of Nottingham, 1996. http://eprints.nottingham.ac.uk/13445/.

Full text

Abstract:

The problem of authorship attribution has received attention both in the academic world (e.g. did Shakespeare or Marlowe write Edward III?) and outside (e.g. is this confession really the words of the accused or was it made up by someone else?). Previous studies by statisticians and literary scholars have sought "verbal habits" that characterize particular authors consistently. By and large, this has meant looking for distinctive rates of usage of specific marker words -- as in the classic study by Mosteller and Wallace of the Federalist Papers. The present study is based on the premiss that authorship attribution is just one type of text classification and that advances in this area can be made by applying and adapting techniques from the field of machine learning. Five different trainable text-classification systems are described, which differ from current stylometric practice in a number of ways, in particular by using a wider variety of marker patterns than customary and by seeking such markers automatically, without being told what to look for. A comparison of the strengths and weaknesses of these systems, when tested on a representative range of text-classification problems, confirms the importance of paying more attention than usual to alternative methods of representing distinctive differences between types of text. The thesis concludes with suggestions on how to make further progress towards the goal of a fully automatic, trainable text-classification system.

APA, Harvard, Vancouver, ISO, and other styles

43

Delisle, Sylvain. "Text processing without a priori domain knowledge: Semi-automatic linguistic analysis for incremental knowledge acquisition." Thesis, University of Ottawa (Canada), 1994. http://hdl.handle.net/10393/6574.

Full text

Abstract:

Technical texts are an invaluable source of the domain-specific knowledge which plays a crucial role in advanced knowledge-based systems today. However, acquiring such knowledge has always been a major difficulty in the construction of these systems--this critical obstacle is sometimes referred to as the "knowledge acquisition bottleneck". In order to lessen the burden on the knowledge engineer's shoulders, several approaches have been proposed in the literature. A few of these suggest processing texts pertaining to the domain of interest in order to extract the knowledge they contain and thus facilitate the domain modelling. We herein propose a new approach to knowledge acquisition from texts; this approach is comprised of a new methodology and computational framework for the implementation of a linguistic processor which represents the central component of a system for the acquisition of knowledge from text. The system, named TANKA, is not given the complete domain model beforehand. It is designed to process technical texts in order to incrementally build a knowledge base containing a conceptual model of the domain. TANKA is an intelligent assistant to the knowledge engineer; when it cannot proceed entirely on its own, the user is asked to collaborate. In the process, the system acquires knowledge from text; it can be said to learn about the domain. The originality of the research is due mainly to the fact that we do not assume significant a priori domain-specific (semantic) knowledge: this assumption represents a severe constraint on the natural language processor. The only external elements of knowledge we consider in the proposed framework are "off-the-shelf" publicly available and domain-independent repositories, such as a basic dictionary containing surface syntactic information (i.e. The Collins) and a lexical database (i.e. WordNet). Other components of the proposed framework are general-purpose. The parser (DIPETT) is domain-independent with a large coverage of English: our approach relies on full syntactic analysis. The Case-based semantic analyzer (HAIKU) is semi-automatic: it interacts with the user in order to get his$\sp1$ approval of the analysis it has just proposed and negotiates refined elements of the analysis when necessary. The combined processing of DIPETT and HAIKU allows TANKA, the encompassing system$\sp2$, to acquire knowledge, based on the conceptual elements produced by HAIKU. The thesis also describes experiments that have been conducted on a Prolog implementation of both of these text analysis components. The approach presented in the thesis is general and in principle portable to any domain in which suitable technical texts are available. The thesis presents theoretical considerations as well as engineering aspects of the many facets of this research work. We also provide a detailed discussion of many future work items that could be added to what has already been accomplished in order to make the framework even more productive. (Abstract shortened by UMI.) ftn$\sp1$In order to lighten the text, the terms 'he' and 'his' have been used generically to refer equally to persons of either sex. No discrimination is either implied or intended. $\sp2$DIPETT and HAIKU constitute a conceptual analyzer that can be used independently of TANKA or within a different encompassing system.

APA, Harvard, Vancouver, ISO, and other styles

44

Xu, Jingguo. "A study of the reading process in Chinese through detecting errors in a meaningful text." Diss., The University of Arizona, 1998. http://hdl.handle.net/10150/282855.

Full text

Abstract:

The Goodman Reading Model differs from the word recognition model on the issues of (a) whether reading depends on perception of every single word; (b) whether prediction is used in the reading process; and (c) whether reading comprehension depends on individual words. The study tested the validity of the two models by investigating the reading process in Chinese through error detection. Two hundred subjects with equal numbers of college and middle school students participated in the experiment. The subjects at each educational level were randomly divided into error and meaning focus groups. The error focus groups were instructed to search for errors embedded in a Chinese text and the meaning focus groups to read for meaning of the same text within limited time. Then they were asked to recall the errors detected and the contents of the story in writing and to answer a questionnaire. After that they were given unlimited time to search for as many errors as they could. The main results showed that (a) all subjects failed to detect half of the errors under limited exposure and all errors under unlimited exposure; (b) the error focus subjects detected significantly more errors than the meaning focus subjects under limited exposure, but the meaning focus subjects scored significantly higher than the error focus subjects in recall of the story; (c) there was no significant difference between reading times in the number of errors detected but in the scores for the recall of the story; (d) the college subjects performed significantly better than the middle school subjects in error detection and reading comprehension; (e) more errors were detected in the contents word category than in the function word category; and (f) some extralinguistic factors had effects on the task performance. The results suggest (a) that characters and/or words are not recognized in a linear process in reading; (b) prediction is used under the influence of knowledge of various kinds; and (c) reading comprehension employs words but does not depend on individual words. The Goodman Reading Model is validated and proved applicable to reading in Chinese.

APA, Harvard, Vancouver, ISO, and other styles

45

Plum, Guenter Arnold. "Text and Contextual Conditioning in Spoken English: A genre approach." Thesis, The University of Sydney, 1988. http://hdl.handle.net/2123/608.

Full text

Abstract:

This study brings together two approaches to linguistic variation, Hallidayan systemic-functional grammar and Labovian variation theory, and in doing so brings together a functional interpretation of language and its empirical investigation in its social context. The study reports on an empirical investigation of the concept of text. The investigation proceeds on the basis of a corpus of texts gathered in sociolinguistic interviews with fifty adult speakers of Australian English in Sydney. The total corpus accounted for in terms of text type or genre numbers 420 texts of varying length, 125 of which, produced in response to four narrative questions, are investigated in greater detail in respect both of the types of text they constitute as well as of some of their linguistic realisations. These largely narrative-type texts, which represent between two and three hours of spoken English and total approximately 53000 words, are presented in a second volume analysed in terms of their textual or generic structure as well as their realisation at the level of the clause complex. The study explores in some detail models of register and genre developed within systemic-functional linguistics, adopting a genre model developed by J.R. Martin and others working within his model which foregrounds the notion that all aspects of the system(s) involved are related to one another probabilistically. In order to investigate the concept of text in actual discourse under conditions which permit us to become sufficiently confident of our understanding of it to proceed to generalisations about text and its contextual conditioning in spoken discourse, we turn to Labovian methods of sociolinguistic inquiry, i.e. to quantitative methods or methods of quantifying linguistic choice. The study takes the sociolinguistic interview as pioneered by Labov in his study of phonological variation in New York City and develops it for the purpose of investigating textual variation. The question of methodology constitutes a substantial part of the study, contributing in the process to a much greater understanding of the very phenomenon of text in discourse, for example by addressing itself to the question of the feasibility of operationalising a concept of text in the context of spoken discourse. The narrative-type texts investigated in further detail were found to range on a continuum from most experientially-oriented texts such as procedure and recount at one end to the classic narrative of personal experience and anecdote to the increasingly interpersonally-oriented exemplum and observation, both of which become interpretative of the real world in contrast to the straightforwardly representational slant taken on the same experience by the more experientially-oriented texts. The explanation for the generic variation along this continuum must be sought in a system of generic choice which is essentially cultural. A quantitative analysis of clausal theme and clause complex-type relations was carried out, the latter by means of log-linear analysis, in order to investigate their correlation with generic structure. While it was possible to relate the choice of theme to the particular stages of generic structures, clause complex-type relations are chosen too infrequently to be related to stages and were thus related to genres as a whole. We find that while by and large the choice of theme correlates well with different generic stages, it only discriminates between different genres, i.e. generic structures in toto, for those genres which are maximally different. Similarly, investigating the two choices in the principal systems involved in the organisation of the clause complex, i.e. the choice of taxis (parataxis vs. hypotaxis) and the (grammatically independent) choice of logico-semantic relations (expansion vs. projection), we find that both those choices discriminate better between types more distant on a narrative continuum. The log-linear analysis of clause complex-type relations also permitted the investigation of the social characteristics of speakers. We found that the choice of logico-semantic relations correlates with genre and question, while the choice of taxis correlates with a speaker's sex and his membership of some social group (in addition to genre). Parataxis is favoured by men and by members of the group lowest in the social hierarchy. Age on the other hand is not significant in the choice of taxis at all. In other words, since social factors are clearly shown to be significant in the making of abstract grammatical choices where they cannot be explained in terms of the functional organisation of text, we conclude that social factors must be made part of a model of text in order to fully account for its contextual conditioning. The study demonstrates that an understanding of the linguistic properties of discourse requires empirical study and, conversely, that it is possible to study discourse empirically without relaxing the standards of scientific inquiry.

APA, Harvard, Vancouver, ISO, and other styles

46

Plum, Guenter Arnold. "Text and Contextual Conditioning in Spoken English: A genre approach." University of Sydney. Linguistics, 1988. http://hdl.handle.net/2123/608.

Full text

Abstract:

This study brings together two approaches to linguistic variation, Hallidayan systemic-functional grammar and Labovian variation theory, and in doing so brings together a functional interpretation of language and its empirical investigation in its social context. The study reports on an empirical investigation of the concept of text. The investigation proceeds on the basis of a corpus of texts gathered in sociolinguistic interviews with fifty adult speakers of Australian English in Sydney. The total corpus accounted for in terms of text type or genre numbers 420 texts of varying length, 125 of which, produced in response to four narrative questions, are investigated in greater detail in respect both of the types of text they constitute as well as of some of their linguistic realisations. These largely narrative-type texts, which represent between two and three hours of spoken English and total approximately 53000 words, are presented in a second volume analysed in terms of their textual or generic structure as well as their realisation at the level of the clause complex. The study explores in some detail models of register and genre developed within systemic-functional linguistics, adopting a genre model developed by J.R. Martin and others working within his model which foregrounds the notion that all aspects of the system(s) involved are related to one another probabilistically. In order to investigate the concept of text in actual discourse under conditions which permit us to become sufficiently confident of our understanding of it to proceed to generalisations about text and its contextual conditioning in spoken discourse, we turn to Labovian methods of sociolinguistic inquiry, i.e. to quantitative methods or methods of quantifying linguistic choice. The study takes the sociolinguistic interview as pioneered by Labov in his study of phonological variation in New York City and develops it for the purpose of investigating textual variation. The question of methodology constitutes a substantial part of the study, contributing in the process to a much greater understanding of the very phenomenon of text in discourse, for example by addressing itself to the question of the feasibility of operationalising a concept of text in the context of spoken discourse. The narrative-type texts investigated in further detail were found to range on a continuum from most experientially-oriented texts such as procedure and recount at one end to the classic narrative of personal experience and anecdote to the increasingly interpersonally-oriented exemplum and observation, both of which become interpretative of the real world in contrast to the straightforwardly representational slant taken on the same experience by the more experientially-oriented texts. The explanation for the generic variation along this continuum must be sought in a system of generic choice which is essentially cultural. A quantitative analysis of clausal theme and clause complex-type relations was carried out, the latter by means of log-linear analysis, in order to investigate their correlation with generic structure. While it was possible to relate the choice of theme to the particular stages of generic structures, clause complex-type relations are chosen too infrequently to be related to stages and were thus related to genres as a whole. We find that while by and large the choice of theme correlates well with different generic stages, it only discriminates between different genres, i.e. generic structures in toto, for those genres which are maximally different. Similarly, investigating the two choices in the principal systems involved in the organisation of the clause complex, i.e. the choice of taxis (parataxis vs. hypotaxis) and the (grammatically independent) choice of logico-semantic relations (expansion vs. projection), we find that both those choices discriminate better between types more distant on a narrative continuum. The log-linear analysis of clause complex-type relations also permitted the investigation of the social characteristics of speakers. We found that the choice of logico-semantic relations correlates with genre and question, while the choice of taxis correlates with a speaker's sex and his membership of some social group (in addition to genre). Parataxis is favoured by men and by members of the group lowest in the social hierarchy. Age on the other hand is not significant in the choice of taxis at all. In other words, since social factors are clearly shown to be significant in the making of abstract grammatical choices where they cannot be explained in terms of the functional organisation of text, we conclude that social factors must be made part of a model of text in order to fully account for its contextual conditioning. The study demonstrates that an understanding of the linguistic properties of discourse requires empirical study and, conversely, that it is possible to study discourse empirically without relaxing the standards of scientific inquiry.

APA, Harvard, Vancouver, ISO, and other styles

47

Santos, Rodrigo Maia Theodoro dos. "Procedimentos e operações de reconstrução textual." Pontifícia Universidade Católica de São Paulo, 2012. https://tede2.pucsp.br/handle/handle/14248.

Full text

Abstract:

Made available in DSpace on 2016-04-28T19:33:37Z (GMT). No. of bitstreams: 1 Rodrigo Maia Theodoro dos Santos.pdf: 401207 bytes, checksum: 1d9b18ba5758bfed5bfd36858bfe6b75 (MD5) Previous issue date: 2012-11-07
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
The theme for this thesis is a study of textual revision procedures, taken by base beyond the grammatical criteria, factors textuality, seeking rearticulations operating in the reconstruction process of the text. The main objective is to identify, in the corpus selected items that present the possible operations methodological review. The intention is to develop a diagram with procedures to guide professionals dealing with the text. To discuss the theoretical bias of the thesis was based on the approach taken by the Textlinguistics and criteria for textuality, identified as the main responsible for the articulation of the text as a meaningful unity. From this perspective, the corpus of the thesis will consist of summaries of academic completion of course work. The choice of this kind is due to the fact that the summary be characterized as a succinct way of rescuing a broader context, which features a procedure retextualization. Thus, the search for textual reconstruction procedures can be performed with higher quality and clarity. Nevertheless, the thesis showed objectively and exemplifying the need for the teacher or reviewer to consider items that are beyond the grammatical aspects. From the procedures developed in the thesis, it was revealed that adaptation to gender and textuality are key factors to reach a production of relevant text
A presente tese tem por tema um estudo de procedimentos de revisão textual, tomados por base, para além dos critérios gramaticais, os fatores de textualidade, em busca de rearticulações operacionais no processo de reconstrução do texto. O objetivo principal do trabalho é identificar, no corpus selecionado, itens que apresentem as possíveis operações metodológicas de revisão. A intenção é desenvolver um diagrama com procedimentos para orientar os profissionais que lidam com o texto. Para discorrer sobre o viés teórico da tese, foi tomada por base a abordagem da Lingüística Textual e os critérios de textualidade, apontados como os principais responsáveis pela articulação do texto como uma unidade significativa. Nessa perspectiva, o corpus da tese será constituído por resumos acadêmicos de trabalho de conclusão de curso. A escolha desse gênero se deve ao fato de o resumo se caracterizar como uma forma sucinta de resgate de um texto mais amplo, o que caracteriza um procedimento de retextualização. Dessa forma, a busca pelos procedimentos de reconstrução textual pode ser realizada com mais qualidade e clareza. Não obstante, a tese evidenciou de forma objetiva e exemplificativa a necessidade de o professor ou o revisor considerar itens que estão além dos aspectos gramaticais. A partir dos procedimentos e operações desenvolvidas na tese, foi possível perceber que a adaptação ao gênero e aos fatores de textualidade são fundamentais para chegarmos a uma produção de texto competente

APA, Harvard, Vancouver, ISO, and other styles

48

Al-Jubouri, Adnan J. R. "Computer-aided categorisation and quantification of connectives in English and Arabic (based on newspaper text corpora)." Thesis, Aston University, 1987. http://publications.aston.ac.uk/10283/.

Full text

Abstract:

This study presents a detailed contrastive description of the textual functioning of connectives in English and Arabic. Particular emphasis is placed on the organisational force of connectives and their role in sustaining cohesion. The description is intended as a contribution for a better understanding of the variations in the dominant tendencies for text organisation in each language. The findings are expected to be utilised for pedagogical purposes, particularly in improving EFL teaching of writing at the undergraduate level. The study is based on an empirical investigation of the phenomenon of connectivity and, for optimal efficiency, employs computer-aided procedures, particularly those adopted in corpus linguistics, for investigatory purposes. One important methodological requirement is the establishment of two comparable and statistically adequate corpora, also the design of software and the use of existing packages and to achieve the basic analysis. Each corpus comprises ca 250,000 words of newspaper material sampled in accordance to a specific set of criteria and assembled in machine readable form prior to the computer-assisted analysis. A suite of programmes have been written in SPITBOL to accomplish a variety of analytical tasks, and in particular to perform a battery of measurements intended to quantify the textual functioning of connectives in each corpus. Concordances and some word lists are produced by using OCP. Results of these researches confirm the existence of fundamental differences in text organisation in Arabic in comparison to English. This manifests itself in the way textual operations of grouping and sequencing are performed and in the intensity of the textual role of connectives in imposing linearity and continuity and in maintaining overall stability. Furthermore, computation of connective functionality and range of operationality has identified fundamental differences in the way favourable choices for text organisation are made and implemented.

APA, Harvard, Vancouver, ISO, and other styles

49

Cash, Cash Phillip E. "Timnakni Timat (writing from the heart): Sahaptin discourse and text in the speaker writing of Xiluxin." Thesis, The University of Arizona, 2000. http://hdl.handle.net/10150/278750.

Full text

Abstract:

The unique contributions of speaker scholarship to the study of Sahaptian languages in the Columbia Plateau have rarely been considered a domain of inquiry in the field of linguistics. In the present study, I utilize a discourse-centered approach to investigate the ways in which an indigenous language is employed as a resource in the creation of texts. I examine the status of Sahaptin language use in a series of unpublished texts produced by X&dotbelow;ilux&dotbelow;in (Charlie McKay, 1910--1996), a multilingual Sahaptin speaker and scholar from the Umatilla Indian Reservation of northeastern Oregon. I account for the merging of internal indigenous linguistic forms with writing in two occurrences: language documentation and individual expression. The study found that, when a Sahaptin speaker writer transfers his or her internalized language to the written form, Sahaptin discourse and world view play a key role in its outcome.

APA, Harvard, Vancouver, ISO, and other styles

50

Rennes, Evelina. "Improved Automatic Text Simplification by Manual Training." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-120001.

Full text

Abstract:

The purpose of this thesis was the further development of a rule set used in an automatic text simplification system, and the exploration of whether it is possible to improve the performance of a rule based text simplification system by manual training. A first rule set was developed from a thor- ough literature review, and the rule refinement was performed by manually adapting the first rule set to a set of training texts. When there was no more change added to the set of rules, the training was considered to be completed, and the two sets were applied to a test set, for evaluation. This thesis evaluated the performance of a text simplification system as a clas- sification task, by the use of objective metrics: precision and recall. The comparison of the rule sets revealed a clear improvement of the system, since precision increased from 45% to 82%, and recall increased from 37% to 53%. Both recall and precision was improved after training for the ma- jority of the rules, with a few exceptions. All rule types resulted in a higher score on correctness for R2. Automatic text simplification systems target- ing real life readers need to account for qualitative aspects, which has not been considered in this thesis. Future evaluation should, in addition to quantitative metrics such as precision, recall, and complexity metrics, also account for the experience of the reader.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Text linguistics'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles