Dissertations / Theses on the topic 'Discourse analysis, Literary – Data processing'

To see the other types of publications on this topic, follow the link: Discourse analysis, Literary – Data processing.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 18 dissertations / theses for your research on the topic 'Discourse analysis, Literary – Data processing.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

李嘉雯 and Ka-man Carmen Lee. "Chinese and English computer-mediated communication in the context of New Literacy Studies." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2002. http://hub.hku.hk/bib/B29872959.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Stephens, Maegan R. "A computerized content analysis of Oprah Winfrey's discourse during the James Frey controversy." Virtual Press, 2008. http://liblink.bsu.edu/uhtbin/catkey/1397651.

Full text
Abstract:
This analysis utilizes the computer-based content analysis program DICTION to gain a better understanding of Oprah Winfrey's specific discourse types (praise, blame, and standard) and her language surrounding the James Frey Controversy. Grounded in Social Influence Theory, this thesis argues that is important to understand the language styles of such a significant rhetor in society because she has the potential to influence the public. The findings indicate that Oprah's discourse types differ in the level of Optimism her language represents and that the two episodes of The Oprah Winfrey Show relating to the James Frey Controversy differ in terms of the Certainty. Also, this thesis provides a new application of the program DICTION and the implications for such procedures are discussed.
Department of Communication Studies
APA, Harvard, Vancouver, ISO, and other styles
3

Caines, Andrew Paul. "You talking to me? : zero auxiliary constructions in British English." Thesis, University of Cambridge, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.609153.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Paterson, Kimberly Laurel Ms. "TSPOONS: Tracking Salience Profiles Of Online News Stories." DigitalCommons@CalPoly, 2014. https://digitalcommons.calpoly.edu/theses/1222.

Full text
Abstract:
News space is a relatively nebulous term that describes the general discourse concerning events that affect the populace. Past research has focused on qualitatively analyzing news space in an attempt to answer big questions about how the populace relates to the news and how they respond to it. We want to ask when do stories begin? What stories stand out among the noise? In order to answer the big questions about news space, we need to track the course of individual stories in the news. By analyzing the specific articles that comprise stories, we can synthesize the information gained from several stories to see a more complete picture of the discourse. The individual articles, the groups of articles that become stories, and the overall themes that connect stories together all complete the narrative about what is happening in society. TSPOONS provides a framework for analyzing news stories and answering two main questions: what were the important stories during some time frame and what were the important stories involving some topic. Drawing technical news stories from Techmeme.com, TSPOONS generates profiles of each news story, quantitatively measuring the importance, or salience, of news stories as well as quantifying the impact of these stories over time.
APA, Harvard, Vancouver, ISO, and other styles
5

Rickly, Rebecca J. "Exploring the dimensions of discourse : a multi-model analysis of electronic and oral discussions in developmental English." Virtual Press, 1995. http://liblink.bsu.edu/uhtbin/catkey/1001179.

Full text
Abstract:
This study investigated participation levels of developmental writing students inoral discussions and electronic discussions using the synchronous conferencing software InterChange. The study used a combination of quantitative and qualitative methods in a naturalistic/quasi-experimental design under a social constructivist epistemology. The methods included: word counts onto which biological sex and socially-constructed gender (as measured by the Bern Sex-Role Inventory) were overlaid as variables; a modified taxonomy based on Bales' Conversational Analysis measure; a taxonomy which measured the direction of discourse; and "thick description" in the form of subjective reactions to videotaped oral discussions and transcribed electronic discussions.The multi-modal, descriptive findings indicate that students participate more frequently in electronic discussions; that subsequent oral classes take on participatory characteristics of an InterChange session; and that while the more frequent participation in InterChange discussions does appear to carry over into subsequent oral discussions, socially constructed variables such as gender may, in fact, encourage students to participate less frequently in oral discussons after using InterChange. The findings also show that InterChange discussions are primarily student-centered: most of the responses generated are aimed at other students. In the oral classroom, very little student-to-student interaction occurs. The findings of this study indicate that while the computer environment may not promote egalitarian discourse, it does tend to produce more democratic discourse.
Department of English
APA, Harvard, Vancouver, ISO, and other styles
6

Mazidi, Karen. "Infusing Automatic Question Generation with Natural Language Understanding." Thesis, University of North Texas, 2016. https://digital.library.unt.edu/ark:/67531/metadc955021/.

Full text
Abstract:
Automatically generating questions from text for educational purposes is an active research area in natural language processing. The automatic question generation system accompanying this dissertation is MARGE, which is a recursive acronym for: MARGE automatically reads generates and evaluates. MARGE generates questions from both individual sentences and the passage as a whole, and is the first question generation system to successfully generate meaningful questions from textual units larger than a sentence. Prior work in automatic question generation from text treats a sentence as a string of constituents to be rearranged into as many questions as allowed by English grammar rules. Consequently, such systems overgenerate and create mainly trivial questions. Further, none of these systems to date has been able to automatically determine which questions are meaningful and which are trivial. This is because the research focus has been placed on NLG at the expense of NLU. In contrast, the work presented here infuses the questions generation process with natural language understanding. From the input text, MARGE creates a meaning analysis representation for each sentence in a passage via the DeconStructure algorithm presented in this work. Questions are generated from sentence meaning analysis representations using templates. The generated questions are automatically evaluated for question quality and importance via a ranking algorithm.
APA, Harvard, Vancouver, ISO, and other styles
7

Faruque, Md Ehsanul. "A Minimally Supervised Word Sense Disambiguation Algorithm Using Syntactic Dependencies and Semantic Generalizations." Thesis, University of North Texas, 2005. https://digital.library.unt.edu/ark:/67531/metadc4969/.

Full text
Abstract:
Natural language is inherently ambiguous. For example, the word "bank" can mean a financial institution or a river shore. Finding the correct meaning of a word in a particular context is a task known as word sense disambiguation (WSD), which is essential for many natural language processing applications such as machine translation, information retrieval, and others. While most current WSD methods try to disambiguate a small number of words for which enough annotated examples are available, the method proposed in this thesis attempts to address all words in unrestricted text. The method is based on constraints imposed by syntactic dependencies and concept generalizations drawn from an external dictionary. The method was tested on standard benchmarks as used during the SENSEVAL-2 and SENSEVAL-3 WSD international evaluation exercises, and was found to be competitive.
APA, Harvard, Vancouver, ISO, and other styles
8

Sinha, Ravi Som. "Graph-based Centrality Algorithms for Unsupervised Word Sense Disambiguation." Thesis, University of North Texas, 2008. https://digital.library.unt.edu/ark:/67531/metadc9736/.

Full text
Abstract:
This thesis introduces an innovative methodology of combining some traditional dictionary based approaches to word sense disambiguation (semantic similarity measures and overlap of word glosses, both based on WordNet) with some graph-based centrality methods, namely the degree of the vertices, Pagerank, closeness, and betweenness. The approach is completely unsupervised, and is based on creating graphs for the words to be disambiguated. We experiment with several possible combinations of the semantic similarity measures as the first stage in our experiments. The next stage attempts to score individual vertices in the graphs previously created based on several graph connectivity measures. During the final stage, several voting schemes are applied on the results obtained from the different centrality algorithms. The most important contributions of this work are not only that it is a novel approach and it works well, but also that it has great potential in overcoming the new-knowledge-acquisition bottleneck which has apparently brought research in supervised WSD as an explicit application to a plateau. The type of research reported in this thesis, which does not require manually annotated data, holds promise of a lot of new and interesting things, and our work is one of the first steps, despite being a small one, in this direction. The complete system is built and tested on standard benchmarks, and is comparable with work done on graph-based word sense disambiguation as well as lexical chains. The evaluation indicates that the right combination of the above mentioned metrics can be used to develop an unsupervised disambiguation engine as powerful as the state-of-the-art in WSD.
APA, Harvard, Vancouver, ISO, and other styles
9

Silveira, Gabriela. "Narrativas produzidas por indivíduos afásicos e indivíduos cognitivamente sadios: análise computadorizada de macro e micro estrutura." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/5/5170/tde-01112018-101055/.

Full text
Abstract:
INTRODUÇÃO: O tema de investigação, discurso de afásicos, fornece informações importantes sobre aspectos fonológicos, morfológicos, sintáticos, semânticos e pragmáticos da linguagem de pacientes que sofreram lesão vascular cerebral. Uma das maneiras de estudar o discurso é por meio de cenas figurativas temáticas simples ou em sequência. A sequência da história de \"Cinderela\" é frequentemente utilizada em estudos, por ser familiar em todo o mundo, o que favorece estudos transculturais; por induzir a produção de narrativas, ao invés de descrições, frequentemente obtidas quando se utiliza prancha única para eliciar discursos. Outra vantagem do uso das sequências da \"Cinderela\" é o fato de gerar material linguístico em quantidade suficiente para análise detalhada. OBJETIVOS: (1) analisar, por meio de tecnologias computadorizadas, aspectos macro e microestruturais do discurso de indivíduos sadios do ponto de vista cognitivo, afásicos de Broca e afásicos anômicos; (2) explorar o discurso como indicador de evolução da afasia; (3) analisar a contribuição do SPECT para verificação de evolução da afasia junto ao discurso. MÉTODO: Participaram do estudo oito indivíduos afásicos de Broca e anômicos que compuseram o grupo do estudo longitudinal (G1), 15 indivíduos afásicos de Broca e anômicos que compuseram o outro grupo de estudo (G2) e 30 cognitivamente sadios (GC). Os participantes foram solicitados a examinar as cenas da história \"Cinderela\" e depois recontar a história, com suas palavras. Foram exploradas tecnologias computadorizadas e analisados aspectos macro e microestruturais dos discursos produzidos. Para o G1, tivermos a particularidade de coleta de discurso também pela prancha \"Roubo dos Biscoitos\", análise do exame SPECT e acompanhamento longitudinal por um período de seis meses. RESULTADOS: Comparando o GC e o G2, em relação à macroestrutura, notou-se que os afásicos do G2 se diferenciaram significativamente do GC em todas as proposições e, em relação à microestrutura, sete métricas foram capazes de diferenciar ambos os grupos. Houve diferença significante macro e micro estrutural entre os sujeitos afásicos de Broca e anômicos. Foi possível verificar diferenças em medidas da macro e da microestrutura no G1 com o avançar do tempo de lesão após AVC. A história da \"Cinderela\" forneceu dados de microestrutura mais completos do que a prancha \"Roubo dos Biscoitos\". Os resultados do SPECT permaneceram os mesmos, sem demonstração de mudança com a evolução da afasia. CONCLUSÃO: A produção de narrativa gerou material para análise de macroestrutura e microestrutura, tanto aspectos de macro quanto de microestrutura diferenciaram indivíduos cognitivamente sadios dos sujeitos afásicos. A análise do discurso da \"Cinderela\" serviu como instrumento para mensurar a melhora da linguagem dos sujeitos afásicos. O uso da ferramenta computacional auxiliou as análises discursivas
INTRODUCTION: The aphasic discourse analysis provides important information about the phonological, morphological, syntactic, semantic and pragmatic aspects of the language of patients who have suffered a stroke. The evaluation of the discourse, along with other methods, can contribute to observation of the evolution of the language and communication of aphasic patients; however, manual analysis is laborious and can lead to errors. OBJECTIVES: (1) to analyze, by computerized technologies, macro and microstructural aspects of the discourse of healthy cognitive individuals, Broca\'s and anomic aphasics; (2) to explore the discourse as indicator of the evolution of aphasia; (3) to analyze the contribution of single photon emission computed tomography (SPECT) to verify the correlation between behavioral and neuroimaging evolution data. METHOD: Two groups of patients were studied: GA1, consisting of eight individuals with Broca\'s aphasia and anomic aphasia, who were analyzed longitudinally from the sub-acute phase of the lesion and after three and six months; GA2 composed of 15 individuals with Broca\'s and anomic aphasia, with varying times of stroke installation and GC consisting of 30 cognitively healthy participants. Computerized technologies were explored for the analysis of metrics related to the micro and macrostructure of discourses uttered from Cinderela history and Cookie Theft picture. RESULTS: Comparing the GC and GA2, in relation to the discourse macrostructure, it was observed that the GA2 aphasics differed significantly from the GC in relation to the total number of propositions emitted; considering the microstructure, seven metrics differentiated both groups. There was a significant difference in the macro and microstructure between the discourses of Broca\'s aphasic subjects and anomic ones. It was possible to verify differences in macro and microstructure measurements in GA1 with the advancement of injury time. In GA1, the comparison between parameters in the sub-acute phase and after 6 months of stroke revealed differences in macrostructure - increase in the number of propositions of the orientation block and of the total propositions. Regarding the microstructure, the initial measures of syllable metrics by word content, incidence of nouns and incidence of content words differed after 6 months of intervention. The variable incidence of missing words in the dictionary showed a significantly lower value after three months of stroke. Cinderella\'s story provided more complete microstructure data than the Cookie Theft picture. There was no change in SPECT over time, without demonstration of change with the evolution of aphasia. CONCLUSION: The discourse produced from the history of Cinderella and the Cookie Theft picture generated material for macrostructure and microstructure analysis of cognitively healthy and aphasic individuals, made it possible to quantify and qualify the evolution of language in different phases of stroke recuperation and distinguished the behavior of healthy and with Broca´s and anomic aphasia, in macro and microstructure aspects. The exploration of computerized tools facilitated the analysis of the data in relation to the microstructure, but it was not applicable to the macrostructure, demonstrating that there is a need for tool adjustments for the discourse analysis of patients. SPECT data did not reflect the behavioral improvement of the language of aphasic subjects
APA, Harvard, Vancouver, ISO, and other styles
10

Pienaar, Cheryl Leelavathie. "Towards a corpus of Indian South African English (ISAE) : an investigation of lexical and syntactic features in a spoken corpus of contemporary ISAE." Thesis, Rhodes University, 2008. http://hdl.handle.net/10962/d1002640.

Full text
Abstract:
There is consensus among scholars that there is not just one English language but a family of “World Englishes”. The umbrella-term “World Englishes” provides a conceptual framework to accommodate the different varieties of English that have evolved as a result of the linguistic cross-fertilization attendant upon colonization, migration, trade and transplantation of the original “strain” or variety. Various theoretical models have emerged in an attempt to understand and classify the extant and emerging varieties of this global language. The hierarchically based model of English, which classifies world English as “First Language”, “Second Language” and “Foreign Language”, has been challenged by more equitably-conceived models which refer to the emerging varieties as New Englishes. The situation in a country such as multi-lingual South Africa is a complex one: there are 11 official languages, one of which is English. However the English used in South Africa (or “South African English”), is not a homogeneous variety, since its speakers include those for whom it is a first language, those for whom it is an additional language and those for whom it is a replacement language. The Indian population in South Africa are amongst the latter group, as theirs is a case where English has ousted the traditional Indian languages and become a de facto first language, which has retained strong community resonances. This study was undertaken using the methodology of corpus linguistics to initiate the creation of a repository of linguistic evidence (or corpus), of Indian South African English, a sub-variety of South African English (Mesthrie 1992b, 1996, 2002). Although small (approximately 60 000 words), and representing a narrow age band of young adults, the resulting corpus of spoken data confirmed the existence of robust features identified in prior research into the sub-variety. These features include the use of ‘y’all’ as a second person plural pronoun, the use of but in a sentence-final position, and ‘lakker’ /'lVk@/ as a pronunciation variant of ‘lekker’ (meaning ‘good’, ‘nice’ or great’). An examination of lexical frequency lists revealed examples of general South African English such as the colloquially pervasive ‘ja’, ‘bladdy’ (for bloody) and jol(ling) (for partying or enjoying oneself) together with neologisms such as ‘eish’, the latter previously associated with speakers of Black South African English. The frequency lists facilitated cross-corpora comparisons with data from the British National Corpus and the Corpus of London Teenage Language and similarities and differences were noted and discussed. The study also used discourse analysis frameworks to investigate the role of high frequency lexical items such as ‘like’ in the data. In recent times ‘like’ has emerged globally as a lexicalized discourse marker, and its appearance in the corpus of Indian South African English confirms this trend. The corpus built as part of this study is intended as the first building block towards a full corpus of Indian South African English which could serve as a standard for referencing research into the sub-variety. Ultimately, it is argued that the establishment of similar corpora of other known sub-varieties of South African English could contribute towards the creation of a truly representative large corpus of South African English and a more nuanced understanding and definition of this important variety of World English.
APA, Harvard, Vancouver, ISO, and other styles
11

Barakat, Arian. "What makes an (audio)book popular?" Thesis, Linköpings universitet, Statistik och maskininlärning, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-152871.

Full text
Abstract:
Audiobook reading has traditionally been used for educational purposes but has in recent times grown into a popular alternative to the more traditional means of consuming literature. In order to differentiate themselves from other players in the market, but also provide their users enjoyable literature, several audiobook companies have lately directed their efforts on producing own content. Creating highly rated content is, however, no easy task and one reoccurring challenge is how to make a bestselling story. In an attempt to identify latent features shared by successful audiobooks and evaluate proposed methods for literary quantification, this thesis employs an array of frameworks from the field of Statistics, Machine Learning and Natural Language Processing on data and literature provided by Storytel - Sweden’s largest audiobook company. We analyze and identify important features from a collection of 3077 Swedish books concerning their promotional and literary success. By considering features from the aspects Metadata, Theme, Plot, Style and Readability, we found that popular books are typically published as a book series, cover 1-3 central topics, write about, e.g., daughter-mother relationships and human closeness but that they also hold, on average, a higher proportion of verbs and a lower degree of short words. Despite successfully identifying these, but also other factors, we recognized that none of our models predicted “bestseller” adequately and that future work may desire to study additional factors, employ other models or even use different metrics to define and measure popularity. From our evaluation of the literary quantification methods, namely topic modeling and narrative approximation, we found that these methods are, in general, suitable for Swedish texts but that they require further improvement and experimentation to be successfully deployed for Swedish literature. For topic modeling, we recognized that the sole use of nouns provided more interpretable topics and that the inclusion of character names tended to pollute the topics. We also identified and discussed the possible problem of word inflections when modeling topics for more morphologically complex languages, and that additional preprocessing treatments such as word lemmatization or post-training text normalization may improve the quality and interpretability of topics. For the narrative approximation, we discovered that the method currently suffers from three shortcomings: (1) unreliable sentence segmentation, (2) unsatisfactory dictionary-based sentiment analysis and (3) the possible loss of sentiment information induced by translations. Despite only examining a handful of literary work, we further found that books written initially in Swedish had narratives that were more cross-language consistent compared to books written in English and then translated to Swedish.
APA, Harvard, Vancouver, ISO, and other styles
12

López, del Castillo Wilderbeek Francisco Leslie. "El Discurso social en España." Doctoral thesis, Universitat Pompeu Fabra, 2018. http://hdl.handle.net/10803/663746.

Full text
Abstract:
Esta investigación se ha propuesto la comprensión de todo el flujo discursivo de una sociedad en un tiempo determinado. Semejante esfuerzo ha tomado como referente los trabajos del historiador de las ideas Marc Angenot que en su obra El discurso social (2010) se fijó como objetivo interpretar toda la materia escrita en la Francia de 1889. La dimensión de producción de contenidos actual ha obligado a la extensión de los campos discursivos en los que observar la voz de una sociedad, desde medios masivos hasta medios sociales. Este cambio cuantitativo ha supuesto la aplicación de metodologías diversas desde modelos para el procesamiento de grandes volúmenes de textos hasta el análisis semio-discursivo de muestras significativas obtenidas computacionalmente. El resultado final tanto por las conclusiones obtenidas como por el trayecto realizado supone el intento de trasladar un concepto de límites extensos a un método definido y fructífero.
This research has proposed the understanding of the entire discursive flow of a society in a given time. Such effort has taken as reference the work of the historian of ideas Marc Angenot who in his work The social discourse (2010) intended to interpret all the written material in the France of 1889. The dimension of current content production has forced the extension of discursive fields in which to observe the voice of a society, from mass media to social media. This quantitative change has involved the application of diverse methodologies from models for the processing of large volumes of texts to the semio-discursive analysis of significant samples obtained computationally. The final result both by the conclusions obtained and by the path taken is the attempt to transfer a concept of extended limits to a concrete and fruitful method.
APA, Harvard, Vancouver, ISO, and other styles
13

Oyerinde, Oyeyinka Dantala. "Creating public value in information and communication technology: a learning analytics approach." Thesis, 2019. http://hdl.handle.net/10500/26446.

Full text
Abstract:
This thesis contributes to the ongoing global discourse in ICT4D on ICT and its effect on socio-economic development in both theory and practice. The thesis comprises five studies presented logically from chapters 5 to 9. The thesis employs Mixed Methods research methodology within the Critical Realist epistemological perspective in Information Systems Research. Studies 1-4 employ different quantitative research and analytical methods while study 5 employs a qualitative research and analytical method. Study 1 proposes and operationalizes a predictive analytics framework in Learning Analytics by using a case study of the Computer Science Department of the University of Jos, Nigeria. Multiple Linear Regression was used with the aid of the Statistical Package for Social Sciences (SPSS) analysis tool. Statistical Hypothesis testing was then used to validate the model with a 5% level of significance. Results show how predictive learning analytics can be successfully operationalized and used for predicting students’ academic performances. In Study 2 the relative efficiency of ICT infrastructure utilization with respect to the educational component of the Human Development Index (HDI) is investigated. A Novel conceptual model is proposed and the Data Envelopment Analysis (DEA) methodology is used to measure the relative efficiency of the components of ICT infrastructure (Inputs) and the components of education (Outputs). Ordinary Least Squares (OLS) Regression Analysis is used to determine the effect of ICT infrastructure on Educational Attainment/Adult Literacy Rates. Results show a strong positive effect of ICT infrastructure on educational attainment and adult literacy rates, a strong correlation between this infrastructure and literacy rates as well as provide a theoretical support for the argument of increasing ICT infrastructure to provide an increase in human development especially within the educational context. In Study 3 the relative efficiency and productivity of ICT Infrastructure Utilization in Education are examined. The research employs the Data Envelopment Analysis (DEA) and Malmquist Index (MI), well established non-parametric data analysis methodologies, applied to archival data on International countries divided into Arab States, Europe, Sub-Saharan Africa and World regions. Ordinary Least Squares (OLS) Regression analysis is applied to determine the effect of ICT infrastructure on Adult Literacy Rates. Findings show a relatively efficient utilization and steady increase in productivity for the regions but with only Europe and the Arab States currently operating in a state of positive growth in productivity. A strong positive effect of ICT infrastructure on Adult Literacy Rates is also observed. Study 4 investigates the efficiency and productivity of ICT utilization in public value creation with respect to Adult Literacy Rates. The research employs Data Envelopment Analysis (DEA) and Malmquist Index (MI), well established non-parametric data analysis methodologies, applied to archival data on International countries divided into Arab States, Europe, Sub-Saharan Africa and World regions. Findings show a relatively efficient utilization of ICT in public value creation but an average decline in productivity levels. Finally, in Study 5 a Critical Discourse Analysis (CDA) on the UNDP Human Development Research Reports from 2010-2016 is carried out to determine whether or not any public value is created or derived from the policy directions being put forward and their subsequent implementations. The CDA is operationalized by Habermas’ Theory of Communicative Action (TCA). Findings show that Public Value is indeed being created and at the core of the policy directions being called for in these reports.
School of Computing
Ph.D. (Information Systems)
APA, Harvard, Vancouver, ISO, and other styles
14

"Chinese readability analysis and its applications on the internet." 2007. http://library.cuhk.edu.hk/record=b5893108.

Full text
Abstract:
Lau Tak Pang.
Thesis submitted in: October 2006.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2007.
Includes bibliographical references (leaves 110-122).
Abstracts in English and Chinese.
Abstract --- p.i
Acknowledgement --- p.v
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Motivation and Major Contributions --- p.1
Chapter 1.1.1 --- Chinese Readability Analysis --- p.1
Chapter 1.1.2 --- Web Readability Analysis --- p.3
Chapter 1.2 --- Thesis Chapter Organization --- p.6
Chapter 2 --- Related Work --- p.7
Chapter 2.1 --- Readability Assessment --- p.7
Chapter 2.1.1 --- Assessment for Text Document --- p.8
Chapter 2.1.2 --- Assessment for Web Page --- p.13
Chapter 2.2 --- Support Vector Machine --- p.14
Chapter 2.2.1 --- Characteristics and Advantages --- p.14
Chapter 2.2.2 --- Applications --- p.16
Chapter 2.3 --- Chinese Word Segmentation --- p.16
Chapter 2.3.1 --- Difficulty in Chinese Word Segmentation --- p.16
Chapter 2.3.2 --- Approaches for Chinese Word Segmentation --- p.17
Chapter 3 --- Chinese Readability Analysis --- p.20
Chapter 3.1 --- Chinese Readability Factor Analysis --- p.20
Chapter 3.1.1 --- Systematic Analysis --- p.20
Chapter 3.1.2 --- Feature Extraction --- p.30
Chapter 3.1.3 --- Limitation of Our Analysis and Possible Extension --- p.32
Chapter 3.2 --- Research Methodology --- p.33
Chapter 3.2.1 --- Definition of Readability --- p.33
Chapter 3.2.2 --- Data Acquisition and Sampling --- p.34
Chapter 3.2.3 --- Text Processing and Feature Extraction . --- p.35
Chapter 3.2.4 --- Regression Analysis using Support Vector Regression --- p.36
Chapter 3.2.5 --- Evaluation --- p.36
Chapter 3.3 --- Introduction to Support Vector Regression --- p.38
Chapter 3.3.1 --- Basic Concept --- p.38
Chapter 3.3.2 --- Non-Linear Extension using Kernel Technique --- p.41
Chapter 3.4 --- Implementation Details --- p.42
Chapter 3.4.1 --- Chinese Word Segmentation --- p.42
Chapter 3.4.2 --- Building Basic Chinese Character / Word Lists --- p.47
Chapter 3.4.3 --- Pull Sentence Detection --- p.49
Chapter 3.4.4 --- Feature Selection Using Genetic Algorithm --- p.50
Chapter 3.5 --- Experiments --- p.55
Chapter 3.5.1 --- Experiment 1: Evaluation on Chinese Word Segmentation using the LMR-RC Tagging Scheme --- p.56
Chapter 3.5.2 --- Experiment 2: Initial SVR Parameters Searching with Different Kernel Functions --- p.61
Chapter 3.5.3 --- Experiment 3: Feature Selection Using Genetic Algorithm --- p.63
Chapter 3.5.4 --- Experiment 4: Training and Cross-validation Performance using the Selected Feature Subset --- p.67
Chapter 3.5.5 --- Experiment 5: Comparison with Linear Regression --- p.74
Chapter 3.6 --- Summary and Future Work --- p.76
Chapter 4 --- Web Readability Analysis --- p.78
Chapter 4.1 --- Web Page Readability --- p.79
Chapter 4.1.1 --- Readability as Comprehension Difficulty . --- p.79
Chapter 4.1.2 --- Readability as Grade Level --- p.81
Chapter 4.2 --- Web Site Readability --- p.83
Chapter 4.3 --- Experiments --- p.85
Chapter 4.3.1 --- Experiment 1: Web Page Readability Analysis -Comprehension Difficulty --- p.87
Chapter 4.3.2 --- Experiment 2: Web Page Readability Analysis -Grade Level --- p.92
Chapter 4.3.3 --- Experiment 3: Web Site Readability Analysis --- p.98
Chapter 4.4 --- Summary and Future Work --- p.101
Chapter 5 --- Conclusion --- p.104
Chapter A --- List of Symbols and Notations --- p.107
Chapter B --- List of Publications --- p.110
Bibliography --- p.113
APA, Harvard, Vancouver, ISO, and other styles
15

"Towards discourse classication for Chinese, a resource-poor language." 2014. http://repository.lib.cuhk.edu.hk/en/item/cuhk-1290645.

Full text
Abstract:
Discourse raises issues about semantics, and especially the nature of coherence and cohesion of texts. Similar to part-of-speech tagging and syntactic parsing, discourse classification is fundamental in computational linguistics. But relatively, this issue is not well studied. The lack of annotated corpora brings limitations to research of discourse classification for most languages other than English (e.g., Chinese). Manual annotation for discourse classification is complex, time consuming and costly. To overcome this predicament, one alternative is to explore unsupervised learning methods. Nevertheless, previous work on English showed that unsupervised methods could only deal with coarse-grained discourse relations and suffered from low precision. Another possible way is to make use of discourse classification capabilities from other languages which have rich discourse corpora. But the problem of cross language discourse classification is still very much open for investigation. Using Chinese as the target, this thesis presents the first study on discourse classification for resource-poor language. Furthermore, we also annotate the first open discourse treebank for Chinese which includes 890 news articles.
At the beginning, we propose a novel bootstrapping unsupervised method based on semantic sequential representation (SSR) for discourse classification. SSR is a new representation for discourse instances which integrate basic bag-of-words information with lexical, semantic and word sequential information. Our method starts with a small set of cue-phrase-based patterns to collect large number of discourse instances which are later converted to SSRs. We then propose an unsupervised SSR learner to generate, weigh and filter new SSRs without cue phrases for recognizing discourse relations. Experimental results showed that our method outperformed previous unsupervised method by 7% in F-score. We also show that SSRs are effective features for supervised learning methods.
The SSR-based method (F-score = 0:63) ignores the ambiguities of discourse connectives. As a result, it suffers from low recall (Recall = 0:49). To discover and eliminate these ambiguities, we further propose a cross-language framework for discourse classification. In our framework, discourse classification for Chinese is achieved in two steps: (1) Discourse connective/trigger identification and (2) Sense classification. English Penn Discourse Treebank 2 (PDTB2) and Chinese-English parallel data are coupled to provide the training data for a co-training based framework. Experimental results showed that our method achieved significant improvement comparing to SSR based method. The proposed framework is practical and effective especially in coping with the inter community problem, which is common in cross-language discourse classification. Moreover, the proposed framework does not integrate any language specific features, making it theoretically applicable for other languages.
Every language has its unique characteristics, our cross-language framework which focuses on the common characteristics between languages is ineffective in detecting Chinese language specific characteristics. As a result, we package the corpus we used in this research to form the Discourse Treebank for Chinese (DTBC). DTBC adopts the principles of PDTB2, and at the same time, it incorporates the linguistic characteristics of Chinese. The annotation work adds a discourse layer to 890 articles from the Penn Chinese Tree Bank 5 (CTB5). DTBC is the first ever open Chinese discourse treebank, which will be an invaluable linguistic resource for future research in Chinese discourse.
語篇(Discourse)提出了關於語義理解的問題,特別是篇章的銜接與連貫問題。與詞法分析、語法分析相似,語篇分類问题是計算語言學的基本問題之一。較同领域其他問題而言,語篇分類的研究尚處於初級階段。對於除英文外的絕大多數語言,由於缺乏语篇標注資料,語篇分類的研究受到了很大的限制。眾所周知,語篇資料的標注工作複雜度较高而且需要花費大量的時間。為了克服這一困境,一種方法是探索無指導的語篇分類方法。然而,在英文上的先行研究表明,無指導语篇分类方法的缺陷是準確率較低並且僅能處理粗粒度的語篇關係。另一種方法是將語篇分類技術從有大量標注資料的源語言遷移到其他目標語言。然而,當前跨語言語篇分類技術尚不成熟。本文以中文為目標語言,首創了在本地標注資料非常有限(Resource-Poor)的情況下,對中文進行語篇分類的研究。不僅如此,我們還標註了中文第一個公開的,包含890篇新聞文章的語篇樹庫。
為了克服以往無指導方法的缺點,我們首先提出了一種新穎的,基於語義有序標記法 (SSR: Semantic Sequential Representation) 的無指導方法。語義有序標記法是一種新的表示語篇實例的方法,它集成了詞袋(bag-of-words)資訊,詞法資訊,語義資訊以及詞序資訊。我們的方法首先從一小組基於語篇連接詞的模式出發,在中文生語料中獲取大量的語篇實例,我們用語義有序標記法表示這些語篇實例。然後,我們提出了一種無指導的,在不考慮語篇連接詞的情況下,對語義有序表示進行挖掘,打分和過濾的方法。實驗結果證明,我們提出的方法比先前的方法在F值上提高了7%。我們還證明了語義有序表示也可以成為有指導語篇分類方法的有效特徵。
基於挖掘語義有序表示的無指導方法(F-score=0.63)忽略了語篇連接詞的歧義性。因此,其召回率較低。爲消除歧義,我們進一步提出了一種跨語言的語篇分類框架。在我們的框架中,中文語篇分類任務由兩個步驟組成:(1)語篇連詞/觸發詞的發現;(2)語篇關係分類。我們將英文語篇樹庫(PDTB2: Penn Discourse TreeBank 2.0)和中文樹庫(CTB5: Chinese TreeBank 5.0)結合起來作為訓練資料,作為co-training演算法框架的輸入。實驗結果表明,我們提出的跨語言語篇分類方法比單純使用語義有序表示的方法在F值上有非常顯著的提高。 這說明我們提出的跨語言框架可以有效地通過雙語平行語料的橋樑作用,識別不同語言之間的語篇分類的共通性。值得一提的是,我們提出的演算法框架並不需要特定的,語言相關的特徵,因此,它具有很強的擴展並應用到其他語言的能力。
每種語言都有其獨特的特點,我們提出的跨語言方法主要注重於發掘語言之間的共同特點,因此並不能有效地發掘中文篇章分類的獨有特點。我們將實驗中標注過的中文語篇分析資料進行了總結和歸納,形成了中文語篇樹庫(DTBC: Discourse TreeBank for Chinese)。中文語篇樹庫繼承了英文語篇庫的構建原則,與此同時,它針對中文獨有的特點進行了大量的本地化工作。我們的標注工作為中文樹庫 (CTB5: The Chinese TreeBank 5.0)的全部890篇新聞文章添加了語篇資訊層。中文語篇樹庫是第一個開放的、大規模中文語篇樹庫語料。它為未來的中文語篇分析研究提供了至關重要的基礎性標註數據。
Zhou, Lanjun.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2014.
Includes bibliographical references (leaves 98-104).
Abstracts also in Chinese.
Title from PDF title page (viewed on 20, December, 2016).
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
APA, Harvard, Vancouver, ISO, and other styles
16

Stewart, Graham Douglas James. "The implications of e-text resource development for Southern African literary studies in terms of analysis and methodology." Thesis, 1999. http://hdl.handle.net/10413/9002.

Full text
Abstract:
This study was aimed at investigating established electronic text and information projects and resources to inform the design and implementation of a South African electronic text resource. Literature was surveyed on a wide variety of electronic text projects and virtual libraries in the humanities, bibliographic databases, electronic encyclopaedias, literature webs, on-line learning, corcordancing and textual analysis, and computer application programs for searching and displaying electronic texts .The SALIT Web CD-ROM which is a supplementary outcome of the research - including the database, relational table structure, keyword search criteria, search screens, and hypertext linking of title entries to the electronic full-texts in the virtual library section - was based on this research. Other outcomes of the project include encoded electronic texts and an Internet web site. The research was undertaken to investigate the benefits of designing and developing an etext database (hypertext web) that could be used effectively as a learning/teaching and research resource in South African literary studies. The backbone of the resource would be an indexed ''virtual library" containing electronic texts (books and other documents in digital form), conforming to international standards for interchange and for sharing with others. Working on the assumption that hypertext is an essentially democratic and anti canonical environment where the learner/users are free to construct meaning for themselves, it seemed an ideal medium in which to conduct learning, teaching and research in South African literature. By undertaking this project I hoped to start a process, based on international standards, that would provide a framework for a virtual library of South African literature, especially those works considered "marginal" or which had gone out of print, or were difficult to access for a variety of reasons. Internationally, the TEI (Text Encoding Initiative) and other, literature based hypertext projects, promised the emergence of networked information resources that could absorb and then share texts essential for contemporary South African literary research. Investigation of the current status of on-line reference sources revealed that the digital frameworks underlying bibliographic databases, electronic encyclopaedias and literature webs are now very similar. Specially designed displays allow the SALIT Web to be used as a digital library, providing an opportunity to read books that may not be available from any other library. The on-line learning potential of the SALIT Web is extensive. Asynchronous Learning Network (ALN) programmes in use were assessed and found to offer a high degree of learner-tutor and learner-learner interaction. The Text Analysis Computing Tools (TACT) program was used to investigate the possibility of detailed text analysis of the full texts included in the SALIT library on the CDROM. Features such as Keyword-in-context and word-frequency generators, offer valuable methods to automate the more time-consuming aspects of both thematic and formal text analysis. In the light of current hypertext theory that emphasises hypertext's lack of fixity and closure, the SALIT Web can be seen to transfer authority from the author/teacher/librarian, to the user, by offering free access to information and so weakening the established power relations of education and access to education. The resource has the capacity to allow the user to examine previously unnoticed, but significant contradictions, inconsistencies and patterns and construct meaning from them. Yet the resource may still also contain interventions by the author/teacher consisting of pathways to promote the construction of meaning, but not dictate it. A hypertext web resource harnesses the cheap and powerful benefits of Information Technology for the purpose of literary research, especially in the under-resourced area of South African literary studies. By making a large amount of information readily available and easily accessible, it saves time and reduces frustration for both learners and teachers. An electronic text resource provides users with a virtual library at their fingertips. Its resources can be standardised so that others can add to it, thus compounding the benefits over time. It can place scarce works (books, articles and papers) within easy access for student use. Students may then be able to use its resources for independent discovery, or via guided sets of exercises or assignments. Electronic texts break the tyranny of inadequate library resources, restricted access to rare documents and the unavailability of comprehensive bibliographical information in the area of South African literary studies. The publication of the CD-ROM enables the launch of new, related projects, with the emphasis on building a collection of South African texts in all languages and in translation. Training in electronic text preparation, and Internet access to the resource will also be addressed to take these projects forward.
Thesis (Ph.D)-University of Durban-Westville, Durban,1999.
APA, Harvard, Vancouver, ISO, and other styles
17

Mak, King Tong. "The dynamics of collocation: a corpus-based study of the phraseology and pragmatics of the introductory-it construction." Thesis, 2005. http://hdl.handle.net/2152/1776.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Akova, Ferit. "A nonparametric Bayesian perspective for machine learning in partially-observed settings." Thesis, 2014. http://hdl.handle.net/1805/4825.

Full text
Abstract:
Indiana University-Purdue University Indianapolis (IUPUI)
Robustness and generalizability of supervised learning algorithms depend on the quality of the labeled data set in representing the real-life problem. In many real-world domains, however, we may not have full knowledge of the underlying data-generating mechanism, which may even have an evolving nature introducing new classes continually. This constitutes a partially-observed setting, where it would be impractical to obtain a labeled data set exhaustively defined by a fixed set of classes. Traditional supervised learning algorithms, assuming an exhaustive training library, would misclassify a future sample of an unobserved class with probability one, leading to an ill-defined classification problem. Our goal is to address situations where such assumption is violated by a non-exhaustive training library, which is a very realistic yet an overlooked issue in supervised learning. In this dissertation we pursue a new direction for supervised learning by defining self-adjusting models to relax the fixed model assumption imposed on classes and their distributions. We let the model adapt itself to the prospective data by dynamically adding new classes/components as data demand, which in turn gradually make the model more representative of the entire population. In this framework, we first employ suitably chosen nonparametric priors to model class distributions for observed as well as unobserved classes and then, utilize new inference methods to classify samples from observed classes and discover/model novel classes for those from unobserved classes. This thesis presents the initiating steps of an ongoing effort to address one of the most overlooked bottlenecks in supervised learning and indicates the potential for taking new perspectives in some of the most heavily studied areas of machine learning: novelty detection, online class discovery and semi-supervised learning.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography