Dissertations / Theses: 'LL. Automated language processing'

1

Allott, Nicholas Mark. "A natural language processing framework for automated assessment." Thesis, Nottingham Trent University, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.314333.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Onyenwe, Ikechukwu Ekene. "Developing methods and resources for automated processing of the African language Igbo." Thesis, University of Sheffield, 2017. http://etheses.whiterose.ac.uk/17043/.

Full text

Abstract:

Natural Language Processing (NLP) research is still in its infancy in Africa. Most of languages in Africa have few or zero NLP resources available, of which Igbo is among those at zero state. In this study, we develop NLP resources to support NLP-based research in the Igbo language. The springboard is the development of a new part-of-speech (POS) tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result of language internal features not recognized in EAGLES. The tagset consists of three granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The medium-grained tagset is to strike a balance between the other two grains for practical purpose. Following this is the preprocessing of Igbo electronic texts through normalization and tokenization processes. The tokenizer is developed in this study using the tagset definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million tokens. This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus (IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the IgbTS where necessary. A novel automatic method was developed to bootstrap a manual annotation process through exploitation of the by-products of this IAA exercise, to improve IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach was adopted to propose erroneous instances on IgbTC for correction. A novel automatic method that uses knowledge of affixes to flag and correct all morphologically-inflected words in the IgbTC whose tags violate their status as not being morphologically-inflected was also developed and used. Experiments towards the development of an automatic POS tagging system for Igbo using IgbTC show good accuracy scores comparable to other languages that these taggers have been tested on, such as English. Accuracy on the words previously unseen during the taggers’ training (also called unknown words) is considerably low, and much lower on the unknown words that are morphologically-complex, which indicates difficulty in handling morphologically-complex words in Igbo. This was improved by adopting a morphological reconstruction method (a linguistically-informed segmentation into stems and affixes) that reformatted these morphologically-complex words into patterns learnable by machines. This enables taggers to use the knowledge of stems and associated affixes of these morphologically-complex words during the tagging process to predict their appropriate tags. Interestingly, this method outperforms other methods that existing taggers use in handling unknown words, and achieves an impressive increase for the accuracy of the morphologically-inflected unknown words and overall unknown words. These developments are the first NLP toolkit for the Igbo language and a step towards achieving the objective of Basic Language Resources Kits (BLARK) for the language. This IgboNLP toolkit will be made available for the NLP community and should encourage further research and development for the language.

APA, Harvard, Vancouver, ISO, and other styles

3

Leonhard, Annette Christa. "Automated question answering for clinical comparison questions." Thesis, University of Edinburgh, 2012. http://hdl.handle.net/1842/6266.

Full text

Abstract:

This thesis describes the development and evaluation of new automated Question Answering (QA) methods tailored to clinical comparison questions that give clinicians a rank-ordered list of MEDLINE® abstracts targeted to natural language clinical drug comparison questions (e.g. ”Have any studies directly compared the effects of Pioglitazone and Rosiglitazone on the liver?”). Three corpora were created to develop and evaluate a new QA system for clinical comparison questions called RetroRank. RetroRank takes the clinician’s plain text question as input, processes it and outputs a rank-ordered list of potential answer candidates, i.e. MEDLINE® abstracts, that is reordered using new post-retrieval ranking strategies to ensure the most topically-relevant abstracts are displayed as high in the result set as possible. RetroRank achieves a significant improvement over the PubMed recency baseline and performs equal to or better than previous approaches to post-retrieval ranking relying on query frames and annotated data such as the approach by Demner-Fushman and Lin (2007). The performance of RetroRank shows that it is possible to successfully use natural language input and a fully automated approach to obtain answers to clinical drug comparison questions. This thesis also introduces two new evaluation corpora of clinical comparison questions with “gold standard” references that are freely available and are a valuable resource for future research in medical QA.

APA, Harvard, Vancouver, ISO, and other styles

4

Xozwa, Thandolwethu. "Automated statistical audit system for a government regulatory authority." Thesis, Nelson Mandela Metropolitan University, 2015. http://hdl.handle.net/10948/6061.

Full text

Abstract:

Governments all over the world are faced with numerous challenges while running their countries on a daily basis. The predominant challenges which arise are those which involve statistical methodologies. Official statistics to South Africa’s infrastructure are very important and because of this it is important that an effort is made to reduce the challenges that occur during the development of official statistics. For official statistics to be developed successfully quality standards need to be built into an organisational framework and form a system of architecture (Statistics New Zealand 2009:1). Therefore, this study seeks to develop a statistical methodology that is appropriate and scientifically correct using an automated statistical system for audits in government regulatory authorities. The study makes use of Mathematica to provide guidelines on how to develop and use an automated statistical audit system. A comprehensive literature study was conducted using existing secondary sources. A quantitative research paradigm was adopted for this study, to empirically assess the demographic characteristics of tenants of Social Housing Estates and their perceptions towards the rental units they inhabit. More specifically a descriptive study was undertaken. Furthermore, a sample size was selected by means of convenience sampling for a case study on SHRA to assess the respondent’s biographical information. From this sample, a pilot study was conducted investigating the general perceptions of the respondents regarding the physical conditions and quality of their units. The technical development of an automated statistical audit system was discussed. This process involved the development and use of a questionnaire design tool, statistical analysis and reporting and how Mathematica software served as a platform for developing the system. The findings of this study provide insights on how government regulatory authorities can best utilise automated statistical audits for regulation purposes and achieved this by developing an automated statistical audit system for government regulatory authorities. It is hoped that the findings of this study will provide government regulatory authorities with practical suggestions or solutions regarding the generating of official statistics for regulatory purposes, and that the suggestions for future research will inspire future researchers to further investigate automated statistical audit systems, statistical analysis, automated questionnaire development, and government regulatory authorities individually.

APA, Harvard, Vancouver, ISO, and other styles

5

Sommers, Alexander Mitchell. "EXPLORING PSEUDO-TOPIC-MODELING FOR CREATING AUTOMATED DISTANT-ANNOTATION SYSTEMS." OpenSIUC, 2021. https://opensiuc.lib.siu.edu/theses/2862.

Full text

Abstract:

We explore the use a Latent Dirichlet Allocation (LDA) imitating pseudo-topic-model, based on our original relevance metric, as a tool to facilitate distant annotation of short (often one to two sentence or less) documents. Our exploration manifests as annotating tweets for emotions, this being the current use-case of interest to us, but we believe the method could be extended to any multi-class labeling task of documents of similar length. Tweets are gathered via the Twitter API using "track" terms thought likely to capture tweets with a greater chance of exhibiting each emotional class, 3,000 tweets for each of 26 topics anticipated to elicit emotional discourse. Our pseudo-topic-model is used to produce relevance-ranked vocabularies for each corpus of tweets and these are used to distribute emotional annotations to those tweets not manually annotated, magnifying the number of annotated tweets by a factor of 29. The vector labels the annotators produce for the topics are cascaded out to the tweets via three different schemes which are compared for performance by proxy through the competition of bidirectional-LSMTs trained using the tweets labeled at a distance. An SVM and two emotionally annotated vocabularies are also tested on each task to provide context and comparison.

APA, Harvard, Vancouver, ISO, and other styles

6

Wang, Wei. "Automated spatiotemporal and semantic information extraction for hazards." Diss., University of Iowa, 2014. https://ir.uiowa.edu/etd/1415.

Full text

Abstract:

This dissertation explores three research topics related to automated spatiotemporal and semantic information extraction about hazard events from Web news reports and other social media. The dissertation makes a unique contribution of bridging geographic information science, geographic information retrieval, and natural language processing. Geographic information retrieval and natural language processing techniques are applied to extract spatiotemporal and semantic information automatically from Web documents, to retrieve information about patterns of hazard events that are not explicitly described in the texts. Chapters 2, 3 and 4 can be regarded as three standalone journal papers. The research topics covered by the three chapters are related to each other, and are presented in a sequential way. Chapter 2 begins with an investigation of methods for automatically extracting spatial and temporal information about hazards from Web news reports. A set of rules is developed to combine the spatial and temporal information contained in the reports based on how this information is presented in text in order to capture the dynamics of hazard events (e.g., changes in event locations, new events occurring) as they occur over space and time. Chapter 3 presents an approach for retrieving semantic information about hazard events using ontologies and semantic gazetteers. With this work, information on the different kinds of events (e.g., impact, response, or recovery events) can be extracted as well as information about hazard events at different levels of detail. Using the methods presented in Chapter 2 and 3, an approach for automatically extracting spatial, temporal, and semantic information from tweets is discussed in Chapter 4. Four different elements of tweets are used for assigning appropriate spatial and temporal information to hazard events in tweets. Since tweets represent shorter, but more current information about hazards and how they are impacting a local area, key information about hazards can be retrieved through extracted spatiotemporal and semantic information from tweets.

APA, Harvard, Vancouver, ISO, and other styles

7

Teske, Alexander. "Automated Risk Management Framework with Application to Big Maritime Data." Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/38567.

Full text

Abstract:

Risk management is an essential tool for ensuring the safety and timeliness of maritime operations and transportation. Some of the many risk factors that can compromise the smooth operation of maritime activities include harsh weather and pirate activity. However, identifying and quantifying the extent of these risk factors for a particular vessel is not a trivial process. One challenge is that processing the vast amounts of automatic identification system (AIS) messages generated by the ships requires significant computational resources. Another is that the risk management process partially relies on human expertise, which can be timeconsuming and error-prone. In this thesis, an existing Risk Management Framework (RMF) is augmented to address these issues. A parallel/distributed version of the RMF is developed to e ciently process large volumes of AIS data and assess the risk levels of the corresponding vessels in near-real-time. A genetic fuzzy system is added to the RMF's Risk Assessment module in order to automatically learn the fuzzy rule base governing the risk assessment process, thereby reducing the reliance on human domain experts. A new weather risk feature is proposed, and an existing regional hostility feature is extended to automatically learn about pirate activity by ingesting unstructured news articles and incident reports. Finally, a geovisualization tool is developed to display the position and risk levels of ships at sea. Together, these contributions pave the way towards truly automatic risk management, a crucial component of modern maritime solutions. The outcomes of this thesis will contribute to enhance Larus Technologies' Total::Insight, a risk-aware decision support system successfully deployed in maritime scenarios.

APA, Harvard, Vancouver, ISO, and other styles

8

Salov, Aleksandar. "Towards automated learning from software development issues : Analyzing open source project repositories using natural language processing and machine learning techniques." Thesis, Linnéuniversitetet, Institutionen för medieteknik (ME), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-66834.

Full text

Abstract:

This thesis presents an in-depth investigation on the subject of how natural language processing and machine learning techniques can be utilized in order to perform a comprehensive analysis of programming issues found in different open source project repositories hosted on GitHub. The research is focused on examining issues gathered from a number of JavaScript repositories based on their user generated textual description. The primary goal of the study is to explore how natural language processing and machine learning methods can facilitate the process of identifying and categorizing distinct issue types. Furthermore, the research goes one step further and investigates how these same techniques can support users in searching for potential solutions to these issues. For this purpose, an initial proof-of-concept implementation is developed, which collects over 30 000 JavaScript issues from over 100 GitHub repositories. Then, the system extracts the titles of the issues, cleans and processes the data, before supplying it to an unsupervised clustering model which tries to uncover any discernible similarities and patterns within the examined dataset. What is more, the main system is supplemented by a dedicated web application prototype, which enables users to utilize the underlying machine learning model in order to find solutions to their programming related issues. Furthermore, the developed implementation is meticulously evaluated through a number of measures. First of all, the trained clustering model is assessed by two independent groups of external reviewers - one group of fellow researchers and another group of practitioners in the software industry, so as to determine whether the resulting categories contain distinct types of issues. Moreover, in order to find out if the system can facilitate the search for issue solutions, the web application prototype is tested in a series of user sessions with participants who are not only representative of the main target group which can benefit most from such a system, but who also have a mixture of both practical and theoretical backgrounds. The results of this research demonstrate that the proposed solution can effectively categorize issues according to their type, solely based on the user generated free-text title. This provides strong evidence that natural language processing and machine learning techniques can be utilized for analyzing issues and automating the overall learning process. However, the study was unable to conclusively determine whether these same methods can aid the search for issue solutions. Nevertheless, the thesis provides a detailed account of how this problem was addressed and can therefore serve as the basis for future research.

APA, Harvard, Vancouver, ISO, and other styles

9

Sunil, Kamalakar FNU. "Automatically Generating Tests from Natural Language Descriptions of Software Behavior." Thesis, Virginia Tech, 2013. http://hdl.handle.net/10919/23907.

Full text

Abstract:

Behavior-Driven Development (BDD) is an emerging agile development approach where all stakeholders (including developers and customers) work together to write user stories in structured natural language to capture a software application's functionality in terms of re- quired "behaviors". Developers then manually write "glue" code so that these scenarios can be executed as software tests. This glue code represents individual steps within unit and acceptance test cases, and tools exist that automate the mapping from scenario descriptions to manually written code steps (typically using regular expressions). Instead of requiring programmers to write manual glue code, this thesis investigates a practical approach to con- vert natural language scenario descriptions into executable software tests fully automatically. To show feasibility, we developed a tool called Kirby that uses natural language processing techniques, code information extraction and probabilistic matching to automatically gener- ate executable software tests from structured English scenario descriptions. Kirby relieves the developer from the laborious work of writing code for the individual steps described in scenarios, so that both developers and customers can both focus on the scenarios as pure behavior descriptions (understandable to all, not just programmers). Results from assessing the performance and accuracy of this technique are presented.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

10

Mao, Jin, Lisa R. Moore, Carrine E. Blank, Elvis Hsin-Hui Wu, Marcia Ackerman, Sonali Ranade, and Hong Cui. "Microbial phenomics information extractor (MicroPIE): a natural language processing tool for the automated acquisition of prokaryotic phenotypic characters from text sources." BIOMED CENTRAL LTD, 2016. http://hdl.handle.net/10150/622562.

Full text

Abstract:

Background: The large-scale analysis of phenomic data (i.e., full phenotypic traits of an organism, such as shape, metabolic substrates, and growth conditions) in microbial bioinformatics has been hampered by the lack of tools to rapidly and accurately extract phenotypic data from existing legacy text in the field of microbiology. To quickly obtain knowledge on the distribution and evolution of microbial traits, an information extraction system needed to be developed to extract phenotypic characters from large numbers of taxonomic descriptions so they can be used as input to existing phylogenetic analysis software packages. Results: We report the development and evaluation of Microbial Phenomics Information Extractor (MicroPIE, version 0.1.0). MicroPIE is a natural language processing application that uses a robust supervised classification algorithm (Support Vector Machine) to identify characters from sentences in prokaryotic taxonomic descriptions, followed by a combination of algorithms applying linguistic rules with groups of known terms to extract characters as well as character states. The input to MicroPIE is a set of taxonomic descriptions (clean text). The output is a taxon-by-character matrix-with taxa in the rows and a set of 42 pre-defined characters (e.g., optimum growth temperature) in the columns. The performance of MicroPIE was evaluated against a gold standard matrix and another student-made matrix. Results show that, compared to the gold standard, MicroPIE extracted 21 characters (50%) with a Relaxed F1 score > 0.80 and 16 characters (38%) with Relaxed F1 scores ranging between 0.50 and 0.80. Inclusion of a character prediction component (SVM) improved the overall performance of MicroPIE, notably the precision. Evaluated against the same gold standard, MicroPIE performed significantly better than the undergraduate students. Conclusion: MicroPIE is a promising new tool for the rapid and efficient extraction of phenotypic character information from prokaryotic taxonomic descriptions. However, further development, including incorporation of ontologies, will be necessary to improve the performance of the extraction for some character types.

APA, Harvard, Vancouver, ISO, and other styles

11

Munnecom, Lorenna, and Miguel Chaves de Lemos Pacheco. "Exploration of an Automated Motivation Letter Scoring System to Emulate Human Judgement." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34563.

Full text

Abstract:

As the popularity of the master’s in data science at Dalarna University increases, so does the number of applicants. The aim of this thesis was to explore different approaches to provide an automated motivation letter scoring system which could emulate the human judgement and automate the process of candidate selection. Several steps such as image processing and text processing were required to enable the authors to retrieve numerous features which could lead to the identification of the factors graded by the program managers. Grammatical based features and Advanced textual features were extracted from the motivation letters followed by the application of Topic Modelling methods to extract the probability of each topics occurring within a motivation letter. Furthermore, correlation analysis was applied to quantify the association between the features and the different factors graded by the program managers, followed by Ordinal Logistic Regression and Random Forest to build models with the most impactful variables. Finally, Naïve Bayes Algorithm, Random Forest and Support Vector Machine were used, first for classification and then for prediction purposes. These results were not promising as the factors were not accurately identified. Nevertheless, the authors suspected that the factors may be strongly related to the highlight of specific topics within a motivation letter which can lead to further research.

APA, Harvard, Vancouver, ISO, and other styles

12

Cunningham-Nelson, Samuel Kayne. "Enhancing student conceptual understanding and learning experience through automated textual analysis." Thesis, Queensland University of Technology, 2019. https://eprints.qut.edu.au/134145/1/Samuel_Cunningham-Nelson_Thesis.pdf.

Full text

Abstract:

Supporting students to develop a strong foundation for thorough understanding, and assisting educators in teaching effectively, both require the utilization of meaningful feedback. The contributions presented in this thesis aimed to provide instantaneous, and individualised feedback for both students and educators through the use of text analysis. The methodologies and models described are all automated, therefore once implemented can provide feedback routinely and recurrently. These solutions facilitate both learning and teaching for students and educators, respectively, helping to close the quality assurance loop.

APA, Harvard, Vancouver, ISO, and other styles

13

Paterson, Kimberly Laurel Ms. "TSPOONS: Tracking Salience Profiles Of Online News Stories." DigitalCommons@CalPoly, 2014. https://digitalcommons.calpoly.edu/theses/1222.

Full text

Abstract:

News space is a relatively nebulous term that describes the general discourse concerning events that affect the populace. Past research has focused on qualitatively analyzing news space in an attempt to answer big questions about how the populace relates to the news and how they respond to it. We want to ask when do stories begin? What stories stand out among the noise? In order to answer the big questions about news space, we need to track the course of individual stories in the news. By analyzing the specific articles that comprise stories, we can synthesize the information gained from several stories to see a more complete picture of the discourse. The individual articles, the groups of articles that become stories, and the overall themes that connect stories together all complete the narrative about what is happening in society. TSPOONS provides a framework for analyzing news stories and answering two main questions: what were the important stories during some time frame and what were the important stories involving some topic. Drawing technical news stories from Techmeme.com, TSPOONS generates profiles of each news story, quantitatively measuring the importance, or salience, of news stories as well as quantifying the impact of these stories over time.

APA, Harvard, Vancouver, ISO, and other styles

14

Svensson, Pontus. "Automated Image Suggestions for News Articles : An Evaluation of Text and Image Representations in an Image Retrieval System." Thesis, Linköpings universitet, Interaktiva och kognitiva system, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166669.

Full text

Abstract:

Multimodal machine learning is a subfield of machine learning that aims to relate data from different modalities, such as texts and images. One of the many applications that could be built upon this technique is an image retrieval system that, given a text query, retrieves suitable images from a database. In this thesis, a retrieval system based on canonical correlation is used to suggest images for news articles. Different dense text representations produced by Word2vec and Doc2vec, and image representations produced by pre-trained convolutional neural networks are explored to find out how they affect the suggestions. Which part of an article is best suited as a query to the system is also studied. Also, experiments are carried out to determine if an article's date of publication can be used to improve the suggestions. The results show that Word2vec outperforms Doc2vec in the task, which indicates that the meaning of article texts are not as important as the individual words they consist of. Furthermore, the queries are improved by rewarding words that are particularly significant.

APA, Harvard, Vancouver, ISO, and other styles

15

Lepage, Yves. "Un système de grammaires correspondancielles d'identification." Grenoble 1, 1989. http://www.theses.fr/1989GRE10059.

Full text

Abstract:

Proposition d'un langage de programmation déclaratif ou les objets de base sont des planches. Une planche exprime la correspondance entre une chaine et un arbre. Le système propose expose sur l'identification, les variables y étant non pas des variables de termes, mais des variables de forêts

APA, Harvard, Vancouver, ISO, and other styles

16

Xia, Menglin. "Text readability and summarisation for non-native reading comprehension." Thesis, University of Cambridge, 2019. https://www.repository.cam.ac.uk/handle/1810/288740.

Full text

Abstract:

This thesis focuses on two important aspects of non-native reading comprehension: text readability assessment, which estimates the reading difficulty of a given text for L2 learners, and learner summarisation assessment, which evaluates the quality of learner summaries to assess their reading comprehension. We approach both tasks as supervised machine learning problems and present automated assessment systems that achieve state-of-the-art performance. We first address the task of text readability assessment for L2 learners. One of the major challenges for a data-driven approach to text readability assessment is the lack of significantly-sized level-annotated data aimed at L2 learners. We present a dataset of CEFR-graded texts tailored for L2 learners and look into a range of linguistic features affecting text readability. We compare the text readability measures for native and L2 learners and explore methods that make use of the more plentiful data aimed at native readers to help improve L2 readability assessment. We then present a summarisation task for evaluating non-native reading comprehension and demonstrate an automated summarisation assessment system aimed at evaluating the quality of learner summaries. We propose three novel machine learning approaches to assessing learner summaries. In the first approach, we examine using several NLP techniques to extract features to measure the content similarity between the reading passage and the summary. In the second approach, we calculate a similarity matrix and apply a convolutional neural network (CNN) model to assess the summary quality using the similarity matrix. In the third approach, we build an end-to-end summarisation assessment model using recurrent neural networks (RNNs). Further, we combine the three approaches to a single system using a parallel ensemble modelling technique. We show that our models outperform traditional approaches that rely on exact word match on the task and that our best model produces quality assessments close to professional examiners.

APA, Harvard, Vancouver, ISO, and other styles

17

Fancellu, Federico. "Computational models for multilingual negation scope detection." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/33038.

Full text

Abstract:

Negation is a common property of languages, in that there are few languages, if any, that lack means to revert the truth-value of a statement. A challenge to cross-lingual studies of negation lies in the fact that languages encode and use it in different ways. Although this variation has been extensively researched in linguistics, little has been done in automated language processing. In particular, we lack computational models of processing negation that can be generalized across language. We even lack knowledge of what the development of such models would require. These models however exist and can be built by means of existing cross-lingual resources, even when annotated data for a language other than English is not available. This thesis shows this in the context of detecting string-level negation scope, i.e. the set of tokens in a sentence whose meaning is affected by a negation marker (e.g. 'not'). Our contribution has two parts. First, we investigate the scenario where annotated training data is available. We show that Bi-directional Long Short Term Memory (BiLSTM) networks are state-of-the-art models whose features can be generalized across language. We also show that these models suffer from genre effects and that for most of the corpora we have experimented with, high performance is simply an artifact of the annotation styles, where negation scope is often a span of text delimited by punctuation. Second, we investigate the scenario where annotated data is available in only one language, experimenting with model transfer. To test our approach, we first build NEGPAR, a parallel corpus annotated for negation, where pre-existing annotations on English sentences have been edited and extended to Chinese translations. We then show that transferring a model for negation scope detection across languages is possible by means of structured neural models where negation scope is detected on top of a cross-linguistically consistent representation, Universal Dependencies. On the other hand, we found cross-lingual lexical information only to help very little with performance. Finally, error analysis shows that performance is better when a negation marker is in the same dependency substructure as its scope and that some of the phenomena related to negation scope requiring lexical knowledge are still not captured correctly. In the conclusions, we tie up the contributions of this thesis and we point future work towards representing negation scope across languages at the level of logical form as well.

APA, Harvard, Vancouver, ISO, and other styles

18

Lermuzeaux, Jean-Marc. "Contribution à l'intégration des niveaux de traitement automatique de la langue écrite : ANAEL : un environnement de compréhension basé sur les objets, les actions et les grammaires d'événements." Caen, 1988. http://www.theses.fr/1988CAEN2029.

Full text

Abstract:

L'environnement ANAEL est basé sur une représentation des connaissances sous forme d'objets et d'actions et sur les grammaires d'événements. ANAEL s'articule sur trois niveaux : 1) le niveau acquisition des modèles réel et linguistique ; 2) le niveau compréhension où un interpréteur manipule une base de connaissances en langage naturel et 3) le niveau méta où les modèles sont considérés comme des faits

APA, Harvard, Vancouver, ISO, and other styles

19

Dyremark, Johanna, and Caroline Mayer. "Bedömning av elevuppsatser genom maskininlärning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-262041.

Full text

Abstract:

Betygsättning upptar idag en stor del av lärares arbetstid och det finns en betydande inkonsekvens vid bedömning utförd av olika lärare. Denna studie ämnar undersöka vilken träffsäkerhet som en automtiserad bedömningsmodell kan uppnå. Tre maskininlärningsmodeller för klassifikation i form av Linear Discriminant Analysis, K-Nearest Neighbor och Random Forest tränas och testas med femfaldig korsvalidering på uppsatser från nationella prov i svenska. Klassificeringen baseras på språk och formrelaterade attribut inkluderande ord och teckenvisa längdmått, likhet med texter av olika formalitetsgrad och grammatikrelaterade mått. Detta utmynnar i ett maximalt quadratic weighted kappa-värde på 0,4829 och identisk överensstämmelse med expertgivna betyg i 57,53 % av fallen. Dessa resultat uppnåddes av en modell baserad på Linear Discriminant Analysis och uppvisar en högre korrelation med expertgivna betyg än en ordinarie lärare. Trots pågående digitalisering inom skolväsendet kvarstår ett antal hinder innan fullständigt maskininlärningsbaserad bedömning kan realiseras, såsom användarnas inställning till tekniken, etiska dilemman och teknikens svårigheter med förståelse av semantik. En delvis integrerad automatisk betygssättning har dock potential att identifiera uppsatser där behov av dubbelrättning föreligger, vilket kan öka överensstämmelsen vid storskaliga prov till en låg kostnad.
Today, a large amount of a teacher’s workload is comprised of essay scoring and there is a large variability between teachers’ gradings. This report aims to examine what accuracy can be acceived with an automated essay scoring system for Swedish. Three following machine learning models for classification are trained and tested with 5-fold cross-validation on essays from Swedish national tests: Linear Discriminant Analysis, K-Nearest Neighbour and Random Forest. Essays are classified based on 31 language structure related attributes such as token-based length measures, similarity to texts with different formal levels and use of grammar. The results show a maximal quadratic weighted kappa value of 0.4829 and a grading identical to expert’s assessment in 57.53% of all tests. These results were achieved by a model based on Linear Discriminant Analysis and showed higher inter-rater reliability with expert grading than a local teacher. Despite an ongoing digitilization within the Swedish educational system, there are a number of obstacles preventing a complete automization of essay scoring such as users’ attitude, ethical issues and the current techniques difficulties in understanding semantics. Nevertheless, a partial integration of automatic essay scoring has potential to effectively identify essays suitable for double grading which can increase the consistency of large-scale tests to a low cost.

APA, Harvard, Vancouver, ISO, and other styles

20

Marshall, Susan LaVonne. "Concept of Operations (CONOPS) for foreign language and speech translation technologies in a coalition military environment." Thesis, Monterey, Calif. : Springfield, Va. : Naval Postgraduate School ; Available from National Technical Information Service, 2005. http://library.nps.navy.mil/uhtbin/hyperion/05Mar%5FMarshall.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Silveira, Gabriela. "Narrativas produzidas por indivíduos afásicos e indivíduos cognitivamente sadios: análise computadorizada de macro e micro estrutura." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/5/5170/tde-01112018-101055/.

Full text

Abstract:

INTRODUÇÃO: O tema de investigação, discurso de afásicos, fornece informações importantes sobre aspectos fonológicos, morfológicos, sintáticos, semânticos e pragmáticos da linguagem de pacientes que sofreram lesão vascular cerebral. Uma das maneiras de estudar o discurso é por meio de cenas figurativas temáticas simples ou em sequência. A sequência da história de \"Cinderela\" é frequentemente utilizada em estudos, por ser familiar em todo o mundo, o que favorece estudos transculturais; por induzir a produção de narrativas, ao invés de descrições, frequentemente obtidas quando se utiliza prancha única para eliciar discursos. Outra vantagem do uso das sequências da \"Cinderela\" é o fato de gerar material linguístico em quantidade suficiente para análise detalhada. OBJETIVOS: (1) analisar, por meio de tecnologias computadorizadas, aspectos macro e microestruturais do discurso de indivíduos sadios do ponto de vista cognitivo, afásicos de Broca e afásicos anômicos; (2) explorar o discurso como indicador de evolução da afasia; (3) analisar a contribuição do SPECT para verificação de evolução da afasia junto ao discurso. MÉTODO: Participaram do estudo oito indivíduos afásicos de Broca e anômicos que compuseram o grupo do estudo longitudinal (G1), 15 indivíduos afásicos de Broca e anômicos que compuseram o outro grupo de estudo (G2) e 30 cognitivamente sadios (GC). Os participantes foram solicitados a examinar as cenas da história \"Cinderela\" e depois recontar a história, com suas palavras. Foram exploradas tecnologias computadorizadas e analisados aspectos macro e microestruturais dos discursos produzidos. Para o G1, tivermos a particularidade de coleta de discurso também pela prancha \"Roubo dos Biscoitos\", análise do exame SPECT e acompanhamento longitudinal por um período de seis meses. RESULTADOS: Comparando o GC e o G2, em relação à macroestrutura, notou-se que os afásicos do G2 se diferenciaram significativamente do GC em todas as proposições e, em relação à microestrutura, sete métricas foram capazes de diferenciar ambos os grupos. Houve diferença significante macro e micro estrutural entre os sujeitos afásicos de Broca e anômicos. Foi possível verificar diferenças em medidas da macro e da microestrutura no G1 com o avançar do tempo de lesão após AVC. A história da \"Cinderela\" forneceu dados de microestrutura mais completos do que a prancha \"Roubo dos Biscoitos\". Os resultados do SPECT permaneceram os mesmos, sem demonstração de mudança com a evolução da afasia. CONCLUSÃO: A produção de narrativa gerou material para análise de macroestrutura e microestrutura, tanto aspectos de macro quanto de microestrutura diferenciaram indivíduos cognitivamente sadios dos sujeitos afásicos. A análise do discurso da \"Cinderela\" serviu como instrumento para mensurar a melhora da linguagem dos sujeitos afásicos. O uso da ferramenta computacional auxiliou as análises discursivas
INTRODUCTION: The aphasic discourse analysis provides important information about the phonological, morphological, syntactic, semantic and pragmatic aspects of the language of patients who have suffered a stroke. The evaluation of the discourse, along with other methods, can contribute to observation of the evolution of the language and communication of aphasic patients; however, manual analysis is laborious and can lead to errors. OBJECTIVES: (1) to analyze, by computerized technologies, macro and microstructural aspects of the discourse of healthy cognitive individuals, Broca\'s and anomic aphasics; (2) to explore the discourse as indicator of the evolution of aphasia; (3) to analyze the contribution of single photon emission computed tomography (SPECT) to verify the correlation between behavioral and neuroimaging evolution data. METHOD: Two groups of patients were studied: GA1, consisting of eight individuals with Broca\'s aphasia and anomic aphasia, who were analyzed longitudinally from the sub-acute phase of the lesion and after three and six months; GA2 composed of 15 individuals with Broca\'s and anomic aphasia, with varying times of stroke installation and GC consisting of 30 cognitively healthy participants. Computerized technologies were explored for the analysis of metrics related to the micro and macrostructure of discourses uttered from Cinderela history and Cookie Theft picture. RESULTS: Comparing the GC and GA2, in relation to the discourse macrostructure, it was observed that the GA2 aphasics differed significantly from the GC in relation to the total number of propositions emitted; considering the microstructure, seven metrics differentiated both groups. There was a significant difference in the macro and microstructure between the discourses of Broca\'s aphasic subjects and anomic ones. It was possible to verify differences in macro and microstructure measurements in GA1 with the advancement of injury time. In GA1, the comparison between parameters in the sub-acute phase and after 6 months of stroke revealed differences in macrostructure - increase in the number of propositions of the orientation block and of the total propositions. Regarding the microstructure, the initial measures of syllable metrics by word content, incidence of nouns and incidence of content words differed after 6 months of intervention. The variable incidence of missing words in the dictionary showed a significantly lower value after three months of stroke. Cinderella\'s story provided more complete microstructure data than the Cookie Theft picture. There was no change in SPECT over time, without demonstration of change with the evolution of aphasia. CONCLUSION: The discourse produced from the history of Cinderella and the Cookie Theft picture generated material for macrostructure and microstructure analysis of cognitively healthy and aphasic individuals, made it possible to quantify and qualify the evolution of language in different phases of stroke recuperation and distinguished the behavior of healthy and with Broca´s and anomic aphasia, in macro and microstructure aspects. The exploration of computerized tools facilitated the analysis of the data in relation to the microstructure, but it was not applicable to the macrostructure, demonstrating that there is a need for tool adjustments for the discourse analysis of patients. SPECT data did not reflect the behavioral improvement of the language of aphasic subjects

APA, Harvard, Vancouver, ISO, and other styles

22

Toledo, Cíntia Matsuda. "Análise de aspectos micro e macrolinguísticos da narrativa de indivíduos com doença de Alzheimer, comprometimento cognitivo leve e sem comprometimentos cognitivos." Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/5/5170/tde-11092017-133850/.

Full text

Abstract:

INTRODUÇÃO: O envelhecimento da população é uma tendência social conhecida em países desenvolvidos e cada vez mais pronunciada em países em desenvolvimento. A demência é considerada um dos principais problemas de saúde devido ao rápido crescimento populacional de idosos, sendo os distúrbios de linguagem considerados importantes nesses quadros. O discurso tem ganhado destaque para a identificação dos distúrbios linguísticos nas demências assim como no seguimento desses pacientes. A caracterização das diferenças pode auxiliar no diagnóstico diferencial e contribuir para a criação de ferramentas futuras que auxiliem na intervenção clínica e ajudem a evitar a evolução e/ou progressão dos quadros demenciais. O processo de transcrição e análise do discurso é bastante laborioso, desta forma o uso de métodos computacionais tem auxiliado na identificação e extração de características linguísticas. OBJETIVO: identificar alterações em aspectos micro e macrolinguísticos que diferenciem indivíduos com doença de Alzheimer, comprometimento cognitivo leve e idosos sem comprometimento cognitivo na tarefa de narrativa de figuras em sequência e explorar a ferramenta computacional (Coh-Metrix-Dementia) para análise do discurso desses sujeitos. MÉTODO: Foram avaliados 60 indivíduos, sendo 20 em cada grupo de pesquisa (doença de Alzheimer leve - GDA, comprometimento cognitivo leve amnéstico - GCCLa e controle - GC). Os indivíduos foram solicitados a enunciar uma narrativa baseada em 22 cenas em sequência, que retratam a história da \"Cinderela\". Foram aplicados também os seguintes testes linguístico-cognitivos: Fluência Verbal, Teste de Nomeação do Boston e Camel and Cactus test. Utilizou-se o Coh-Metrix- Dementia para extração automática das métricas. RESULTADOS: Os valores extraídos pelo Coh-Metrix-Dementia foram tratados estatisticamente sendo possível levantar métricas capazes de distinguir os grupos estudados. Em relação aos aspectos microlinguísticos destacaram-se a redução nas habilidades sintáticas, maior dificuldade no resgate verbal, discursos com menor coesão e coerência local no GDA. No nível macrolinguístico o GDA apresentou os discursos menos informativos, com maior prejuízo em relação a coerência global e maior número de modalizações. O GDA também apresentou maior comprometimento da estrutura narrativa. Não foi possível discriminar o GCCLa e GC em nenhuma métrica do discurso deste estudo. Foram feitas adaptações em relação a segmentação das sentenças para um melhor funcionamento da ferramenta computacional. CONCLUSÃO: Os indivíduos do GDA apresentaram discursos com maior comprometimento macro e microestrutural. O uso da ferramenta computacional se mostrou um importante aliado para análises discursivas
INTRODUCTION: Population aging is a social trend known in developed countries and increasingly pronounced in developing countries. Dementia is considered one of the main health problems due to the rapid population growth of the elderly, and language disorders are considered important in these settings. The discourse is important for the identification of linguistic disorders in dementias as well as in the follow-up of these patients. The discourse differences characterization can help on the differential diagnosis and contribute to the creation of future tools for clinical intervention and help prevent the evolution and/or progression of dementia. The transcription and discourse analysis are laborius, thus the use of computational methods helped in the identification and extraction of linguistic characteristics. OBJECTIVE: The objective of this study was to identify changes in micro and macrolinguistic aspects that differentiate individuals with Alzheimer\'s disease, mild cognitive impairment and healthy elderly individuals during narrative of figures in sequence and to explore the computational tool (Coh-Metrix-Dementia) to analyze the subjects\' discourse. METHODS: 60 subjects were evaluated, 20 of them in each research group (mild Alzheimer\'s disease - GDA, amnestic cognitive impairment - GCCLa and control - CG). The subjects were asked to construct a narrative based on sequence of pictures, about the \"Cinderella´s Story\". The following linguistic-cognitive tests were also applied: Verbal Fluency, Boston Naming Test, and Camel and Cactus test. Coh-Metrix-Dementia was used for automatic metrics extraction. RESULTS: The values extracted by Coh-Metrix-Dementia were statistically treated and it was possible to obtain metrics capable of distinguishing the studied groups. In relation to the microlinguistic aspects, it was found the reduction in syntactic abilities, greater difficulty in verbal rescue, discourses with less cohesion and local coherence in the GDA. In the macrolinguistic level the GDA presented the less informative discourses, with greater loss in global coherence and the greater number of modalizations. The GDA also presented greater impairment on narrative structure. It was not possible to discriminate GCCLa and GC in any discourse´s metric in this study tool functioning. CONCLUSION: The GDA subjects presented discourses with greater macro and microstructural impairment. The computational tool usage proved to be an important ally for discursive analysis

APA, Harvard, Vancouver, ISO, and other styles

23

Murakami, Tiago R. M. "Tesauros e a World Wide Web." Thesis, 2005. http://eprints.rclis.org/9863/1/murakami-tesauros.pdf.

Full text

Abstract:

Thesauri are tools that growing importance in Web context. For this, is necessary adapting the thesauri for Web technologies and functionalities. The present work is an exploratory study that aim identifies how the documentary thesauri are being utilized and/or incorporated for the management of information in the Web.

APA, Harvard, Vancouver, ISO, and other styles

24

Vidal-Santos, Gerard. "Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos." Thesis, 2018. http://eprints.rclis.org/33589/1/VidalSantos_TFG_2018.pdf.

Full text

Abstract:

The aim of the study is to explore the use of unsupervised Named-Entity Recognition (NER) processes to generate descriptive metadata capable to assist information retrieval interfaces in large-scaled digital collections and support the construction of of more diverse knowledge representation models in academic libraries. For this purpose the study reviews some experiences and canonical literature in the use of automatized subject headings creation in libraries and archives environments as a leveraging tool in the overexploited use of search engines as main access points to retrieve assets in their catalogs and digital collections., focusing on the guidelines established by two articles that address this task from two complementary points of view: • van Hooland S, de Wilde M, Verborgh R, Steiner T, Van de Walle R. Exploring entity recognition and disambiguation for cultural heritage collections. DIGITAL SCHOLARSHIP IN THE HUMANITIES. 2013 Nov 1;30(2):262–79. • Zeng M. Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky’s Theory and Two Research Samples. Knowledge Organization. 2014 Jan 1;440–51. The first one, provides the tools to generate named-entities in large scale samples of text and establishes the parameters to assess the suitability of this entities in a quantitative level. The second one provides the guidelines to analyze the quality of those results by developing a 3 layered framework (identification-description-interpretation) based on Edward Panofsky’s work in the analysis and interpretation of pictorial works. A work environment is built on this premise to extract and analyze the entities detected by DBPedia Spotlight (the NER service used for extraction) in a random collection of bibliographic records extracted from a thesis aggregator (Open Acces Thesis & Dissertations). The results shows the great improve on descriptive access points provided by this processes at a quantitative basis, allowing users to browse more effectively in better contextualized records if combined with the keywords already indexed, despite not having the necessary consistency to successfully surpass the quality filter established in the evaluation table. This setback, however, conditions in a relative way the possibility of improving the visibility of record in large collections by these means if the logical constructions from the semantic basis that manages the extraction service is taken into consideration on iterative cataloging processes, establishing a iterative and cost-effective way of constructing more diverse maps of knowledge graphs connecting manual or self-generated indexed keywords to others nodes in the linked open data (LOD) cloud.

APA, Harvard, Vancouver, ISO, and other styles

25

Vidal-Santos, Gerard. "Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos." Thesis, 2018. http://eprints.rclis.org/33692/1/VidalSantos_TFG_2018.pdf.

Full text

Abstract:

The aim of the study is to explore the use of unsupervised Named-Entity Recognition (NER) processes to generate descriptive metadata capable to assist information retrieval interfaces in large-scaled digital collections and support the construction of of more diverse knowledge representation models in academic libraries. For this purpose the study reviews some experiences and canonical literature in the use of automatized subject headings creation in libraries and archives environments as a leveraging tool in the overexploited use of search engines as main access points to retrieve assets in their catalogs and digital collections., focusing on the guidelines established by two articles that address this task from two complementary points of view: • van Hooland S, de Wilde M, Verborgh R, Steiner T, Van de Walle R. Exploring entity recognition and disambiguation for cultural heritage collections. DIGITAL SCHOLARSHIP IN THE HUMANITIES. 2013 Nov 1;30(2):262–79. • Zeng M. Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky’s Theory and Two Research Samples. Knowledge Organization. 2014 Jan 1;440–51. The first one, provides the tools to generate named-entities in large scale samples of text and establishes the parameters to assess the suitability of this entities in a quantitative level. The second one provides the guidelines to analyze the quality of those results by developing a 3 layered framework (identification-description-interpretation) based on Edward Panofsky’s work in the analysis and interpretation of pictorial works. A work environment is built on this premise to extract and analyze the entities detected by DBPedia Spotlight (the NER service used for extraction) in a random collection of bibliographic records extracted from a thesis aggregator (Open Acces Thesis & Dissertations). The results shows the great improve on descriptive access points provided by this processes at a quantitative basis, allowing users to browse more effectively in better contextualized records if combined with the keywords already indexed, despite not having the necessary consistency to successfully surpass the quality filter established in the evaluation table. This setback, however, conditions in a relative way the possibility of improving the visibility of record in large collections by these means if the logical constructions from the semantic basis that manages the extraction service is taken into consideration on iterative cataloging processes, establishing a iterative and cost-effective way of constructing more diverse maps of knowledge graphs connecting manual or self-generated indexed keywords to others nodes in the linked open data (LOD) cloud.

APA, Harvard, Vancouver, ISO, and other styles

26

Wille, Jens. "Automatisches Klassifizieren bibliographischer Beschreibungsdaten: Vorgehensweise und Ergebnisse." Thesis, 2006. http://eprints.rclis.org/7790/1/wille_-_automatisches_klassifizieren_bibliographischer_beschreibungsdaten_%28diplomarbeit%29.pdf.

Full text

Abstract:

This work deals with the practical aspects of automated categorization of bibliographic records. Its main concern regards the course of action within the ad hoc developed open source program "COBRA – Classification Of Bibliographic Records, Automatic". Preconditions and parameters for application in the library field are clarified. Finally, categorization results of socio-scientific records from the database SOLIS are evaluated.

APA, Harvard, Vancouver, ISO, and other styles

27

Gómez-Díaz, Raquel. "Estudio de la incidencia del conocimiento lingüístico en los sistemas de recuperación de la información para el español." Thesis, 2001. http://eprints.rclis.org/15670/1/DBD_G%C3%B3mezD%C3%ADazR_Estudiodelaincidencia.pdf.

Full text

Abstract:

Today it is necessary to be well informed and the characteristics of the information we need systems that work with natural language or where control of the terms is minimal. For this work we have created a stemmer by a finite state machine nondeterministic order to apply to the recovery of information in Spanish. Function is to remove the suffix stemmer automatically and establish their motto. From the stem is indexing and recovery. To test the effectiveness, stemming experiments are performed inflected and derivative, combining this with the removal of stop words.

APA, Harvard, Vancouver, ISO, and other styles

28

Çapkın, Çağdaş. "Türkçe metin tabanlı açık arşivlerde kullanılan dizinleme yönteminin değerlendirilmesi / Evaluation of indexing method used in Turkish text-based open archives." Thesis, 2011. http://eprints.rclis.org/28804/1/Cagdas_CAPKIN_Yuksek_Lisans_tezi.pdf.

Full text

Abstract:

The purpose of this research is to evaluate performance of information retrieval systems designed for open archives, and standards/protocols enabling retrieving and organizing information in open archives. In this regard, an open archive was developed with 2215 text-based documents from "Turkish Librarianship" journal and three different information retrieval systems based on Boolean and Vector Space models were designed in order to evaluate information retrieval performances in the open archive developed. The designed information retrieval systems are: "metadata information retrieval system" (ÜBES) involving indexing with metadata created based only on human, "full-text information retrieval system" (TBES) involving (automatic) indexing based on only machine, and "mixed information retrieval system" (KBES) involving indexing based both on human and machine. Descriptive research method is used to describe the current situation and findings are evaluated based on literature. In order to evaluate performances of information retrieval systems, "precision and recall" and "normalized recall" measurements are made. The following results are found at the end of the research: It is determined that the precision performance of KBES information retrieval system designed for open archives creates statistically significant difference in comparison to ÜBES and TBES. In each information retrieval system, a strong negative correlation is identified between recall and precision, where precision decreases as recall increases. It is determined that the "normalized recall" performance of ÜBES and KBES create statistically significant difference in comparison to TBES. In "normalized recall" performance, no statistically significant difference is identified between ÜBES and KBES. ÜBES is the information retrieval system through which minimum number of relevant and nonrelevant documents; TBES, through which maximum number of nonrelevant and second most relevant documents, and KBES, through which maximum number of relevant and second most nonrelevant documents are retrieved. It is concluded that using OAI-PMH and OAI-ORE protocols together rather than using only OAI-PMH protocol fits the purpose of open archives.

APA, Harvard, Vancouver, ISO, and other styles

29

Oberhauser, Otto. "Automatisches Klassifizieren : Verfahren zur Erschliessung elektronischer Dokumente." Thesis, 2004. http://eprints.rclis.org/8526/1/OCO_MLIS_Thesis.pdf.

Full text

Abstract:

Automatic classification of text documents refers to the computerized allocation of class numbers from existing classification schemes to natural language texts by means of suitable algorithms. Based upon a comprehensive literature review, this thesis establishes an informed and up-to-date view of the applicability of automatic classification for the subject approach to electronic documents, particularly to Web resources. Both methodological aspects and the experiences drawn from relevant projects and applications are covered. Concerning methodology, the present state-of-the-art comprises a number of statistical approaches that rely on machine learning; these methods use pre-classified example documents for establishing a model - the "classifier" - which is then used for classifying new documents. However, the four large-scale projects conducted in the 1990s by the Universities of Lund, Wolverhampton and Oldenburg, and by OCLC (Dublin, OH), still used rather simple and more traditional methodological approaches. These projects are described and analyzed in detail. As they made use of traditional library classifications their results are significant for LIS, even if no permanent quality services have resulted from these endeavours. The analysis of other relevant applications and projects reveals a number of attempts to use automatic classification for document processing in the fields of patent and media documentation. Here, semi-automatic solutions that support human classifiers are preferred, due to the yet unsatisfactory classification results obtained by fully automated systems. Other interesting implementations include Web portals, search engines and (commercial) information services, whereas only little interest has been shown in the automatic classification of books and bibliographic records. In the concluding part of the study the author discusses the most significant applications and projects, and also addresses several problems and issues in the context of automatic classification.

APA, Harvard, Vancouver, ISO, and other styles

30

Bejarano-Ballen, Juan S. "Análisis de los altos cargos de la Generalitat Valenciana." Thesis, 2017. http://eprints.rclis.org/31994/1/TFM_Juan_Sebastian_Bejarano.pdf.

Full text

Abstract:

The high positions are the political base of the public administration in our country. The citizen knows them fundamentally, through the means, but this image provides a fragmented vision. The websites of the administration offer complete information about them and their competencies, but as a snapshot, of the corresponding legislature. So where is the map of the political power of an administration throughout its history? Nowhere. The dispersion of this information, so significant, and the lack of understanding may be acting to the detriment of the transparency of governments. The interest of the citizens in the politics and in the actions of the politicians supposes a social change in which the maximum is the transparency and the accessibility to the information of public character in a clear and organized form. The project proposes to build an automated methodology and a prototype of a graph based on this information from unstructured sources that is useful for citizenship and administration in order to relate and organize the information of the Generalitat Valenciana. The model with the data that support it will be released for later reuse and adaptation to other administrations.

APA, Harvard, Vancouver, ISO, and other styles

31

Schmidt, Nora. "Semantisches Publizieren im interdisziplinären Wissenschaftsnetzwerk. Theoretische Grundlagen und Anforderungen." Thesis, 2014. http://eprints.rclis.org/24215/1/schmidt_semantic-publishing_e-lis.html.

Full text

Abstract:

The study examines preconditions to adopt semantic web technologies for a novel specialized medium of scholarly communication that – also interdisciplinary – enables the synchronicity of publication and knowledge representation on the one hand and the dynamic bundling of assertions on the other hand. Therefore it is first of all necessary to determine a concept of „(scholarly) publication“ and of neighbouring concepts. These considerations are fertilized by theories that can be related to the radical constructivism. Therefrom derives a critique of the mainstream of knowledge representation that resigns to being not able to represent the dynamics of knowledge. Finally the study evinces a conceptual outline of a technical system that is built upon the known concept of nanopublications and is called „scholarly network“. The increased effort while publishing in the scholarly network is outweighed by the benefits of this publication medium: It may help to render research outputs more precisely as well as to raise their connectivity through reducing the complexity of assertions. Beyond that it would generate an openly accessible and finely structured discourse archive – a wide participation provided.

APA, Harvard, Vancouver, ISO, and other styles

32

"An automated Chinese text processing system (ACCESS): user-friendly interface and feature enhancement." Chinese University of Hong Kong, 1994. http://library.cuhk.edu.hk/record=b5888227.

Full text

Abstract:

Suen Tow Sunny.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.
Includes bibliographical references (leaves 65-67).
Introduction --- p.1
Chapter 1. --- ACCESS with an Extendible User-friendly X/Chinese Interface --- p.4
Chapter 1.1. --- System requirement --- p.4
Chapter 1.1.1. --- User interface issue --- p.4
Chapter 1.1.2. --- Development issue --- p.5
Chapter 1.2. --- Development decision --- p.6
Chapter 1.2.1. --- X window system --- p.6
Chapter 1.2.2. --- X/Chinese toolkit --- p.7
Chapter 1.2.3. --- C language --- p.8
Chapter 1.2.4. --- Source code control system --- p.8
Chapter 1.3. --- System architecture --- p.9
Chapter 1.4. --- User interface --- p.10
Chapter 1.5. --- Sample screen --- p.13
Chapter 1.6. --- System extension --- p.14
Chapter 1.7. --- System portability --- p.18
Chapter 2. --- Study on Algorithms for Automatically Correcting Characters in Chinese Cangjie-typed Text --- p.19
Chapter 2.1. --- Chinese character input --- p.19
Chapter 2.1.1. --- Chinese keyboards --- p.20
Chapter 2.1.2. --- Keyboard redefinition scheme --- p.21
Chapter 2.2. --- Cangjie input method --- p.24
Chapter 2.3. --- Review on existing techniques for automatically correcting words in English text --- p.26
Chapter 2.3.1. --- Nonword error detection --- p.27
Chapter 2.3.2. --- Isolated-word error correction --- p.28
Chapter 2.3.2.1. --- Spelling error patterns --- p.29
Chapter 2.3.2.2. --- Correction techniques --- p.31
Chapter 2.3.3. --- Context-dependent word correction research --- p.32
Chapter 2.3.3.1. --- Natural language processing approach --- p.33
Chapter 2.3.3.2. --- Statistical language model --- p.35
Chapter 2.4. --- Research on error rates and patterns in Cangjie input method --- p.37
Chapter 2.5. --- Similarities and differences between Chinese and English typed text --- p.41
Chapter 2.5.1. --- Similarities --- p.41
Chapter 2.5.2. --- Differences --- p.42
Chapter 2.6. --- Proposed algorithm for automatic Chinese text correction --- p.44
Chapter 2.6.1. --- Sentence level --- p.44
Chapter 2.6.2. --- Part-of-speech level --- p.45
Chapter 2.6.3. --- Character level --- p.47
Conclusion --- p.50
Appendix A Cangjie Radix Table --- p.51
Appendix B Sample Text --- p.52
Article 1 --- p.52
Article 2 --- p.53
Article 3 --- p.56
Article 4 --- p.58
Appendix C Error Statistics --- p.61
References --- p.65

APA, Harvard, Vancouver, ISO, and other styles

33

Gruzd, Anatoliy A., and Caroline Haythornthwaite. "Automated Discovery and Analysis of Social Networks from Threaded Discussions." 2008. http://hdl.handle.net/10150/105081.

Full text

Abstract:

To gain greater insight into the operation of online social networks, we applied Natural Language Processing (NLP) techniques to text-based communication to identify and describe underlying social structures in online communities. This paper presents our approach and preliminary evaluation for content-based, automated discovery of social networks. Our research question is: What syntactic and semantic features of postings in a threaded discussions help uncover explicit and implicit ties between network members, and which provide a reliable estimate of the strengths of interpersonal ties among the network members? To evaluate our automated procedures, we compare the results from the NLP processes with social networks built from basic who-to-whom data, and a sample of hand-coded data derived from a close reading of the text. For our test case, and as part of ongoing research on networked learning, we used the archive of threaded discussions collected over eight iterations of an online graduate class. We first associate personal names and nicknames mentioned in the postings with class participants. Next we analyze the context in which each name occurs in the postings to determine whether or not there is an interpersonal tie between a sender of the posting and a person mentioned in it. Because information exchange is a key factor in the operation and success of a learning community, we estimate and assign weights to the ties by measuring the amount of information exchanged between each pair of the nodes; information in this case is operationalized as counts of important concept terms in the postings as derived through the NLP analyses. Finally, we compare the resulting network(s) against those derived from other means, including basic who-to-whom data derived from posting sequences (e.g., whose postings follow whose). In this comparison we evaluate what is gained in understanding network processes by our more elaborate analyses.

APA, Harvard, Vancouver, ISO, and other styles

34

Zhang, Lei. "DASE: Document-Assisted Symbolic Execution for Improving Automated Test Generation." Thesis, 2014. http://hdl.handle.net/10012/8532.

Full text

Abstract:

Software testing is crucial for uncovering software defects and ensuring software reliability. Symbolic execution has been utilized for automatic test generation to improve testing effectiveness. However, existing test generation techniques based on symbolic execution fail to take full advantage of programs’ rich amount of documentation specifying their input constraints, which can further enhance the effectiveness of test generation. In this paper we present a general approach, Document-Assisted Symbolic Execution (DASE), to improve automated test generation and bug detection. DASE leverages natural language processing techniques and heuristics to analyze programs’ readily available documentation and extract input constraints. The input constraints are then used as pruning criteria; inputs far from being valid are trimmed off. In this way, DASE guides symbolic execution to focus on those inputs that are semantically more important. We evaluated DASE on 88 programs from 5 mature real-world software suites: GNU Coreutils, GNU findutils, GNU grep, GNU Binutils, and elftoolchain. Compared to symbolic execution without input constraints, DASE increases line coverage, branch coverage, and call coverage by 5.27–22.10%, 5.83–21.25% and 2.81–21.43% respectively. In addition, DASE detected 13 previously unknown bugs, 6 of which have already been confirmed by the developers.

APA, Harvard, Vancouver, ISO, and other styles

35

Cerveira, João Miguel dos Santos. "Automated Metrics System to Support Software Development Process with Natural Language Assistant." Master's thesis, 2017. http://hdl.handle.net/10316/83083.

Full text

Abstract:

Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia
A Whitesmith é uma empresa de produtos e consultoria de desenvolvimento de software, que recorre a várias ferramentas de monitorização para auxiliar no seu processo de desenvolvimento de produtos.Para que este método seja bem aplicado, é necessário a existência de vários repositórios de dados sobre todo o planeamento e monitorização de desenvolvimento. Esta informação tem de estar guardada em ferramentas de fácil alcance e de rápida compreensão. Posto esta necessidade de alojamento de dados, começaram a surgir, no mercado, várias ferramentas com a capacidade de guardar e manipular informação, de modo a ajudar no desenvolvimento de software.Com o crescimento da empresa, seguiu-se uma grande quantidade de informação distribuída em várias destas ferramentas. Para ser possível fazer uma análise ao desenvolvimento de um determinado projeto, é necessário procurar informação e introduzi-la manualmente. Assim, surgiu a necessidade de criar uma solução para este problema que, não só consiga recolher toda a informação, mas que também execute uma análise ao estado de desenvolvimento de todos os projetos. Para não criar atrito no processo de desenvolvimento, vai ser necessário que a solução contenha o mínimo de interacção humano-computacional, sendo que todo o seu processo seja automatizado.A única interacção requisitada pela empresa, foi a integração de um assistente de linguagem natural na plataforma de comunicação usada por todos os membros, com a finalidade de melhorar a usabilidade na recolha de informação.
Whitesmith is a software development and product consulting company that uses a variety of monitoring tools to aid in its product development process.For this method to be well implemented, it's necessary to have several data repositories on all development planning and monitoring. This information must be stored in tools that are easy to reach and quick to understand. With this need for data, several tools with the ability to store and manipulate information have started to appear in the market in order to aid in the development of software.Since the company is growing, a large amount of information is distributed between this tools, so, to be able to make an analysis of a certain project development stage, it's necessary to look for information and to introduce it manually. Thus, the need to create a solution to this problem arose, that not only can collected all the information, but also perform an analysis of the development status of all its projects.To not create friction in the development process, it will be necessary for the solution to contain the minimum human-computational interaction, and the entire needs to be processed is automatically. The only interaction required by the company was the integration of a natural language assistant in the communication platform used by all members, in order to improve the usability of information collection. This communication should be made by both sides depending on the subject of the metric in question, creating the perfect atmosphere.

APA, Harvard, Vancouver, ISO, and other styles

36

Radford, Benjamin James. "Automated Learning of Event Coding Dictionaries for Novel Domains with an Application to Cyberspace." Diss., 2016. http://hdl.handle.net/10161/13386.

Full text

Abstract:

Event data provide high-resolution and high-volume information about political events. From COPDAB to KEDS, GDELT, ICEWS, and PHOENIX, event datasets and the frameworks that produce them have supported a variety of research efforts across fields and including political science. While these datasets are machine-coded from vast amounts of raw text input, they nonetheless require substantial human effort to produce and update sets of required dictionaries. I introduce a novel method for generating large dictionaries appropriate for event-coding given only a small sample dictionary. This technique leverages recent advances in natural language processing and deep learning to greatly reduce the researcher-hours required to go from defining a new domain-of-interest to producing structured event data that describes that domain. An application to cybersecurity is described and both the generated dictionaries and resultant event data are examined. The cybersecurity event data are also examined in relation to existing datasets in related domains.

Dissertation

APA, Harvard, Vancouver, ISO, and other styles

37

Gruzd, Anatoliy. "Name Networks: A Content-Based Method for Automated Discovery of Social Networks to Study Collaborative Learning." 2009. http://hdl.handle.net/10150/105553.

Full text

Abstract:

As a way to gain greater insight into the operation of Library and Information Science (LIS) e-learning communities, the presented work applies automated text mining techniques to text-based communication to identify, describe and evaluate underlying social networks within such communities. The main thrust of the study is to find a way to use computers to automatically discover social ties that form between students just from their threaded discussions. Currently, one of the most common but time consuming methods for discovering social ties between people is to ask questions about their perceived social ties via a survey. However, such a survey is difficult to collect due to the high cost associated with data collection and the sensitive nature of the types of questions that must be asked. To overcome these limitations, the paper presents a new, content-based method for automated discovery of social networks from threaded discussions dubbed name networks. When fully developed, name networks can be used as a real time diagnostic tool for educators to evaluate and improve teaching models and to identify students who might need additional help or students who may provide such help to others.

APA, Harvard, Vancouver, ISO, and other styles

38

"Analysis and Decision-Making with Social Media." Doctoral diss., 2019. http://hdl.handle.net/2286/R.I.54830.

Full text

Abstract:

abstract: The rapid advancements of technology have greatly extended the ubiquitous nature of smartphones acting as a gateway to numerous social media applications. This brings an immense convenience to the users of these applications wishing to stay connected to other individuals through sharing their statuses, posting their opinions, experiences, suggestions, etc on online social networks (OSNs). Exploring and analyzing this data has a great potential to enable deep and fine-grained insights into the behavior, emotions, and language of individuals in a society. This proposed dissertation focuses on utilizing these online social footprints to research two main threads – 1) Analysis: to study the behavior of individuals online (content analysis) and 2) Synthesis: to build models that influence the behavior of individuals offline (incomplete action models for decision-making). A large percentage of posts shared online are in an unrestricted natural language format that is meant for human consumption. One of the demanding problems in this context is to leverage and develop approaches to automatically extract important insights from this incessant massive data pool. Efforts in this direction emphasize mining or extracting the wealth of latent information in the data from multiple OSNs independently. The first thread of this dissertation focuses on analytics to investigate the differentiated content-sharing behavior of individuals. The second thread of this dissertation attempts to build decision-making systems using social media data. The results of the proposed dissertation emphasize the importance of considering multiple data types while interpreting the content shared on OSNs. They highlight the unique ways in which the data and the extracted patterns from text-based platforms or visual-based platforms complement and contrast in terms of their content. The proposed research demonstrated that, in many ways, the results obtained by focusing on either only text or only visual elements of content shared online could lead to biased insights. On the other hand, it also shows the power of a sequential set of patterns that have some sort of precedence relationships and collaboration between humans and automated planners.
Dissertation/Thesis
Doctoral Dissertation Computer Science 2019

APA, Harvard, Vancouver, ISO, and other styles

39

Silva, Filipe José Good da. "Criação de um Módulo de Aprendizagem Computacional Automatizada para Cientistas de Dados." Master's thesis, 2020. http://hdl.handle.net/10316/92522.

Full text

Abstract:

Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia
A área de Aprendizagem Computacional nunca teve tanto interesse e influência como nos dias de hoje. Várias são as outras áreas em que esta pode acrescentar valor e fazer face à crescente necessidade de melhoria, desde a área humana em que as nossas decisões são tomadas por algoritmos informáticos que foram desenvolvidos para executar determinadas tarefas, à área industrial onde as empresas recorrem a Aprendizagem Computacional para obter valor da quantidade enorme de dados que produzem. Contudo, desenvolver sistemas de Aprendizagem Computacional não é trivial, exigindo muito conhecimento e tempo, tornando assim o trabalho limitado a pessoas com experiência na área.Aprendizagem Computacional Automatizada (AutoML) procura remover limitações associadas ao desenvolvimento de sistemas dotados de inteligência ao automatizar as diferentes fases de um projecto de Aprendizagem Computacional. Esta nova área tenciona fazer face à necessidade crescente de ferramentas que tornam Aprendizagem Computacional mais acessível e menos complexa.Neste trabalho explorámos as capacidades actuais de AutoML de forma a implementar um módulo de AutoML. O módulo implementado está capacitado para realizar diversas etapas de um projecto de Aprendizagem Computacional de forma automatizada. Além disso, explorámos também um cenário onde AutoML pode ser integrado. Neste sentido, o módulo implementado foi integrado num assistente virtual, criando assim uma prova de conceito que permite a execução de operações de AutoML com recurso à comunicação em linguagem natural. Os resultados obtidos demonstram que as duas ferramentas implementadas permitem ultrapassar duas dificuldades no que toca à implementação de projectos de Aprendizagem Computacional. Por um lado, o módulo de AutoML reduz a complexidade associada ao desenvolvimento de sistemas inteligentes, permitindo assim que indivíduos sem conhecimento em Aprendizagem Computacional possam beneficiar da mesma. Por outro, o assistente virtual implementado elimina a necessidade de experiência de programação que é, por norma, fundamental em projectos de Aprendizagem Computacional.
The area of Machine Learning has never had as much interest as it has today. There are several other areas in which it can add value and address the growing need for improvement, from the human area in which our decisions are made by computer algorithms that were developed to perform certain tasks, to the industrial area where companies make use of to Machine Learning to gain value from the huge amount of data they produce. However, developing a Machine Learning system is not trivial. It is a complex task that requires a large amount of knowledge and time, limiting its development to people with experience in the area.Automated Machine Learning (AutoML) seeks to remove the limitations associated with developing intelligent systems by automating the different phases of a Machine Learning project. This new area aims to address the growing need for tools that make Machine Learning more accessible and less complex.In this work, we explored the current capabilities of AutoML in order to develop an AutoML module. The implemented module is able to execute several phases of a Machine Learning project in an automated way. In addition, we also explored a scenario where AutoML could be integrated. In this respect, the implemented module was integrated in a virtual assistant, thus creating a proof of concept that allows the execution of AutoML operations using natural language. Our results suggest that the two implemented tools enable to overcome two obstacles related with the implementation of Machine Learning projects. On one hand, the AutoML module reduces the complexity associated with the development of intelligent systems, thus allowing individuals without knowledge in Machine Learning to benefit from it. On the other hand, the implemented virtual assistant eliminates the need for programming experience, that is usually vital in Machine Learning Projects.

APA, Harvard, Vancouver, ISO, and other styles

40

Nogueira, Afonso Manuel Salazar. "Comparação de desempenho de algoritmos de Machine Learning na classificação de IT incident tickets." Master's thesis, 2020. http://hdl.handle.net/1822/71092.

Full text

Abstract:

Dissertação de mestrado integrado em Engenharia e Gestão de Sistemas de Informação
Esta dissertação, inserida no projeto de dissertação de mestrado em Engenharia e Gestão de Sistemas de Informação do departamento de Sistemas de Informação da Universidade do Minho, tem como tema “Comparação de Desempenho de Algoritmos de Machine Learning na Classificação de IT Incident Tickets”, que deriva do estágio profissional que o autor realizou no Grupo Petrotec. Todos os dias, colaboradores dos inúmeros departamentos da instituição reportam incidentes tecnológicos, isto é, problemas relacionados com os mais variados elementos de trabalho do seu quotidiano que, a priori, possam ser resolvidos pelos profissionais de TI. Quando se deparam com algum problema, dirigem-se a uma plataforma onde podem detalhar categórica e textualmente o incidente ocorrido, de forma a que o support agent perceba facilmente o cerne da questão. Contudo, nem todos os colaboradores são rigorosos e precisos a descrever o incidente, onde, por muitas vezes, se verifica uma categoria totalmente desfasada com a descrição textual do ticket, o que torna mais demorada a dedução da solução por parte do profissional. Nesta dissertação, é proposta uma solução que visa atribuir uma categoria ao novo incident ticket através da classificação do mesmo, especificando o técnico informático especializado na solução do incidente em questão, sendo um mecanismo que recorre a técnicas de Text Mining, Processamento de Linguagem Natural (PLN) e Machine Learning que tenta reduzir ao máximo a intervenção humana na classificação dos tickets, diminuindo o tempo gasto na perceção e resolução dos mesmos. Com isso, a classificação do atributo relativo à descrição textual do ticket vai ser fulcral para a dedução do agente informático a resolver o incidente. Os resultados obtidos foram bastante satisfatórios, decifrando qual os melhores procedimentos de processamento textual a serem realizados, obtendo posteriormente, na maior parte dos modelos de classificação utilizados, uma acuidade superior a 90%, o que torna legítima a implementação de todas as metodologias adotadas num cenário real, isto é, no Grupo Petrotec. No que concerne à recolha, processamento e mining dos dados, teve-se em conta a metodologia Cross Industry Standard Process for Data Mining (CRISP-DM) e como metodologia de investigação utilizou-se a Design Science Research (DSR).
This dissertation, included in the master's thesis project in Engineering and Management of Information Systems of the Information Systems department of the University of Minho, has the theme ‘Performance Comparison of Machine Learning Algorithms in Classifying IT Incident Tickets’, which derives from the professional internship that the author performed at Petrotec Group. Every day, employees from the numerous departments of the institution report technological incidents, that is, problems related to the most varied elements of their daily work that can be solved by IT professionals. When faced with a problem, they go to a platform where they can categorically and verbally detail the incident that occurred, so that the 'support agent' easily understands the heart of the matter. However, not all employees are rigorous and accurate in describing the incident, where there is often a category that is totally out of step with the textual description of the ticket, which makes the professional's deduction from the solution more time consuming. In this dissertation, a solution is proposed which aims to assign a category to the new incident ticket through the classification of the same, specifying the specialized support agent in solving the incident in question, being a mechanism, which uses Text Mining, Natural Language Processing (NLP) and Machine Learning techniques and tries to reduce as much as possible the human intervention in the classification of the tickets, decreasing the time spent in their perception and resolution. Therefore, the classification of the attribute related to the ticket's textual description will be central to the assignment of the ‘support agent’ to solve the incident. The results obtained were quite satisfactory, deciphering the best textual processing procedures to be carried out, subsequently obtaining, in most of the classification models used, an accuracy of more than 90%, which makes the implementation of all the methodologies adopted in a real scenario legitimate, that is, in the Petrotec Group. Regarding to data collection, processing and mining, the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology was taken into account and Design Science Research (DSR) was used as the research methodology.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'LL. Automated language processing'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles